Algolia搜索配置方法

mx-space的文档中有比较详细的配置教程，其他博客框架可能大同小异。

索引大小限制

很不幸，在根据文档配置完后，log中报错了：

16:40:40  ERROR   [AlgoliaSearch]  algolia 推送错误
16:40:40  ERROR   [Event]  Record at the position 10 objectID=xxxxxxxx is too big size=12097/10000 bytes. Please have a look at
  https://www.algolia.com/doc/guides/sending-and-managing-data/prepare-your-data/in-depth/index-and-records-size-and-usage-limitations/#record-size-limits

出错原因也很明确，有一篇博客太长了，而免费的Algolia每条数据仅有10KB。对于我这种想白嫖的人怎么能忍，马上想办法解决。

解决方案

思路

对于mx-space来说，可以配置API Token后从/api/v2/search/algolia/import-json获取到手动提交到Algolia索引的json文件。
其中是一个包含了posts, pages和notes的列表，示例数据如下：

{
  "title": "南京大学IPv4地址范围",
  "text": "# 动机\n\n<details>\n<summary>动机来自于搭建的网页。由于校内和公网都有搭建....",
  "slug": "nju-ipv4",
  "categoryId": "abcdefg",
  "category": {
  "_id": "abcdefg",
  "name": "其他",
  "slug": "others",
  "id": "abcdefg"
  },
  "id": "1234567",
  "objectID": "1234567",
  "type": "post"
},

其中objectID比较关键，提交给Algolia的必须唯一。

这里我能想到的思路便是分页，将有过长text的文章切分，同时修改objectID不就可以了？！（显然，此时并没有想到问题的严重性）
另外我的一些页面里会写<style>和<script>，这部分也可以直接使用正则匹配删掉。
于是有了如下Python代码，编辑从上述接口下载的json并提交给Algolia。

from algoliasearch.search.client import SearchClientSync
import requests
import json
import math
import os
from copy import deepcopy
import re

MAXSIZE = 9990
APPID = "..."
APPKey = "..."
MXSPACETOKEN = "..."
url = "https://www.do1e.cn/api/v2/search/algolia/import-json"
headers = {
  "Authorization": MXSPACETOKEN,
  "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0",
}

ret = requests.get(url, headers=headers)
ret = ret.json()
with open("data.json", "w", encoding="utf-8") as f:
  json.dump(ret, f, ensure_ascii=False, indent=2)
to_push = []

def json_length(item):
  content = json.dumps(item, ensure_ascii=False).encode("utf-8")
  return len(content)

def right_text(text):
  try:
    text.decode("utf-8")
    return True
  except:
    return False

def cut_json(item):
  length = json_length(item)
  text_length = len(item["text"].encode("utf-8"))
  # 计算切分份数
  n = math.ceil(text_length / (MAXSIZE - length + text_length))
  start = 0
  text_content = item["text"].encode("utf-8")
  for i in range(n):
    new_item = deepcopy(item)
    new_item["objectID"] = f"{item['objectID']}_{i}"
    end = start + text_length // n
    # 切分时要注意确保能被正确解码（中文占2个字节）
    while not right_text(text_content[start:end]):
      end -= 1
    new_item["text"] = text_content[start:end].decode("utf-8")
    start = end
    to_push.append(new_item)

for item in ret:
  # 删除style和script标签
  item["text"] = re.sub(r"<style.*?>.*?</style>", "", item["text"], flags=re.DOTALL)
  item["text"] = re.sub(r"<script.*?>.*?</script>", "", item["text"], flags=re.DOTALL)
  if json_length(item) > MAXSIZE: # 超过限制，切分
    print(f"{item['title']} is too large, cut it")
    cut_json(item)
  else: # 没超限制也修改objectID以保持一致性
    item["objectID"] = f"{item['objectID']}_0"
    to_push.append(item)

with open("topush.json", "w", encoding="utf-8") as f:
  json.dump(to_push, f, ensure_ascii=False, indent=2)

client = SearchClientSync(APPID, APPKey)
resp = client.replace_all_objects("mx-space", to_push)
print(resp)

如果你用的是其他博客框架，看到这里就够了，希望能给你提供点思路。

很好，用Python修改搜索索引后重新提交到Algolia并在mx-space后台启用搜索功能，来试一试搜索超出限制的JPEG编码细节吧。
怎么没有结果？怎么后台又报错了？

17:03:46  ERROR   [Catch]  Cast to ObjectId failed for value "1234567_0" (type string) at path "_id" for model "posts"
  at SchemaObjectId.cast (entrypoints.js:1073:883)
  at SchemaType.applySetters (entrypoints.js:1187:226)
  at SchemaType.castForQuery (entrypoints.js:1199:338)
  at cast (entrypoints.js:159:5360)
  at Query.cast (entrypoints.js:799:583)
  at Query._castConditions (entrypoints.js:765:9879)
  at Hr.Query._findOne (entrypoints.js:768:4304)
  at Hr.Query.exec (entrypoints.js:784:5145)
  at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
  at async Promise.all (index 0)