从旧版爬虫迁移
非官方测试版翻译
本页面由 PageTurner AI 翻译(测试版)。未经项目官方认可。 发现错误? 报告问题 →
简介
随着新版 DocSearch UI 的发布,我们希望更进一步,为您提供更强大的工具来创建和维护配置文件,并实现大家期待已久的 Algolia 功能!
有哪些新变化?
爬虫工具
DocSearch 基础设施现已采用 Algolia Crawler。我们与合作伙伴共同开发了全新的 DocSearch 助手,它能像我们之前钟爱的 DocSearch 爬虫 一样提取记录!
The best part is that you no longer need to install any tooling on your side if you want to maintain or update your index!
-
启动、调度和监控爬取任务
-
通过实时编辑器修改配置文件
-
直接使用 DocSearch v3 或 DocSearch v4 测试结果
Algolia 应用与凭证
我们收到了大量功能请求,包括:
-
团队成员管理功能
-
浏览 Algolia 记录的索引方式
-
查看并订阅其他 Algolia 功能
现在这些功能已全部在 您专属的 Algolia 应用 中免费提供 :D
常见问题
您可以 在 Crawler 常见问题页面 找到与 DocSearch 迁移相关的解答。
实用链接
配置文件键映射
Below are the keys that can be found in the legacy DocSearch configs and their translation to an Algolia Crawler config. For more detailed information on the Algolia Crawler, see the official documentation.
legacy | current | description |
|---|---|---|
start_urls | startUrls | Now accepts URLs only, see helpers.docsearch to handle custom variables |
page_rank | pageRank | Can be added to the recordProps in helpers.docsearch, should be passed as a string |
js_render | renderJavaScript | Unchanged |
js_wait | renderJavascript.waitTime | See documentation of renderJavaScript |
index_name | removed, see actions | Handled directly in the actions |
sitemap_urls | sitemaps | Unchanged |
stop_urls | exclusionPatterns | Supports micromatch |
selectors_exclude | removed | Should be handled in the recordExtractor and helpers.docsearch |
custom_settings | initialIndexSettings | Unchanged |
scrape_start_urls | removed | Can be handled with exclusionPatterns |
strip_chars | removed | # are removed automatically from anchor links, edge cases should be handled in the recordExtractor and helpers.docsearch |
conversation_id | removed | Not needed anymore |
nb_hits | removed | Not needed anymore |
sitemap_alternate_links | removed | Not needed anymore |
stop_content | removed | Should be handled in the recordExtractor and helpers.docsearch |