记录提取器
本页面由 PageTurner AI 翻译(测试版)。未经项目官方认可。 发现错误? 报告问题 →
简介
本文档仅涵盖 helpers.docsearch 方法的相关信息,有关 Algolia 爬虫 的完整内容,请查阅 Algolia 爬虫文档。
页面内容通过 recordExtractor 提取器进行提取。这些提取器通过 recordExtractor 参数分配给 actions。该参数指向一个函数,该函数返回您要索引的数据,并组织为 JSON 对象数组。
辅助函数集用于帮助您提取内容并生成 Algolia 记录。
实用链接
使用指南
使用 DocSearch 辅助函数最常见的方式是将其结果返回给 recordExtractor 函数。
recordExtractor: ({ helpers }) => {
return helpers.docsearch({
recordProps: {
lvl0: {
selectors: "header h1",
},
lvl1: "article h2",
lvl2: "article h3",
lvl3: "article h4",
lvl4: "article h5",
lvl5: "article h6",
content: "main p, main li",
},
});
},
使用 Cheerio 操作 DOM
Cheerio instance ($) 使您能够操作 DOM:
recordExtractor: ({ $, helpers }) => {
// Removing DOM elements we don't want to crawl
$(".my-warning-message").remove();
return helpers.docsearch({
recordProps: {
lvl0: {
selectors: "header h1",
},
lvl1: "article h2",
lvl2: "article h3",
lvl3: "article h4",
lvl4: "article h5",
lvl5: "article h6",
content: "main p, main li",
},
});
},
提供备用选择器
当提取可能在某些页面不存在的内容时,备用选择器非常有用:
recordExtractor: ({ $, helpers }) => {
return helpers.docsearch({
recordProps: {
// `.exists h1` will be selected if `.exists-probably h1` does not exists.
lvl0: {
selectors: [".exists-probably h1", ".exists h1"],
},
lvl1: "article h2",
lvl2: "article h3",
lvl3: "article h4",
lvl4: "article h5",
lvl5: "article h6",
// `.exists p, .exists li` will be selected.
content: [
".does-not-exists p, .does-not-exists li",
".exists p, .exists li",
],
},
});
},
提供原始文本 (defaultValue)
仅 lvl0 和自定义变量支持此选项
您可能需要使搜索结果的结构与网站不同,或为可能不存在的选择器提供 defaultValue:
recordExtractor: ({ $, helpers }) => {
return helpers.docsearch({
recordProps: {
lvl0: {
// It also supports the fallback DOM selectors syntax!
selectors: ".exists-probably h1",
defaultValue: "myRawTextIfDoesNotExists",
},
lvl1: "article h2",
lvl2: "article h3",
lvl3: "article h4",
lvl4: "article h5",
lvl5: "article h6",
content: "main p, main li",
// The variables below can be used to filter your search
language: {
// It also supports the fallback DOM selectors syntax!
selectors: ".exists-probably .language",
// Since custom variables are used for filtering, we allow sending
// multiple raw values
defaultValue: ["en", "en-US"],
},
},
});
},
为分面搜索建立索引内容
这些选择器同样支持 defaultValue 和 备用选择器
You might want to index content that will be used as filters in your frontend (e.g. version or lang), you can define any custom variable to the recordProps object to add them to your Algolia records:
recordExtractor: ({ helpers }) => {
return helpers.docsearch({
recordProps: {
lvl0: {
selectors: "header h1",
},
lvl1: "article h2",
lvl2: "article h3",
lvl3: "article h4",
lvl4: "article h5",
lvl5: "article h6",
content: "main p, main li",
// The variables below can be used to filter your search
foo: ".bar",
language: {
// It also supports the fallback DOM selectors syntax!
selectors: ".does-not-exists",
// Since custom variables are used for filtering, we allow sending
// multiple raw values
defaultValue: ["en", "en-US"],
},
version: {
// You can send raw values without `selectors`
defaultValue: ["latest", "stable"],
},
},
});
},
以下 version, lang 和 foo 属性将在您的记录中可用:
foo: "valueFromBarSelector",
language: ["en", "en-US"],
version: ["latest", "stable"]
现在您可以在前端使用这些属性过滤搜索结果
通过 pageRank 提升搜索结果
此参数允许您使用基于当前 pathsToMatch 构建的自定义排序属性来提升记录。具有较高 pageRank 的页面将优先于较低 pageRank 的页面返回。默认值为 0,您可以传递字符串形式的任意数值(包括负值)。
搜索结果按权重降序排序,因此被提升和未提升的结果会同时存在。每条结果的权重将根据匹配程度、位置等多重因素针对特定查询计算得出,而 pageRank 值将加入最终权重。仅靠 pageRank 可能不足以影响查询结果,具体取决于您的整体排序设置。若调整 pageRank 值(即使使用较大值)仍无法有效影 响搜索结果,请在索引的"Ranking and Sorting"页面中提高 weight.pageRank 的优先级。
您可以直接在 Algolia 仪表板查看计算权重(dashboard.algolia.com→Search→执行搜索→将鼠标悬停在每条记录右下角的"ranking criteria"图标上)。这将帮助您判断适用的 pageRank 值范围。
{
indexName: "YOUR_INDEX_NAME",
pathsToMatch: ["https://YOUR_WEBSITE_URL/api/**"],
recordExtractor: ({ $, helpers, url }) => {
const isDocPage = /\/[\w-]+\/docs\//.test(url.pathname);
const isBlogPage = /\/[\w-]+\/blog\//.test(url.pathname);
return helpers.docsearch({
recordProps: {
lvl0: {
selectors: "header h1",
},
lvl1: "article h2",
lvl2: "article h3",
lvl3: "article h4",
lvl4: "article h5",
lvl5: "article h6",
content: "article p, article li",
pageRank: isDocPage ? "-2000" : isBlogPage ? "-1000" : "0",
},
});
},
},
减少记录数量
If you encounter the Extractors returned too many records error when your page outputs more than 750 records, the aggregateContent option helps you reduce the number of records at the content level of the extractor.
{
indexName: "YOUR_INDEX_NAME",
pathsToMatch: ["https://YOUR_WEBSITE_URL/api/**"],
recordExtractor: ({ $, helpers }) => {
return helpers.docsearch({
recordProps: {
lvl0: {
selectors: "header h1",
},
lvl1: "article h2",
lvl2: "article h3",
lvl3: "article h4",
lvl4: "article h5",
lvl5: "article h6",
content: "article p, article li",
},
aggregateContent: true,
});
},
},
缩减记录体积
If you encounter the Records extracted are too big error when crawling your website, it is usually because there is too much information in your records, or because your page is too large. The recordVersion option helps you reduce the records size by removing informations that are only used with DocSearch v2.
{
indexName: "YOUR_INDEX_NAME",
pathsToMatch: ["https://YOUR_WEBSITE_URL/api/**"],
recordExtractor: ({ $, helpers }) => {
return helpers.docsearch({
recordProps: {
lvl0: {
selectors: "header h1",
},
lvl1: "article h2",
lvl2: "article h3",
lvl3: "article h4",
lvl4: "article h5",
lvl5: "article h6",
content: "article p, article li",
},
recordVersion: "v3",
});
},
},
recordProps API 参考
lvl0
type: Lvl0| 必填
type Lvl0 = {
selectors: string | string[];
defaultValue?: string;
};
lvl1, content
type: string | string[]| 必填
lvl2, lvl3, lvl4, lvl5, lvl6
type: string | string[]| 可选
pageRank
type: number| 可选
参见实际应用案例
自定义变量
type: string | string[] | CustomVariable| 可选
type CustomVariable =
| {
defaultValue: string | string[];
}
| {
selectors: string | string[];
defaultValue?: string | string[];
};
自定义变量用于 filter your search,可在recordProps中定义
helpers.docsearch API 参考
aggregateContent
type: boolean| 默认值:true| 可选
该选项将标题下的content层级记录聚合为单条记录,减少记录数量
recordVersion
type: 'v3' | 'v2'| 默认值:v2| 可选
This option removes content from the Algolia records that are only used for DocSearch v2. If you are using the latest version of DocSearch, you can set it to v3.
indexHeadings
type: boolean | { from: number, to: number }| 默认值:true| 可选
控制是否索引headings标题层级(lvlX)
-
设为
false时仅创建content层级的记录 -
提供
from, to范围时仅创建lvlX至lvlY层级的记录