版本：旧版 (v3.x)

记录提取器

非官方测试版翻译

本页面由 PageTurner AI 翻译（测试版）。未经项目官方认可。发现错误？报告问题 →

简介

信息

本文档仅涵盖 helpers.docsearch 方法的相关信息，有关 Algolia 爬虫 的完整内容，请查阅 Algolia 爬虫文档。

页面内容通过 recordExtractor 提取器进行提取。这些提取器通过 recordExtractor 参数分配给 actions。该参数指向一个函数，该函数返回您要索引的数据，并组织为 JSON 对象数组。

辅助函数集用于帮助您提取内容并生成 Algolia 记录。

实用链接

使用指南

使用 DocSearch 辅助函数最常见的方式是将其结果返回给 recordExtractor 函数。

recordExtractor: ({ helpers }) => {
  return helpers.docsearch({
    recordProps: {
      lvl0: {
        selectors: "header h1",
      },
      lvl1: "article h2",
      lvl2: "article h3",
      lvl3: "article h4",
      lvl4: "article h5",
      lvl5: "article h6",
      content: "main p, main li",
    },
  });
},

使用 Cheerio 操作 DOM

Cheerio instance ($) 使您能够操作 DOM：

recordExtractor: ({ $, helpers }) => {
  // Removing DOM elements we don't want to crawl
  $(".my-warning-message").remove();

  return helpers.docsearch({
    recordProps: {
      lvl0: {
        selectors: "header h1",
      },
      lvl1: "article h2",
      lvl2: "article h3",
      lvl3: "article h4",
      lvl4: "article h5",
      lvl5: "article h6",
      content: "main p, main li",
    },
  });
},

提供备用选择器

当提取可能在某些页面不存在的内容时，备用选择器非常有用：

recordExtractor: ({ $, helpers }) => {
  return helpers.docsearch({
    recordProps: {
      // `.exists h1` will be selected if `.exists-probably h1` does not exists.
      lvl0: {
        selectors: [".exists-probably h1", ".exists h1"],
      }
      lvl1: "article h2",
      lvl2: "article h3",
      lvl3: "article h4",
      lvl4: "article h5",
      lvl5: "article h6",
      // `.exists p, .exists li` will be selected.
      content: [
        ".does-not-exists p, .does-not-exists li",
        ".exists p, .exists li",
      ],
    },
  });
},

提供原始文本 (`defaultValue`)

仅 lvl0 和自定义变量支持此选项

您可能需要使搜索结果的结构与网站不同，或为可能不存在的选择器提供 defaultValue：

recordExtractor: ({ $, helpers }) => {
  return helpers.docsearch({
    recordProps: {
      lvl0: {
        // It also supports the fallback DOM selectors syntax!
        selectors: ".exists-probably h1",
        defaultValue: "myRawTextIfDoesNotExists",
      },
      lvl1: "article h2",
      lvl2: "article h3",
      lvl3: "article h4",
      lvl4: "article h5",
      lvl5: "article h6",
      content: "main p, main li",
      // The variables below can be used to filter your search
      language: {
        // It also supports the fallback DOM selectors syntax!
        selectors: ".exists-probably .language",
        // Since custom variables are used for filtering, we allow sending
        // multiple raw values
        defaultValue: ["en", "en-US"],
      },
    },
  });
},

为分面搜索建立索引内容

这些选择器同样支持 defaultValue 和备用选择器

您可能需要索引将在前端用作过滤器的内容（例如 version 或 lang），可以通过向 recordProps 对象添加任意自定义变量，将其包含在 Algolia 记录中：

recordExtractor: ({ helpers }) => {
  return helpers.docsearch({
    recordProps: {
      lvl0: {
        selectors: "header h1",
      },
      lvl1: "article h2",
      lvl2: "article h3",
      lvl3: "article h4",
      lvl4: "article h5",
      lvl5: "article h6",
      content: "main p, main li",
      // The variables below can be used to filter your search
      foo: ".bar",
      language: {
        // It also supports the fallback DOM selectors syntax!
        selectors: ".does-not-exists",
        // Since custom variables are used for filtering, we allow sending
        // multiple raw values
        defaultValue: ["en", "en-US"],
      },
      version: {
        // You can send raw values without `selectors`
        defaultValue: ["latest", "stable"],
      },
    },
  });
},

以下 version, lang 和 foo 属性将在您的记录中可用：

foo: "valueFromBarSelector",
language: ["en", "en-US"],
version: ["latest", "stable"]

现在您可以在前端使用这些属性过滤搜索结果

通过 `pageRank` 提升搜索结果

此参数允许您使用基于当前 pathsToMatch 构建的自定义排序属性来提升记录。具有较高 pageRank 的页面将优先于较低 pageRank 的页面返回。默认值为 0，您可以传递字符串形式的任意数值（包括负值）。

搜索结果按权重降序排序，因此被提升和未提升的结果会同时存在。每条结果的权重将根据匹配程度、位置等多重因素针对特定查询计算得出，而 pageRank 值将加入最终权重。仅靠 pageRank 可能不足以影响查询结果，具体取决于您的整体排序设置。若调整 pageRank 值（即使使用较大值）仍无法有效影响搜索结果，请在索引的"Ranking and Sorting"页面中提高 weight.pageRank 的优先级。

您可以直接在 Algolia 仪表板查看计算权重（dashboard.algolia.com→Search→执行搜索→将鼠标悬停在每条记录右下角的"ranking criteria"图标上）。这将帮助您判断适用的 pageRank 值范围。

{
  indexName: "YOUR_INDEX_NAME",
  pathsToMatch: ["https://YOUR_WEBSITE_URL/api/**"],
  recordExtractor: ({ $, helpers, url }) => {
    const isDocPage = /\/[\w-]+\/docs\//.test(url.pathname);
    const isBlogPage = /\/[\w-]+\/blog\//.test(url.pathname);
    return helpers.docsearch({
      recordProps: {
        lvl0: {
          selectors: "header h1",
        },
        lvl1: "article h2",
        lvl2: "article h3",
        lvl3: "article h4",
        lvl4: "article h5",
        lvl5: "article h6",
        content: "article p, article li",
        pageRank: isDocPage ? "-2000" : isBlogPage ? "-1000" : "0",
      },
    });
  },
},

减少记录数量

当页面输出超过 750 条记录时若遇到 Extractors returned too many records 错误，aggregateContent 选项可帮助您在提取器的 content 层级减少记录数量。

{
  indexName: "YOUR_INDEX_NAME",
  pathsToMatch: ["https://YOUR_WEBSITE_URL/api/**"],
  recordExtractor: ({ $, helpers }) => {
    return helpers.docsearch({
      recordProps: {
        lvl0: {
          selectors: "header h1",
        },
        lvl1: "article h2",
        lvl2: "article h3",
        lvl3: "article h4",
        lvl4: "article h5",
        lvl5: "article h6",
        content: "article p, article li",
      },
      aggregateContent: true,
    });
  },
},

缩减记录体积

若在抓取网站时遇到Records extracted are too big错误，通常是因为记录信息过多或页面内容过长。recordVersion 选项可移除DocSearch v2专属字段，从而减小记录体积。

{
  indexName: "YOUR_INDEX_NAME",
  pathsToMatch: ["https://YOUR_WEBSITE_URL/api/**"],
  recordExtractor: ({ $, helpers }) => {
    return helpers.docsearch({
      recordProps: {
        lvl0: {
          selectors: "header h1",
        },
        lvl1: "article h2",
        lvl2: "article h3",
        lvl3: "article h4",
        lvl4: "article h5",
        lvl5: "article h6",
        content: "article p, article li",
      },
      recordVersion: "v3",
    });
  },
},

`recordProps` API 参考

`lvl0`

type: Lvl0 | 必填

type Lvl0 = {
  selectors: string | string[];
  defaultValue?: string;
};

`lvl1`, `content`

type: string | string[] | 必填

`lvl2`, `lvl3`, `lvl4`, `lvl5`, `lvl6`

type: string | string[] | 可选

`pageRank`

type: number | 可选

参见实际应用案例

自定义变量

type: string | string[] | CustomVariable | 可选

type CustomVariable =
  | {
      defaultValue: string | string[];
    }
  | {
      selectors: string | string[];
      defaultValue?: string | string[];
    };

自定义变量用于 filter your search，可在recordProps中定义

`helpers.docsearch` API 参考

`aggregateContent`

type: boolean | 默认值: true | 可选

该选项将标题下的content层级记录聚合为单条记录，减少记录数量

`recordVersion`

type: 'v3' | 'v2' | 默认值: v2 | 可选

此选项移除DocSearch v2的专属字段。若使用最新版DocSearch，可设置为v3减小记录体积

`indexHeadings`

type: boolean | { from: number, to: number } | 默认值: true | 可选

控制是否索引headings标题层级（lvlX）

设为false时仅创建content层级的记录
提供from, to范围时仅创建lvlX至lvlY层级的记录

简介​

实用链接​

使用指南​

使用 Cheerio 操作 DOM​

提供备用选择器​

提供原始文本 (defaultValue)​

为分面搜索建立索引内容​

通过 pageRank 提升搜索结果​

减少记录数量​

缩减记录体积​

recordProps API 参考​

lvl0​

lvl1, content​

lvl2, lvl3, lvl4, lvl5, lvl6​

pageRank​

自定义变量​

helpers.docsearch API 参考​

aggregateContent​

recordVersion​

indexHeadings​

简介