当前位置：

首页
/
IT
/
程序
/
Python
/
Elasticsearch 系列 ---Term Vector 工具探查数据

Elasticsearch 系列 ---Term Vector 工具探查数据

概要

本篇主要介绍一个 Term Vector 的概念和基本使用方法.

term vector 是什么?

每次有 document 数据插入时, Elasticsearch 除了对 document 进行正排, 倒排索引的存储之外, 如果此索引的 field 设置了 term_vector 参数, Elasticsearch 还会对这个的分词信息进行计算, 统计, 比如这个 document 有多少个 field, 每个 field 的值分词处理后得到的 term 的 df 值, ttf 值是多少, 每个 term 存储的位置偏移量等信息, 这些统计信息统称为 term vector.

term vector 的值有 5 个

no: 不存储 term vector 信息, 默认值

yes: 只存储 field terms 信息, 不包含 position 和 offset 信息

with_positions: 存储 term 信息和 position 信息

with_offsets: 存储 term 信息和 offset 信息

with_positions_offsets: 存储完整的 term vector 信息, 包括 field terms,position,offset 信息.

term vector 的信息生成有两种方式: index-time 和 query-time.index-time 即建立索引时生成 term vector 信息, query-time 是在查询过程中实时生成 term vector 信息, 前者以空间换时间, 后者以时间换空间.

term vector 有什么作用?

term vector 本质上是一个数据探查的工具 (可以看成是一个 debugger 工具), 上面记录着一个 document 内的 field 分词后的 term 的详细情况, 如拆分成几个 term, 每个 term 在正排索引的哪个位置, 各自的 df 值, ttf 值分别是多少等等. 一般用于数据疑似问题的排查, 比如说排序和搜索与预期的结果不一致, 需要了解根本原因, 可以拿这个工具手动进行数据分析, 帮助判断问题的根源.

读懂 term vector 信息

我们来看看一个完整的 term vector 报文, 都有哪些信息, 带 #号的一行代码是添加的注释, 如下示例:

{
  "_index": "music",
  "_type": "children",
  "_id": "1",
  "_version": 1,
  "found": true,
  "took": 0,
  "term_vectors": {
    "text": {
      "field_statistics": {
        "sum_doc_freq": 3,
        "doc_count": 1,
        "sum_ttf": 3
      },
      "terms": {
        "elasticsearch": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1,
          "tokens": [
            {
              "position": 2,
              "start_offset": 11,
              "end_offset": 24
            }
          ]
        },
        "hello": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1,
          "tokens": [
            {
              "position": 0,
              "start_offset": 0,
              "end_offset": 5
            }
          ]
        },
        "java": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1,
          "tokens": [
            {
              "position": 1,
              "start_offset": 6,
              "end_offset": 10
            }
          ]
        }
      }
    }
  }
}

一段完整的 term vector 信息, term vector 是按 field 为维度来统计的, 主要包含三个部分:

field statistics
term statistics
term information
field statistics

指该索引和 type 下所有的 document, 对这个 field 所有 term 的统计信息, 注意 document 的范围, 不是某一条, 是指定 index/type 下的所有 document.

sum_doc_freq(sum of document frequency): 这个 field 中所有的 term 的 df 之和.

doc_count(document count): 有多少 document 包含这个 field, 有些 document 可能没有这个 field.

sum_ttf(sum of total term frequency): 这个 field 中所有的 term 的 tf 之和.

term statistics

hello 为当前 document 中, text field 字段分词后的 term, 查询时设置 term_statistics=true 时生效.

doc_freq(document frequency): 有多少 document 包含这个 term.

ttf(total term frequency): 这个 term 在所有 document 中出现的频率.

term_freq(term frequency in the field): 这个 term 在当前 document 中出现的频率.

term information

示例中 tokens 里面的内容, tokens 里面是个数组

position: 这个 term 在 field 里的正排索引位置, 如果有多个相同的 term,tokens 下面会有多条记录.

start_offset: 这个 term 在 field 里的偏移, 表示起始位置偏移量.

end_offset: 这个 term 在 field 里的偏移量, 表示结束位置偏移量.

term vector 使用案例

建立索引 music,type 命名为 children, 指定 text 字段为 index-time,fullname 字段为 query-time

PUT /music
{
  "mappings": {
    "children": {
      "properties": {
        "content": {
            "type": "text",
            "term_vector": "with_positions_offsets",
            "store" : true,
            "analyzer" : "standard"
         },
         "fullname": {
            "type": "text",
            "analyzer" : "standard"
        }
      }
    }
  }
}

添加 3 条示例数据

PUT /music/children/1
{
  "fullname" : "Jean Ritchie",
  "content" : "Love Somebody"
}
PUT /music/children/2
{
  "fullname" : "John Smith",
  "content" : "wake me, shark me ..."
}
PUT /music/children/3
{
  "fullname" : "Peter Raffi",
  "content" : "brush your teeth"
}

对 document id 为 1 这条数据进行 term vector 探查

GET /music/children/1/_termvectors
{
  "fields" : ["content"],
  "offsets" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true
}

得到的结果即为上文的 term vector 示例.

另外可以提一下, 用这 3 个 document 的 id 进行查询, field_statistics 部分是一样的.

term vector 常见用法

除了上一节的标准查询用法, 还有一些参数可以丰富 term vector 的查询.

doc 参数

GET /music/children/_termvectors
{
  "doc" : {
    "fullname" : "Peter Raffi",
    "content" : "brush your teeth"
  },
  "fields" : ["content"],
  "offsets" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true
}

这个语法的含义是针对指定的 doc 进行 term vector 分析, doc 里的内容可以随意指定, 特别实用.

per_field_analyzer 参数

可以指定字段的分词器进行探查

GET /music/children/_termvectors
{
  "doc" : {
    "fullname" : "Jimmie Davis",
    "content" : "you are my sunshine"
  },
  "fields" : ["content"],
  "offsets" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true,
  "per_field_analyzer" : {
    "text": "standard"
  }
}

filter 参数

对 term vector 统计结果进行过滤

GET /music/children/_termvectors
{
  "doc" : {
    "fullname" : "Jimmie Davis",
    "content" : "you are my sunshine"
  },
  "fields" : ["content"],
  "offsets" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true,
  "filter" : {
      "max_num_terms" : 3,
      "min_term_freq" : 1,
      "min_doc_freq" : 1
    }
}

根据 term 统计信息, 过滤出你想要看到的 term vector 统计结果. 也挺有用的, 比如你探查数据可以过滤掉一些出现频率过低的 term.

docs 参数

允许你同时对多个 doc 进行探查, 这个使用频率看个人习惯.

GET _mtermvectors
{
   "docs": [
      {
         "_index": "music",
         "_type": "children",
         "_id": "2",
         "term_statistics": true
      },
      {
         "_index": "music",
         "_type": "children",
         "_id": "1",
         "fields": [
            "content"
         ]
      }
   ]
}

term vector 使用建议

有两种方式可以得到 term vector 信息, 一种是像上面案例, 建立时指定, 另一种是直接查询时生成

index-time, 在 mapping 里配置, 建立索引的时候, 就直接给你生成这些 term 和 field 的统计信息, 如果 term_vector 设置为 with_positions_offsets, 索引所占的空间是不设置 term vector 时的 2 倍.

query-time, 你之前没有生成过任何的 Term vector 信息, 然后在查看 term vector 的时候, 直接就可以看到了, 会 on the fly, 现场计算出各种统计信息, 然后返回给你.

这两种方式采用哪种取决于对 term vector 的使用期望, query-time 更常用一些, 毕竟这个工具的用处是协助定位问题, 实时计算就行.

小结

term vector 是一个比较实用的工具, 尤其是针对线上数据进行分析, 协助问题定位的时候, 可以派上很大的用场.

专注 Java 高并发, 分布式架构, 更多技术干货分享与心得, 请关注公众号: Java 架构社区

可以扫左边二维码添加好友, 邀请你加入 Java 架构社区微信群共同探讨技术

来源: https://www.cnblogs.com/huangying2124/p/12854592.html

与本文相关文章

暂无,快来抢沙发吧！