当前位置：

首页
/
IT
/
程序
/
Python
/
Elasticsearch 系列 --- 结构化搜索

Elasticsearch 系列 --- 结构化搜索

概要

结构化搜索针对日期, 时间, 数字等结构化数据的搜索, 它们有自己的格式, 我们可以对它们进行范围, 比较大小等逻辑操作, 这些逻辑操作得到的结果非黑即白, 要么符合条件在结果集里, 要么不符合条件在结果集之外, 没有那种相似的概念.

前言

结构化搜索将会有大量的搜索实例, 我们将 "音乐 APP" 作为主要的案例背景, 去开发一些跟音乐 App 相关的搜索或数据分析, 有助力于我们理解实战的目标, 顺带巩固一下学习的知识.

我们将一首歌需要的字段暂定为:

| name | code | type | remark |
| :---- | :--: | :--: | -----: |

| ID | id | keyword | 文档 ID |

我们手动定义的索引 mapping 信息如下:

PUT /music
{
  "mappings": {
      "children": {
        "properties": {
          "id": {
            "type": "keyword"
          },
          "author_first_name": {
            "type": "text",
            "analyzer": "english"
          },
          "author_last_name": {
            "type": "text",
            "analyzer": "english"
          },
          "author": {
            "type": "text",
            "analyzer": "english",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "name": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "content": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "language": {
            "type": "text",
            "analyzer": "english",
            "fielddata": true
          },
          "tags": {
            "type": "text",
            "analyzer": "english"
          },
          "length": {
            "type": "long"
          },
          "likes": {
            "type": "long"
          },
          "isRelease": {
            "type": "boolean"
          },
          "releaseDate": {
            "type": "date"
          }
        }
      }
  }
}

我们预先导入一批数据进去:

POST /music/children/_bulk
{
	"index": {
	"_id": 1	
}	
}
{
	"id" : "34116101-7fa2-5630-a1a4-1735e19d2834", "author_first_name":"Peter", "author_last_name":"Gymbo", "author" : "Peter Gymbo", "name": "gymbo", "content":"I hava a friend who loves smile, gymbo is his name", "language":"english", "tags":["enlighten","gymbo","friend"], "length":53, "likes": 5, "isRelease":true, "releaseDate": "2019-12-20"	
}
{
	"index": {
	"_id": 2	
}	
}
{
	"id" : "34117101-54cb-59a1-9b7a-82adb46fa58d", "author_first_name":"John", "author_last_name":"Smith", "author" : "John Smith", "name": "wake me, shark me", "content":"don't let me sleep too late, gonna get up brightly early in the morning","language":"english","tags":["wake","early","morning"],"length":55,"likes": 8,"isRelease":true,"releaseDate":"2019-12-21"	
}
{
	"index": {
	"_id": 3	
}	
}
{
	"id" : "34117201-8d01-49d4-a495-69634ae67017", "author_first_name":"Jimmie", "author_last_name":"Davis", "author" : "Jimmie Davis", "name": "you are my sunshine", "content":"you are my sunshine, my only sunshine, you make me happy, when skies are gray", "language":"english", "tags":["sunshine","happy"], "length":65,"likes": 12, "isRelease":true, "releaseDate": "2019-12-22"	
}
{
	"index": {
	"_id": 4	
}	
}
{
	"id" : "55fa74f7-35f3-4313-a678-18c19c918a78", "author_first_name":"Peter", "author_last_name":"Raffi", "author" : "Peter Raffi", "name": "brush your teeth", "content":"When you wake up in the morning it's a quarter to one, and you want to have a little fun You brush your teeth","language":"english","tags":"teeth","length":45,"likes": 17,"isRelease":true,"releaseDate":"2019-12-22"	
}
{
	"index": {
	"_id": 5	
}	
}
{
	"id" : "1740e61c-63da-474f-9058-c2ab3c4f0b0a", "author_first_name":"Jean", "author_last_name":"Ritchie", "author" : "Jean Ritchie", "name": "love somebody", "content":"love somebody, yes I do", "language":"english", "tags":"love", "length":38, "likes": 3,"isRelease":true, "releaseDate": "2019-12-22"	
}

精确值查找

我们根据文档的 mapping 设计, 可以按 ID, 按日期进行查找.

根据 ID 搜索歌曲

GET /music/children/_search
{
    "query" : {
        "constant_score" : {
            "filter" : {
                "term" : {
                    "id" : "34116101-7fa2-5630-a1a4-1735e19d2834"
                }
            }
        }
    }
}

注意 ID 建立时, 类型是指定为 keyword, 这样 ID 在索引时不会进行分词. 如果类型为 text,UUID 值在索引时会分词, 这样反而查不到结果了.

按日期搜索歌曲

GET /music/children/_search
{
    "query" : {
        "constant_score" : {
            "filter" : {
                "term" : {
                    "releaseDate" : "2019-12-21"
                }
            }
        }
    }
}

按歌曲时长搜索

GET /music/children/_search
{
    "query" : {
        "constant_score" : {
            "filter" : {
                "term" : {
                    "length" : 53
                }
            }
        }
    }
}

搜索已发布的歌曲

GET /music/children/_search
{
    "query" : {
        "constant_score" : {
            "filter" : {
                "term" : {
                    "isRelease" : true
                }
            }
        }
    }
}

以上 3 个小例子可以发现: 准确值搜索对 keyword, 日期, 数字, boolean 值天然支持.

组合过滤

前面的 4 个小例子都是单条件过滤的, 实际的需求肯定会有多个条件, 不过万变不离其宗, 再复杂的搜索需求, 也是由一个一个的基础条件复合而成的, 我们来看几个简单的组合过滤的例子.

复习一下之前学过的逻辑:

bool 组合多个条件, 可以嵌套

must 必须匹配

should 可以匹配 (类似于 or, 多个条件在 should 里)

must_not 必须不匹配

搜索发布日期为 2019-12-20, 或歌曲 ID 为 2a8f4288-c0a9-5c9b-8f99-67339b66f4c0, 但发布日期不能是 2019-12-21 的歌曲

GET /music/children/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "bool": {
          "should": [
            {"term":{
              "releaseDate":"2019-12-20"
            }},
            {"term":{
              "id":"2a8f4288-c0a9-5c9b-8f99-67339b66f4c0"
            }}
          ],
          "must_not": {
          "term": {
            "releaseDate":"2019-12-21"
          }
        }
        }
      }
    }
  }
}

搜索歌曲 ID 为 2a8f4288-c0a9-5c9b-8f99-67339b66f4c0, 或者是歌曲 ID 为 34116101-7fa2-5630-a1a4-1735e19d2834 而且发布日期为 2019-12-20 的帖子

GET /music/children/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "bool": {
          "should": [
            {"term":{
              "id":"2a8f4288-c0a9-5c9b-8f99-67339b66f4c0"
            }},
            {
              "bool": {
                "must" : [
                  {
                  "term" : {
                "id":"34116101-7fa2-5630-a1a4-1735e19d2834"
                  }},
                 { "term" : {
                    "releaseDate":"2019-12-20"
                  }}
                ]
              }
            }
          ]
        }
        }
      }
    }
  }

多值搜索

使用语法 terms, 可以同时搜索多个值, 类似 MySQL 的 in 语句.

GET /music/children/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "terms": {
          "id": [
            "34116101-7fa2-5630-a1a4-1735e19d2834",
            "99268c7e-8308-569a-a975-bbce7d3f9a8e"
          ]
        }
      }
    }
  }
}

范围查询

针对 Long 类型和 date 类型的数据, 是支持范围查询的, 使用 gt,lt,gte,lte 来完成范围的判断. 与 MySQL 的 >,<,>=,<= 以及 between...and 异曲同工.

搜索时长在 45-60 秒之间的歌曲

对 Long 类型的范围查询, 直接使用范围表达式:

GET /music/children/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "range": {
          "length": {
            "gte": 45,
            "lte": 60
          }
        }
      }
    }
  }
}

日期的范围搜索

针对日期的范围搜索, 除了直接写日期, 加上常规的范围表达式之外, 还可以使用 + 1d,-1d 表示对指定日期的加减, 如 "2019-12-21||-1d" 表示 "2019-12-20", 也可以使用 now-1d 表示昨天, 挺有趣.

给个示例: 搜索 2019-12-21 前一天新发布的歌曲

GET /music/children/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "range": {
          "releaseDate" :{
          "gt":"2019-12-21||-1d"
        }
      }
    }
  }
}
}

Null 值处理

倒排索引在建立时, 是不接受空值的, 这就意味着 null,[],[null] 这些各种形式的 null 值, 不无法存入倒排索引的, 那这样怎么办?

Elasticsearch 提供了两种查询, 类似于 MySQL 的 is not null 和 not exists.

存在查询

exists 查询, 会返回那些指定字段有值的文档, 与 MySQL 的 is not null 类似.

案例中的 tags 字段, 就是一个选填项, 有些记录可能是 null 值, 如果我需要查询所有的 tags 值的记录, 请求如下:

GET /music/children/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "exists": {
          "field": "tags"
        }
      }
    }
  }
}

缺失查询

缺失查询原来是有关键字 missing 表示, 效果与 exists 相反, 语法上与 MySQL 的 is null 类似, 但 6.x 版本就已经废弃了, 我们可以改用 must not + exists 实现相同的效果.

还是使用 tags 字段为例, 查询 tags 为空的文档:

GET /music/children/_search
{
  "query": {
    "bool": {
      "must_not": {
        "exists": {
          "field": "tags"
        }
      }
    }
  }
}

filter 缓存

过滤器为什么效率那么高? 除了本身的设计集合来达到高效过滤之外, 还将查询结果适当地缓存化.

filter 执行原理

我们了解一下 Elasticsearch 对过滤器的简单操作:

根据 fitler 条件查找匹配的文档, 获取 document list. 如果有多个过滤条件且涉及多个字段, 那么就会有多个 document list,document list 是按倒排索引来的.

根据 document list 构建 bitset(包含 0 或 1 的数组), 匹配了是 1, 没匹配上为 0, 如 [1,0,0,0].

迭代所有的 bitset, 从最稀疏的开始 (可以排除到大量的文档), 取数组相同位置所有值为 1 的记录.

将 bitset 缓存在内存中, 用于提高性能.

filter 比 query 好处是会 caching, 下次不用查倒排索引, filter 大部分情况下在 query 之前执行 query 会计算 doc 对搜索条件的 relevance score, 还会根据这个 score 去排序

filter 简单过滤出想要的数据, 不计算 relevance score, 也不排序

filter 缓存

缓存条件

最近的 256 个 filter 中, 某个 filter 超过一定次数 (次数不固定), 就会自动缓存这个 filter 对应的 bitset.

filter 针对小 segment 获取的结果, 可以不缓存, segment<1000 条或 segment 大小 < index 总大小的 3%. 原因是数据量小, 重新扫描很快, 太小的 segment 在后台会自动合并到大的 segment 中, 缓存意义不大

缓存更新

缓存的更新非常智能, 增量更新的方式, 如果有 document 新增或修改时, 会将新文档加入 bitset, 而不是删除缓存或整个重新计算.

小结

本篇前半部分使用了大量的示例, 可以快速阅读, 后面介绍了 filter 的过滤原理及缓存处理机制, 可以了解一下, 谢谢.

专注 Java 高并发, 分布式架构, 更多技术干货分享与心得, 请关注公众号: Java 架构社区

可以扫左边二维码添加好友, 邀请你加入 Java 架构社区微信群共同探讨技术

来源: https://www.cnblogs.com/huangying2124/p/12230175.html

与本文相关文章

暂无,快来抢沙发吧！