当前位置：

首页
/
IT
/
数据库
/
用 PostgreSQL 做实时高效搜索引擎 - 全文检索, 模糊查询, 正则查询, 相似查询, ADHOC 查询

用 PostgreSQL 做实时高效搜索引擎 - 全文检索, 模糊查询, 正则查询, 相似查询, ADHOC 查询

作者

digoal

日期

2017-12-05

标签

PostgreSQL , 搜索引擎 , GIN , ranking , high light , 全文检索 , 模糊查询 , 正则查询 , 相似查询 , ADHOC 查询

背景

字符串搜索是非常常见的业务需求, 它包括:

1, 前缀 + 模糊查询.(可以使用 b-tree 索引)

select * from tbl where col like 'ab%';

或

select * from tbl where col ~ '^ab';

2, 后缀 + 模糊查询.(可以使用 reverse(col) 表达式 b-tree 索引)

select * from tbl where col like '%ab';

或

select * from tbl where col ~ 'ab$';

写法

select * from tbl where reverse(col) like 'ba%';

或

select * from tbl where reverse(col) ~ '^ba';

3, 前后模糊查询.(可以使用 pg_trgm 和 gin 索引)

select * from tbl where col like '%ab%';

或

select * from tbl where col ~ 'ab';

4, 全文检索.(可以使用全文检索类型以及 gin 或 rum 索引)

select * from tbl where tsvector_col @@ 'postgres & china | digoal:A' order by ts_rank(tsvector_col, 'postgres & china | digoal:A') limit xx;

详细语法后面介绍

5, 正则查询.(可以使用 pg_trgm 和 gin 索引)

select * from tbl where col ~ '^a[0-9]{1,5}\ +digoal$';

6, 相似查询.(可以使用 pg_trgm 和 gin 索引)

select * from tbl order by similarity(col, 'postgre') desc limit 10;

7,ADHOC 查询, 任意字段组合查询.(通过 bloom index, multi-index bitmap scan, gin-index bitmap scan 等索引都可以实现)

select * from tbl where a=? and b=? or c=? and d=? or e between ? and ? and f in (?);

通常来说, 数据库并不具备 3 以后的加速能力, 但是 PostgreSQL 的功能非常强大, 它可以非常完美的支持这类查询的加速.(是指查询和写入不冲突的, 并且索引 BUILD 是实时的.)

用户完全不需要将数据同步到搜索引擎, 再来查询, 而且搜索引擎也只能做到全文检索, 并不你做到正则, 相似, 前后模糊这几个需求.

使用 PostgreSQL 可以大幅度的简化用户的架构, 开发成本, 同时保证数据查询的绝对实时性.

一, 全文检索

全文检索中几个核心的功能:

词典, 分词语法, 搜索语法, 排序算法, 效率, 命中词高亮等.

PostgreSQL 都已经实现, 并支持扩展. 例如扩展词典, 扩展排序算法等.

支持 4 种文档结构 (标题, 作者, 摘要, 内容), 可以在生成 tsvector 时指定. 在一个 tsvector 中允许多个文档结构.

文档结构在 ranking 算法中, 被用于计算权值, 例如在标题中命中的词权值可以设更大一些.

支持掩码, 主要用于调和很长的文本, 调和 ranking 的输出.

通过设置不同文档结构权值, 调和 ranking 的输出.

词典

默认 PG 没有中文分词, 但是好在我们可以基于 text search 框架扩展, 例如开源的 zhparser, jieba 等中文分词插件.

https://github.com/jaiminpan/pg_jieba
https://github.com/jaiminpan/pg_scws

甚至可以通过 pljava, plpython 等来实现对中文的分词, 这个实际上是对应编程体系内的分词能力, 通过 PostgreSQL 的过程语言引入, 是不是很炫酷.

《使用阿里云 PostgreSQL zhparser 中文分词时不可不知的几个参数》

《如何加快 PostgreSQL 结巴分词加载速度》

《PostgreSQL Greenplum 结巴分词 (by plpython)》

分词介绍

1,parser, 功能是将字符串转换为 token(可以自定义 parser).

default parser 的 token 类别如下:

例子

SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.html');
  alias   |  description  |            token
----------+---------------+------------------------------
 protocol | Protocol head | http://
 url      | URL           | example.com/stuff/index.HTML
 host     | Host          | example.com
 url_path | URL path      | /stuff/index.HTML

创建 text parser 的语法

2, 配合 text search configuration 和 dictionary, 将 token 转换为 lexemes

例如创建了一个同义词字典

postgres        pgsql
PostgreSQL      pgsql
postgre pgsql
gogle   googl
indices index*

然后用这个字典来将 token 转换为 lexemes, 转换后得到的是 lexeme. (tsvector 中存储的也是 lexeme, 并不是原始 token)

mydb=# CREATE TEXT SEARCH DICTIONARY syn (template=synonym, synonyms='synonym_sample');
mydb=# SELECT ts_lexize('syn','indices');
 ts_lexize
-----------
 {index}
(1 row)
mydb=# CREATE TEXT SEARCH CONFIGURATION tst (copy=simple);
mydb=# ALTER TEXT SEARCH CONFIGURATION tst ALTER MAPPING FOR asciiword WITH syn;
mydb=# SELECT to_tsvector('tst','indices');
 to_tsvector
-------------
 'index':1
(1 row)
mydb=# SELECT to_tsquery('tst','indices');
 to_tsquery
------------
 'index':*
(1 row)
mydb=# SELECT 'indexes are very useful'::tsvector;
            tsvector
---------------------------------
 'are' 'indexes' 'useful' 'very'
(1 row)
mydb=# SELECT 'indexes are very useful'::tsvector @@ to_tsquery('tst','indices');
 ?column?
----------
 t
(1 row)

创建 text dictionary 的语法

3, 将 lexemes 存储为 tsvector

text search configuration 决定了要存哪些东西.

convert 过程中, parser 得到的 token 依次与 configuration 配置的 dictionary 匹配, 并存储从 dictionary 中对应的 lexeme.

ALTER TEXT SEARCH CONFIGURATION tsconfig 名

ADD MAPPING FOR token 类型 1 WITH 字典 1, 字典 2, 字典 3;

如果使用这个 tsconfig 来转换文本为 tsvector, 那么对于 token 类型 1, 首先与字典 1 匹配, 如果匹配上了, 会存储字典 1 中对应的 lexeme, 如果没有对应上, 则继续搜索字典 2......

创建 text search configuration 的语法

创建 text search template 的语法

4, 控制参数

通常 parser 有一些控制参数, 例如是否输出单字, 双字等. 例如 zhparser 这个 parser 的参数如下:

《使用阿里云 PostgreSQL zhparser 中文分词时不可不知的几个参数》

5, 文档结构

标题, 作者, 摘要, 内容

使用 ABCD 来表示.

搜索语法

1,tsquery 为搜索输入, 支持与, 或, 反, 距离语法, 如下

to_tsquery creates a tsquery value from querytext, which must consist of
single tokens separated by the tsquery operators
& (AND), | (OR), ! (NOT), and <-> (FOLLOWED BY) and <?> (距离多少?),

possibly grouped using parentheses.

In other words, the input to to_tsquery must already follow the general

rules for tsquery input, as described in Section 8.11.2.

The difference is that while basic tsquery input takes the tokens at face value,
to_tsquery normalizes each token into a lexeme using the specified or default configuration,

and discards any tokens that are stop words according to the configuration.

For example:

c 有两个位置, 在匹配距离时, 两个都可以.

postgres=# select to_tsvector('a b c c');
     to_tsvector
---------------------
 'a':1 'b':2 'c':3,4
(1 row)

相邻

postgres=# select to_tsvector('a b c c') @@ to_tsquery('a <-> b');
 ?column?
----------
 t
(1 row)

相邻, 实际上就是 position 相减 = 1

postgres=# select to_tsvector('a b c c') @@ to_tsquery('a <1> b');
 ?column?
----------
 t
(1 row)

距离为 2, 实际上就是 position 相减 = 2

postgres=# select to_tsvector('a b c c') @@ to_tsquery('a <2> c');
 ?column?
----------
 t
(1 row)

距离为 3, 实际上就是 position 相减 = 3

postgres=# select to_tsvector('a b c c') @@ to_tsquery('a <3> c');
 ?column?
----------
 t
(1 row)

距离为 2, 实际上就是 position 相减 = 2

postgres=# select to_tsvector('a b c c') @@ to_tsquery('a <2> b');
 ?column?
----------
 f
(1 row)

2, 支持文档结构, 语法如下

As in basic tsquery input, weight(s) can be attached to each lexeme to restrict it to match only tsvector lexemes of those weight(s). For example:
SELECT to_tsquery('english', 'Fat | Rats:AB');
    to_tsquery
------------------
 'fat' | 'rat':AB

3, 支持前缀匹配, 语法如下

Also, * can be attached to a lexeme to specify prefix matching:
SELECT to_tsquery('supern:*A & star:A*B');
        to_tsquery
--------------------------
 'supern':*A & 'star':*AB

4, 支持 thesaurus 字典嵌套, 自动翻译. 例如

to_tsquery can also accept single-quoted phrases. This is primarily useful when the configuration includes a thesaurus dictionary that may trigger on such phrases. In the example below, a thesaurus contains the rule supernovae stars : sn:

SELECT to_tsquery('''supernovae stars'' & !crab');
  to_tsquery
---------------
 'sn' & !'crab'

5, 搜索操作符为 @@

select * from tbl where $tsvector_col @@ $tsquery;

排序算法

通常全文检索分为两个层面, 一个层面是匹配, 另一个层面是亲和 (rank).

1, 内置了几个 RANKING 算法, 代码如下

src/backend/utils/adt/tsrank.c

2,ts_rank 和 ts_rank_cd 介绍

PostgreSQL provides two predefined ranking functions, which take into account lexical, proximity, and structural information; that is, they consider how often the query terms appear in the document, how close together the terms are in the document, and how important is the part of the document where they occur.

You can write your own ranking functions and/or combine their results with additional factors to fit your specific needs.

The two ranking functions currently available are:
ts_rank([ weights float4[], ] vector tsvector, query tsquery [, normalization integer ]) returns float4

Ranks vectors based on the frequency of their matching lexemes.

ts_rank_cd([ weights float4[], ] vector tsvector, query tsquery [, normalization integer ]) returns float4

ts_rank_cd 为 cover density ranking

described in Clarke, Cormack, and Tudhope's"Relevance Ranking for One to Three Term Queries"in the journal"Information Processing and Management", 1999. Cover density is similar to ts_rank ranking except that the proximity of matching lexemes to each other is taken into consideration.

2.1, 以上两个 ranking 计算函数都支持文档结构权重. 支持权重微调, 作为数组参数输入. 不输入则使用默认值.

For both these functions, the optional weights argument offers the ability to weigh Word instances more or Less heavily depending on how they are labeled. The weight arrays specify how heavily to weigh each category of Word, in the order:
{
	D-weight, C-weight, B-weight, A-weight	
}
If no weights are provided, then these defaults are used:
{
	0.1, 0.2, 0.4, 1.0	
}

2.2, 支持掩码参数, 对付长文本

Since a longer document has a greater chance of containing a query term it is reasonable to take into account document size,
e.g.,
a hundred-Word document with five instances of a search Word is probably more relevant than a thousand-Word document with five instances.

Both ranking functions take an integer normalization option that specifies whether and how a document's length should impact its rank.

The integer option controls several behaviors, so it is a bit mask:
you can specify one or more behaviors using | (for example, 2|4).

掩码如下

0 (the default) ignores the document length
1 divides the rank by 1 + the logarithm of the document length
2 divides the rank by the document length
4 divides the rank by the mean harmonic distance between extents (this is implemented only by ts_rank_cd)
8 divides the rank by the number of unique words in document
16 divides the rank by 1 + the logarithm of the number of unique words in document
32 divides the rank by itself + 1
If more than one flag bit is specified, the transformations are applied in the order listed.
Here is an example that selects only the ten highest-ranked matches:

例子

SELECT title, ts_rank_cd(textsearch, query) AS rank
FROM apod, to_tsquery('neutrino|(dark & matter)') query
WHERE query @@ textsearch
ORDER BY rank DESC
LIMIT 10;
                     title                     |   rank
-----------------------------------------------+----------
 Neutrinos in the Sun                          |      3.1
 The Sudbury Neutrino Detector                 |      2.4
 A MACHO View of Galactic Dark Matter          |  2.01317
 Hot Gas and Dark Matter                       |  1.91171
 The Virgo Cluster: Hot Plasma and Dark Matter |  1.90953
 Rafting for Solar Neutrinos                   |      1.9
 NGC 4650A: Strange Galaxy and Dark Matter     |  1.85774
 Hot Gas and Dark Matter                       |   1.6123
 Ice Fishing for Cosmic Neutrinos              |      1.6
 Weak Lensing Distorts the Universe            | 0.818218
This is the same example using normalized ranking:
SELECT title, ts_rank_cd(textsearch, query, 32 /* rank/(rank+1) */ ) AS rank
FROM apod, to_tsquery('neutrino|(dark & matter)') query
WHERE  query @@ textsearch
ORDER BY rank DESC
LIMIT 10;
                     title                     |        rank
-----------------------------------------------+-------------------
 Neutrinos in the Sun                          | 0.756097569485493
 The Sudbury Neutrino Detector                 | 0.705882361190954
 A MACHO View of Galactic Dark Matter          | 0.668123210574724
 Hot Gas and Dark Matter                       |  0.65655958650282
 The Virgo Cluster: Hot Plasma and Dark Matter | 0.656301290640973
 Rafting for Solar Neutrinos                   | 0.655172410958162
 NGC 4650A: Strange Galaxy and Dark Matter     | 0.650072921219637
 Hot Gas and Dark Matter                       | 0.617195790024749
 Ice Fishing for Cosmic Neutrinos              | 0.615384618911517
 Weak Lensing Distorts the Universe            | 0.450010798361481

Ranking can be expensive since it requires consulting the tsvector of each matching document, which can be I/O bound and therefore slow. Unfortunately, it is almost impossible to avoid since practical queries often result in large numbers of matches.

效率

1, 写入效率

500 万个词的词库, 随机提取 64 个, 组成一个含 64 个词的分词字符串. 包含全文检索 GIN 索引.

56 个并发写入, 每秒写入 93955 行, 平均响应时间 0.6 毫秒.

《HTAP 数据库 PostgreSQL 场景与性能测试之 7 - (OLTP) 全文检索 - 含索引实时写入》

2, 查询效率

1 亿条文本记录, 并发全文检索查询.

56 个并发查询, 每秒查询 51369 次, 平均响应时间 1.1 毫秒.

《HTAP 数据库 PostgreSQL 场景与性能测试之 14 - (OLTP) 字符串搜索 - 全文检索》

特殊功能 - HTML 高亮

对于匹配上的文本, 将其高亮显示.

ts_headline([ config regconfig, ] document text, query tsquery [, options text ]) returns text

例子

SELECT ts_headline('english',
  'The most common type of search
is to find all documents containing given query terms
and return them in order of their similarity to the
query.',
  to_tsquery('query & similarity'));
                        ts_headline
------------------------------------------------------------
 containing given <b>query</b> terms
 and return them in order of their <b>similarity</b> to the
 <b>query</b>.
SELECT ts_headline('english',
  'The most common type of search
is to find all documents containing given query terms
and return them in order of their similarity to the
query.',
  to_tsquery('query & similarity'),
  'StartSel = <, StopSel =>');
                      ts_headline
-------------------------------------------------------
 containing given <query> terms
 and return them in order of their <similarity> to the
 <query>.

特殊功能 - 生成文档统计信息

sqlquery 返回 tsvector 一列, 统计这一列中, 有哪些 lexeme, 每个 lexeme 出现在多少文本中, 每个 lexeme 总共出现了多少次.

ts_stat(sqlquery text, [ weights text, ]
        OUT Word text, OUT ndoc integer,
        OUT nentry integer) returns setof record

返回值

Word text - the value of a lexeme
ndoc integer - number of documents (tsvectors) the Word occurred in
nentry integer - total number of occurrences of the Word

例子

For example, to find the ten most frequent words in a document collection:
SELECT * FROM ts_stat('SELECT vector FROM apod')
ORDER BY nentry DESC, ndoc DESC, Word
LIMIT 10;
The same, but counting only Word occurrences with weight A or B:
SELECT * FROM ts_stat('SELECT vector FROM apod', 'ab')
ORDER BY nentry DESC, ndoc DESC, Word
LIMIT 10;

特殊功能 - 设置文档结构

设置 tsvector 属于哪个文档结构 (标题, 作者, 摘要, 内容).

1,
setweight(vector tsvector, weight "char")
assign weight to each element of vector
setweight('fat:2,4 cat:3 rat:5B'::tsvector, 'A')
'cat':3A 'fat':2A,4A 'rat':5A
2,
setweight(vector tsvector, weight "char", lexemes text[])
assign weight to elements of vector that are listed in lexemes
setweight('fat:2,4 cat:3 rat:5B'::tsvector, 'A', '{cat,rat}')
'cat':3A 'fat':2,4 'rat':5A

特殊功能 - 调试文本

1, 调试 token

例子

ts_debug([ config regconfig, ] document text,
         OUT alias text,
         OUT description text,
         OUT token text,
         OUT dictionaries regdictionary[],
         OUT dictionary regdictionary,
         OUT lexemes text[])
         returns setof record
alias text - short name of the token type
description text - description of the token type
token text - text of the token
dictionaries regdictionary[] - the dictionaries selected by the configuration for this token type
dictionary regdictionary - the dictionary that recognized the token, or NULL if none did
lexemes text[] - the lexeme(s) produced by the dictionary that recognized the token,
or NULL if none did; an empty array ({}) means it was recognized as a stop Word

2, 调试 lexeme, 可以使用 ts_lexize 判断某个 token 是否在某个字典里面有与之匹配的条目 (lexeme).

ts_lexize(dict regdictionary, token text) returns text[]

例子

SELECT ts_lexize('english_stem', 'stars');
 ts_lexize
-----------
 {star}
SELECT ts_lexize('english_stem', 'a');
 ts_lexize
-----------
 {}

限制

The length of each lexeme must be Less than 2K bytes, 单个 lexeme 不能大于 2K 字节.

The length of a tsvector (lexemes + positions) must be Less than 1 megabyte, 单个 tsvector 不能大于 1MB.

postgres=# select length(to_tsvector(string_agg(md5(random()::text), ' '))) from generate_series(1,100000);
ERROR:  54000: string is too long for tsvector (3624424 bytes, max 1048575 bytes)
LOCATION:  make_tsvector, to_tsany.c:185

The number of lexemes must be Less than 2^64, 单个 tsvector 中不能存储超过 2 的 64 次方个 lexeme.

Position values in tsvector must be greater than 0 and no more than 16,383, 单个 tsvector 中, lexeme 的位置值不能大于 16,383

The match distance in a (FOLLOWED BY) tsquery operator cannot be more than 16,384, 单个 tsquery 中, lexeme 的距离值不能大于 16,383

No more than 256 positions per lexeme, 同一个 lexeme 不能超过 256 个位置.

The number of nodes (lexemes + operators) in a tsquery must be Less than 32,768, 单个 tsvector 不能存储超过 32765 个 node(lexemes + 位置).

超过限制的话, 可以使用多个字段.

二, 模糊查询正则查询相似查询

模糊查询, 相似查询, 正则匹配查询, 都属于文本匹配的范畴, PostgreSQL 通过 gin 和 pg_trgm 实现了这三种搜索的索引加速.

性能如下:

下面是实际的案例:

《HTAP 数据库 PostgreSQL 场景与性能测试之 13 - (OLTP) 字符串搜索 - 相似查询》

《HTAP 数据库 PostgreSQL 场景与性能测试之 12 - (OLTP) 字符串搜索 - 前后模糊查询》

《HTAP 数据库 PostgreSQL 场景与性能测试之 11 - (OLTP) 字符串搜索 - 后缀查询》

《HTAP 数据库 PostgreSQL 场景与性能测试之 10 - (OLTP) 字符串搜索 - 前缀查询》

《HTAP 数据库 PostgreSQL 场景与性能测试之 9 - (OLTP) 字符串模糊查询, 相似查询索引实时 BUILD - 含索引实时写入》

三, ADHOC 搜索

ADHOC 搜索, 指任意字段组合搜索. 例如一张表有 N 个列, N 个列都可能被作为查询条件, 因此有 N! 种组合, 为了实现这种场景的高效率搜索, 在 PostgreSQL 中提供了几种方法.

1, 分区表, 可以实现一些字段搜索时的收敛.

2,BLOOM 过滤索引, 支持任意组合的等值搜索. lossy 过滤.

3,GIN 多列复合索引, BITMAP SCAN. 支持任意组合搜索, 过滤到 BLOCK 级别.

4, 多个单列索引, BITMAP SCAN. 数据库自动优化, 根据 COST 评估选择 index scan 或 bitmap scan, 支持任意组合搜索, 过滤到 BLOCK 级别.

性能如下:

下面是实际的案例:

《HTAP 数据库 PostgreSQL 场景与性能测试之 20 - (OLAP) 用户画像圈人场景 - 多个字段任意组合条件筛选与透视》

《PostgreSQL 多字段任意组合搜索的性能》

《时间, 空间, 对象多维属性海量数据任意多维高效检索 - 阿里云 RDS PostgreSQL 最佳实践》

《多字段, 任意组合条件查询 (无需建模) - 毫秒级实时圈人最佳实践》

《宝剑赠英雄 - 任意组合字段等效查询, 探探 PostgreSQL 多列展开式 B 树 (GIN)》

《PostgreSQL 如何高效解决按任意字段分词检索的问题 - case 1》

《PostgreSQL 9.6 黑科技 bloom 算法索引, 一个索引支撑任意列组合查询》

《PostgreSQL bitmapAnd, bitmapOr, bitmap index scan, bitmap heap scan》

参考

1, 全文检索

2, 模糊, 正则, 相似查询

3,ADHOC 查询

https://www.postgresql.org/docs/10/static/gin.html
4,

《从难缠的模糊查询聊开 - PostgreSQL 独门绝招之一 GIN , Gist , SP-Gist , RUM 索引原理与技术背景》

《PostgreSQL 全文检索加速快到没有朋友 - RUM 索引接口 (潘多拉魔盒)》

《PostgreSQL 全表全字段模糊查询的毫秒级高效实现 - 搜索引擎颤抖了》

来源: https://yq.aliyun.com/articles/672261

与本文相关文章

暂无,快来抢沙发吧！

用 PostgreSQL 做实时高效 搜索引擎 - 全文检索, 模糊查询, 正则查询, 相似查询, ADHOC 查询

与本文相关文章

用 PostgreSQL 做实时高效搜索引擎 - 全文检索, 模糊查询, 正则查询, 相似查询, ADHOC 查询