[搜索引擎] PostgreSQL 10 实时全文检索和分词, 相似搜索, 模糊匹配实现类似 Google 搜索自动提示

需求分析

要通过 PostgreSQL 实现类似 Google 搜索自动提示的功能, 例如要实现一个查询海量数据中的商品名字, 每次输入就提示用户各种相关搜索选项, 例如淘宝, 京东等电商查询

思路

这个功能可以用 PostgreSQL 的实时全文检索和分词, 相似搜索, 前模糊匹配等特性实现. 具体策略是, 定义一个搜索提示的最大数量. 首先通过前模糊匹配查询获取数据, 如果未满最大数量, 则进行全文检索分词查询补偿, 如果全文检索补偿查询的数据未满最大提示数量, 最后就加入相识查询的结果. 当然这里是一个简单的思路, 复杂的还得根据实际需求实现.

构造数据

新建一张商品表, 插入一千万条数据, name 就是商品名字.

create table goods(id int, name varchar);
insert into goods select generate_series(1,10000000),md5(random()::varchar);

一, 前模糊匹配及优化

实现 SQL, 每次输入就作为前缀模糊查询:

select * from goods where name like '123%' ;

这个简单的前模糊匹配 SQL, 可以使用 B-Tree 来加速优化模糊查询.

未建立索引时查询 "123%" 的商品名字, 执行计划显示耗时大约为 575ms:

explain (analyze,verbose,timing,costs,buffers) select * from goods where name like '123%' ;
========================================
Gather  (cost=1000.00..136516.59 rows=1000 width=37) (actual time=1.390..572.857 rows=2364 loops=1)
  Output: id, name
  Workers Planned: 2
  Workers Launched: 2
  Buffers: shared hit=83334
  ->  Parallel Seq Scan on public.goods  (cost=0.00..135416.59 rows=417 width=37) (actual time=0.750..528.116 rows=788 loops=3)
        Output: id, name
        Filter: ((goods.name)::text ~~ '123%'::text)
        Rows Removed by Filter: 3332545
        Buffers: shared hit=83334
        Worker 0: actual time=1.032..511.776 rows=676 loops=1
          Buffers: shared hit=24201
        Worker 1: actual time=0.145..511.737 rows=755 loops=1
          Buffers: shared hit=26101
Planning time: 0.065 ms
Execution time: 573.157 ms

优化 1, 建立索引 (lc_collate 方式)

通过 lc_collate 方式建立索引, 也就是 B-Tree 索引.

lc_collate (string) 是指报告文本数据排序使用的区域

建立索引脚本如下

create index idx_c on goods(name collate "C");

执行计划显示耗时为 10ms 以内:

explain (analyze,verbose,timing,costs,buffers) select * from goods where name like '123%' collate "C";

优化 2, 建立索引 (操作符类 varchar_pattern_ops 方式)

建立索引脚本如下

create index idx_varchar on goods(name varchar_pattern_ops);

执行计划显示耗时为 5ms 以内:

explain (analyze,verbose,timing,costs,buffers) select * from goods where name like '123%' collate "C";
======================================
Bitmap Heap Scan on public.goods  (cost=86.60..7681.10 rows=1000 width=37) (actual time=0.740..4.628 rows=2364 loops=1)
  Output: id, name
  Filter: ((goods.name)::text ~~ '123%'::text)
  Heap Blocks: exact=2330
  Buffers: shared hit=2351
  ->  Bitmap Index Scan on idx_varchar  (cost=0.00..86.35 rows=2179 width=0) (actual time=0.487..0.487 rows=2364 loops=1)
        Index Cond: (((goods.name)::text ~>=~ '123'::text) AND ((goods.name)::text ~<~ '124'::text))
        Buffers: shared hit=21
Planning time: 0.139 ms
Execution time: 4.891 ms

二, 全文检索和分词 (通过 gin 索引优化加速)

注意: 全文检索和下面的相识搜索都需要 pg_trgm 插件. 所以先要执行:

create extension pg_trgm;

具体 SQL 如下, 每次输入空格用 & 符号代替, 最后接:* 表示模糊检索. to_tsvector ,to_tsquery 参阅 PostgreSQL 全文检索文档.

SELECT name FROM goods WHERE to_tsvector('English',name) @@ to_tsquery('English','aaa&bbb&cc:*')

通过执行计划查看速度: 接近 8 秒

.......
Planning time: 0.129 ms
Execution time: 7986.176 ms

通过 gin 索引来优化加速, 这里 to_tsvector('English',name) 就是一个表达式索引.

CREATE INDEX name_idx ON goods USING GIN(to_tsvector('English',name));

优化后后的执行计划, 速度为 13 毫秒左右:

explain (analyze,verbose,timing,costs,buffers) SELECT name FROM goods WHERE to_tsvector('English',name) @@  to_tsquery('English','aaa&bbb&cc:*')
=================================================
Bitmap Heap Scan on public.goods  (cost=88.04..109.24 rows=5 width=33) (actual time=17.343..17.353 rows=4 loops=1)
  Output: name
  Recheck Cond: (to_tsvector('english'::regconfig, (goods.name)::text) @@ '''aaa'' & ''bbb'' & ''cc'':*'::tsquery)
  Heap Blocks: exact=1
  Buffers: shared hit=473
  ->  Bitmap Index Scan on name_idx  (cost=0.00..88.04 rows=5 width=0) (actual time=17.334..17.334 rows=4 loops=1)
        Index Cond: (to_tsvector('english'::regconfig, (goods.name)::text) @@ '''aaa'' & ''bbb'' & ''cc'':*'::tsquery)
        Buffers: shared hit=472
Planning time: 0.222 ms
Execution time: 13.381 ms

三, 相似搜索

具体实现 SQL, 通过查询结果可以看到越相似, 相似度越小, 可以看到, 在搜索 aaa bbb 的时候搜索出了 aaa b6b, 这就是相似搜索.

SELECT name ,name <-> 'aaa bbb' FROM goods WHERE name <-> 'aaa bbb' < 0.7 LIMIT 10
aaa bbb             0
aaa bbb ccc         0.333333
aaa ccc bbb         0.333333
aaa bbb ccc ddd     0.5
aaa b6b ccc         0.666667
aaa bbb ccsdsd      0.466667
aaa                 0.5

PostgreSQL 扩展知识

在第一种模糊查询中, 可以使用关键字 ILIKE 替换 LIKE, ILIKE 表示字符串匹配时与大小写无关. 这是一个 PostgreSQL 扩展, 并不是标准 SQL 语法.

参考

PostgreSQL 索引官方文档

来源: https://www.cnblogs.com/monkjavaer/p/11173169.html

与本文相关文章

暂无,快来抢沙发吧！