原理介绍
PG 中所有的索引都是二级索引, 即在进行索引查询的过程中, 需要同时访问索引数据和源表数据. Index Only Scan 按照字面意思理解, 即在查询过程中只需要扫描索引数据. 这种扫描方式需要一个前提就是索引中包含了查询所需要的所有数据 (也叫覆盖索引), 如出现在 SELECT, WHERE, ORDER BY 中所引用的列.
由于 PG 的 MVCC 机制, 在没有 Index only scan 之前, 任何索引查询都需要经过通过源表数据进行可见性检查, 如图所示:
在索引扫描到过程中, 需要通过源表获取每个 Record 的可见性信息.
在 PG9.2 版本以后, 支持了 Index Only Scan, 如果一个查询所需要的数据能够完全可以被索引覆盖, 那么 Index Only Scan 就会成为一种新的扫描路径, 并且通过 Visibility map 避免了通过获取源表进行可见性检查, 提升了查询性能, 如果所示:
这里主要依赖了 Visibility map 的机制, Visibility map 中有一个标记位, 标记了 Page 中的元组是否都是可见的, 也就意味着如果表没有被 delete,update 过或者已经被 vacuum 过了.
如果 Visibility map 能够确认该 Index entry 所对应的 Page 都是可见的, 那么就不再获取源表 Record 进行可见性判断了, 否则还需要获取源表元组并进行可见性判断.
使用示例
GP6 版本集成了 PG9.4 版本, 因此也支持了 Index Only Scan 的特性.
例如存在一张表, 并在其中一个列上创建了索引:
- postgres=# \d customer_reviews_hp
- Table "public.customer_reviews_hp"
- Column | Type | Modifiers
- ----------------------+-----------------+-----------
- customer_id | text |
- review_date | date |
- review_rating | integer |
- review_votes | integer |
- review_helpful_votes | integer |
- product_id | character(10) |
- product_title | text |
- product_sales_rank | bigint |
- product_group | text |
- product_category | text |
- product_subcategory | text |
- similar_product_ids | character(10)[] |
- Indexes:
- "c_review_rating" btree (review_rating)
- Distributed by: (customer_id)
查询:
- postgres=# explain analyze select count(*), review_rating from customer_reviews_hp where review_rating> 1 group by 2;
- QUERY PLAN
- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
- -------
- Gather Motion 4:1 (slice2; segments: 4) (cost=49979.36..49979.50 rows=5 width=12) (actual time=782.673..782.726 rows=4 loops=1)
- -> GroupAggregate (cost=49979.36..49979.50 rows=2 width=12) (actual time=782.384..782.385 rows=2 loops=1)
- Group Key: customer_reviews_hp.review_rating
- -> Sort (cost=49979.36..49979.37 rows=2 width=12) (actual time=782.376..782.377 rows=8 loops=1)
- Sort Key: customer_reviews_hp.review_rating
- Sort Method: quicksort Memory: 132kB
- -> Redistribute Motion 4:4 (slice1; segments: 4) (cost=0.18..49979.30 rows=2 width=12) (actual time=76.538..782.345 rows=8 loops=1)
- Hash Key: customer_reviews_hp.review_rating
- -> GroupAggregate (cost=0.18..49979.20 rows=2 width=12) (actual time=5.102..73.709 rows=4 loops=1)
- Group Key: customer_reviews_hp.review_rating
- -> Index Only Scan using c_review_rating on customer_reviews_hp (cost=0.18..41742.09 rows=411854 width=4) (actual time=0.128..643.718 rows=1061311 lo
- ops=1)
- Index Cond: (review_rating> 1)
- Heap Fetches: 0
- Planning time: 0.212 ms
- (slice0) Executor memory: 220K bytes.
- (slice1) Executor memory: 156K bytes avg x 4 workers, 156K bytes max (seg0).
- (slice2) Executor memory: 92K bytes avg x 4 workers, 92K bytes max (seg0). Work_mem: 33K bytes max.
- Memory used: 2047000kB
- Optimizer: Postgres query optimizer
- Execution time: 783.308 ms
- (20 rows)
由此可见启用了 Index Only Scan.
可以通过 enable_indexonlyscan 来控制是否使用 Index Only Scan, 例如同样上面的查询设置 enable_indexonlyscan 为 off 后, 再次执行:
- postgres=# explain analyze select count(*), review_rating from customer_reviews_hp where review_rating> 1 group by 2;
- QUERY PLAN
- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
- --
- Gather Motion 4:1 (slice2; segments: 4) (cost=49979.36..49979.50 rows=5 width=12) (actual time=951.830..951.840 rows=4 loops=1)
- -> GroupAggregate (cost=49979.36..49979.50 rows=2 width=12) (actual time=951.566..951.567 rows=2 loops=1)
- Group Key: customer_reviews_hp.review_rating
- -> Sort (cost=49979.36..49979.37 rows=2 width=12) (actual time=951.556..951.556 rows=8 loops=1)
- Sort Key: customer_reviews_hp.review_rating
- Sort Method: quicksort Memory: 132kB
- -> Redistribute Motion 4:4 (slice1; segments: 4) (cost=0.18..49979.30 rows=2 width=12) (actual time=75.010..951.527 rows=8 loops=1)
- Hash Key: customer_reviews_hp.review_rating
- -> GroupAggregate (cost=0.18..49979.20 rows=2 width=12) (actual time=5.211..77.359 rows=4 loops=1)
- Group Key: customer_reviews_hp.review_rating
- -> Index Scan using c_review_rating on customer_reviews_hp (cost=0.18..41742.09 rows=411854 width=4) (actual time=0.118..817.460 rows=1061311 loops=1
- )
- Index Cond: (review_rating> 1)
- Planning time: 0.217 ms
- (slice0) Executor memory: 156K bytes.
- (slice1) Executor memory: 92K bytes avg x 4 workers, 92K bytes max (seg0).
- (slice2) Executor memory: 92K bytes avg x 4 workers, 92K bytes max (seg0). Work_mem: 33K bytes max.
- Memory used: 2047000kB
- Optimizer: Postgres query optimizer
- Execution time: 952.473 ms
- (19 rows)
只是用到了索引, 不是 Index Only Scan, 执行时间上增加了将近 200ms, 下降了 20% 左右.
但同时需要注意的是, Index Only Scan 并不是银弹, 做到 Index Only Scan 往往需要创建联合索引, 联合索引本身也会有性能问题, 例如影响写入, 更新性能等. 需要具体问题具体分析, Index Only Scan 只是多了一种可优化路径选择.
GP 的限制
Orca 优化器不支持 Index Only Scan,GP6 版本中, 只有 PG 原生的优化器支持 Index Only Scan.
列存表也不支持 Index Only Scan,Index Only Scan 依赖 Visibility map 机制实现, 列存表显然做不到 Index Only Scan.
GP 上的 Index Only Scan 在 explain analyze 时, Heap Fetches 显示不准确, 例如:
- create table test (a , b ,c);
- create table test (a int, b int ,c int);
- insert into test values(generate_series(1,100000),generate_series(1,100000),generate_series(1,100000));
- create index a_ind on test(a,b,c);
-- Master 上执行:
- postgres=# explain analyze select * from test where a> 1 order by a;
- QUERY PLAN
- ------------------------------------------------------------------------------------------------------------------------------------
- Gather Motion 1:1 (slice1; segments: 1) (cost=0.17..2463.87 rows=99990 width=12) (actual time=1.169..84.196 rows=99999 loops=1)
- Merge Key: a
- -> Index Only Scan using a_ind on test (cost=0.17..2463.87 rows=99990 width=12) (actual time=0.116..44.373 rows=99999 loops=1)
- Index Cond: (a> 1)
- Heap Fetches: 0
- Planning time: 0.685 ms
- (slice0) Executor memory: 216K bytes.
- (slice1) Executor memory: 148K bytes (seg0).
- Memory used: 128000kB
- Optimizer: Postgres query optimizer
- Execution time: 96.809 ms
- (11 rows)
显示 Heap Fetchs 为 0, 而直接连上 segment 进行 explain analyze:
- postgres=# explain analyze select * from test where a> 1 order by a;
- QUERY PLAN
- ------------------------------------------------------------------------------------------------------------------------------
- Index Only Scan using a_ind on test (cost=0.29..1255.62 rows=33334 width=12) (actual time=0.072..39.561 rows=99999 loops=1)
- Index Cond: (a> 1)
- Heap Fetches: 99999
- Planning time: 0.148 ms
- (slice0)
- Optimizer: Postgres query optimizer
- Execution time: 47.481 ms
- (7 rows)
其实是存在 Heap Fetches 的, 从执行时间上看, Master 上的 Heap Fetches 项显示不对.
这种情况需要依赖 Vacuum 来做 Visibility Map 的清理工作了. 正常情况下做下 Vacuum analyze 就能保证不不需要 Heap Fetch.
参考
https://www.postgresql.org/docs/current/indexes-index-only-scans.html
来源: https://yq.aliyun.com/articles/720749