问题背景
许多开发和测试人员都可能遇到过列表的数据翻下一页的时候显示了上一页的数据, 也就是翻页会有重复的数据.
如何处理?
这个问题出现的原因是因为选择的排序字段有重复, 常见的处理办法就是排序的时候加上唯一字段, 这样在分页的过程中数据就不会重复了. 关于这个问题文档也有解释并非是一个 bug. 而是排序时需要选择唯一字段来做排序, 不然返回的结果不确定
排序返回数据重复的根本原因是什么呢?
经常优化 sql 的同学可能会发现, 执行计划里面会有 Sort Method 这个关键字, 而这个关键字就是排序选择的方法. abase 的排序分为三种
quicksort 快速排序
top-N heapsort Memory 堆排序
external merge Disk 归并排序
推测
分页重复的问题和执行计划选择排序算法的稳定性有关.
简单介绍下这三种排序算法的场景:
在有索引的情况下: 排序可以直接走索引. 在没有索引的情况下: 当表的数据量较小的时候选择快速排序 (排序所需必须内存小于 work_mem), 当排序有 limit, 且耗费的内存不超过 work_mem 时选择堆排序, 当 work_mem 不够时选择归并排序.
验证推测
1. 创建表, 初始化数据
- abase=# create table t_sort(n_int int,c_id varchar(300));
- CREATE TABLE
- abase=# insert into t_sort(n_int,c_id) select 100,generate_series(1,9);
- INSERT 0 9
- abase=# insert into t_sort(n_int,c_id) select 200,generate_series(1,9);
- INSERT 0 9
- abase=# insert into t_sort(n_int,c_id) select 300,generate_series(1,9);
- INSERT 0 9
- abase=# insert into t_sort(n_int,c_id) select 400,generate_series(1,9);
- INSERT 0 9
- abase=# insert into t_sort(n_int,c_id) select 500,generate_series(1,9);
- INSERT 0 9
- abase=# insert into t_sort(n_int,c_id) select 600,generate_series(1,9);
- INSERT 0 9
三种排序
-- 快速排序 quicksort
- abase=# explain analyze select ctid,n_int,c_id from t_sort order by n_int asc;
- QUERY PLAN
- ------------------------------------------------------------
- Sort (cost=3.09..3.23 rows=54 width=12) (actual time=0.058..0.061 rows=54 loops=1)
- Sort Key: n_int
- Sort Method: quicksort Memory: 27kB
- -> Seq Scan on t_sort (cost=0.00..1.54 rows=54 width=12) (actual time=0.021..0.032 rows=54 loops=1)
- Planning time: 0.161 ms
- Execution time: 0.104 ms
- (6 rows)
-- 堆排序 top-N heapsort
- abase=# explain analyze select ctid,n_int,c_id from t_sort order by n_int asc limit 10;
- QUERY PLAN
- ------------------------------------------------------------
- Limit (cost=2.71..2.73 rows=10 width=12) (actual time=0.066..0.068 rows=10 loops=1)
- -> Sort (cost=2.71..2.84 rows=54 width=12) (actual time=0.065..0.066 rows=10 loops=1)
- Sort Key: n_int
- Sort Method: top-N heapsort Memory: 25kB
- -> Seq Scan on t_sort (cost=0.00..1.54 rows=54 width=12) (actual time=0.022..0.031 rows=54 loops=1
- )
- Planning time: 0.215 ms
- Execution time: 0.124 ms
- (7 rows)
-- 归并排序 external sort Disk
-- 插入大量值为 a 的数据
- abase=# insert into t_sort(n_int,c_id) select generate_series(1000,2000),'a';
- INSERT 0 1001
- abase=# set work_mem = '64kB';
- SET
- abase=# explain analyze select ctid,n_int,c_id from t_sort order by n_int asc;
- QUERY PLAN
- -------------------------------------------------------------
- Sort (cost=18.60..19.28 rows=270 width=12) (actual time=1.235..1.386 rows=1055 loops=1)
- Sort Key: n_int
- Sort Method: external sort Disk: 32kB
- -> Seq Scan on t_sort (cost=0.00..7.70 rows=270 width=12) (actual time=0.030..0.247 rows=1055 loops=1)
- Planning time: 0.198 ms
- Execution time: 1.663 ms
- (6 rows)
快速排序
-- 快速排序
- abase=# explain analyze select ctid,n_int,c_id from t_sort order by n_int asc;
- QUERY PLAN
- ------------------------------------------------------------
- Sort (cost=3.09..3.23 rows=54 width=12) (actual time=0.058..0.061 rows=54 loops=1)
- Sort Key: n_int
- Sort Method: quicksort Memory: 27kB
- -> Seq Scan on t_sort (cost=0.00..1.54 rows=54 width=12) (actual time=0.021..0.032 rows=54 loops=1)
- Planning time: 0.161 ms
- Execution time: 0.104 ms
- (6 rows)
-- 获取前 20 条数据
- abase=# select ctid,n_int,c_id from t_sort order by n_int asc limit 20;
- ctid | n_int | c_id
- --------+-------+------
- (0,7) | 100 | 7
- (0,2) | 100 | 2
- (0,4) | 100 | 4
- (0,8) | 100 | 8
- (0,3) | 100 | 3
- (0,6) | 100 | 6
- (0,5) | 100 | 5
- (0,9) | 100 | 9
- (0,1) | 100 | 1
- (0,14) | 200 | 5
- (0,13) | 200 | 4
- (0,12) | 200 | 3
- (0,10) | 200 | 1
- (0,15) | 200 | 6
- (0,16) | 200 | 7
- (0,17) | 200 | 8
- (0,11) | 200 | 2
- (0,18) | 200 | 9
- (0,20) | 300 | 2
- (0,19) | 300 | 1
(20 rows) -- 分页获取前 10 条数据
- abase=# select ctid,n_int,c_id from t_sort order by n_int asc limit 10 offset 0;
- ctid | n_int | c_id
- --------+-------+------
- (0,1) | 100 | 1
- (0,3) | 100 | 3
- (0,4) | 100 | 4
- (0,2) | 100 | 2
- (0,6) | 100 | 6
- (0,7) | 100 | 7
- (0,8) | 100 | 8
- (0,9) | 100 | 9
- (0,5) | 100 | 5
- (0,10) | 200 | 1
- (10 rows)
-- 分页从第 10 条开始获取 10 条
- abase=# select ctid,n_int,c_id from t_sort order by n_int asc limit 10 offset 10;
- ctid | n_int | c_id
- --------+-------+------
- (0,13) | 200 | 4
- (0,12) | 200 | 3
- (0,10) | 200 | 1
- (0,15) | 200 | 6
- (0,16) | 200 | 7
- (0,17) | 200 | 8
- (0,11) | 200 | 2
- (0,18) | 200 | 9
- (0,20) | 300 | 2
- (0,19) | 300 | 1
- (10 rows)
limit 10 offset 0,limit 10 offset 10, 连续取两页数据
此处可以看到 limit 10 offset 10 结果中, 第三条数据重复了第一页的最后一条: (0,10) | 200 | 1
并且 n_int = 200 and c_id = 5 这条数据被 "遗漏" 了.
堆排序
- abase=# select count(*) from t_sort;
- count
- -------
- 1055
- (1 row)
-- 设置 work_mem 4MB
- abase=# show work_mem ;
- work_mem
- ----------
- 4MB
- (1 row)
- --top-N heapsort
- abase=# explain analyze select * from ( select ctid,n_int,c_id from test order by n_int asc limit 1001 offset 0) td limit 10;
- QUERY PLAN
- -------------------------------------------------------------------------------------------------------------
- Limit (cost=2061.33..2061.45 rows=10 width=13) (actual time=15.247..15.251 rows=10 loops=1)
- -> Limit (cost=2061.33..2063.83 rows=1001 width=13) (actual time=15.245..15.247 rows=10 loops=1)
- -> Sort (cost=2061.33..2135.72 rows=29757 width=13) (actual time=15.244..15.245 rows=10 loops=1)
- Sort Key: test.n_int
- Sort Method: top-N heapsort Memory: 95kB
- -> Seq Scan on test (cost=0.00..429.57 rows=29757 width=13) (actual time=0.042..7.627 rows=2
- 9757 loops=1)
- Planning time: 0.376 ms
- Execution time: 15.415 ms
- (8 rows)
-- 获取 limit 1001 offset 0, 然后取 10 前 10 条数据
- abase=# select * from ( select ctid,n_int,c_id from test order by n_int asc limit 1001 offset 0) td limit 10;
- ctid | n_int | c_id
- ----------+-------+------
- (0,6) | 100 | 6
- (0,2) | 100 | 2
- (0,5) | 100 | 5
- (87,195) | 100 | 888
- (0,3) | 100 | 3
- (0,1) | 100 | 1
- (0,8) | 100 | 8
- (0,55) | 100 | 888
- (44,12) | 100 | 888
- (0,9) | 100 | 9
- (10 rows)
--- 获取 limit 1001 offset 1, 然后取 10 前 10 条数据
- abase=# select * from ( select ctid,n_int,c_id from test order by n_int asc limit 1001 offset 1) td limit 10;
- ctid | n_int | c_id
- ----------+-------+------
- (44,12) | 100 | 888
- (0,8) | 100 | 8
- (0,1) | 100 | 1
- (0,5) | 100 | 5
- (0,9) | 100 | 9
- (87,195) | 100 | 888
- (0,7) | 100 | 7
- (0,6) | 100 | 6
- (0,3) | 100 | 3
- (0,4) | 100 | 4
- (10 rows)
--- 获取 limit 1001 offset 2, 然后取 10 前 10 条数据
- abase=# select * from ( select ctid,n_int,c_id from test order by n_int asc limit 1001 offset 2) td limit 10;
- ctid | n_int | c_id
- ----------+-------+------
- (0,5) | 100 | 5
- (0,55) | 100 | 888
- (0,1) | 100 | 1
- (0,9) | 100 | 9
- (0,2) | 100 | 2
- (0,3) | 100 | 3
- (44,12) | 100 | 888
- (0,7) | 100 | 7
- (87,195) | 100 | 888
- (0,4) | 100 | 4
- (10 rows)
堆排序使用内存: Sort Method: top-N heapsort Memory: 95kB
当 offset 从 0 变成 1 后, 以及变成 2 后, 会发现查询出的 10 条数据不是有顺序的.
归并排序
-- 将 work_mem 设置为 64kb 让其走归并排序.
- abase=# set work_mem ='64kB';
- SET
- abase=# show work_mem;
- work_mem
- ----------
- 64kB
- (1 row)
- -- external merge Disk
- abase=# explain analyze select * from ( select ctid,n_int,c_id from test order by n_int asc limit 1001 offset 0) td limit 10;
- QUERY PLAN
- ---------------------------------------------------------------------------------------------------------------------------
- Limit (cost=2061.33..2061.45 rows=10 width=13) (actual time=27.912..27.916 rows=10 loops=1)
- -> Limit (cost=2061.33..2063.83 rows=1001 width=13) (actual time=27.910..27.913 rows=10 loops=1)
- -> Sort (cost=2061.33..2135.72 rows=29757 width=13) (actual time=27.909..27.911 rows=10 loops=1)
- Sort Key: test.n_int
- Sort Method: external merge Disk: 784kB
- -> Seq Scan on test (cost=0.00..429.57 rows=29757 width=13) (actual time=0.024..6.730 rows=29757 loops=1)
- Planning time: 0.218 ms
- Execution time: 28.358 ms
- (8 rows)
-- 同堆排序一样, 获取 limit 1001 offset 0, 然后取 10 前 10 条数据
- abase=# select * from ( select ctid,n_int,c_id from test order by n_int asc limit 1001 offset 0) td limit 10;
- ctid | n_int | c_id
- --------+-------+------
- (0,1) | 100 | 1
- (0,2) | 100 | 2
- (0,4) | 100 | 4
- (0,8) | 100 | 8
- (0,9) | 100 | 9
- (0,5) | 100 | 5
- (0,3) | 100 | 3
- (0,6) | 100 | 6
- (0,55) | 100 | 888
- (0,7) | 100 | 7
- (10 rows)
-- 同堆排序一样, 获取 limit 1001 offset 1, 然后取 10 前 10 条数据
- abase=# select * from ( select ctid,n_int,c_id from test order by n_int asc limit 1001 offset 1) td limit 10;
- ctid | n_int | c_id
- ----------+-------+------
- (0,2) | 100 | 2
- (0,4) | 100 | 4
- (0,8) | 100 | 8
- (0,9) | 100 | 9
- (0,5) | 100 | 5
- (0,3) | 100 | 3
- (0,6) | 100 | 6
- (0,55) | 100 | 888
- (0,7) | 100 | 7
- (87,195) | 100 | 888
- (10 rows)
-- 同堆排序一样, 获取 limit 1001 offset 2, 然后取 10 前 10 条数据
- abase=# select * from ( select ctid,n_int,c_id from test order by n_int asc limit 1001 offset 2) td limit 10;
- ctid | n_int | c_id
- ----------+-------+------
- (0,4) | 100 | 4
- (0,8) | 100 | 8
- (0,9) | 100 | 9
- (0,5) | 100 | 5
- (0,3) | 100 | 3
- (0,6) | 100 | 6
- (0,55) | 100 | 888
- (0,7) | 100 | 7
- (87,195) | 100 | 888
- (44,12) | 100 | 888
- (10 rows)
减小 work_mem 使用归并排序的时候, offset 从 0 变成 1 后以及变成 2 后, 任然有序.
还有一种情况, 那就是在查询前面几页的时候会有重复, 但是越往后面翻就不会重复了, 现在也可以解释清楚.
如果每页 10 条数据, 当 offse 较小的时候使用的内存较少. 当 offse 不断增大, 所耗费的内存也就越多.
-- 设置 work_mem =64kb
- abase=# show work_mem;
- work_mem
- ----------
- 64kB
- (1 row)
-- 查询 limit 10 offset 10
- abase=# explain analyze select * from ( select ctid,n_int,c_id from test order by n_int asc limit 10 offset 10) td limit 10;
- QUERY PLAN
- ---------------------------------------------------------------------------------------------------------------------------
- Limit (cost=1221.42..1221.54 rows=10 width=13) (actual time=12.881..12.884 rows=10 loops=1)
- -> Limit (cost=1221.42..1221.44 rows=10 width=13) (actual time=12.879..12.881 rows=10 loops=1)
- -> Sort (cost=1221.39..1295.79 rows=29757 width=13) (actual time=12.877..12.879 rows=20 loops=1)
- Sort Key: test.n_int
- Sort Method: top-N heapsort Memory: 25kB
- -> Seq Scan on test (cost=0.00..429.57 rows=29757 width=13) (actual time=0.058..6.363 rows=29757 loops=1)
- Planning time: 0.230 ms
- Execution time: 12.976 ms
- (8 rows)
-- 查询 limit 10 offset 1000
- abase=# explain analyze select * from ( select ctid,n_int,c_id from test order by n_int asc limit 10 offset 1000) td limit 10;
- QUERY PLAN
- ---------------------------------------------------------------------------------------------------------------------------
- Limit (cost=2065.75..2065.88 rows=10 width=13) (actual time=27.188..27.192 rows=10 loops=1)
- -> Limit (cost=2065.75..2065.78 rows=10 width=13) (actual time=27.186..27.188 rows=10 loops=1)
- -> Sort (cost=2063.25..2137.64 rows=29757 width=13) (actual time=26.940..27.138 rows=1010 loops=1)
- Sort Key: test.n_int
- Sort Method: external merge Disk: 784kB
- -> Seq Scan on test (cost=0.00..429.57 rows=29757 width=13) (actual time=0.026..6.374 rows=29757 loops=1)
- Planning time: 0.207 ms
- Execution time: 27.718 ms
- (8 rows)
可以看到当 offset 从 10 增加到 1000 的时候, 使用的内存增加, 排序的方法从堆排序变成了归并排序. 而归并排序为稳定排序, 所以后面的分页不会再有后一页出现前一页数据的情况.
参考资料: PostgreSQL - repeating rows from LIMIT OFFSET
参考资料: LIMIT and OFFSET
结语
1. 关于分页重复数据的问题主要是排序字段不唯一并且执行计划走了快速排序和堆排序导致.
2. 当排序有重复字段, 但是如果查询是归并排序, 便不会存在有重复数据的问题.
3. 当用重复字段排序, 前面的页重复, 随着 offset 的增大导致 work_mem 不足以后使用归并排序, 就不存在重复的数据了.
4. 排序和算法的稳定性有关, 当执行计划选择不同的排序算法时, 返回的结果不一样.
5. 处理重复数据的常见手段就是, 排序的时候可以在排序字段 d_larq(立案日期) 后面加上 c_bh(唯一字段) 来排序.
order by d_larq,c_bh;
来源: https://www.cnblogs.com/zhangfx01/p/10613862.html