当前位置：

首页
/
IT
/
Java
/
MongoDB 分页的 Java 实现和分页需求的思考

MongoDB 分页的 Java 实现和分页需求的思考

前言

传统关系数据库中都提供了基于 row number 的分页功能, 切换 MongoDB 后, 想要实现分页, 则需要修改一下思路.

传统分页思路

假设一页大小为 10 条. 则

//page 1
1-10
//page 2
11-20
//page 3
21-30
...
//page n
10*(n-1) +1 - 10*n

MongoDB 提供了 skip()和 limit()方法.

skip: 跳过指定数量的数据. 可以用来跳过当前页之前的数据, 即跳过 pageSize*(n-1).

limit: 指定从 MongoDB 中读取的记录条数, 可以当做页面大小 pageSize.

所以, 分页可以这样做:

//Page 1
db.users.find().limit (10)
//Page 2
db.users.find().skip(10).limit(10)
//Page 3
db.users.find().skip(20).limit(10)
........

问题

看起来, 分页已经实现了, 但是官方文档并不推荐, 说会扫描全部文档, 然后再返回结果.

The cursor.skip() method requires the server to scan from the beginning of the input results set before beginning to return results. As the offset increases, cursor.skip() will become slower.

所以, 需要一种更快的方式. 其实和 mysql 数量大之后不推荐用 limit m,n 一样, 解决方案是先查出当前页的第一条, 然后顺序数 pageSize 条. MongoDB 官方也是这样推荐的.

正确的分页办法

我们假设基于_id 的条件进行查询比较. 事实上, 这个比较的基准字段可以是任何你想要的有序的字段, 比如时间戳.

//Page 1
db.users.find().limit(pageSize);
//Find the id of the last document in this page
last_id = ...
//Page 2
users = db.users.find({
  '_id' :{ "$gt" :ObjectId("5b16c194666cd10add402c87")}
}).limit(10)
//Update the last id with the id of the last document in this page
last_id = ...

显然, 第一页和后面的不同. 对于构建分页 API, 我们可以要求用户必须传递 pageSize, lastId.

pageSize 页面大小

lastId 上一页的最后一条记录的 id, 如果不传, 则将强制为第一页

降序

_id 降序, 第一页是最大的, 下一页的 id 比上一页的最后的 id 还小.

function printStudents(startValue, nPerPage) {
  let endValue = null;
  db.students.find( { _id: { $lt: startValue } } )
             .sort( { _id: -1 } )
             .limit( nPerPage )
             .forEach( student => {
               print( student.name );
               endValue = student._id;
             } );
  return endValue;
}

升序

_id 升序, 下一页的 id 比上一页的最后一条记录 id 还大.

function printStudents(startValue, nPerPage) {
  let endValue = null;
  db.students.find( { _id: { $gt: startValue } } )
             .sort( { _id: 1 } )
             .limit( nPerPage )
             .forEach( student => {
               print( student.name );
               endValue = student._id;
             } );
  return endValue;
}

一共多少条

还有一共多少条和多少页的问题. 所以, 需要先查一共多少条 count.

db.users.find().count();

ObjectId 的有序性问题

先看 ObjectId 生成规则:

比如

"_id" : ObjectId("5b1886f8965c44c78540a4fc")

取 id 的前 4 个字节. 由于 id 是 16 进制的 string,4 个字节就是 32 位, 对应 id 前 8 个字符. 即 5b1886f8, 转换成 10 进制为

1528334072

. 加上 1970, 就是当前时间.

事实上, 更简单的办法是查看 org.mongodb:bson:3.4.3 里的 ObjectId 对象.

public ObjectId(Date date) {
    this(dateToTimestampSeconds(date), MACHINE_IDENTIFIER, PROCESS_IDENTIFIER, NEXT_COUNTER.getAndIncrement(), false);
}
//org.bson.types.ObjectId#dateToTimestampSeconds
private static int dateToTimestampSeconds(Date time) {
    return (int)(time.getTime() / 1000L);
}
//java.util.Date#getTime
/**
 * Returns the number of milliseconds since January 1, 1970, 00:00:00 GMT
 * represented by this <tt>Date</tt> object.
 *
 * @return  the number of milliseconds since January 1, 1970, 00:00:00 GMT
 *          represented by this date.
 */
public long getTime() {
    return getTimeImpl();
}

MongoDB 的 ObjectId 应该是随着时间而增加的, 即后插入的 id 会比之前的大. 但考量 id 的生成规则, 最小时间排序区分是秒, 同一秒内的排序无法保证. 当然, 如果是同一台机器的同一个进程生成的对象, 是有序的.

如果是分布式机器, 不同机器时钟同步和偏移的问题. 所以, 如果你有个字段可以保证是有序的, 那么用这个字段来排序是最好的._id 则是最后的备选方案.

如果我一定要跳页

上面的分页看起来看理想, 虽然确实是, 但有个刚需不曾指明 --- 我怎么跳页.

我们的分页数据要和排序键关联, 所以必须有一个排序基准来截断记录. 而跳页, 我只知道第几页, 条件不足, 无法分页了.

现实业务需求确实提出了跳页的需求, 虽然几乎不会有人用, 人们更关心的是开头和结尾, 而结尾可以通过逆排序的方案转成开头. 所以, 真正分页的需求应当是不存在的. 如果你是为了查找某个记录, 那么查询条件搜索是最快的方案. 如果你不知道查询条件, 通过肉眼去一一查看, 那么下一页足矣.

说了这么多, 就是想扭转传统分页的概念, 在互联网发展的今天, 大部分数据的体量都是庞大的, 跳页的需求将消耗更多的内存和 cpu, 对应的就是查询慢.

当然, 如果数量不大, 如果不介意慢一点, 那么 skip 也不是啥问题, 关键要看业务场景.

我今天接到的需求就是要跳页, 而且数量很小, 那么 skip 吧, 不费事, 还快.

来看看大厂们怎么做的

Google 最常用了, 看起来是有跳页选择的啊. 再仔细看, 只有 10 页, 多的就必须下一页, 并没有提供一共多少页, 跳到任意页的选择. 这不就是我们的 find-condition-then-limit 方案吗, 只是他的一页数量比较多, 前端或者后端把这一页给切成了 10 份.

同样, 看 Facebook, 虽然提供了总 count, 但也只能下一页.

其他场景, 比如 Twitter, 微博, 朋友圈等, 根本没有跳页的概念的.

排序和性能

前面关注于分页的实现原理, 但忽略了排序. 既然分页, 肯定是按照某个顺序进行分页的, 所以必须要有排序的.

MongoDB 的 sort 和 find 组合

db.bios.find().sort( { name: 1 } ).limit( 5 )
db.bios.find().limit( 5 ).sort( { name: 1 } )

这两个都是等价的, 顺序不影响执行顺序. 即, 都是先 find 查询符合条件的结果, 然后在结果集中排序.

我们条件查询有时候也会按照某字段排序的, 比如按照时间排序. 查询一组时间序列的数据, 我们想要按照时间先后顺序来显示内容, 则必须先按照时间字段排序, 然后再按照 id 升序.

db.users.find({name: "Ryan"}).sort( { birth: 1, _id: 1 } ).limit( 5 )

我们先按照 birth 升序, 然后 birth 相同的 record 再按照_id 升序, 如此可以实现我们的分页功能了.

多字段排序

db.records.sort({ a:1, b:-1})

表示先按照 a 升序, 再按照 b 降序. 即, 按照字段 a 升序, 对于 a 相同的记录, 再用 b 降序, 而不是按 a 排完之后再全部按 b 排.

示例:

db.user.find();

结果:

{
    "_id" : ObjectId("5b1886ac965c44c78540a4fb"),
    "name" : "a",
    "age" : 1.0,
    "id" : "1"
}
{
    "_id" : ObjectId("5b1886f8965c44c78540a4fc"),
    "name" : "a",
    "age" : 2.0,
    "id" : "2"
}
{
    "_id" : ObjectId("5b1886fa965c44c78540a4fd"),
    "name" : "b",
    "age" : 1.0,
    "id" : "3"
}
{
    "_id" : ObjectId("5b1886fd965c44c78540a4fe"),
    "name" : "b",
    "age" : 2.0,
    "id" : "4"
}
{
    "_id" : ObjectId("5b1886ff965c44c78540a4ff"),
    "name" : "c",
    "age" : 10.0,
    "id" : "5"
}

按照名称升序, 然后按照 age 降序

db.user.find({}).sort({name: 1, age: -1})

结果:

{
    "_id" : ObjectId("5b1886f8965c44c78540a4fc"),
    "name" : "a",
    "age" : 2.0,
    "id" : "2"
}
{
    "_id" : ObjectId("5b1886ac965c44c78540a4fb"),
    "name" : "a",
    "age" : 1.0,
    "id" : "1"
}
{
    "_id" : ObjectId("5b1886fd965c44c78540a4fe"),
    "name" : "b",
    "age" : 2.0,
    "id" : "4"
}
{
    "_id" : ObjectId("5b1886fa965c44c78540a4fd"),
    "name" : "b",
    "age" : 1.0,
    "id" : "3"
}
{
    "_id" : ObjectId("5b1886ff965c44c78540a4ff"),
    "name" : "c",
    "age" : 10.0,
    "id" : "5"
}

用索引优化排序

到这里必须考虑下性能.

$sort and Memory Restrictions

The $sort stage has a limit of 100 megabytes of RAM. By default, if the stage exceeds this limit, $sort will produce an error. To allow for the handling of large datasets, set the allowDiskUse option to true to enable $sort operations to write to temporary files. See the allowDiskUse option in db.collection.aggregate() method and the aggregate command for details.

Changed in version 2.6: The memory limit for $sort changed from 10 percent of RAM to 100 megabytes of RAM.

从 2.6 开始, sort 只排序 100M 以内的数据, 超过将会报错. 可以通过设置 allowDiskUse 来允许排序大容量数据.

有索引的排序会比没有索引的排序快, 所以官方推荐为需要排序的 key 建立索引.

索引

对于单 key 排序, 建立单独索引

db.records.createIndex( { a: 1 } )

索引可以支持同排序和逆序的 sort

索引又分升序 (1) 和降序 (-1), 索引定义的排序方向以及逆转方向可以支持 sort. 对于上述单 key 索引 a, 可以支持 sort({a:1}) 升序和 sort({a:-1})降序.

对于多字段排序

如果想要使用索引. 则可以建立复合 (compound index) 索引为

db.records.createIndex( { a: 1, b:-1 } )

复合索引的字段顺序必须和 sort 一致

复合多字段索引的顺序要和 sort 的字段一致才可以走索引. 比如索引 {a:1, b:1}, 可以支持 sort({a:1, b:1}) 和逆序 sort({a:-1, b:-1}), 但是, 不支持 a,b 颠倒. 即, 不支持 sort({b:1, a:1}).

复合索引支持 sort 同排序和逆序

索引{a:1, b:-1} 可以支持 sort({a:1, b:-1}), 也可以支持 sort({a:-1, b:1})

复合索引可以前缀子集支持 sort

对于多字段复合索引, 可以拆分成多个前缀子集. 比如 {a:1, b:1, c:1} 相当于

{ a: 1 }
{ a: 1, b: 1 }
{ a: 1, b: 1, c: 1 }

示例:

Example	Index Prefix
db.data.find().sort( { a: 1 } )	{ a: 1 }
db.data.find().sort( { a: -1 } )	{ a: 1 }
db.data.find().sort( { a: 1, b: 1 } )	{ a: 1, b: 1 }
db.data.find().sort( { a: -1, b: -1 } )	{ a: 1, b: 1 }
db.data.find().sort( { a: 1, b: 1, c: 1 } )	{ a: 1, b: 1, c: 1 }
db.data.find( { a: { $gt: 4 } } ).sort( { a: 1, b: 1 } )	{ a: 1, b: 1 }

复合索引的非前缀子集可以支持 sort, 前提是前缀子集的元素要在 find 的查询条件里是 equals

这个条件比较绕口, 复合索引的非前缀子集, 只要 find 和 sort 的字段要组成索引前缀, 并且 find 里的条件必须是相等.

示例

Example	Index Prefix
db.data.find( { a: 5 } ).sort( { b: 1, c: 1 } )	{ a: 1 , b: 1, c: 1 }
db.data.find( { b: 3, a: 4 } ).sort( { c: 1 } )	{ a: 1, b: 1, c: 1 }
db.data.find( { a: 5, b: { $lt: 3} } ).sort( { b: 1 } )	{ a: 1, b: 1 }

find 和 sort 的字段加起来满足前缀子集, find 条件中可以使用其他字段进行非 equals 比较.

对于既不是前缀子集, 也不是 find 相等条件的. 索引无效. 比如, 对于索引{a:1, b:1, c:1}. 以下两种方式不走索引.

db.data.find( { a: { $gt: 2 } } ).sort( { c: 1 } )
db.data.find( { c: 5 } ).sort( { c: 1 } )

Java 代码分页

由于确实有跳页的需求, 目前还没有发现性能问题, 仍旧采用 skip 做分页, 当然也兼容条件分页

public PageResult<StatByClientRs> findByDurationPage(FindByDurationPageRq rq) {
    final Criteria criteriaDefinition = Criteria.where("duration").is(rq.getDuration());
    final Query query = new Query(criteriaDefinition).with(new Sort(Lists.newArrayList(new Order(Direction.ASC, "_id"))));
    // 分页逻辑
    long total = mongoTemplate.count(query, StatByClient.class);
    Integer pageSize = rq.getPageSize();
    Integer pageNum = rq.getPageNum();
    String lastId = rq.getLastId();
    final Integer pages = (int) Math.ceil(total / (double) pageSize);
    if (pageNum<=0 || pageNum> pages) {
        pageNum = 1;
    }
    if (StringUtils.isNotBlank(lastId)) {
        if (pageNum != 1) {
            criteriaDefinition.and("_id").gt(new ObjectId(lastId));
        }
        query.limit(pageSize);
    } else {
        int skip = pageSize * (pageNum - 1);
        query.skip(skip).limit(pageSize);
    }
    List<StatByClient> statByClientList = mongoTemplate.find(query, StatByClient.class);
    PageResult<StatByClientRs> pageResult = new PageResult<>();
    pageResult.setTotal(total);
    pageResult.setPages(pages);
    pageResult.setPageSize(pageSize);
    pageResult.setPageNum(pageNum);
    pageResult.setList(mapper.mapToListRs(statByClientList));
    return pageResult;
}

这个示例中, 目标是根据 duration 查询 list, 结果集进行分页. 当请求体中包含 lastId, 那就走下一页方案. 如果想要跳页, 就不传 lastId, 随便你跳吧.

参考

官方分页推荐 https://docs.mongodb.com/manual/reference/method/cursor.skip/

官方 sort 文档 https://docs.mongodb.com/manual/reference/operator/aggregation/sort/index.html

官方使用索引优化 sort 文档 https://docs.mongodb.com/manual/tutorial/sort-results-with-indexes/

官方复合索引 https://docs.mongodb.com/manual/core/index-compound/#index-type-compound

如何正确看待分页的需求 http://www.ovaistariq.net/404/mysql-paginated-displays-how-to-kill-performance-vs-how-to-improve-performance/#.WxiEK4huaUk

http://ian.wang/35.htm
https://cnodejs.org/topic/559a0bf493cb46f578f0a601

来源: https://www.cnblogs.com/woshimrf/p/mongodb-pagenation-performance.html

与本文相关文章

暂无,快来抢沙发吧！