PostgreSQL , 数据透视 , cube , grouping sets , rollup
人群透视是商业与数据结合的案例之一,比如大型商场的选址,可与分析的数据包括车流、人流量等等。
结合数据可以更深入的分析人群的组成结构,消费能力等等,给大型商场的选址带来更多的参考价值。
那么如何使用数据库透视人群数据呢?
1. 人群属性表
记载了每个人的各个属性段落,比如收入、车龄、固定资产等等。如下
- create table people(
- id serial8 primary key, -- 用户ID
- c1 int2, -- 年龄分段, 假设分5个档, 使用0,1,2,3,4表示
- c2 int2, -- 个人收入分段, 假设分3个档, 使用0,1,2表示
- c3 int2, -- 车龄分段, 假设分5个档, 使用0,1,2,3,4表示
- c4 int2, -- 家庭收入分段, 假设分3个档, 使用0,1,2表示
- c5 int2, -- 固定资产分段, 假设分3个档, 使用0,1,2表示
- c6 int2 -- 存款分段, 假设分3个档, 使用0,1,2表示
- );
2. 人群动态轨迹
记录的是人群的活动位置或轨迹
使用 PostgreSQL PostGIS 插件,可以很方便的记录轨迹数据,并且支持 GIST 索引,可以快速的根据某个区域或范围搜索对应的人群。
- create table people_loc(
- id int8, -- 用户ID
- -- loc geometry, -- 位置
- crt_time timestamp -- 时间
- );
1. 生成 1000 万人群的测试数据, 其中车龄为 4, 年龄段为 4 的不插入,制造一些空洞。
- insert into people (c1,c2,c3,c4,c5,c6)
- select
- mod((random()*10)::int,4),
- mod((random()*10)::int,3),
- mod((random()*10)::int,4),
- mod((random()*10)::int,3),
- mod((random()*10)::int,3),
- mod((random()*10)::int,3)
- from generate_series(1,10000000);
- postgres=# select * from people limit 10;
- id | c1 | c2 | c3 | c4 | c5 | c6
- ----+----+----+----+----+----+----
- 1 | 2 | 1 | 3 | 0 | 1 | 2
- 2 | 0 | 0 | 1 | 0 | 1 | 0
- 3 | 2 | 1 | 0 | 2 | 0 | 2
- 4 | 1 | 0 | 0 | 0 | 1 | 2
- 5 | 3 | 2 | 2 | 1 | 2 | 1
- 6 | 1 | 2 | 0 | 0 | 1 | 1
- 7 | 2 | 1 | 0 | 1 | 0 | 0
- 8 | 1 | 1 | 0 | 1 | 0 | 2
- 9 | 3 | 0 | 3 | 1 | 2 | 1
- 10 | 3 | 2 | 2 | 0 | 2 | 1
- (10 rows)
2. 生成 1000 万人群轨迹数据
- insert into people_loc (id, crt_time)
- select random()*10000000, now()+format('%L', (500000-random()*1000000))::interval
- from generate_series(1,10000000);
- postgres=# select * from people_loc limit 10;
- id | crt_time
- ---------+----------------------------
- 7278581 | 2017-03-05 16:35:13.828435
- 3456421 | 2017-03-07 09:08:26.853477
- 976602 | 2017-03-04 18:47:49.176176
- 1996929 | 2017-03-11 08:46:31.955573
- 6590325 | 2017-03-11 14:48:55.231263
- 7252414 | 2017-03-04 08:17:28.731733
- 8763332 | 2017-03-01 15:37:11.57363
- 9426083 | 2017-03-11 17:51:46.474757
- 4399781 | 2017-03-05 08:07:45.962599
- 9049432 | 2017-03-09 14:10:42.211882
- (10 rows)
1. 选择人群
以某个点为中心、或者根据某个闭环区域,圈一部分人群,(采用 PostGIS)
这里不举例 GIS(跟兴趣的童鞋可以使用 PostGIS 测试一下,性能杠杠的),我直接以时间为度量直接圈人。
- select id from people_loc where crt_time between '2017-03-06'::date and '2017-03-08'::date;
有人可能要问,如果这个时间段,同一个人出现了多条轨迹,怎么处理呢?
这里使用了 IN,PostgreSQL 的优化器很强大,JOIN 时数据库会自动聚合,不必在这里 GROUP BY,原理可参考如下文章。
2. 数据透视
PostgreSQL 的 SQL 兼容性非常强大,对于数据透视,可以使用 grouping sets, cube, rollup 等语法。
- select c1,c2,c3,c4,c5,c6,count(*) cnt
- from
- people
- where id in (
- select id from people_loc where crt_time between '2017-03-06'::date and '2017-03-08'::date
- )
- GROUP BY GROUPING SETS (c1,c2,c3,c4,c5,c6,());
- c1 | c2 | c3 | c4 | c5 | c6 | cnt
- ----+----+----+----+----+----+---------
- | 0 | | | | | 555530
- | 1 | | | | | 555525
- | 2 | | | | | 475596
- | | | | | | 1586651
- | | | 0 | | | 554079
- | | | 1 | | | 555864
- | | | 2 | | | 476708
- | | | | | 0 | 554738
- | | | | | 1 | 554843
- | | | | | 2 | 477070
- | | | | 0 | | 554552
- | | | | 1 | | 555073
- | | | | 2 | | 477026
- 0 | | | | | | 396349
- 1 | | | | | | 475616
- 2 | | | | | | 397502
- 3 | | | | | | 317184
- | | 0 | | | | 396947
- | | 1 | | | | 475504
- | | 2 | | | | 395852
- | | 3 | | | | 318348
- (21 rows)
更多透视用法参考 cube, rollup, grouping sets 用法。
目前 PostgreSQL, HybridDB, Greenplum 都支持以上语法。
3. 结果转换
使用 WITH 语法,将以上结果进行转换
- with tmp as (
- select c1,c2,c3,c4,c5,c6,count(*) cnt
- from
- people
- where id in (
- select id from people_loc where crt_time between '2017-03-06'::date and '2017-03-08'::date
- )
- GROUP BY GROUPING SETS (c1,c2,c3,c4,c5,c6,())
- )
- select case
- when c1 is not null then 'c1_'||c1
- when c2 is not null then 'c2_'||c2
- when c3 is not null then 'c3_'||c3
- when c4 is not null then 'c4_'||c4
- when c5 is not null then 'c5_'||c5
- when c6 is not null then 'c6_'||c6
- else 'cnt' end AS col,
- t1.cnt as private,
- t2.cnt as all,
- t1.cnt::numeric/t2.cnt as ratio
- from tmp t1, (select cnt from tmp where tmp.c1 is null and tmp.c2 is null and tmp.c3 is null and tmp.c4 is null and tmp.c5 is null and tmp.c6 is null) t2
- ;
- col | private | all | ratio
- ------+---------+---------+------------------------
- c2_0 | 555530 | 1586651 | 0.35012740672019240526
- c2_1 | 555525 | 1586651 | 0.35012425542857250901
- c2_2 | 475596 | 1586651 | 0.29974833785123508572
- cnt | 1586651 | 1586651 | 1.00000000000000000000
- c4_0 | 554079 | 1586651 | 0.34921290189209851442
- c4_1 | 555864 | 1586651 | 0.35033791300040147455
- c4_2 | 476708 | 1586651 | 0.30044918510750001103
- c6_0 | 554738 | 1586651 | 0.34962824212760083976
- c6_1 | 554843 | 1586651 | 0.34969441925161866094
- c6_2 | 477070 | 1586651 | 0.30067733862078049930
- c5_0 | 554552 | 1586651 | 0.34951101407934069937
- c5_1 | 555073 | 1586651 | 0.34983937866613388830
- c5_2 | 477026 | 1586651 | 0.30064960725452541233
- c1_0 | 396349 | 1586651 | 0.24980225645085151051
- c1_1 | 475616 | 1586651 | 0.29976094301771467071
- c1_2 | 397502 | 1586651 | 0.25052894429839958504
- c1_3 | 317184 | 1586651 | 0.19990785623303423374
- c3_0 | 396947 | 1586651 | 0.25017915092859110163
- c3_1 | 475504 | 1586651 | 0.29969035408542899478
- c3_2 | 395852 | 1586651 | 0.24948901806383382357
- c3_3 | 318348 | 1586651 | 0.20064147692214608001
- (21 rows)
- Time: 8466.507 ms
perf report
- # Events: 8K cycles
- #
- # Overhead Command Shared Object Symbol
- # ........ ........ .................. ...................................................
- #
- 6.29% postgres postgres [.] comparetup_heap
- |
- --- comparetup_heap
- |
- |--41.84%-- (nil)
- |
- |--33.36%-- 0x1
- |
- |--8.44%-- 0x23e8e
- |
- |--8.43%-- 0x2
- |
- --7.93%-- 0x3
- 5.16% postgres postgres [.] slot_deform_tuple.lto_priv.1138
- |
- --- slot_deform_tuple.lto_priv.1138
- 3.82% postgres postgres [.] mergeprereadone
- |
- --- mergeprereadone
- 3.79% postgres postgres [.] qsort_ssup
- |
- --- qsort_ssup
- 3.51% postgres postgres [.] tuplesort_gettuple_common.lto_priv.1348
- |
- --- tuplesort_gettuple_common.lto_priv.1348
- |
- |--32.14%-- 0x1
- |
- |--22.28%-- 0x2
- |
- |--18.95%-- (nil)
- |
- |--11.41%-- 0x10
- |
- |--5.72%-- 0x3
- |
- |--1.91%-- 0x3d84d9
- |
- |--1.91%-- 0xef259
- |
- |--1.91%-- get_select_query_def.lto_priv.1324
- |
- |--1.91%-- 0x95c9af
- |
- --1.88%-- 0x3a0e54
4. left join 补缺 (可选)
对于空洞值,如果你要补齐的话,使用 left join 即可
- select * from (values ('c1_0'),('c1_1'),('c1_2'),('c1_3'),('c1_4'),('c2_0'),('c2_1'),('c2_2'),('c3_0'),('c3_1'),('c3_2'),('c3_3'),('c3_4'),('c4_0'),('c4_1'),('c4_2'),('c5_0'),('c5_1'),('c5_2'),('c6_0'),('c6_1'),('c6_2')) t (col);
- col
- ------
- c1_0
- c1_1
- c1_2
- c1_3
- c1_4
- c2_0
- c2_1
- c2_2
- c3_0
- c3_1
- c3_2
- c3_3
- c3_4
- c4_0
- c4_1
- c4_2
- c5_0
- c5_1
- c5_2
- c6_0
- c6_1
- c6_2
- (22 rows)
补缺如下
- with tmp as (
- select c1,c2,c3,c4,c5,c6,count(*) cnt
- from
- people
- where id in (
- select id from people_loc where crt_time between '2017-03-06'::date and '2017-03-08'::date
- )
- GROUP BY GROUPING SETS (c1,c2,c3,c4,c5,c6,())
- ),
- tmp2 as (
- select case
- when c1 is not null then 'c1_'||c1
- when c2 is not null then 'c2_'||c2
- when c3 is not null then 'c3_'||c3
- when c4 is not null then 'c4_'||c4
- when c5 is not null then 'c5_'||c5
- when c6 is not null then 'c6_'||c6
- else 'cnt' end AS col,
- t1.cnt as private,
- t2.cnt as all,
- t1.cnt::numeric/t2.cnt as ratio
- from tmp t1, (select cnt from tmp where tmp.c1 is null and tmp.c2 is null and tmp.c3 is null and tmp.c4 is null and tmp.c5 is null and tmp.c6 is null) t2
- )
- select t1.col,coalesce(t2.ratio,0) ratio from (values ('c1_0'),('c1_1'),('c1_2'),('c1_3'),('c1_4'),('c2_0'),('c2_1'),('c2_2'),('c3_0'),('c3_1'),('c3_2'),('c3_3'),('c3_4'),('c4_0'),('c4_1'),('c4_2'),('c5_0'),('c5_1'),('c5_2'),('c6_0'),('c6_1'),('c6_2'))
- t1 (col)
- left join tmp2 t2
- on (t1.col=t2.col)
- order by t1.col;
- col | ratio
- ------+------------------------
- c1_0 | 0.24980225645085151051
- c1_1 | 0.29976094301771467071
- c1_2 | 0.25052894429839958504
- c1_3 | 0.19990785623303423374
- c1_4 | 0
- c2_0 | 0.35012740672019240526
- c2_1 | 0.35012425542857250901
- c2_2 | 0.29974833785123508572
- c3_0 | 0.25017915092859110163
- c3_1 | 0.29969035408542899478
- c3_2 | 0.24948901806383382357
- c3_3 | 0.20064147692214608001
- c3_4 | 0
- c4_0 | 0.34921290189209851442
- c4_1 | 0.35033791300040147455
- c4_2 | 0.30044918510750001103
- c5_0 | 0.34951101407934069937
- c5_1 | 0.34983937866613388830
- c5_2 | 0.30064960725452541233
- c6_0 | 0.34962824212760083976
- c6_1 | 0.34969441925161866094
- c6_2 | 0.30067733862078049930
- (22 rows)
5. 行列变换 (可选)
如果要将以上数据,多行转换为单行,可以使用 tablefunc 插件,PostgreSQL 玩法巨多哦。
- create extension tablefunc;
- select * from
- crosstab($$
- with tmp as (
- select c1,c2,c3,c4,c5,c6,count(*) cnt
- from
- people
- where id in (
- select id from people_loc where crt_time between '2017-03-06'::date and '2017-03-08'::date
- )
- GROUP BY GROUPING SETS (c1,c2,c3,c4,c5,c6,())
- ),
- tmp2 as (
- select case
- when c1 is not null then 'c1_'||c1
- when c2 is not null then 'c2_'||c2
- when c3 is not null then 'c3_'||c3
- when c4 is not null then 'c4_'||c4
- when c5 is not null then 'c5_'||c5
- when c6 is not null then 'c6_'||c6
- else 'cnt' end AS col,
- t1.cnt as private,
- t2.cnt as all,
- t1.cnt::numeric/t2.cnt as ratio
- from tmp t1, (select cnt from tmp where tmp.c1 is null and tmp.c2 is null and tmp.c3 is null and tmp.c4 is null and tmp.c5 is null and tmp.c6 is null) t2
- )
- select 'row'::text , t1.col,coalesce(t2.ratio,0) ratio from (values ('c1_0'),('c1_1'),('c1_2'),('c1_3'),('c1_4'),('c2_0'),('c2_1'),('c2_2'),('c3_0'),('c3_1'),('c3_2'),('c3_3'),('c3_4'),('c4_0'),('c4_1'),('c4_2'),('c5_0'),('c5_1'),('c5_2'),('c6_0'),('c6_1'),('c6_2'))
- t1 (col)
- left join tmp2 t2
- on (t1.col=t2.col)
- order by t1.col
- $$
- )
- as
- (
- row text,
- c1_0 numeric,
- c1_1 numeric,
- c1_2 numeric,
- c1_3 numeric,
- c1_4 numeric,
- c2_0 numeric,
- c2_1 numeric,
- c2_2 numeric,
- c3_0 numeric,
- c3_1 numeric,
- c3_2 numeric,
- c3_3 numeric,
- c3_4 numeric,
- c4_0 numeric,
- c4_1 numeric,
- c4_2 numeric,
- c5_0 numeric,
- c5_1 numeric,
- c5_2 numeric,
- c6_0 numeric,
- c6_1 numeric,
- c6_2 numeric
- );
- row | c1_0 | c1_1 | c1_2 | c1_3 | c1_4 | c2_0 | c2_1 | c2_2 | c3_0 | c3_1
- | c3_2 | c3_3 | c3_4 | c4_0 | c4_1 | c4_2 | c5_0 | c5_1 | c5_2 | c6_0 |
- c6_1 | c6_2
- -----+------------------------+------------------------+------------------------+------------------------+------+------------------------+------------------------+------------------------+------------------------+------------------------
- +------------------------+------------------------+------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+----
- --------------------+------------------------
- row | 0.24980225645085151051 | 0.29976094301771467071 | 0.25052894429839958504 | 0.19990785623303423374 | 0 | 0.35012740672019240526 | 0.35012425542857250901 | 0.29974833785123508572 | 0.25017915092859110163 | 0.29969035408542899478
- | 0.24948901806383382357 | 0.20064147692214608001 | 0 | 0.34921290189209851442 | 0.35033791300040147455 | 0.30044918510750001103 | 0.34951101407934069937 | 0.34983937866613388830 | 0.30064960725452541233 | 0.34962824212760083976 | 0.3
- 4969441925161866094 | 0.30067733862078049930
- (1 row)
1. 关于索引(BRIN, GIST, BTREE_GIST)
通常我们会限定两个维度,筛选人群,1 时间范围,2 地理位置范围。
由于轨迹数据通常是时间和堆的线性相关性很好的,所以,在索引方面,可以使用 BRIN 索引。
brin 索引详见
而对于地理位置,如果要进行快速筛选的话,可以建立 GIST 索引
如果要建立两者的复合索引,可以使用 btree_gist 插件,那么时间和地理位置就能放在一个 GIST 索引中了。
- create extension btree_gist;
2. 递归优化
如果轨迹点很多,但是大多数为重复人群,可使用递归优化 IN 查询
参考
3. case when 优化,在使用本例的 cube,grouping sets,rollup 前,或者其他不支持数据透视语法的数据库中,可以使用 case when 的方法来聚合,但是每条数据都要经过 case when 的计算,耗费很大的 CPU。
- select sum(
- case when c1 = 0 then 1
- else 0 end) / (count( * )) : :numeric as c1_0,
- sum(
- case when c1 = 1 then 1
- else 0 end) / (count( * )) : :numeric as c1_1,
- sum(
- case when c1 = 2 then 1
- else 0 end) / (count( * )) : :numeric as c1_2,
- sum(
- case when c1 = 3 then 1
- else 0 end) / (count( * )) : :numeric as c1_3,
- sum(
- case when c1 = 4 then 1
- else 0 end) / (count( * )) : :numeric as c1_4,
- sum(
- case when c2 = 0 then 1
- else 0 end) / (count( * )) : :numeric as c2_0,
- sum(
- case when c2 = 1 then 1
- else 0 end) / (count( * )) : :numeric as c2_1,
- sum(
- case when c2 = 2 then 1
- else 0 end) / (count( * )) : :numeric as c2_2,
- sum(
- case when c3 = 0 then 1
- else 0 end) / (count( * )) : :numeric as c3_0,
- sum(
- case when c3 = 1 then 1
- else 0 end) / (count( * )) : :numeric as c3_1,
- sum(
- case when c3 = 2 then 1
- else 0 end) / (count( * )) : :numeric as c3_2,
- sum(
- case when c3 = 3 then 1
- else 0 end) / (count( * )) : :numeric as c3_3,
- sum(
- case when c3 = 4 then 1
- else 0 end) / (count( * )) : :numeric as c3_4,
- sum(
- case when c4 = 0 then 1
- else 0 end) / (count( * )) : :numeric as c4_0,
- sum(
- case when c4 = 1 then 1
- else 0 end) / (count( * )) : :numeric as c4_1,
- sum(
- case when c4 = 2 then 1
- else 0 end) / (count( * )) : :numeric as c4_2,
- sum(
- case when c5 = 0 then 1
- else 0 end) / (count( * )) : :numeric as c5_0,
- sum(
- case when c5 = 1 then 1
- else 0 end) / (count( * )) : :numeric as c5_1,
- sum(
- case when c5 = 2 then 1
- else 0 end) / (count( * )) : :numeric as c5_2,
- sum(
- case when c6 = 0 then 1
- else 0 end) / (count( * )) : :numeric as c6_0,
- sum(
- case when c6 = 1 then 1
- else 0 end) / (count( * )) : :numeric as c6_1,
- sum(
- case when c6 = 2 then 1
- else 0 end) / (count( * )) : :numeric as c6_2 from people where id in (select id from people_loc where crt_time between '2017-03-06': :date and '2017-03-08': :date);
- c1_0 | c1_1 | c1_2 | c1_3 | c1_4 | c2_0 | c2_1 | c2_2 | c3_0 | c3_1 | c3_2 | c3_3 | c3_4 | c4_0 | c4_1 | c4_2 | c5_0 | c5_1 | c5_2 | c6_0 | c6_1 | c6_2------------------------+------------------------+------------------------+------------------------+----------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+----------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------+------------------------0.24980225645085151051 | 0.29976094301771467071 | 0.25052894429839958504 | 0.19990785623303423374 | 0.000000000000000000000000 | 0.35012740672019240526 | 0.35012425542857250901 | 0.29974833785123508572 | 0.25017915092859110163 | 0.29969 035408542899478 | 0.24948901806383382357 | 0.20064147692214608001 | 0.000000000000000000000000 | 0.34921290189209851442 | 0.35033791300040147455 | 0.30044918510750001103 | 0.34951101407934069937 | 0.34983937866613388830 | 0.3006496072545 2541233 | 0.34962824212760083976 | 0.34969441925161866094 | 0.30067733862078049930(1 row) Time: 8282.168 ms
perf report
- # Events: 8K cycles
- #
- # Overhead Command Shared Object Symbol
- # ........ ........ .................. ...................................................
- #
- 12.15% postgres postgres [.] ExecMakeFunctionResultNoSets
- |
- --- ExecMakeFunctionResultNoSets
- |
- --100.00%-- (nil)
- 7.11% postgres postgres [.] ExecEvalCase
- |
- --- ExecEvalCase
- |
- --100.00%-- (nil)
- 6.85% postgres postgres [.] ExecTargetList.isra.6.lto_priv.1346
- |
- --- ExecTargetList.isra.6.lto_priv.1346
- 5.43% postgres postgres [.] ExecProject.constprop.414
- |
- --- ExecProject.constprop.414
- 5.37% postgres postgres [.] ExecEvalScalarVarFast
- |
- --- ExecEvalScalarVarFast
- 4.35% postgres postgres [.] slot_getattr
- |
- --- slot_getattr
- 4.13% postgres postgres [.] advance_aggregates
- |
- --- advance_aggregates
- 3.43% postgres postgres [.] slot_deform_tuple.lto_priv.1138
- |
- --- slot_deform_tuple.lto_priv.1138
- 3.12% postgres postgres [.] ExecClearTuple
- |
- --- ExecClearTuple
- 2.82% postgres postgres [.] IndexNext
- |
- --- IndexNext
- 2.45% postgres postgres [.] ExecEvalConst
- |
- --- ExecEvalConst
- |
- --100.00%-- (nil)
1. 语法 cube, grouping sets, rollup 给数据透视提供了比较好的便利。
2. 行列变换可以使用 tablefunc 插件。
3. case when 过多时,对 CPU 的开销会比较大。
4. 结合 PostGIS 可以很方便的基于地理位置和时间维度,分析人群特性。
5. 阿里云 HybridDB, PostgreSQL 都能提供以上功能,其中 HybridDB 为分布式数据仓库。
来源: https://yq.aliyun.com/articles/71875