转载自 dachuan
随着双十一数据量的暴增,之前用 distinct 去重可以简单处理的场景,现在消耗的时间成倍增长。如果用了 multiple distinct,那就更要警惕,因为多重去重本身会带来数据量的成倍增长,很可能 10 分钟的任务,在双十一期间会跑上几个小时都没有结果。
这里介绍一个小技巧,其实在稳定性手册里面已经有过介绍,不过总感觉没有看懂。最近正好做了一次优化,于是在这里小结一下:
例如原来的代码是这样:
- select D1,
- D2,
- count(distinct
- case when A is not null then B end) as B_distinct_cnt from xxx group by D1,
- D2
那么优化方案可以是这样:
- create table tmp1 as select D1,
- D2,
- B,
- count(
- case when A is not null then B end) as B_cnt from xxx group by D1,
- D1,
- B select D1,
- D2,
- sum(
- case when B_cnt > 0 then 1
- else 0 end) as B_distinct_cnt from tmp1 group by D1,
- D2
多重去重的优化也可以采用上面的方案,只是要注意 Group By 的 Key 是以源表聚合维度为基础,根据 distinct 计算的值进行组合。
例如下面的这个例子:
- select D1,
- D2,
- count(distinct
- case when A is not null then B end) as B_distinct_cnt,
- count(distinct
- case when E is not null then C end) as C_distinct_cnt from xxx group by D1,
- D2
那么优化方案可以是:
- create table tmp1
- as
- select D1,D2,B,
- count( case when A is not null then B end ) as B_cnt
- from xxx
- group by D1, D1, B
- create table tmp1_1
- as
- select D1,D2,
- sum(case when B_cnt > 0 then 1 else 0 end) as B_distinct_cnt
- from tmp1
- group by D1,D2
- create table tmp2
- as
- select D1,D2,C,
- count( case when E is not null then C end ) as C_cnt
- from xxx
- group by D1, D1, C
- create table tmp2_1
- as
- select D1,D2,
- sum(case when C_cnt > 0 then 1 else 0 end) as C_distinct_cnt
- from tmp1
- group by D1,D2
- select
- t1.D1,t1.D2,
- t1.B_distinct_cnt,
- t2.C_distinct_cnt
- from tmp1_1 t1
- left outer join tmp2_1 t2
- on t1.D1=t2.D1 and t1.D2=t2.D2
来源: http://www.bubuko.com/infodetail-1985024.html