使用 union all 命令之后如何对 hive 表格进行去重

业务场景大概是这样的, 这里由两个 hive 表格, tableA 和 tableB, 格式内容都是这样的:

uid cate1 cate2

在 hive QL 中, 我们知道 union 有着自动去重的功能, 但是那是真对几行内容完全一致的情况下才可以. 现在我们要进行去重的情况是根据 uid 进行去重.

也就是说可能存在这种情况:

1234 老师唱歌

1234 老师跳舞

对于 hive 表格中的这两行数据我们只想要保留其中的一行.

针对这种情况, 我们做的大致思路就是, 取两个表格数据的时候同时人为加上一个 flag, 然后使用 python 代码根据 flag 进行区分保留.

为了进行去重, 我们写了两个代码, 一个是取得 hive 数据的 shell 脚本, 一个是处理 hive 数据的 python 脚本

VIM get_data.sh
function merge(){
cat <<EOF
add file ./process.py;
    select transform(a.*) using 'python tt.py' as uid,cate1,cate2 from
    (select * from
    (select uid,cate1,cate2,"0" as flag from tableA where dt='sth1'
    union all
    select uid,cate1,cate2,"1" as flag from tableB where dt='sth2'
    )ts
    distribute by uid sort by uid,flag asc
    )a
EOF
}

对于上面这个代码, 我觉得有一点需要特别注意, 就是

distribute by uid sort by uid,flag asc

为了了解这行代码, 我特意去看了看这里的解释参考

简单来说就是说, distribute by uid 代表的就是所有 uid 相同的数据会被送到同一个 reducer 中去处理.

VIM process.py

#!/bin/env python
#-*- encoding:utf-8 -*-
import os
import sys
def set_values(value):
        if value.isdigit():
                return int(value)
        else :
                return 0
lastuid=""cate1=""
cate2=""flag=""
for line in sys.stdin :
        line=line.replace("\n","").replace(" ","")
        v=line.split("\t")
        try :
                uid=v[0]
                if not uid.isdigit() or len(v) != 4:
                        pass
                if lastuid!="" and lastuid!=uid:
                        print (lastuid+"\t"+str(cate1)+"\t"+str(cate2))
                        lastuid=""                        cate1=""
                        cate2=""                        flag=""
                cate1=v[1]
                cate2=v[2]
                flag=v[3]
                lastuid=uid
        except :
                pass
print (lastuid+"\t"+str(cate1)+"\t"+str(cate2)) #这行代码是为了输出最后一行, 这行代码很类似于 python Word count 中的示例代码

来源: http://www.bubuko.com/infodetail-2988080.html

与本文相关文章

暂无,快来抢沙发吧！