案例1:
==效果等同于一行轉多行
資料表名稱:zhangb.gid_tags
資料格式,每行是2個字段,(gid,tags) ,可能有髒資料,分隔符為“\t”,
ANDROID-9de77225cadb4319adfc1b1fe51c54f0 h65010000,014200,018100,011300,0200
ANDROID-9de77dfdbbab42679eed11f4e48c0ffc 022000,026400,022400,016400,018100
要統計tags中每個tag出現的次數:使用hive python方法:
第一步:寫python處理腳本:腳本名字:hivepython.py
import sys
for line in sys.stdin:
appArray=line.strip( ).split("\t")
if len(appArray)==2:
gid=appArray[0]
tags=appArray[1].split(",")
for tag in tags:
print tag
腳本輸出是1個字段 tag
第二步,将python腳本加進hive 中:
add file /home/zhangb/hivepython.py
第三步:運作hive語句:
select b.* from (
select a.counter_attr as tag,count(1) as tag_count from (
select TRANSFORM(gid,usertags) USING 'python hivepython.py' as counter_attr from zhangb.gid_tags) a
group by a.counter_attr) b
sort by b.tag;
案例2:
案例1是python處理後輸出了1個字段,下面是輸出多個字段的形式:
【參考:http://www.cnblogs.com/Joe-T/p/4643452.html】
Hive中有一表,列分隔符為冒号(:),有一列utime是Timestamp格式,需要轉成Weekday存到新表。
利用Python寫一個Pipeline的Transform,weekday.py的代碼也很簡單:
import sys
import datetime
for line in sys.stdin:
line=line.strip()
uid,mid,rating,utime=line.split(':')
weekday=datetime.datetime.fromtimestamp(float(utime)).isoweekday()
print '\t'.join([uid,mid,rating,str(weekday)])
HQL的查詢也很簡單:
select
transform(uid,mid,rating,utime)
using 'python weekday.py' as (uid,mid,rating,weekday)
from rating