Hive日志分析实践例子

日志格式为： 36.248.169.9 - - [22/Sep/2013:01:21:45 +0800] "GET /mapreduce/hadoop-terasort-analyse/ HTTP/1.1" 200 22166 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;)"

36.248.169.9 - - [22/Sep/2013:01:21:48 +0800] "POST /wp-comments-post.php HTTP/1.1" 302 520 "http://dongxicheng.org/mapreduce/hadoop-terasort-analyse/" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"

36.248.169.9 - - [22/Sep/2013:01:21:51 +0800] "GET /mapreduce/hadoop-terasort-analyse/ HTTP/1.1" 200 22172 "http://dongxicheng.org/mapreduce/hadoop-terasort-analyse/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;)"

36.248.169.9 - - [22/Sep/2013:01:22:34 +0800] "GET /mapreduce/hadoop-terasort-analyse HTTP/1.1" 301 425 "http://dongxicheng.org/mapreduce/hadoop-terasort-analyse/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;)"

36.248.169.9 - - [22/Sep/2013:01:22:34 +0800] "GET /mapreduce/hadoop-terasort-analyse HTTP/1.1" 301 425 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;)"

220.181.89.164 - - [22/Sep/2013:01:22:31 +0800] "GET /category/data-mining/ HTTP/1.1" 200 15675 "-" "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)"

伪分布式环境下：（1） a) mapper_hive.py如下 #!/usr/bin/env python import sys # input comes from STDIN for line in sys.stdin: data = line.split() print "ip=%s|time=%s|path=%s" % (data[0],data[3][1:],data[6])

b) 上传本地日志到hdfs [[email protected] hadoop-1.0.4]$ bin/hadoop fs -put ~/Sep-2013 /user/user/hive_hwork/Sep-2013 c) 按要求输出 [[email protected] hive-0.9.0]$ bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar -mapper ~/Desktop/mapper_hive.py -input /user/user/hive_hwork/Sep-2013/*/* -output /user/user/hive_hwork/Sep_2013_filter

（2） a)创建表 s_access_log create table if not exists s_access_log( log map<string,string>) row format delimited collection items terminated by '|' map keys terminated by '=' stored as textfile; hive> show tables; OK s_access_log Time taken: 4.09 seconds

(3)导入数据 a) hive> load data inpath '/user/user/hive_hwork/Sep_2013_filter/part-00000' into table s_access_log; Loading data to table default.s_access_log OK Time taken: 0.605 seconds hive> select * from s_access_log limit 5; OK {"ip":"1.10.213.225","time":"25/Sep/2013:21:09:38","path":"/wp-login.php\t"} {"ip":"1.10.213.225","time":"25/Sep/2013:21:29:55","path":"/wp-login.php\t"} {"ip":"1.162.173.111","time":"28/Sep/2013:14:37:17","path":"/?author=1\t"} {"ip":"1.162.173.111","time":"28/Sep/2013:14:37:18","path":"/?feed=rss2\t"} {"ip":"1.162.173.111","time":"28/Sep/2013:14:37:19","path":"/feed/\t"} Time taken: 0.28 seconds

（4）统计结果 1)独立IP数目 hive> select count(distinct log['ip']) from s_access_log; OK 13349

2)统计每天的PV （没有进行重复的过滤） hive> select substring(log["time"],0,11),count(*) from s_access_log group by substring(log["time"],0,11); 20/Sep/2013 78975 21/Sep/2013 81408 22/Sep/2013 97257 23/Sep/2013 121201 24/Sep/2013 115658 25/Sep/2013 129944 26/Sep/2013 149390 27/Sep/2013 123892 28/Sep/2013 103551 29/Sep/2013 150813 30/Sep/2013 777

3）统计每天的UV hive> select count(distinct(log["ip"])),substring(log["time"],0,11) from s_access_log group by substring(log["time"],0,11); 1776 20/Sep/2013 2301 21/Sep/2013 2420 22/Sep/2013 2855 23/Sep/2013 2755 24/Sep/2013 3246 25/Sep/2013 2416 26/Sep/2013 2514 27/Sep/2013 1837 28/Sep/2013 2278 29/Sep/2013 52 30/Sep/2013 Time taken: 42.539 seconds

4）IP为 175.44.30.93的用户每天访问的url总数先把每天的url（去重）结果存入t_url_perday create table t_url_perday as select count(distinct(log["path"])) as url,substring(log["time"],0,11) as time from s_access_log where log["ip"]="175.44.30.93" group by substring(log["time"],0,11); 然后查询该表进行avg select avg(url) from t_url_perday; 45.0

5) select distinct(log["path"]) from s_access_log where log["ip"]="175.44.30.93" and log["time"]>="29/Sep/2013:00:10:16" and log["time"]<="29/Sep/2013:09:10:16" /apache-mesos/meso-architecture /apache-mesos/meso-architecture/ /cluster-managemant/puppet /cluster-managemant/puppet/ /hadoop-corona/corona-clustermanager-details /hadoop-corona/corona-clustermanager-details/ /mapreduce-nextgen/apache-hadoop-2-0-3-published /mapreduce-nextgen/apache-hadoop-2-0-3-published/ /mapreduce-nextgen/client-codes /mapreduce-nextgen/client-codes/ /mapreduce-nextgen/hadoop-2-0-terms-explained /mapreduce-nextgen/hadoop-2-0-terms-explained/ /mapreduce-nextgen/hadoop-internals-mapreduce-reference /mapreduce-nextgen/hadoop-internals-mapreduce-reference/ /mapreduce-nextgen/hadoop-jira-yarn-3 /mapreduce-nextgen/hadoop-jira-yarn-3/ /mapreduce-nextgen/hadoop-jira-yarn-392 /mapreduce-nextgen/hadoop-jira-yarn-392/ /mapreduce-nextgen/use-hadoop-yarn-to-build-a-cloud-system /mapreduce-nextgen/use-hadoop-yarn-to-build-a-cloud-system/ /mapreduce-nextgen/yarnmrv2-mrappmaster-containerallocator /mapreduce-nextgen/yarnmrv2-mrappmaster-containerallocator/ /mapreduce-nextgen/yarnmrv2-resource-manager-nm-manager /mapreduce-nextgen/yarnmrv2-resource-manager-nm-manager/ /mapreduce/hadoop-mapreduce-failure-tolerance /mapreduce/hadoop-mapreduce-failure-tolerance/ /mapreduce/hadoop-rumen-introduction /mapreduce/hadoop-rumen-introduction/ /mapreduce/hadoop-streaming-programming /mapreduce/hadoop-streaming-programming/ /mapreduce/hadoop-terasort-analyse /mapreduce/hadoop-terasort-analyse/ /mapreduce/hdfs-small-files-solution /mapreduce/hdfs-small-files-solution/ /search-engine/kafka /search-engine/kafka/ /search-engine/scribe-intro /search-engine/scribe-intro/ /structure/heap /structure/heap/ /structure/segment-tree /structure/segment-tree/ /structure/suffix-array /structure/suffix-array/ /wp-comments-post.php Time taken: 34.821 seconds

Hive日志分析实践例子

继续阅读

Hive中内部表、外部表、分区、分桶以及SQL的执行顺序

Hive中的内部表外部表及分区表

Hive---外部分区表的创建

Hive学习笔记 3 Hive的数据模型：内部表、分区表、外部表、桶表、视图

HiveQL(二):分区表

Hive的分区表入门

Hive的分区表

Hive（二）--分区分桶，内部表外部表

大数据高频面试题之Hive的小文件合并

【AOP】利用aop，实现方法的环绕日志（包括注解日志）背景需求步骤过程运行拓展

世界因大数据而改变

hive sql通过具体地址解析出行政区划(省＞市＞区＞县＞乡＞镇＞村)

MFC边操作边通过控制台实时看printf日志

Hive最全常见错误及解决方案hive --service metastore &

《Hive权威指南》第八章：HiveQL索引8 HiveQL：索引

HiveQl语句应用实例：WordCount具体步骤如下：