天天看点

Hive日志分析实践例子

日志格式为: 36.248.169.9 - - [22/Sep/2013:01:21:45 +0800] "GET /mapreduce/hadoop-terasort-analyse/ HTTP/1.1" 200 22166 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;)"

36.248.169.9 - - [22/Sep/2013:01:21:48 +0800] "POST /wp-comments-post.php HTTP/1.1" 302 520 "http://dongxicheng.org/mapreduce/hadoop-terasort-analyse/" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"

36.248.169.9 - - [22/Sep/2013:01:21:51 +0800] "GET /mapreduce/hadoop-terasort-analyse/ HTTP/1.1" 200 22172 "http://dongxicheng.org/mapreduce/hadoop-terasort-analyse/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;)"

36.248.169.9 - - [22/Sep/2013:01:22:34 +0800] "GET /mapreduce/hadoop-terasort-analyse HTTP/1.1" 301 425 "http://dongxicheng.org/mapreduce/hadoop-terasort-analyse/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;)"

36.248.169.9 - - [22/Sep/2013:01:22:34 +0800] "GET /mapreduce/hadoop-terasort-analyse HTTP/1.1" 301 425 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;)"

220.181.89.164 - - [22/Sep/2013:01:22:31 +0800] "GET /category/data-mining/ HTTP/1.1" 200 15675 "-" "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)"

伪分布式环境下: (1) a)  mapper_hive.py如下 #!/usr/bin/env python import sys # input comes from STDIN for line in sys.stdin:     data = line.split()     print "ip=%s|time=%s|path=%s" % (data[0],data[3][1:],data[6])

b) 上传本地日志到hdfs [[email protected] hadoop-1.0.4]$  bin/hadoop fs -put ~/Sep-2013 /user/user/hive_hwork/Sep-2013 c) 按要求输出 [[email protected] hive-0.9.0]$  bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar -mapper ~/Desktop/mapper_hive.py -input /user/user/hive_hwork/Sep-2013/*/* -output /user/user/hive_hwork/Sep_2013_filter

(2) a)创建表 s_access_log create table if not exists s_access_log( log map<string,string>) row format delimited collection items terminated by '|' map keys terminated by '=' stored as textfile; hive> show tables; OK s_access_log Time taken: 4.09 seconds

(3)导入数据 a) hive>  load data inpath '/user/user/hive_hwork/Sep_2013_filter/part-00000' into table s_access_log;                                            Loading data to table default.s_access_log OK Time taken: 0.605 seconds hive> select * from s_access_log limit 5; OK {"ip":"1.10.213.225","time":"25/Sep/2013:21:09:38","path":"/wp-login.php\t"} {"ip":"1.10.213.225","time":"25/Sep/2013:21:29:55","path":"/wp-login.php\t"} {"ip":"1.162.173.111","time":"28/Sep/2013:14:37:17","path":"/?author=1\t"} {"ip":"1.162.173.111","time":"28/Sep/2013:14:37:18","path":"/?feed=rss2\t"} {"ip":"1.162.173.111","time":"28/Sep/2013:14:37:19","path":"/feed/\t"} Time taken: 0.28 seconds

(4)统计结果 1)独立IP数目 hive>  select count(distinct log['ip']) from s_access_log; OK 13349

2)统计每天的PV (没有进行重复的过滤) hive> select substring(log["time"],0,11),count(*) from s_access_log group by substring(log["time"],0,11);   20/Sep/2013 78975 21/Sep/2013 81408 22/Sep/2013 97257 23/Sep/2013 121201 24/Sep/2013 115658 25/Sep/2013 129944 26/Sep/2013 149390 27/Sep/2013 123892 28/Sep/2013 103551 29/Sep/2013 150813 30/Sep/2013 777

3)统计每天的UV hive> select  count(distinct(log["ip"])),substring(log["time"],0,11) from s_access_log group by substring(log["time"],0,11); 1776 20/Sep/2013 2301 21/Sep/2013 2420 22/Sep/2013 2855 23/Sep/2013 2755 24/Sep/2013 3246 25/Sep/2013 2416 26/Sep/2013 2514 27/Sep/2013 1837 28/Sep/2013 2278 29/Sep/2013 52 30/Sep/2013 Time taken: 42.539 seconds

4)IP为 175.44.30.93的用户每天访问的url总数 先把每天的url(去重)结果存入t_url_perday create table t_url_perday as select  count(distinct(log["path"])) as url,substring(log["time"],0,11) as time from s_access_log where log["ip"]="175.44.30.93"         group by substring(log["time"],0,11);    然后查询该表进行avg  select avg(url) from t_url_perday; 45.0

5) select distinct(log["path"]) from s_access_log where log["ip"]="175.44.30.93" and log["time"]>="29/Sep/2013:00:10:16" and log["time"]<="29/Sep/2013:09:10:16" /apache-mesos/meso-architecture /apache-mesos/meso-architecture/ /cluster-managemant/puppet /cluster-managemant/puppet/ /hadoop-corona/corona-clustermanager-details /hadoop-corona/corona-clustermanager-details/ /mapreduce-nextgen/apache-hadoop-2-0-3-published /mapreduce-nextgen/apache-hadoop-2-0-3-published/ /mapreduce-nextgen/client-codes /mapreduce-nextgen/client-codes/ /mapreduce-nextgen/hadoop-2-0-terms-explained /mapreduce-nextgen/hadoop-2-0-terms-explained/ /mapreduce-nextgen/hadoop-internals-mapreduce-reference /mapreduce-nextgen/hadoop-internals-mapreduce-reference/ /mapreduce-nextgen/hadoop-jira-yarn-3 /mapreduce-nextgen/hadoop-jira-yarn-3/ /mapreduce-nextgen/hadoop-jira-yarn-392 /mapreduce-nextgen/hadoop-jira-yarn-392/ /mapreduce-nextgen/use-hadoop-yarn-to-build-a-cloud-system /mapreduce-nextgen/use-hadoop-yarn-to-build-a-cloud-system/ /mapreduce-nextgen/yarnmrv2-mrappmaster-containerallocator /mapreduce-nextgen/yarnmrv2-mrappmaster-containerallocator/ /mapreduce-nextgen/yarnmrv2-resource-manager-nm-manager /mapreduce-nextgen/yarnmrv2-resource-manager-nm-manager/ /mapreduce/hadoop-mapreduce-failure-tolerance /mapreduce/hadoop-mapreduce-failure-tolerance/ /mapreduce/hadoop-rumen-introduction /mapreduce/hadoop-rumen-introduction/ /mapreduce/hadoop-streaming-programming /mapreduce/hadoop-streaming-programming/ /mapreduce/hadoop-terasort-analyse /mapreduce/hadoop-terasort-analyse/ /mapreduce/hdfs-small-files-solution /mapreduce/hdfs-small-files-solution/ /search-engine/kafka /search-engine/kafka/ /search-engine/scribe-intro /search-engine/scribe-intro/ /structure/heap /structure/heap/ /structure/segment-tree /structure/segment-tree/ /structure/suffix-array /structure/suffix-array/ /wp-comments-post.php Time taken: 34.821 seconds

继续阅读