針對RDD, 使用 keyBy 來構築 key-line 對:
[training@localhost ~]$ cat webs.log
56.31.230.188 - 90700 "GET/KDDOC-00101.html HTTP/1.0"
56.32.230.186 - 90700 "GET/contents.css HTTP/1.0"
202.156.27.99 - 25223 "GET /KDDOC-00220.html HTTP/1.0"
[training@localhost ~]$
[training@localhost ~]$ hdfs dfs -put webs.log
[training@localhost ~]$ hdfs dfs -cat webs.log
[training@localhost ~]$
In [23]: mylogs = sc.textFile("webs.log")
In [25]: mylogs001 = mylogs.keyBy(lambda line: line.split(' ')[2])
In [26]: mylogs001.take(1)
Out[26]: [(u'90700', u'56.31.230.188 - 90700 "GET/KDDOC-00101.html HTTP/1.0"')]
In [28]: mylogs001.take(2)
Out[28]:
[(u'90700', u'56.31.230.188 - 90700 "GET/KDDOC-00101.html HTTP/1.0"'),
(u'90700', u'56.32.230.186 - 90700 "GET/contents.css HTTP/1.0"')]
作一個對比,看看 mylogs001.take(3) 和 mylogs.take(3)
In [30]: mylogs001.take(3)
Out[30]:
(u'90700', u'56.32.230.186 - 90700 "GET/contents.css HTTP/1.0"'),
(u'25223', u'202.156.27.99 - 25223 "GET /KDDOC-00220.html HTTP/1.0"')]
In [31]: mylogs.take(3)
Out[31]:
[u'56.31.230.188 - 90700 "GET/KDDOC-00101.html HTTP/1.0"',
u'56.32.230.186 - 90700 "GET/contents.css HTTP/1.0"',
u'202.156.27.99 - 25223 "GET /KDDOC-00220.html HTTP/1.0"']
本文轉自健哥的資料花園部落格園部落格,原文連結:http://www.cnblogs.com/gaojian/p/008-Aggregating-Data-with-Pair-RDDs-keyBy.html,如需轉載請自行聯系原作者