使用Python實作Hadoop MapReduce程式

盡管hadoop 架構是使用java編寫的但是我們仍然需要使用像c++、python等語言來實作 hadoop程式。盡管hadoop官方網站給的示例程式是使用jython編寫并打包成jar檔案，這樣顯然造成了不便，其實，不一定非要這樣來實作，我們可以使用python與hadoop 關聯進行程式設計，看看位于/src/examples/python/wordcount.py 的例子，你将了解到我在說什麼。

我們想要做什麼？

我們将編寫一個簡單的 mapreduce 程式，使用的是c-python，而不是jython編寫後打包成jar包的程式。

我們的這個例子将模仿 wordcount 并使用python來實作，例子通過讀取文本檔案來統計出單詞的出現次數。結果也以文本形式輸出，每一行包含一個單詞和單詞出現的次數，兩者中間使用制表符來想間隔。

先決條件

編寫這個程式之前，你學要架設好hadoop 叢集，這樣才能不會在後期工作抓瞎。如果你沒有架設好，那麼在後面有個簡明教程來教你在ubuntu linux 上搭建（同樣适用于其他發行版linux、unix）

python的mapreduce代碼

使用python編寫mapreduce代碼的技巧就在于我們使用了 hadoopstreaming 來幫助我們在map 和 reduce間傳遞資料通過stdin (标準輸入)和stdout (标準輸出).我們僅僅使用python的sys.stdin來輸入資料，使用sys.stdout輸出資料，這樣做是因為hadoopstreaming會幫我們辦好其他事。這是真的，别不相信！

map: mapper.py

将下列的代碼儲存在/usr/local/hadoop/mapper.py中，他将從stdin讀取資料并将單詞成行分隔開，生成一個清單映射單詞與發生次數的關系：

注意：要確定這個腳本有足夠權限（chmod +x mapper.py）。

#!/usr/bin/env python

import sys

# input comes from stdin (standard input)

for line in sys.stdin:

# remove leading and trailing whitespace

line = line.strip()

# split the line into words

words = line.split()

# increase counters

for word in words:

# write the results to stdout (standard output);

# what we output here will be the input for the

# reduce step, i.e. the input for reducer.py

# tab-delimited; the trivial word count is 1

print '%s\t%s' % (word, 1)

在這個腳本中，并不計算出單詞出現的總數，它将輸出 "<word> 1" 迅速地，盡管<word>可能會在輸入中出現多次，計算是留給後來的reduce步驟（或叫做程式）來實作。當然你可以改變下編碼風格，完全尊重你的習慣。reduce: reducer.py

将代碼存儲在/usr/local/hadoop/reducer.py 中，這個腳本的作用是從mapper.py 的stdin中讀取結果，然後計算每個單詞出現次數的總和，并輸出結果到stdout。

同樣，要注意腳本權限：chmod +x reducer.py

from operator import itemgetter

current_word = none

current_count = 0

word = none

# input comes from stdin

# parse the input we got from mapper.py

word, count = line.split('\t', 1)

# convert count (currently a string) to int

try:

count = int(count)

except valueerror:

# count was not a number, so silently

# ignore/discard this line

continue

# this if-switch only works because hadoop sorts map output

# by key (here: word) before it is passed to the reducer

if current_word == word:

current_count += count

else:

if current_word:

# write result to stdout

print '%s\t%s' % (current_word, current_count)

current_count = count

current_word = word

# do not forget to output the last word if needed!

if current_word == word:

print '%s\t%s' % (current_word, current_count)

測試你的代碼（cat data | map | sort | reduce）

我建議你在運作mapreduce job測試前嘗試手工測試你的mapper.py 和 reducer.py腳本，以免得不到任何傳回結果

這裡有一些建議，關于如何測試你的map和reduce的功能：

hadoop@derekubun:/usr/local/hadoop$ echo "foo foo quux labs foo bar quux" | ./mapper.py

foo 1

quux 1

labs 1

bar 1

hadoop@derekubun:/usr/local/hadoop$ echo "foo foo quux labs foo bar quux" |./mapper.py | sort |./reducer.py

bar 1

foo 3

labs 1

quux 2

# using one of the ebooks as example input

# (see below on where to get the ebooks)

hadoop@derekubun:/usr/local/hadoop$ cat book/book.txt |./mapper.pysubscribe 1

to 1

our 1

email 1

newsletter 1

hear 1

about 1

new 1

ebooks. 1

在hadoop平台上運作python腳本

為了這個例子，我們将需要一本電子書，把它放在/usr/local/hadpoop/book/book.txt之下

hadoop@derekubun:/usr/local/hadoop$ ls -l book

總用量 636

-rw-rw-r-- 1 derek derek 649669 3月 12 12:22 book.txt

複制本地資料到hdfs

在我們運作mapreduce job 前，我們需要将本地的檔案複制到hdfs中：

hadoop@derekubun:/usr/local/hadoop$ hadoop dfs -copyfromlocal /usr/local/hadoop/book book

hadoop@derekubun:/usr/local/hadoop$ hadoop dfs -ls

found 3 items

drwxr-xr-x - hadoop supergroup 0 2013-03-12 15:56 /user/hadoop/book

執行 mapreduce job現在，一切準備就緒，我們将在運作python mapreduce job 在hadoop叢集上。像我上面所說的，我們使用的是hadoopstreaming 幫助我們傳遞資料在map和reduce間并通過stdin和stdout，進行标準化輸入輸出。

hadoop@derekubun:/usr/local/hadoop$hadoop jar contrib/streaming/hadoop-streaming-1.1.2.jar

-mapper /usr/local/hadoop/mapper.py

-reducer /usr/local/hadoop/reducer.py

-input book/*

-output book-output

在運作中，如果你想更改hadoop的一些設定，如增加reduce任務的數量，你可以使用“-jobconf”選項：

-jobconf mapred.reduce.tasks=4

-output book-output

如果上面兩個運作出錯，請參考下面一段代碼。注意，重新運作，需要删除dfs中的output檔案

bin/hadoop jar contrib/streaming/hadoop-streaming-1.1.2.jar

-mapper task1/mapper.py

-file task1/mapper.py

-reducer task1/reducer.py

-file task1/reducer.py

-input url

-output url-output

-jobconf mapred.reduce.tasks=3

一個重要的備忘是關于hadoop does not honor mapred.map.tasks 這個任務将會讀取hdfs目錄下的book并處理他們，将結果存儲在獨立的結果檔案中，并存儲在hdfs目錄下的book-output目錄。之前執行的結果如下：

hadoop@derekubun:/usr/local/hadoop$ hadoop jar contrib/streaming/hadoop-streaming-1.1.2.jar -jobconf mapred.reduce.tasks=4 -mapper /usr/local/hadoop/mapper.py -reducer /usr/local/hadoop/reducer.py -input book/* -output book-output

13/03/12 16:01:05 warn streaming.streamjob: -jobconf option is deprecated, please use -d instead.

packagejobjar: [/usr/local/hadoop/tmp/hadoop-unjar4835873410426602498/] [] /tmp/streamjob5047485520312501206.jar tmpdir=null

13/03/12 16:01:06 info util.nativecodeloader: loaded the native-hadoop library

13/03/12 16:01:06 warn snappy.loadsnappy: snappy native library not loaded

13/03/12 16:01:06 info mapred.fileinputformat: total input paths to process : 1

13/03/12 16:01:06 info streaming.streamjob: getlocaldirs(): [/usr/local/hadoop/tmp/mapred/local]

13/03/12 16:01:06 info streaming.streamjob: running job: job_201303121448_0010

13/03/12 16:01:06 info streaming.streamjob: to kill this job, run:

13/03/12 16:01:06 info streaming.streamjob: /usr/local/hadoop/libexec/../bin/hadoop job -dmapred.job.tracker=localhost:9001 -kill job_201303121448_0010

13/03/12 16:01:06 info streaming.streamjob: tracking url: http://localhost:50030/jobdetails.jsp?jobid=job_201303121448_0010

13/03/12 16:01:07 info streaming.streamjob: map 0% reduce 0%

13/03/12 16:01:10 info streaming.streamjob: map 100% reduce 0%

13/03/12 16:01:17 info streaming.streamjob: map 100% reduce 8%

13/03/12 16:01:18 info streaming.streamjob: map 100% reduce 33%

13/03/12 16:01:19 info streaming.streamjob: map 100% reduce 50%

13/03/12 16:01:26 info streaming.streamjob: map 100% reduce 67%

13/03/12 16:01:27 info streaming.streamjob: map 100% reduce 83%

13/03/12 16:01:28 info streaming.streamjob: map 100% reduce 100%

13/03/12 16:01:29 info streaming.streamjob: job complete: job_201303121448_0010

13/03/12 16:01:29 info streaming.streamjob: output: book-output

hadoop@derekubun:/usr/local/hadoop$

如你所見到的上面的輸出結果，hadoop 同時還提供了一個基本的web接口顯示統計結果和資訊。

當hadoop叢集在執行時，你可以使用浏覽器通路 http://localhost:50030/ ：

檢查結果是否輸出并存儲在hdfs目錄下的book-output中：

hadoop@derekubun:/usr/local/hadoop$ hadoop dfs -ls book-output

found 6 items

-rw-r--r-- 2 hadoop supergroup 0 2013-03-12 16:01 /user/hadoop/book-output/_success

drwxr-xr-x - hadoop supergroup 0 2013-03-12 16:01 /user/hadoop/book-output/_logs

-rw-r--r-- 2 hadoop supergroup 33 2013-03-12 16:01 /user/hadoop/book-output/part-00000

-rw-r--r-- 2 hadoop supergroup 60 2013-03-12 16:01 /user/hadoop/book-output/part-00001

-rw-r--r-- 2 hadoop supergroup 54 2013-03-12 16:01 /user/hadoop/book-output/part-00002

-rw-r--r-- 2 hadoop supergroup 47 2013-03-12 16:01 /user/hadoop/book-output/part-00003

可以使用dfs -cat 指令檢查檔案目錄

hadoop@derekubun:/usr/local/hadoop$ hadoop dfs -cat book-output/part-00000

about 1

ebooks. 1

the 1

to 2

hadoop@derekubun:/usr/local/hadoop$

下面是原英文作者mapper.py和reducer.py的兩個修改版本:

mapper.py

"""a more advanced mapper, using python iterators and generators."""

def read_input(file):

for line in file:

# split the line into words

yield line.split()

def main(separator='\t'):

# input comes from stdin (standard input)

data = read_input(sys.stdin)

for words in data:

for word in words:

print '%s%s%d' % (word, separator, 1)

if __name__ == "__main__":

main()

reducer.py

"""a more advanced reducer, using python iterators and generators."""

from itertools import groupby

def read_mapper_output(file, separator='\t'):

yield line.rstrip().split(separator, 1)

data = read_mapper_output(sys.stdin, separator=separator)

# groupby groups multiple word-count pairs by word,

# and creates an iterator that returns consecutive keys and their group:

# current_word - string containing a word (the key)

# group - iterator yielding all ["<current_word>", "<count>"] items

for current_word, group in groupby(data, itemgetter(0)):

try:

total_count = sum(int(count) for current_word, count in group)

print "%s%s%d" % (current_word, separator, total_count)

except valueerror:

# count was not a number, so silently discard this item

pass

使用Python實作Hadoop MapReduce程式

繼續閱讀

AOP程式設計_Android優雅權限架構(1)概念基礎，2021金三銀四前言正文大綱正文

Effective Java 8:通用程式設計

C++判斷素數、求最大公約數代碼判斷一個數是否為素數求兩個數的最大公約數

【Linux】UDP廣播封包接收速率問題

SequoiaDB巨杉資料庫C++驅動概述

OOM三種類型

工廠模式-三種類型

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

【遞歸】高效率求2的n次幂

win10本地scala和spark安裝安裝scala安裝spark

Linux裝置模型（中）之上層容器

JBoss,Geronimo和Glassfish初窺

scala (3) Function 和 Method

PowerPC平台 Linux移植三

在python中建立excel并寫入