天天看點

Hadoop MapReduce之wordcount(詞頻統計)

1.建立test.log

點選(此處)折疊或打開

[root@sht-sgmhadoopnn-01 mapreduce]# more /tmp/test.log

1

2

3

a

b

v

a a a

abc

我是誰

%……

%

2.hadoop建立目錄及上傳

[root@sht-sgmhadoopnn-01 ~]# hadoop fs -mkdir /testdir

16/02/28 19:40:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

[root@sht-sgmhadoopnn-01 ~]# hadoop fs -put /tmp/test.log /testdir/

16/02/28 19:40:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

3.檢視官方封裝好的函數,我們選取wordcount

[root@sht-sgmhadoopnn-01 ~]#cd /hadoop/hadoop-2.7.2/share/hadoop/mapreduce

[root@sht-sgmhadoopnn-01 mapreduce]# hadoop jar hadoop-mapreduce-examples-2.7.2.jar

An example program must be given as the first argument.

Valid program names are:

  aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.

  aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.

  bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.

  dbcount: An example job that count the pageview counts from a database.

  distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.

  grep: A map/reduce program that counts the matches of a regex in the input.

  join: A job that effects a join over sorted, equally partitioned datasets

  multifilewc: A job that counts words from several files.

  pentomino: A map/reduce tile laying program to find solutions to pentomino problems.

  pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.

  randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.

  randomwriter: A map/reduce program that writes 10GB of random data per node.

  secondarysort: An example defining a secondary sort to the reduce.

  sort: A map/reduce program that sorts the data written by the random writer.

  sudoku: A sudoku solver.

  teragen: Generate data for the terasort

  terasort: Run the terasort

  teravalidate: Checking results of terasort

  wordcount: A map/reduce program that counts the words in the input files.

  wordmean: A map/reduce program that counts the average length of the words in the input files.

  wordmedian: A map/reduce program that counts the median length of the words in the input files.

  wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.

4.運作wordcount

# hadoop jar hadoop-mapreduce-examples-2.7.2.jar wordcount /testdir /out1

#                       官方模闆jar包              函數    輸入目錄 輸出目錄(未建立)

[root@sht-sgmhadoopnn-01 mapreduce]# hadoop jar hadoop-mapreduce-examples-2.7.2.jar wordcount /testdir /out1

16/02/28 19:40:50 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

16/02/28 19:40:53 INFO input.FileInputFormat: Total input paths to process : 1

16/02/28 19:40:53 INFO mapreduce.JobSubmitter: number of splits:1

16/02/28 19:40:53 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1456590271264_0002

16/02/28 19:40:54 INFO impl.YarnClientImpl: Submitted application application_1456590271264_0002

16/02/28 19:40:54 INFO mapreduce.Job: The url to track the job: http://sht-sgmhadoopnn-01:8088/proxy/application_1456590271264_0002/

16/02/28 19:40:54 INFO mapreduce.Job: Running job: job_1456590271264_0002

16/02/28 19:41:04 INFO mapreduce.Job: Job job_1456590271264_0002 running in uber mode : false

16/02/28 19:41:04 INFO mapreduce.Job: map 0% reduce 0%

16/02/28 19:41:12 INFO mapreduce.Job: map 100% reduce 0%

16/02/28 19:41:21 INFO mapreduce.Job: map 100% reduce 100%

16/02/28 19:41:22 INFO mapreduce.Job: Job job_1456590271264_0002 completed successfully

16/02/28 19:41:22 INFO mapreduce.Job: Counters: 49

        File System Counters

                FILE: Number of bytes read=102

                FILE: Number of bytes written=244621

                FILE: Number of read operations=0

                FILE: Number of large read operations=0

                FILE: Number of write operations=0

                HDFS: Number of bytes read=142

                HDFS: Number of bytes written=56

                HDFS: Number of read operations=6

                HDFS: Number of large read operations=0

                HDFS: Number of write operations=2

        Job Counters

                Launched map tasks=1

                Launched reduce tasks=1

                Data-local map tasks=1

                Total time spent by all maps in occupied slots (ms)=5537

                Total time spent by all reduces in occupied slots (ms)=6555

                Total time spent by all map tasks (ms)=5537

                Total time spent by all reduce tasks (ms)=6555

                Total vcore-milliseconds taken by all map tasks=5537

                Total vcore-milliseconds taken by all reduce tasks=6555

                Total megabyte-milliseconds taken by all map tasks=5669888

                Total megabyte-milliseconds taken by all reduce tasks=6712320

        Map-Reduce Framework

                Map input records=12

                Map output records=14

                Map output bytes=100

                Map output materialized bytes=102

                Input split bytes=98

                Combine input records=14

                Combine output records=10

                Reduce input groups=10

                Reduce shuffle bytes=102

                Reduce input records=10

                Reduce output records=10

                Spilled Records=20

                Shuffled Maps =1

                Failed Shuffles=0

                Merged Map outputs=1

                GC time elapsed (ms)=79

                CPU time spent (ms)=2560

                Physical memory (bytes) snapshot=445992960

                Virtual memory (bytes) snapshot=1775263744

                Total committed heap usage (bytes)=306184192

        Shuffle Errors

                BAD_ID=0

                CONNECTION=0

                IO_ERROR=0

                WRONG_LENGTH=0

                WRONG_MAP=0

                WRONG_REDUCE=0

        File Input Format Counters

                Bytes Read=44

        File Output Format Counters

                Bytes Written=56

You have mail in /var/spool/mail/root

[root@sht-sgmhadoopnn-01 mapreduce]#

5.驗證wordcount,詞頻統計

[root@sht-sgmhadoopnn-01 mapreduce]# hadoop fs -ls /out1

16/02/28 19:43:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Found 2 items

-rw-r--r-- 3 root supergroup 0 2016-02-28 19:41 /out1/_SUCCESS

-rw-r--r-- 3 root supergroup 56 2016-02-28 19:41 /out1/part-r-00000

[root@sht-sgmhadoopnn-01 mapreduce]# hadoop fs -text /out1/part-r-00000

16/02/28 19:43:38 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

% 1

%…… 1

1 1

2 1

3 1

a 5

abc 1

b 1

v 1

我是誰 1