hadoop入門之wordcount小案例

1.建立工程

file->new->other->map/reduce->map/reduce project
->next->project name  -->finish

2.建立工程目錄

3.寫java檔案

3.1WCMapper.java

package hadoop.example.wordcount;

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class WCMapper extends Mapper<LongWritable,Text,Text,LongWritable>{

    @Override
    protected void map(LongWritable key, Text value,Context context)
            throws IOException, InterruptedException {
        // TODO Auto-generated method stub
        //接收資料 
        String line = value.toString();
        //切分資料 
        String[] words=line.split(" ");
        //循環所有資料
        for(String w : words){
            // 查詢一個記一次  
            //new Text(w), new LongWritable(1) 将資料進行包裝
            context.write(new Text(w), new LongWritable(1));
        }

    }

}

3.2WCReducer.java

package hadoop.example.wordcount;

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WCReducer extends Reducer<Text,LongWritable,Text,LongWritable >{

    @Override
    protected void reduce(Text key, Iterable<LongWritable> values,
            Context context) throws IOException, InterruptedException {
        //接受資料
    //  Text  key3=key；
        //定義一個計數器
        long counter=0;
        //循values
        for(LongWritable l :values){
            counter+=l.get();
        }       
        //輸出
        context.write(key, new LongWritable(counter));      

    }

}

3.3WordCount.java

package hadoop.example.wordcount;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;


public class WordCount {
    public static void main(String args[]) throws IOException, ClassNotFoundException, InterruptedException{
        //建構一個job對象
        Job job=Job.getInstance(new Configuration());

        //action ：main 方法所在的類 
        job.setJarByClass(WordCount.class);

        //設定Mapper的相關屬性
        job.setMapperClass(WCMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);
        FileInputFormat.setInputPaths(job, new Path("/root/workplace/hdfs/wdcount/1.txt"));


        //設定Reducer的相關的屬性
        job.setReducerClass(WCReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);
        FileOutputFormat.setOutputPath(job,new Path("/root/workplace/hdfs/wdcount/output"));

        //送出任務
        //列印進度詳情
        job.waitForCompletion(true);

    }
}

4.将寫好的檔案打成jar包

project->export->JAR file->next->在jar file裡選擇要講jar放在的目錄->next->next->main Class 選擇你要在此jar裡配置的mian.class->finish

5.上傳檔案1.txt

hello tom 
hello jerry
hello kitty
hello world
hello tom


上傳檔案
[[email protected] ~]# hadoop dfs -put /root/workplace/wdcount  /root/workplace/hdfs
Warning: $HADOOP_HOME is deprecated.

檢視檔案
[[email protected] ~]# hadoop dfs -ls  /root/workplace/hdfs/wdcount/1.txt
Warning: $HADOOP_HOME is deprecated.

Found 1 items
-rw-r--r--   1 root supergroup         57 2016-08-20 09:06 /root/workplace/hdfs/wdcount/1.txt
檢視檔案詳情
[[email protected] ~]# hadoop dfs -cat  /root/workplace/hdfs/wdcount/1.txt
Warning: $HADOOP_HOME is deprecated.

hello tom 
hello jerry
hello kitty
hello world
hello tom

6.兩種執行jar檔案的方式

6.1執行jar程式—工作裡執行的方式

[[email protected] wdcount]# hadoop jar /root/workplace/wdcount/wc.jar 
Warning: $HADOOP_HOME is deprecated.

16/08/20 09:20:34 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
16/08/20 09:20:35 INFO input.FileInputFormat: Total input paths to process : 1
16/08/20 09:20:35 INFO util.NativeCodeLoader: Loaded the native-hadoop library
16/08/20 09:20:35 WARN snappy.LoadSnappy: Snappy native library not loaded
16/08/20 09:20:35 INFO mapred.JobClient: Running job: job_201608192017_0008
16/08/20 09:20:36 INFO mapred.JobClient:  map 0% reduce 0%
16/08/20 09:20:43 INFO mapred.JobClient:  map 100% reduce 0%
16/08/20 09:20:52 INFO mapred.JobClient:  map 100% reduce 33%
16/08/20 09:20:54 INFO mapred.JobClient:  map 100% reduce 100%
16/08/20 09:20:56 INFO mapred.JobClient: Job complete: job_201608192017_0008
16/08/20 09:20:56 INFO mapred.JobClient: Counters: 29
16/08/20 09:20:56 INFO mapred.JobClient:   Job Counters 
16/08/20 09:20:56 INFO mapred.JobClient:     Launched reduce tasks=1
16/08/20 09:20:56 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=8719
16/08/20 09:20:56 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
16/08/20 09:20:56 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
16/08/20 09:20:56 INFO mapred.JobClient:     Launched map tasks=1
16/08/20 09:20:56 INFO mapred.JobClient:     Data-local map tasks=1
16/08/20 09:20:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=10405
16/08/20 09:20:56 INFO mapred.JobClient:   File Output Format Counters 
16/08/20 09:20:56 INFO mapred.JobClient:     Bytes Written=38
16/08/20 09:20:56 INFO mapred.JobClient:   FileSystemCounters
16/08/20 09:20:56 INFO mapred.JobClient:     FILE_BYTES_READ=162
16/08/20 09:20:56 INFO mapred.JobClient:     HDFS_BYTES_READ=177
16/08/20 09:20:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=110365
16/08/20 09:20:56 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=38
16/08/20 09:20:56 INFO mapred.JobClient:   File Input Format Counters 
16/08/20 09:20:56 INFO mapred.JobClient:     Bytes Read=57
16/08/20 09:20:56 INFO mapred.JobClient:   Map-Reduce Framework
16/08/20 09:20:56 INFO mapred.JobClient:     Map output materialized bytes=162
16/08/20 09:20:56 INFO mapred.JobClient:     Map input records=5
16/08/20 09:20:56 INFO mapred.JobClient:     Reduce shuffle bytes=162
16/08/20 09:20:56 INFO mapred.JobClient:     Spilled Records=20
16/08/20 09:20:56 INFO mapred.JobClient:     Map output bytes=136
16/08/20 09:20:56 INFO mapred.JobClient:     Total committed heap usage (bytes)=158797824
16/08/20 09:20:56 INFO mapred.JobClient:     CPU time spent (ms)=2290
16/08/20 09:20:56 INFO mapred.JobClient:     Combine input records=0
16/08/20 09:20:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=120
16/08/20 09:20:56 INFO mapred.JobClient:     Reduce input records=10
16/08/20 09:20:56 INFO mapred.JobClient:     Reduce input groups=5
16/08/20 09:20:56 INFO mapred.JobClient:     Combine output records=0
16/08/20 09:20:56 INFO mapred.JobClient:     Physical memory (bytes) snapshot=263684096
16/08/20 09:20:56 INFO mapred.JobClient:     Reduce output records=5
16/08/20 09:20:56 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=3726540800
16/08/20 09:20:56 INFO mapred.JobClient:     Map output records=10
----------------------
執行jar程式完成檢視執行結果
[[email protected] wdcount]# hadoop dfs -ls /root/workplace/hdfs/wdcount/output
Warning: $HADOOP_HOME is deprecated.

Found 3 items
-rw-r--r--   1 root supergroup          0 2016-08-20 09:20 /root/workplace/hdfs/wdcount/output/_SUCCESS
drwxr-xr-x   - root supergroup          0 2016-08-20 09:20 /root/workplace/hdfs/wdcount/output/_logs
-rw-r--r--   1 root supergroup         38 2016-08-20 09:20 /root/workplace/hdfs/wdcount/output/part-r-00000
[[email protected] wdcount]# hadoop dfs -cat  /root/workplace/hdfs/wdcount/output/part-r-00000
Warning: $HADOOP_HOME is deprecated.

hello   5
jerry   1
kitty   1
tom 2
world   1
[[email protected] wdcount]#

6.2執行程式的另外一種方法

hadoop入門之wordcount小案例

控制台的輸出
    16/08/20 10:32:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/08/20 10:32:12 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
16/08/20 10:32:12 WARN mapred.JobClient: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
16/08/20 10:32:12 INFO input.FileInputFormat: Total input paths to process : 1
16/08/20 10:32:13 WARN snappy.LoadSnappy: Snappy native library not loaded
16/08/20 10:32:13 INFO mapred.JobClient: Running job: job_local288352385_0001
16/08/20 10:32:13 INFO mapred.LocalJobRunner: Waiting for map tasks
16/08/20 10:32:13 INFO mapred.LocalJobRunner: Starting task: attempt_local288352385_0001_m_000000_0
16/08/20 10:32:13 INFO util.ProcessTree: setsid exited with exit code 0
16/08/20 10:32:13 INFO mapred.Task:  Using ResourceCalculatorPlugin : [email protected]
16/08/20 10:32:13 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/root/workplace/hdfs/wdcount/1.txt:0+57
16/08/20 10:32:14 INFO mapred.MapTask: io.sort.mb = 100
16/08/20 10:32:14 INFO mapred.MapTask: data buffer = 79691776/99614720
16/08/20 10:32:14 INFO mapred.MapTask: record buffer = 262144/327680
16/08/20 10:32:14 INFO mapred.MapTask: Starting flush of map output
16/08/20 10:32:14 INFO mapred.MapTask: Finished spill 0
16/08/20 10:32:14 INFO mapred.Task: Task:attempt_local288352385_0001_m_000000_0 is done. And is in the process of commiting
16/08/20 10:32:14 INFO mapred.LocalJobRunner: 
16/08/20 10:32:14 INFO mapred.Task: Task 'attempt_local288352385_0001_m_000000_0' done.
16/08/20 10:32:14 INFO mapred.LocalJobRunner: Finishing task: attempt_local288352385_0001_m_000000_0
16/08/20 10:32:14 INFO mapred.LocalJobRunner: Map task executor complete.
16/08/20 10:32:14 INFO mapred.Task:  Using ResourceCalculatorPlugin : [email protected]
16/08/20 10:32:14 INFO mapred.LocalJobRunner: 
16/08/20 10:32:14 INFO mapred.Merger: Merging 1 sorted segments
16/08/20 10:32:14 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 158 bytes
16/08/20 10:32:14 INFO mapred.LocalJobRunner: 
16/08/20 10:32:14 INFO mapred.Task: Task:attempt_local288352385_0001_r_000000_0 is done. And is in the process of commiting
16/08/20 10:32:14 INFO mapred.LocalJobRunner: 
16/08/20 10:32:14 INFO mapred.Task: Task attempt_local288352385_0001_r_000000_0 is allowed to commit now
16/08/20 10:32:14 INFO output.FileOutputCommitter: Saved output of task 'attempt_local288352385_0001_r_000000_0' to hdfs://localhost:9000/root/workplace/hdfs/wdcount/output
16/08/20 10:32:14 INFO mapred.LocalJobRunner: reduce > reduce
16/08/20 10:32:14 INFO mapred.Task: Task 'attempt_local288352385_0001_r_000000_0' done.
16/08/20 10:32:14 INFO mapred.JobClient:  map 100% reduce 100%
16/08/20 10:32:14 INFO mapred.JobClient: Job complete: job_local288352385_0001
16/08/20 10:32:14 INFO mapred.JobClient: Counters: 22
16/08/20 10:32:14 INFO mapred.JobClient:   File Output Format Counters 
16/08/20 10:32:14 INFO mapred.JobClient:     Bytes Written=38
16/08/20 10:32:14 INFO mapred.JobClient:   File Input Format Counters 
16/08/20 10:32:14 INFO mapred.JobClient:     Bytes Read=57
16/08/20 10:32:14 INFO mapred.JobClient:   FileSystemCounters
16/08/20 10:32:14 INFO mapred.JobClient:     FILE_BYTES_READ=510
16/08/20 10:32:14 INFO mapred.JobClient:     HDFS_BYTES_READ=114
16/08/20 10:32:14 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=136042
16/08/20 10:32:14 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=38
16/08/20 10:32:14 INFO mapred.JobClient:   Map-Reduce Framework
16/08/20 10:32:14 INFO mapred.JobClient:     Reduce input groups=5
16/08/20 10:32:14 INFO mapred.JobClient:     Map output materialized bytes=162
16/08/20 10:32:14 INFO mapred.JobClient:     Combine output records=0
16/08/20 10:32:14 INFO mapred.JobClient:     Map input records=5
16/08/20 10:32:14 INFO mapred.JobClient:     Reduce shuffle bytes=0
16/08/20 10:32:14 INFO mapred.JobClient:     Physical memory (bytes) snapshot=0
16/08/20 10:32:14 INFO mapred.JobClient:     Reduce output records=5
16/08/20 10:32:14 INFO mapred.JobClient:     Spilled Records=20
16/08/20 10:32:14 INFO mapred.JobClient:     Map output bytes=136
16/08/20 10:32:14 INFO mapred.JobClient:     Total committed heap usage (bytes)=258482176
16/08/20 10:32:14 INFO mapred.JobClient:     CPU time spent (ms)=0
16/08/20 10:32:14 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=0
16/08/20 10:32:14 INFO mapred.JobClient:     SPLIT_RAW_BYTES=120
16/08/20 10:32:14 INFO mapred.JobClient:     Map output records=10
16/08/20 10:32:14 INFO mapred.JobClient:     Combine input records=0
16/08/20 10:32:14 INFO mapred.JobClient:     Reduce input records=10

hadoop入門之wordcount小案例

7.完成wrodcount案例

8.HDFS的一些基本操作

删除檔案夾的操作

[[email protected] ~]# hadoop dfs -rmr /root/workplace/wdcount
Warning: $HADOOP_HOME is deprecated.

Deleted hdfs://localhost:9000/root/workplace/wdcount

檢視檔案的操作

[[email protected] ~]# hadoop dfs -ls  /root/workplace/hdfs/wdcount/1.txt
Warning: $HADOOP_HOME is deprecated.

Found 1 items
-rw-r--r--   1 root supergroup         57 2016-08-20 09:06 /root/workplace/hdfs/wdcount/1.txt

檢視檔案的詳情

[[email protected] ~]# hadoop dfs -cat  /root/workplace/hdfs/wdcount/1.txt
Warning: $HADOOP_HOME is deprecated.

hello tom 
hello jerry
hello kitty
hello world
hello tom

删除檔案夾下的所有的東西

[[email protected] ~]# hadoop dfs -rm /root/workplace/wdcount/*
Warning: $HADOOP_HOME is deprecated.

Deleted hdfs://localhost:9000/root/workplace/wdcount/1.txt
Deleted hdfs://localhost:9000/root/workplace/wdcount/1.txt~
Deleted hdfs://localhost:9000/root/workplace/wdcount/wc.jar

将本地檔案上傳到HDFS檔案系統

[[email protected] ~]# hadoop dfs -put /root/workplace/wdcount/1.txt  /root/workplace/hdfs/wdcount
Warning: $HADOOP_HOME is deprecated.

[[email protected] ~]# hadoop dfs -ls  /root/workplace/hdfs/wdcount
Warning: $HADOOP_HOME is deprecated.

Found 1 items
-rw-r--r--   1 root supergroup         57 2016-08-20 09:02 /root/workplace/hdfs/wdcount

hadoop入門之wordcount小案例

8.HDFS的一些基本操作

繼續閱讀

Windows下Cygwin環境的Hadoop安裝（3）- 運作hadoop中的wordcount執行個體遇到的問題和解決方法

MapReduce運作Wordcount時一直卡在INFO mapreduce.Job: Running job，web檢視一直處于accepted階段

ubuntu hadoop2.6.1，terminal下運作wordcount

MapReduce(一)：入門級程式wordcount及其分析

HiveQl語句應用執行個體：WordCount具體步驟如下：

hadoop操作遇到的問題問題一：輸出檔案已存在

用mapreduce計算wordCount和手機流量統計程式運作過程WordCount統計手機流量統計

Hadoop之運作wordcount

jdk1.7+Eclipse+Maven3.5+Hadoop2.7.3建構hadoop項目

Eclipse運作WordCount（詳細版）相關連接配接Eclipse運作WordCount

hadoop 用MR實作join操作

Centos7 下 Hadoop 2.6.4 分布式叢集環境搭建摘要叢集準備安裝JDK 安裝 Hadoop 2.6.4 部署 slaver1-slaver4 啟動 hadoop 叢集成功了

MapReduce的幾個企業級經典面試案例MapReduce的幾個企業級經典面試案例

ubuntu14.04下安裝hbse1.0.1.1

User Defined Hadoop DataType

Ambari介紹和架構原理