MapReduce做詞頻率統計

WordCount堪稱大資料界的HelloWorld

MapReduce做詞頻率統計

移除點選此處添加圖檔說明文字

今天來學習搭建hadoop開發環境。并且制作一個本地測試版本的WordCount，稍後我們将會來開發實際項目，在此之前，我們需要了解mapreduce所能做的事情。

先介紹一下業務需求假如我們有這樣一個檔案：

hadoop hello world

hello hadoop

hbase zookeeper

想統計每個單詞出現的次數。

好吧，開始搭建：

首先eclipse準備好，然後建立工程，java工程即可。

建立mapper

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{

//該方法循環調用，從檔案的split中讀取每行調用一次，把該行所在的下标為key，該行的内容為value

protected void map(LongWritable key, Text value,

Context context)

throws IOException, InterruptedException {

String[] words = StringUtils.split(value.toString(), ' ');

for(String w :words){

context.write(new Text(w), new IntWritable(1));

}

建立reducer：

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{

//每組調用一次，這一組資料特點：key相同，value可能有多個。

protected void reduce(Text arg0, Iterable<IntWritable> arg1,

Context arg2)

throws IOException, InterruptedException {

int sum =0;

for(IntWritable i: arg1){

sum=sum+i.get();

}

arg2.write(arg0, new IntWritable(sum));

}

最後，建立run方法：

public class RunJob {

public static void main(String[] args) {

Configuration config =new Configuration();

config.set("fs.defaultFS", "hdfs://192.168.181.100:8020");

config.set("yarn.resourcemanager.hostname", "192.168.181.100");

//config.set("mapred.jar", "C:\\Users\\Administrator\\Desktop\\wc.jar");

try {

FileSystem fs =FileSystem.get(config);

Job job =Job.getInstance(config);

job.setJarByClass(RunJob.class);

job.setJobName("wc");

job.setMapperClass(WordCountMapper.class);

job.setReducerClass(WordCountReducer.class);

job.setMapOutputKeyClass(Text.class);

job.setMapOutputValueClass(IntWritable.class);

//FileInputFormat.addInputPath(job, new Path("/data/input/"));

FileInputFormat.addInputPath(job, new Path("/data/wc.txt"));

//Path outpath =new Path("/usr/output/wc");

Path outpath =new Path("/data/output/wc");

if(fs.exists(outpath)){

fs.delete(outpath, true);

}

FileOutputFormat.setOutputPath(job, outpath);

boolean f= job.waitForCompletion(true);

if(f){

System.out.println("job finished!!!");

}

} catch (Exception e) {

e.printStackTrace();

}

解釋一下這兩行

config.set("fs.defaultFS", "hdfs://192.168.181.100:8020");

config.set("yarn.resourcemanager.hostname", "192.168.181.100");

是為了告訴伺服器我們的節點位置。

然後記住要配置本地的hadoop，沒錯就本機的hadoop，去網上下載下傳源碼。如果解壓報錯什麼.so檔案啥的，不用管，請記住了哈，什麼不是管理者權限啥的，及時用了管理者權限也沒用。而且，其實那個.so檔案不影響使用hadoop

MapReduce做詞頻率統計

移除點選此處添加圖檔說明文字

然後，把我們從網上找到的這個用戶端winutils.exe放到bin目錄下，如果不放，會報錯，null/hadoop/bin：

MapReduce做詞頻率統計

移除點選此處添加圖檔說明文字

接着，我們要配置環境變量：

MapReduce做詞頻率統計

移除點選此處添加圖檔說明文字

F:\MapReduce\mr\hadoop-2.5.2\hadoop-2.5.2

path後面追加：

;%HADOOP_HOME%\bin

然後，我們要準備資料哦，把這個文本放到伺服器上，再上傳到hdfs裡。

MapReduce做詞頻率統計

移除點選此處添加圖檔說明文字

說實話，我還是建議一邊做，一遍記文檔，這個就是我回頭來做的，有很多重要的問題，重要的截圖丢了很尴尬啊。

總是報這個錯，很煩，然後想辦法：

MapReduce做詞頻率統計

移除點選此處添加圖檔說明文字

更改所有權限：

./hdfs dfs -chmod 777 /data/wc.txt

MapReduce做詞頻率統計

移除點選此處添加圖檔說明文字

但是任然報這個錯，原來是hadoop的執行權限裡面， /data/output 其他使用者隻有執行權限，我們任然要寫，之前寫的寫入檔案權限，接着改。

org.apache.hadoop.security.AccessControlException: Permission denied: user=lishouzhuang, access=WRITE, inode="/data/output":beifeng:supergroup:drwxr-xr-x

MapReduce做詞頻率統計

移除點選此處添加圖檔說明文字

原來的output權限很小：

MapReduce做詞頻率統計

移除點選此處添加圖檔說明文字

./hdfs dfs -chmod 777 /data/output2

現在被我放大：

MapReduce做詞頻率統計

移除點選此處添加圖檔說明文字

到這裡我們在執行run，這回應該不報錯了吧，我靠成功了！愛死jiji啦，

MapReduce做詞頻率統計

移除點選此處添加圖檔說明文字

MapReduce做詞頻率統計

移除點選此處添加圖檔說明文字

去看一下結果：

MapReduce做詞頻率統計

移除點選此處添加圖檔說明文字

在這個目錄下面，我們要取出來。

MapReduce做詞頻率統計

移除點選此處添加圖檔說明文字

然後我們從hdfs拿下來：./hdfs dfs -get /data/output2/wc /file/output/

MapReduce做詞頻率統計

移除點選此處添加圖檔說明文字

再然後看看資料：

hadoop2

hbase1

hello2

world1

zookeeper1

MapReduce做詞頻率統計

移除點選此處添加圖檔說明文字

好了，wordcount搞定了，簡單吧。

MapReduce做詞頻率統計

移除點選此處添加圖檔說明文字

今天來學習搭建hadoop開發環境。并且制作一個本地測試版本的WordCount，稍後我們将會來開發實際項目，在此之前，我們需要了解mapreduce所能做的事情。

先介紹一下業務需求假如我們有這樣一個檔案：

hadoop hello world

hello hadoop

hbase zookeeper

想統計每個單詞出現的次數。

好吧，開始搭建：

首先eclipse準備好，然後建立工程，java工程即可。

建立mapper

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{

//該方法循環調用，從檔案的split中讀取每行調用一次，把該行所在的下标為key，該行的内容為value

protected void map(LongWritable key, Text value,

Context context)

throws IOException, InterruptedException {

String[] words = StringUtils.split(value.toString(), ' ');

for(String w :words){

context.write(new Text(w), new IntWritable(1));

}

建立reducer：

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{

//每組調用一次，這一組資料特點：key相同，value可能有多個。

protected void reduce(Text arg0, Iterable<IntWritable> arg1,

Context arg2)

throws IOException, InterruptedException {

int sum =0;

for(IntWritable i: arg1){

sum=sum+i.get();

}

arg2.write(arg0, new IntWritable(sum));

}

最後，建立run方法：

public class RunJob {

public static void main(String[] args) {

Configuration config =new Configuration();

config.set("fs.defaultFS", "hdfs://192.168.181.100:8020");

config.set("yarn.resourcemanager.hostname", "192.168.181.100");

//config.set("mapred.jar", "C:\\Users\\Administrator\\Desktop\\wc.jar");

try {

FileSystem fs =FileSystem.get(config);

Job job =Job.getInstance(config);

job.setJarByClass(RunJob.class);

job.setJobName("wc");

job.setMapperClass(WordCountMapper.class);

job.setReducerClass(WordCountReducer.class);

job.setMapOutputKeyClass(Text.class);

job.setMapOutputValueClass(IntWritable.class);

//FileInputFormat.addInputPath(job, new Path("/data/input/"));

FileInputFormat.addInputPath(job, new Path("/data/wc.txt"));

//Path outpath =new Path("/usr/output/wc");

Path outpath =new Path("/data/output/wc");

if(fs.exists(outpath)){

fs.delete(outpath, true);

}

FileOutputFormat.setOutputPath(job, outpath);

boolean f= job.waitForCompletion(true);

if(f){

System.out.println("job finished!!!");

}

} catch (Exception e) {

e.printStackTrace();

}

解釋一下這兩行

config.set("fs.defaultFS", "hdfs://192.168.181.100:8020");

config.set("yarn.resourcemanager.hostname", "192.168.181.100");

是為了告訴伺服器我們的節點位置。

MapReduce做詞頻率統計

移除點選此處添加圖檔說明文字

然後，把我們從網上找到的這個用戶端winutils.exe放到bin目錄下，如果不放，會報錯，null/hadoop/bin：

MapReduce做詞頻率統計

移除點選此處添加圖檔說明文字

接着，我們要配置環境變量：

MapReduce做詞頻率統計

移除點選此處添加圖檔說明文字

F:\MapReduce\mr\hadoop-2.5.2\hadoop-2.5.2

path後面追加：

;%HADOOP_HOME%\bin

然後，我們要準備資料哦，把這個文本放到伺服器上，再上傳到hdfs裡。

MapReduce做詞頻率統計

移除點選此處添加圖檔說明文字

說實話，我還是建議一邊做，一遍記文檔，這個就是我回頭來做的，有很多重要的問題，重要的截圖丢了很尴尬啊。

總是報這個錯，很煩，然後想辦法：

MapReduce做詞頻率統計

移除點選此處添加圖檔說明文字

更改所有權限：

./hdfs dfs -chmod 777 /data/wc.txt

MapReduce做詞頻率統計

移除點選此處添加圖檔說明文字

但是任然報這個錯，原來是hadoop的執行權限裡面， /data/output 其他使用者隻有執行權限，我們任然要寫，之前寫的寫入檔案權限，接着改。

org.apache.hadoop.security.AccessControlException: Permission denied: user=lishouzhuang, access=WRITE, inode="/data/output":beifeng:supergroup:drwxr-xr-x

MapReduce做詞頻率統計

移除點選此處添加圖檔說明文字

原來的output權限很小：

MapReduce做詞頻率統計

移除點選此處添加圖檔說明文字

./hdfs dfs -chmod 777 /data/output2

現在被我放大：

MapReduce做詞頻率統計

移除點選此處添加圖檔說明文字

到這裡我們在執行run，這回應該不報錯了吧，我靠成功了！愛死jiji啦，

MapReduce做詞頻率統計

移除點選此處添加圖檔說明文字

MapReduce做詞頻率統計

移除點選此處添加圖檔說明文字

去看一下結果：

MapReduce做詞頻率統計

移除點選此處添加圖檔說明文字

在這個目錄下面，我們要取出來。

MapReduce做詞頻率統計

移除點選此處添加圖檔說明文字

然後我們從hdfs拿下來：./hdfs dfs -get /data/output2/wc /file/output/

MapReduce做詞頻率統計

移除點選此處添加圖檔說明文字

再然後看看資料：

hadoop2

hbase1

hello2

world1

zookeeper1

MapReduce做詞頻率統計

移除點選此處添加圖檔說明文字

好了，wordcount搞定了，簡單吧。

MapReduce做詞頻率統計

繼續閱讀

申請評分模型拒絕推斷（RI）方法申請評分模型拒絕推斷（RI）方法

專家訪談：搜尋開源力量：Lucene技術前景

Sql優化一：sql語句優化

Nacos 2.0 更新前後性能對比壓測

hadoop 用MR實作join操作

Centos7 下 Hadoop 2.6.4 分布式叢集環境搭建摘要叢集準備安裝JDK 安裝 Hadoop 2.6.4 部署 slaver1-slaver4 啟動 hadoop 叢集成功了

尚矽谷—韓順平—圖解 Java設計模式（結構型）（55～）

Storm編譯打包過程中遇到的一些問題及解決方法

MapReduce的幾個企業級經典面試案例MapReduce的幾個企業級經典面試案例

9.spark Core 進階2--Cashe

淺談企業活動中進行資料分析的重要性

ubuntu14.04下安裝hbse1.0.1.1

User Defined Hadoop DataType

Ambari介紹和架構原理

NOSQL安全攻擊

win10本地scala和spark安裝安裝scala安裝spark