文章目录
-
-
- 一、需求分析
- 二、环境准备
- 三、编写程序
- 四、本地测试
- 五、集群测试
-
一、需求分析
-
需求
在给定的文本文件中统计输出每一个单词出现的总次数
- 按照
编程规范,编写MapReduce
(1)将Mapper
传给我们的文本内容先转换成MapTask
String
(2)根据空格将这一行切分成单词
(3)将单词输出为
格式<k,v>
- 按照
编程规范,编写MapReduce
(1)汇总各个Reducer
key
的个数
(2)输出该
的总次数key
- 按照
编程规范,编写MapReduce
(1)获取配置信息,获取Driver
job
对象实例
(2)指定本程序的
jar
包所在的本地路径
(3)关联
MapReducer
业务类
(4)指定Mapper输出数据的
<k,v>
类型
(5)指定最终输出数据的
<k,v>
类型
(6)指定
job
的输入原始文件所在目录
(7)指定
job
的输出结果所在目录
(8)提交作业
二、环境准备
- 创建一个名为
的mrWordCount
工程Maven
- 在
文件中添加如下依赖pom.xml
<dependencies> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>RELEASE</version> </dependency> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-core</artifactId> <version>2.8.2</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.7.2</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.7.2</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.7.2</version> </dependency> </dependencies>
- 在项目的
目录下,新建一个文件,命名为src/main/resources
,在文件中填入log4j.properties
log4j.rootLogger=INFO, stdout log4j.appender.stdout=org.apache.log4j.ConsoleAppender log4j.appender.stdout.layout=org.apache.log4j.PatternLayout log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n log4j.appender.logfile=org.apache.log4j.FileAppender log4j.appender.logfile.File=target/spring.log log4j.appender.logfile.layout=org.apache.log4j.PatternLayout log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n
三、编写程序
本文使用idea进行相关操作:
- 创建包名:
com.easysir.wordcount
- 创建
类WordcountMapper
package com.easysir.wordcount; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import java.io.IOException; // 其中LongWritable类型为输入数据行的偏移量 public class WordcountMapper extends Mapper<LongWritable, Text, Text, IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(1); @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { // 1 获取一行 String line = value.toString(); // 2 按空格切割 String[] words = line.split(" "); // 3 输出结果 for (String word : words) { k.set(word); context.write(k, v); } } }
- 创建
类WordcountReducer
package com.easysir.wordcount; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; import java.io.IOException; public class WordcountReducer extends Reducer<Text, IntWritable, Text, IntWritable> { int sum; IntWritable v = new IntWritable(); @Override protected void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException { // 1 累加求和 sum = 0; for (IntWritable count : values) { sum += count.get(); } // 2 输出 v.set(sum); context.write(key,v); } }
- 创建
类WordcountDriver
package com.easysir.wordcount; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.IOException; public class WordcountDriver { public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { // 1 获取配置信息以及封装任务 Configuration configuration = new Configuration(); Job job = Job.getInstance(configuration); // 2 设置jar加载路径 job.setJarByClass(WordcountDriver.class); // 3 设置map和reduce类 job.setMapperClass(WordcountMapper.class); job.setReducerClass(WordcountReducer.class); // 4 设置map输出 job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); // 5 设置最终输出kv类型 job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); // 6 设置输入和输出路径 FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); // 7 提交 boolean result = job.waitForCompletion(true); System.exit(result ? 0 : 1); } }
四、本地测试
-
填写路径参数
注意:输出路径文件夹不能为已存在的文件夹,否则会报错
【Hadoop学习之MapReduce】_16MR之WordCount案例实操 【Hadoop学习之MapReduce】_16MR之WordCount案例实操 - 运行程序
【Hadoop学习之MapReduce】_16MR之WordCount案例实操 - 查看结果
【Hadoop学习之MapReduce】_16MR之WordCount案例实操 easysir 2 haha 2 heihei 1 hello 2 nihao 1 wanghu 1
五、集群测试
- 添加Maven打包插件依赖,注意修改
路径WordcountDriver
<build> <plugins> <plugin> <artifactId>maven-compiler-plugin</artifactId> <version>2.3.2</version> <configuration> <source>1.8</source> <target>1.8</target> </configuration> </plugin> <plugin> <artifactId>maven-assembly-plugin </artifactId> <configuration> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> <archive> <manifest> <mainClass>com.easysir.wordcount.WordcountDriver</mainClass> </manifest> </archive> </configuration> <executions> <execution> <id>make-assembly</id> <phase>package</phase> <goals> <goal>single</goal> </goals> </execution> </executions> </plugin> </plugins> </build>
- 将程序打成jar包
【Hadoop学习之MapReduce】_16MR之WordCount案例实操 - 将jar包拷贝到
集群中,选择无依赖jar包Hadoop
- 启动
集群Hadoop
- 执行
程序WordCount
# hadoop jar jar包 启动类 输入路径 输出路径 hadoop jar ./mrWordCount-1.0-SNAPSHOT.jar com.easysir.wordcount.WordcountDriver /2020 /output
- 查看结果
hadoop fs -cat /output/part-r-00000