MAPREDUCE实践篇（1）

（1）用户编写的程序分成三个部分：Mapper，Reducer，Driver(提交运行mr程序的客户端)

（2）Mapper的输入数据是KV对的形式（KV的类型可自定义）

（3）Mapper的输出数据是KV对的形式（KV的类型可自定义）

（4）Mapper中的业务逻辑写在map()方法中

（5）map()方法（maptask进程）对每一个<K,V>调用一次

（6）Reducer的输入数据类型对应Mapper的输出数据类型，也是KV

（7）Reducer的业务逻辑写在reduce()方法中

（8）Reducetask进程对每一组相同k的<k,v>组调用一次reduce()方法

（9）用户自定义的Mapper和Reducer都要继承各自的父类

（10）整个程序需要一个Drvier来进行提交，提交的是一个描述了各种必要信息的job对象

需求：在一堆给定的文本文件中统计输出每一个单词出现的总次数

(1)定义一个mapper类

//首先要定义四个泛型的类型

//keyin: LongWritable valuein: Text

//keyout: Text valueout:IntWritable

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{

//map方法的生命周期：框架每传一行数据就被调用一次

//key : 这一行的起始点在文件中的偏移量

//value: 这一行的内容

@Override

protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

//拿到一行数据转换为string

String line = value.toString();

//将这一行切分出各个单词

String[] words = line.split(" ");

//遍历数组，输出<单词，1>

for(String word:words){

context.write(new Text(word), new IntWritable(1));

}

(2)定义一个reducer类

//生命周期：框架每传递进来一个kv 组，reduce方法被调用一次

protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

//定义一个计数器

int count = 0;

//遍历这一组kv的所有v，累加到count中

for(IntWritable value:values){

count += value.get();

context.write(key, new IntWritable(count));

(3)定义一个主类，用来描述job并提交job

public class WordCountRunner {

//把业务逻辑相关的信息（哪个是mapper，哪个是reducer，要处理的数据在哪里，输出的结果放哪里……）描述成一个job对象

//把这个描述好的job提交给集群去运行

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job wcjob = Job.getInstance(conf);

//指定我这个job所在的jar包

//wcjob.setJar("/home/hadoop/wordcount.jar");

wcjob.setJarByClass(WordCountRunner.class);

wcjob.setMapperClass(WordCountMapper.class);

wcjob.setReducerClass(WordCountReducer.class);

//设置我们的业务逻辑Mapper类的输出key和value的数据类型

wcjob.setMapOutputKeyClass(Text.class);

wcjob.setMapOutputValueClass(IntWritable.class);

//设置我们的业务逻辑Reducer类的输出key和value的数据类型

wcjob.setOutputKeyClass(Text.class);

wcjob.setOutputValueClass(IntWritable.class);

//指定要处理的数据所在的位置

FileInputFormat.setInputPaths(wcjob, "hdfs://hdp-server01:9000/wordcount/data/big.txt");

//指定处理完成之后的结果所保存的位置

FileOutputFormat.setOutputPath(wcjob, new Path("hdfs://hdp-server01:9000/wordcount/output/"));

//向yarn集群提交这个job

boolean res = wcjob.waitForCompletion(true);

System.exit(res?0:1);

（1）mapreduce程序是被提交给LocalJobRunner在本地以单进程的形式运行

（2）而处理的数据及输出结果可以在本地文件系统，也可以在hdfs上

（3）怎样实现本地运行？写一个程序，不要带集群的配置文件（本质是你的mr程序的conf中是否有mapreduce.framework.name=local以及yarn.resourcemanager.hostname参数）

（4）本地模式非常便于进行业务逻辑的debug，只要在eclipse中打断点即可

如果在windows下想运行本地模式来测试程序逻辑，需要在windows中配置环境变量：

％HADOOP_HOME％ = d:/hadoop-2.6.1

%PATH% = ％HADOOP_HOME％\bin

并且要将d:/hadoop-2.6.1的lib和bin目录替换成windows平台编译的版本

（1）将mapreduce程序提交给yarn集群resourcemanager，分发到很多的节点上并发执行

（2）处理的数据和输出结果应该位于hdfs文件系统

（3）提交集群的实现步骤：

A、将程序打成JAR包，然后在集群的任意一个节点上用hadoop命令启动

$ hadoop jar wordcount.jar cn.itcast.bigdata.mrsimple.WordCountDriver inputpath outputpath

B、直接在linux的eclipse中运行main方法

（项目中要带参数：mapreduce.framework.name=yarn以及yarn的两个基本配置）

C、如果要在windows的eclipse中提交job给集群，则要修改YarnRunner类

mapreduce程序在集群中运行时的大体流程：

附：在windows平台上访问hadoop时改变自身身份标识的方法之二：

（1）combiner是MR程序中Mapper和Reducer之外的一种组件

（2）combiner组件的父类就是Reducer

（3）combiner和reducer的区别在于运行的位置：

Combiner是在每一个maptask所在的节点运行

Reducer是接收全局所有Mapper的输出结果；

(4) combiner的意义就是对每一个maptask的输出进行局部汇总，以减小网络传输量

具体实现步骤：

1、自定义一个combiner继承Reducer，重写reduce方法

2、在job中设置： job.setCombinerClass(CustomCombiner.class)

(5) combiner能够应用的前提是不能影响最终的业务逻辑

而且，combiner的输出kv应该跟reducer的输入kv类型要对应起来

本文转自yushiwh 51CTO博客，原文链接：http://blog.51cto.com/yushiwh/1913043，如需转载请自行联系原作者

MAPREDUCE实践篇（1）

继续阅读

Java小案例——随机数猜测随机数猜测

nginx location中斜线的位置的重要性

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的简单使用

neo4j之cypher使用文档

Ambari介绍和架构原理

GitHub连夜封杀！这份阿里 10W 字内部 Java 字面试手册到底有多强？

spark/scala关于【资源文件】加载方法概述外部文件加载方案测试资源文件打包入jar包中小结

mybatis_入门程序Mybatis入门

AOP编程_Android优雅权限框架(1)概念基础，2021金三银四前言正文大纲正文

Effective Java 8:通用程序设计

OOM三种类型

工厂模式-三种类型

【递归】高效率求2的n次幂

win10本地scala和spark安装安装scala安装spark

scala (3) Function 和 Method