mapreduce自定义输入格式
概念:
- 当普通的输入格不能满足客户的要求的时候。因为普通的输入格式是将文件的每一行输入的数据作为一个value值然后进行map端的操作。现在有的需求是将数据库中的数据作为一个输入的格式,或者是将一个文件的整体作为一个输入格式等。
举例:
- 现在有一个需求是将一个目录下的所有小文件读取进来,将文件的整个内容都作为一个value值进行输入。出来map端的值是文件名称作为key值,整个文件内容作为value值。
源码解析:
- 源码
public abstract class InputFormat<K, V> {
/**
* Logically split the set of input files for the job.
*
* <p>Each {@link InputSplit} is then assigned to an individual {@link Mapper}
* for processing.</p>
*
* <p><i>Note</i>: The split is a <i>logical</i> split of the inputs and the
* input files are not physically split into chunks. For e.g. a split could
* be <i><input-file-path, start, offset></i> tuple. The InputFormat
* also creates the {@link RecordReader} to read the {@link InputSplit}.
*
* @param context job configuration.
* @return an array of {@link InputSplit}s for the job.
*/
public abstract
List<InputSplit> getSplits(JobContext context
) throws IOException, InterruptedException;
/**
* Create a record reader for a given split. The framework will call
* {@link RecordReader#initialize(InputSplit, TaskAttemptContext)} before
* the split is used.
* @param split the split to be read
* @param context the information about the task
* @return a new record reader
* @throws IOException
* @throws InterruptedException
*/
public abstract
RecordReader<K,V> createRecordReader(InputSplit split,
TaskAttemptContext context
) throws IOException,
InterruptedException;
}
- 解说:
- 这是最基础的InputFormat类,其中包含了两个方法,第一个方法就是getSplit()方法,和getRecordReader()方法。
- getSplit():这个方法是获取这个文件的分片信息,必须要实现。
- getRecordReader():这个方法是具体操作读取文件的方式,也必须得实现。
具体操作
创造一个文件:
...........
1.txt
one
...........
2.txt
tow
...........
3.txt
three
...........
书写自定一的文件输入类型
- 创建新的格式CusFileInputFormat类
/**
* @description
* @author: LuoDeSong [email protected]
* @create: 2019-06-19 11:06:09
**/
public class CusFileInputFormat extends FileInputFormat<NullWritable, BytesWritable> {
//读取的文件是否可被切分,这里的文件小于了128M,所以不需要进行切分,所以设置的值是false,如果需要切分的时候就改成true
@Override
protected boolean isSplitable(JobContext context, Path filename) {
return false;
}
@Override
public RecordReader createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
CusRecordReader cusRecordReader = new CusRecordReader();
cusRecordReader.initialize(split,context);
return cusRecordReader;
}
}
* 说明:因为没有必要更改别人的文件分片的规则,所以我们可以直接继承InputFormat类的实现类FileInputFormat,他其中已经实现了文件的分片规则。我们需要做的就是重新定义我们读取文件的个是就行也就是重新写一个RecordReader类。为getRecordReader()做准备。
* 参数说明:因为输入的key是整个文件所以就不需偏移量了,所以为NullWritable;因为读取的是整个文件,并且是按照字节的方式来读取的,所以为BytesWritable。
- 创建新的RecordReader类CusRecordReader:
/**
* @description
* @author: LuoDeSong [email protected]
* @create: 2019-06-19 11:10:19
**/
public class CusRecordReader extends RecordReader<NullWritable, BytesWritable> {
//定义配置类
private Configuration conf;
//文件的切片类
private FileSplit split;
//是否继续读取文件
private boolean propress;
//输出数据的格式什么样的
private BytesWritable bytesWritable = new BytesWritable();
@Override
public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
this.split = (FileSplit) split;//向下转型
this.conf = context.getConfiguration();//获取配置文件信息
}
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
if (!propress) {
//1定义缓冲区
byte[] data = new byte[(int) split.getLength()];
//要读取的文件就是一个切片
FileSystem fs = null;
FSDataInputStream input = null;
try {
//获取文件系统的实例
Path path = split.getPath();
fs = path.getFileSystem(conf);
//读取数据
input = fs.open(path);
IOUtils.readFully(input, data, 0, data.length);
bytesWritable.set(data, 0, data.length);
} catch (Exception e) {
e.printStackTrace();
} finally {
IOUtils.closeStream(input);
}
propress = true;
return propress;
}
return false;
}
@Override
public NullWritable getCurrentKey() throws IOException, InterruptedException {
return NullWritable.get();
}
@Override
public BytesWritable getCurrentValue() throws IOException, InterruptedException {
return bytesWritable;
}
//读取的进度
@Override
public float getProgress() throws IOException, InterruptedException {
return this.propress ? 1 : -1;
}
@Override
public void close() throws IOException {
}
}
* 说明:需要继承最根本的文件读取的类RecordReader,然后重新按照自己的方式来重新其中必须的方法,书写的过程和方式已经在代码中做好了注释。
* 参数说明:和CusFileInputFormat类中的说明是一样的。
/**
* @description
* @author: LuoDeSong [email protected]
* @create: 2019-06-19 11:36:06
**/
public class FileMapper extends Mapper<NullWritable, ByteWritable, Text, ByteWritable> {
private Text newKey = new Text();
@Override
protected void setup(Context context) throws IOException, InterruptedException {
FileSplit inputSplit = (FileSplit) context.getInputSplit();
String name = inputSplit.getPath().getName();
newKey.set(name);
}
@Override
protected void map(NullWritable key, ByteWritable value, Context context) throws IOException, InterruptedException {
context.write(newKey, value);
}
}
/**
* @description
* @author: LuoDeSong [email protected]
* @create: 2019-06-19 11:42:22
**/
public class FileReducer extends Reducer<Text, ByteWritable, Text, ByteWritable> {
@Override
protected void reduce(Text key, Iterable<ByteWritable> values, Context context) throws IOException, InterruptedException {
context.write(key, values.iterator().next());
}
}
public class Driver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf=new Configuration();
Job job=Job.getInstance(conf);
job.setJarByClass(Driver.class);
job.setMapperClass(FileMapper.class);
job.setReducerClass(FileReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(BytesWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(BytesWritable.class);
job.setInputFormatClass(CusFileInputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
boolean result=job.waitForCompletion(true);
System.exit(result?0:1);
}
}
总结:
- 更改输入格式是一个必需的点,我们整个过程实际上就是追溯源码,仿照源码得来的,希望你在大数据的路上越走越好。