Hadoop-MapReduce-OutputFormat資料輸出

2023-08-07 14:25:18

OutPutFormat資料輸出

實作接口如下：

自定義OutputFormat使用場景以及步驟

Hadoop-MapReduce-OutputFormat資料輸出

實操：

目的：過濾檔案，将包含caocao的行輸入到caocao.txt中，其他的資料輸出到other.txt中

輸入資料：

caocao shi wei wu di
liubei shi shu zhao lie di
liubei shi shu guo de
caocao shi wei guo de 
caozhi shi caocao de er zi

Mapper如下

public class DefinedOutputFormatMapper extends Mapper<LongWritable, Text, Text, NullWritable> {

    Text k = new Text();
    String line = "\r\n";

    @Override
    protected void map (LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        String keyStr = value.toString();
		// 這裡不加\r\n的話，不會自動換行
        k.set(keyStr + line);

        context.write(k, NullWritable.get());
    }
}

Reducer如下：

public class DefinedOutputFormatReducer extends Reducer<Text, NullWritable, Text, NullWritable> {

    @Override
    protected void reduce (Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {

        // 防止有重複的資料
        for (NullWritable value : values) {
            context.write(key, NullWritable.get());
        }
    }
}

自定義OutputFormat實作類如下

public class DefinedOutputFormat extends FileOutputFormat<Text, NullWritable> {

    @Override
    public RecordWriter<Text, NullWritable> getRecordWriter (TaskAttemptContext context) throws IOException, InterruptedException {

        return new DefinedOutputRecordWriter(context);
    }
}

自定義RecordWriter如下

public class DefinedOutputRecordWriter extends RecordWriter<Text, NullWritable> {

    FSDataOutputStream fosCaocao;
    FSDataOutputStream fosOther;

    public DefinedOutputRecordWriter (TaskAttemptContext context) {

        try {
            // 1、擷取檔案系統
            FileSystem fileSystem = FileSystem.get(context.getConfiguration());

            // 2、建立輸出到caocao.txt的輸出流
            fosCaocao = fileSystem.create(new Path("/home/lxj/hadoop-data/output/definedOutput/caocao.txt"));

            // 3、建立輸出到other.txt的輸出流
            fosOther = fileSystem.create(new Path("/home/lxj/hadoop-data/output/definedOutput/other.txt"));

        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    @Override
    public void write (Text text, NullWritable nullWritable) throws IOException, InterruptedException {

        // 判斷key中是否有caocao，有則寫入caocao.txt，否則寫入other.txt
        if (text.toString().contains("caocao")) {
            fosCaocao.write(text.toString().getBytes());
        } else {
            fosOther.write(text.toString().getBytes());
        }
    }

    @Override
    public void close (TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {

        IOUtils.closeStream(fosCaocao);
        IOUtils.closeStream(fosOther);
    }

Driver加入

// 設定OutputFormat
job.setOutputFormatClass(DefinedOutputFormat.class);

運作結果如下：

Hadoop-MapReduce-OutputFormat資料輸出

Hadoop-MapReduce-OutputFormat資料輸出

OutPutFormat資料輸出

繼續閱讀

申請評分模型拒絕推斷（RI）方法申請評分模型拒絕推斷（RI）方法

專家訪談：搜尋開源力量：Lucene技術前景

Sql優化一：sql語句優化

Nacos 2.0 更新前後性能對比壓測

hadoop 用MR實作join操作

Centos7 下 Hadoop 2.6.4 分布式叢集環境搭建摘要叢集準備安裝JDK 安裝 Hadoop 2.6.4 部署 slaver1-slaver4 啟動 hadoop 叢集成功了

尚矽谷—韓順平—圖解 Java設計模式（結構型）（55～）

Storm編譯打包過程中遇到的一些問題及解決方法

MapReduce的幾個企業級經典面試案例MapReduce的幾個企業級經典面試案例

9.spark Core 進階2--Cashe

淺談企業活動中進行資料分析的重要性

ubuntu14.04下安裝hbse1.0.1.1

User Defined Hadoop DataType

Ambari介紹和架構原理

NOSQL安全攻擊

win10本地scala和spark安裝安裝scala安裝spark