hadoop 之 PathFilter -- 輸入檔案過濾器

1.指定多個輸入

在單個操作中處理一批檔案，這是很常見的需求。比如說處理日志的MapReduce作業可能需要分析一個月内包含在大量目錄中的日志檔案。在一個表達式中使用通配符在比對多個檔案時比較友善的，無需列舉每個檔案和目錄來指定輸入。hadoop為執行通配提供了兩個FileSystem方法：

public FileStatus[] globStatus(Path pathPattern) throw IOException
public FileStatus[] globStatus(Path pathPattern, PathFilter filter) throw IOException

PS：

globStatus()方法傳回與路徑想比對的所有檔案的FileStatus對象數組，并按路徑排序。hadoop所支援的通配符與Unix bash相同。
第二個方法傳了一個PathFilter對象作為參數，PathFilter可以進一步對比對進行限制。PathFilter是一個接口，裡面隻有一個方法accept(Path path)。

PathFilter執行個體

RegexExcludePathFilter.java

class RegexExcludePathFilter implements PathFilter{
    private final String regex;
    public RegexExcludePathFilter(String regex) {
        this.regex = regex;
    }
    @Override
    public boolean accept(Path path) {
        return !path.toString().matches(regex);
    }
}

PS：該類實作了PathFilter接口，重寫了accept方法

使用這個過濾器：

//通配符的使用
public static void list() throws IOException{
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);
        //PathFilter是過濾布符合置頂表達式的路徑，下列就是把以txt結尾的過濾掉
        FileStatus[] status = fs.globStatus(new Path("hdfs://master:9000/user/hadoop/test/*"),new RegexExcludePathFilter(".*txt"));
        //FileStatus[] status = fs.globStatus(new Path("hdfs://master:9000/user/hadoop/test/*"));
        Path[] listedPaths = FileUtil.stat2Paths(status);
        for (Path p : listedPaths) {
            System.out.println(p);
        }
    }

如果沒有過濾器，

FileStatus[] status = fs.globStatus(new Path("hdfs://master:9000/user/hadoop/test/*"));

則輸出結果如下：

hdfs://master:/user/hadoop/test/a.txt
hdfs://master:/user/hadoop/test/b.txt
hdfs://master:/user/hadoop/test/c.aaa
hdfs://master:/user/hadoop/test/c.txt
hdfs://master:/user/hadoop/test/cc.aaa

如果使用了過濾器

FileStatus[] status = fs.globStatus(new Path("hdfs://master:9000/user/hadoop/test/*"),new RegexExcludePathFilter(".*txt"));

則輸出結果如下：

hdfs://master:/user/hadoop/test/c.aaa
hdfs://master:/user/hadoop/test/cc.aaa

由此可見，PathFilter就是在比對前面條件之後再加以限制，将比對PathFilter的路徑去除掉。

其實由accept方法裡面的

return !path.toString().matches(regex);

可以看出來，就是将比對的全部去除掉，如果改為

return path.toString().matches(regex);

就是将比對regex的Path輸出，将不比對的去除。

PathFilter執行個體2

public int run(String[] args) throws Exception {
        Configuration conf = getConf();
        FileSystem fs = FileSystem.get(conf);
        Job job = Job.getInstance(conf);

        //通過過濾器過濾掉不要的檔案
        FileStatus[] status = fs.globStatus(new Path(args[]),new RegexExcludePathFilter(".*txt"));
        Path[] listedPaths = FileUtil.stat2Paths(status);

        job.setJarByClass(this.getClass());
        job.setJobName("SumStepByTool");
        job.setInputFormatClass(TextInputFormat.class); //這個是預設的輸入格式

        job.setMapperClass(SumStepByToolMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(InfoBeanMy.class);

        job.setReducerClass(SumStepByToolReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(InfoBeanMy.class);
        //job.setNumReduceTasks();

        //對不同的輸入檔案使用不同的Mapper進行處理
//      MultipleInputs.addInputPath(job, new Path(args[]), TextInputFormat.class, SumStepByToolMapper.class);
//      MultipleInputs.addInputPath(job, new Path(args[]), TextInputFormat.class, SumStepByToolWithCommaMapper.class);
        FileInputFormat.setInputPaths(job, listedPaths);
        FileOutputFormat.setOutputPath(job, new Path(args[]));


        return job.waitForCompletion(true) ? :-;
    }

hadoop 之 PathFilter -- 輸入檔案過濾器

1.指定多個輸入

PathFilter執行個體

PathFilter執行個體2

繼續閱讀

MapReduce運作Wordcount時一直卡在INFO mapreduce.Job: Running job，web檢視一直處于accepted階段

ubuntu hadoop2.6.1，terminal下運作wordcount

MapReduce(一)：入門級程式wordcount及其分析

HiveQl語句應用執行個體：WordCount具體步驟如下：

hadoop操作遇到的問題問題一：輸出檔案已存在

用mapreduce計算wordCount和手機流量統計程式運作過程WordCount統計手機流量統計

Hadoop之運作wordcount

jdk1.7+Eclipse+Maven3.5+Hadoop2.7.3建構hadoop項目

Eclipse運作WordCount（詳細版）相關連接配接Eclipse運作WordCount

專家訪談：搜尋開源力量：Lucene技術前景

hadoop 用MR實作join操作

Centos7 下 Hadoop 2.6.4 分布式叢集環境搭建摘要叢集準備安裝JDK 安裝 Hadoop 2.6.4 部署 slaver1-slaver4 啟動 hadoop 叢集成功了

MapReduce的幾個企業級經典面試案例MapReduce的幾個企業級經典面試案例

ubuntu14.04下安裝hbse1.0.1.1

User Defined Hadoop DataType

Ambari介紹和架構原理