不鏡于水,而鏡于人,則吉兇可鑒也
不蹶于山,而蹶于垤,則細微宜防也
相關連接配接
HDFS相關知識
- Hadoop分布式檔案系統(HDFS)快速入門
- Hadoop分布式檔案系統(HDFS)知識梳理(超詳細)
Hadoop叢集連接配接
- Eclipse連接配接Hadoop叢集
- IntelliJ IDEA連接配接Hadoop叢集
HDFS Java API
Hadoop分布式檔案系統(HDFS)Java接口(HDFS Java API)詳細版
WordCount程式分析
使用Java API編寫WordCount程式
Eclipse運作WordCount
檔案下載下傳
- WordCount.java 提取碼2kwo
- log4j.properties 提取碼tpz9
- data.txt 提取碼zefp
具體步驟
注意:Eclipse連接配接Hadoop叢集執行完所有步驟後方可進行接下來的操作
- 打開Eclipse,依次點選“File”→“New”→“Map/ReduceProject”,點選“Next”
- 在彈出的視窗填寫項目名,選擇項目路徑,點選“Finish”
- 在mapreduce項目的src目錄中建立cn.neu包,點選“Finish”
- 将下載下傳的WordCount.java檔案拷貝粘貼至cn.neu包中(直接拖拽即可)
- 使用Xftp等檔案傳輸軟體将遠端Hadoop叢集安裝目錄下的hadoop/hadoop-2.6.0/etc/hadoop目錄下的core-site.xml和hdfs-site.xml傳輸到本地
上述兩個XML檔案和下載下傳的log4j.properties檔案一起拷貝到src中
注:若不清楚上述XML檔案如何配置,推薦參考多台Linux虛拟機Hadoop叢集的安裝與部署(超詳細版)
若不添加兩個XML檔案,會産生如下錯誤
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/G:/hadoop-2.6.0/share/hadoop/common/lib/hadoop-auth-2.6.0.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/test/input/data.txt
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:321)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:264)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:385)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:597)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:614)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:492)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1296)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1293)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Unknown Source)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1293)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1314)
at cn.neu.WordCount.main(WordCount.java:60)
- 右擊HDFS根目錄,點選“Create new directory” 輸入test後點選“OK”
在Project Explorer框内右擊,點選Refresh重新整理後,即可看到建立的目錄
右擊test檔案夾,在此檔案夾下建立目錄input,重新整理後如下
- 右擊input目錄,選擇Upload files to DFS(HDFS以前也稱DFS) 選擇下載下傳的data.txt檔案後,點選“打開”,再次重新整理Project Explorer,如下圖所示
- WordCount.java代碼中有兩處參數值,是以需要配置參數
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
在代碼編輯處右鍵滑鼠,依次點選“Run As”→“Run Configurations”
點選Arguments,輸入上一步驟中設定的data.txt路徑和程式最終的輸出路徑,點選“Apply”後點選“Run”開始運作程式
注意:不可再程式執行前在test目錄中建立output目錄,output目錄務必不存在!否則會産生目錄已存在的錯誤!
- 可能會報出如下錯誤(若未報該錯誤,直接跳過此步驟)
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.hadoop.util.GenericOptionsParser.preProcessForWindows(GenericOptionsParser.java:438)
at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:484)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:170)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:153)
at cn.neu.WordCount.main(WordCount.java:45)
Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 2
at java.base/java.lang.String.checkBoundsBeginEnd(Unknown Source)
at java.base/java.lang.String.substring(Unknown Source)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:49)
... 5 more
點選
(Shell.java:49)
,進入如下界面,點選Attach Source
進入以下界面後,依次點選“External loaction”→“External file”,根據上圖中的路徑找到sources檔案夾,打開後點選hadoop-common-2.6.0-sources.jar,點選“打開”,最後點選“OK”
再次點選
(Shell.java:49)
可檢視其源碼,定位到第49行,源碼如下
private static boolean IS_JAVA7_OR_ABOVE =
System.getProperty(“java.version”).substring(0, 3).compareTo(“1.7”) >= 0;
結合如下錯誤資訊
at java.base/java.lang.String.checkBoundsBeginEnd(Unknown Source)
at java.base/java.lang.String.substring(Unknown Source)
即找不到字元串,是以需要在主函數中添加如下代碼
System.setProperty("java.version", "1.8");
,其中後面的數字比1.7大即可
- 若程式可以正常運作,等待程式運作完畢後,右擊Project Explorer中Hadoop下建立的test目錄,點選Refresh重新整理,可在其中看到output目錄 輕按兩下part-r-0000檔案可檢視程式運作結果
- 若要再次執行,要麼在參數配置中更改輸出目錄,要麼删除輸出路徑下的檔案
有一個一勞永逸的方法,即在程式中主函數略加改動,即每次進行運算前檢查輸出路徑是否存在,若存在則删除輸出路徑
改動前
System.setProperty("HADOOP_USER_NAME", "root");
System.setProperty("java.version", "1.8");
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
if(otherArgs.length != 2){
System.err.println("Usage WordCount <int> <out>");
System.exit(2);
}
改動後
System.setProperty("HADOOP_USER_NAME", "root");
System.setProperty("java.version", "1.8");
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
if(otherArgs.length != 2){
System.err.println("Usage WordCount <int> <out>");
System.exit(2);
}
Path outPath = new Path(otherArgs[1]);
if(fs.exists(outPath)) {
fs.delete(outPath, true);
}