天天看點

Oozie 送出Spark On Yarn示例準備工作編寫Spark程式Oozie工作流本文重點:oozie spark on yarn官方文檔有說明,要送出spark到yarn上,需要配置以下屬性

大家好,我是crazy_老中醫,我寫程式就像老中醫一樣,全屏感覺和經驗,但是有用!

廢話不多說,現在開始正文,本文将闡述如何将一個Spark程式通過oozie送出到hadoop的Yarn上運作。

準備工作

叢集規劃

hdp-master hdp-slave1 hdp-slave2
hadoop

NameNode

DataNode

SecondaryNameNode

ResourceManager

NodeManager

DataNode

NodeManager

DataNode

NodeManager

oozie Y N N
spark

Master

Worker

Worker Worker

hadoop2.6.0

安裝路徑: /opt/hadoop-2.6.0 環境變量: export HADOOP_HOME=/opt/hadoop-2.6.0 export PATH=$PATH:$HADOOP_HOME/bin export HADOOP_CONF_DIR=/opt/hadoop-2.6.0/etc/hadoop 啟動hadoop: $:cd /opt/hadoop-2.6.0 $:./sbin/start-dfs.sh $:./sbin/start-yarn.sh 啟動成功: 通過jps可看到叢集規劃中的相關程序。

oozie4.2.0

安裝路徑: /opt/oozie-4.2.0 環境變量: export OOZIE_HOME=/opt/oozie-4.2.0

export PATH=$PATH:$OOZIE_HOME/bin

啟動oozie: $:cd /opt/oozie-4.2.0 $:./bin/oozied.sh start

spark1.4.1

安裝路徑: /opt/spark-1.4.1-bin-hadoop2.6

環境變量: export SPARK_HOME=/opt/spark-1.4.1-bin-hadoop2.6

export PATH=$PATH:$SPARK_HOME/bin

啟動spark: $:cd /opt/spark-1.4.1-bin-hadoop2.6 $:./sbin/start-all.sh

啟動成功: 通過jps可看到叢集規劃中的相關程序。

主節點jps截圖

Oozie 送出Spark On Yarn示例準備工作編寫Spark程式Oozie工作流本文重點:oozie spark on yarn官方文檔有說明,要送出spark到yarn上,需要配置以下屬性

分節點jps截圖

Oozie 送出Spark On Yarn示例準備工作編寫Spark程式Oozie工作流本文重點:oozie spark on yarn官方文檔有說明,要送出spark到yarn上,需要配置以下屬性

編寫Spark程式

目的 這裡用的是tomcat産生的access日志檔案,日志格式如下:

118.118.118.30 - - [15/Jun/2016:18:34:19 +0800] "GET /WebAnalytics/js/ds.js?v=3 HTTP/1.1" 304 -
118.118.118.30 - - [15/Jun/2016:18:34:19 +0800] "GET /WebAnalytics/maindomain? HTTP/1.1" 200 3198
118.118.118.30 - - [15/Jun/2016:18:34:19 +0800] "GET /WebAnalytics/tracker/1.0/tpv? HTTP/1.1" 200 631
118.118.118.30 - - [15/Jun/2016:18:34:22 +0800] "GET /WebAnalytics/tracker/1.0/bind? HTTP/1.1" 200 631
118.118.118.30 - - [15/Jun/2016:18:34:46 +0800] "GET /WebAnalytics/js/ds.js?v=3 HTTP/1.1" 304 -
           

對日志中ip進行統計,統計結果如下圖:

Oozie 送出Spark On Yarn示例準備工作編寫Spark程式Oozie工作流本文重點:oozie spark on yarn官方文檔有說明,要送出spark到yarn上,需要配置以下屬性

詳細代碼

package com.simple.spark.oozie.action;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;

/**
 * Created by Administrator on 2016/8/25.
 */
public class OozieAction {
    public static void main(String[] args){
        String input = args[0];
        String output = args[1];
        String master = "local[*]";
        if(args.length >= 3){
            master = args[2];
        }

        SparkConf sparkConf = new SparkConf();
        sparkConf.setAppName("OozieAction - " + System.currentTimeMillis());
        if(!"none".equals(master))
            sparkConf.setMaster(master);
        JavaSparkContext context = new JavaSparkContext(sparkConf);
        JavaRDD<String> stringJavaRDD = context.textFile(input);
        stringJavaRDD.mapToPair(new PairFunction<String, String, Integer>() {
            public Tuple2<String, Integer> call(String s) throws Exception {
                String key = s.split(" ")[0];
                return new Tuple2<String, Integer>(key, 1);
            }
        }).reduceByKey(new Function2<Integer, Integer, Integer>() {
            public Integer call(Integer v1, Integer v2) throws Exception {
                return v1 + v2;
            }
        }).saveAsTextFile(output);
        context.close();
    }
}
           

Oozie工作流

示例工作流隻有一個spark節點,節點工作流配置參照: http://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html#Spark_on_YARN workflow.xml

<workflow-app name="Spark_Workflow" xmlns="uri:oozie:workflow:0.5">
    <start to="spark-SparkOozieAction"/>
    <kill name="Kill">
        <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <action name="spark-SparkOozieAction">
        <spark xmlns="uri:oozie:spark-action:0.1">
			<job-tracker>${jobTracker}</job-tracker>
			<name-node>${nameNode}</name-node>
			<master>${jobmaster}</master>
			<mode>${jobmode}</mode>
			<name>${jobname}</name>
			<class>${jarclass}</class>
			<jar>${jarpath}</jar>
			<spark-opts>${sparkopts}</spark-opts>
			<arg>${jararg1}</arg>
			<arg>${jararg2}</arg>
			<arg>${jararg3}</arg>
        </spark>
        <ok to="End"/>
        <error to="Kill"/>
    </action>
    <end name="End"/>
</workflow-app>
           

job.properties

oozie.use.system.libpath=True
oozie.wf.application.path=/user/root/oozie/t1/workflow.xml
security_enabled=False
dryrun=False
jobTracker=hdp-master:8032
nameNode=hdfs://hdp-master:8020
jobmaster=yarn-cluster
jobmode=cluster
jobname=SparkOozieAction
jarclass=com.simple.spark.oozie.action.OozieAction
jarpath=hdfs://hdp-master:8020/user/root/oozie/t1/SparkOozieAction-jar-with-dependencies.jar
sparkopts=--executor-memory 128M --total-executor-cores 2 --driver-memory 256M --conf spark.yarn.jar=hdfs://hdp-master:8020/system/spark/lib/spark-assembly-1.4.1-hadoop2.6.0.jar --conf spark.yarn.historyServer.address=http://hdp-master:18088 --conf spark.eventLog.dir=hdfs://hdp-master:8020/user/spark/applicationHistory --conf spark.eventLog.enabled=true
jararg1=/data/access_log.txt
jararg2=/out/oozie/t1
jararg3=cluster
           

本文重點:oozie spark on yarn官方文檔有說明,要送出spark到yarn上,需要配置以下屬性

1.確定spark-assembly-1.4.1-hadoop2.6.0.jar在oozie中可用,這個我了解了很久都沒有參透,後面查閱相關文章後才知道,這裡的意思是在送出任務時指定要應用的jar包,本文通過spark-opts參數--conf spark.yarn.jar=hdfs://hdp-master:8020/system/spark/lib/spark-assembly-1.4.1-hadoop2.6.0.jar指定。 2.master隻能指定為yarn-client或者yarn-cluster,本文為:yarn-cluster

3. spark.yarn.historyServer.address=http://SPH-HOST:18088

4. spark.eventLog.dir=hdfs://NN:8020/user/spark/applicationHistory 這裡的目錄

必須事先建立好,spark會提示這個目錄不存在的錯誤

5. spark.eventLog.enabled=true

送出workflow.xml到hdfs

$:hadoop fs -put /user/root/oozie/t1/workflow.xml
           

送出spark引用jar到hdfs

$:hadoop fs -put /user/root/oozie/t1/SparkOozieAction-jar-with-dependencies.jar
           

建立spark.eventLog.dir目錄

hadoop fs -mkdir put /user/spark/applicationHistory
           

送出程式jar包到hdfs

hadoop fs -put /user/root/oozie/t1/SparkOozieAction-jar-with-dependencies.jar
           

送出程式到oozie

$:cd /opt/oozie-4.2.0
$:./bin/oozie job -oozie http://hdp-master:11000/oozie -config /opt/myapps/spark/t1/job.properties -run
           

到oozie UI界面檢視任務情況 http://hdp-master:11000/oozie/

Oozie 送出Spark On Yarn示例準備工作編寫Spark程式Oozie工作流本文重點:oozie spark on yarn官方文檔有說明,要送出spark到yarn上,需要配置以下屬性

到yarn UI檢視spark在yarn上的運作情況

Oozie 送出Spark On Yarn示例準備工作編寫Spark程式Oozie工作流本文重點:oozie spark on yarn官方文檔有說明,要送出spark到yarn上,需要配置以下屬性

至此,spark 通過oozie送出到hadoop yarn上完畢。 謝謝!

繼續閱讀