天天看點

運作spark——2.spark-submit

spark-submit,以kmeans為例

本地模式:

使用setMaster("local"),在idea中直接右鍵run即可

import org.apache.log4j.{ Level, Logger }
import org.apache.spark.{ SparkConf, SparkContext }
import org.apache.spark.mllib.clustering._
import org.apache.spark.mllib.linalg.Vectors

object test {

  def main(args: Array[String]): Unit = {
    // 1. 構造spark對象
    val conf = new SparkConf().setMaster("yarn-client").setAppName("KMeans").set("spark.driver.memory", "512m").set("spark.executor.memory", "512m")
    val sc = new SparkContext(conf)
    println("mode:"+sc.master)
    // 去除多餘的warn資訊

    // 2. 讀取樣本資料,LIBSVM格式
    val data = sc.textFile("file:///test/kmeans_data.txt")
    val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()

    // 3. 建立KMeans模型,并訓練
    val model = KMeans.train(parsedData,2,20)

    // 列印聚類中心
    model.clusterCenters.foreach(println)

  }
}
           

也可以讀取hdfs的檔案:

val data = sc.textFile("file:///test/kmeans_data.txt")

yarn模式:

首先打jar包,可以用idea也可以用sbt

運作spark——2.spark-submit

然後選擇build --》build artifacts,test --》build,

然後在工程目錄的子目錄下會生成對應的jar檔案:

在src檔案的旁的/out/artifacts中找到我們需要的test.jar

複制到好找到的目錄,如/export/spark_jar/

[[email protected] ~]# spark-submit --class test --master yarn file:///export/spark_jar/test.jar

結果:

運作spark——2.spark-submit
運作spark——2.spark-submit

spark-submit指令參數:

./bin/spark-submit --class org.apache.spark.examples.SparkPi \

--master yarn \

--deploy-mode cluster \

--driver-memory 1g \

--executor-memory 1g \

--executor-cores 1 \

--queue thequeue \

examples/target/scala-2.11/jars/spark-examples*.jar 10

遇到的錯誤:

1.記憶體不足:

diagnostics: Application application_1555606869295_0001 failed 2 times due to AM Container for appattempt_1555606869295_0001_000002 exited with exitCode: -103

Failing this attempt.Diagnostics: [2019-04-19 01:02:32.548]Container [pid=18294,containerID=container_1555606869295_0001_02_000001] is running 125774336B beyond the 'VIRTUAL' memory limit. Current usage: 75.3 MB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used. Killing container.

Dump of the process-tree for container_1555606869295_0001_02_000001 :

Container要用2.2GB的記憶體,而虛拟記憶體隻有2.1GB,不夠用了,是以Kill了Container

我的SPARK-EXECUTOR-MEMORY設定的是1G,即實體記憶體是1G,Yarn預設的虛拟記憶體和實體記憶體比例是2.1,也就是說虛拟記憶體是2.1G,小于了需要的記憶體2.2G。解決的辦法是把拟記憶體和實體記憶體比例增大,在yarn-site.xml中增加一個設定:

解決方法:

yarn-site.xml中增加一個設定:

<property> 

    <name>yarn.nodemanager.vmem-pmem-ratio</name> 

    <value>2.5</value> 

</property>
           

修改hadoop的yarn-site.xml,不檢查一些東西

<property> 
    <name>yarn.nodemanager.vmem-check-enabled</name> 
    <value>false</value>  
</property> 
<property> 
    <name>yarn.nodemanager.pmem-check-enabled</name> 
    <value>false</value> 
</property>
           

2.警告warn:Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.

原因:

原因是因為Spark送出任務到yarn叢集,需要上傳Hadoop相關yarn的jar包,隻是warn,可以不管

解決辦法:

把yarn的jar包上傳到hdfs,修改配置檔案指定路徑即可

(1)上傳

[[email protected] ~]# hadoop fs -mkdir -p /spark/jars

[[email protected] ~]# hadoop fs -ls /

[[email protected] ~]# hadoop fs -put /export/servers/spark-2.3.1-bin-hadoop2.7/jars/* /spark/jars/

(2)在spark的conf的spark-default.conf ,添加: spark.yarn.jars hdfs://192.168.12.129:9000//spark/jars/*

(3)重新運作,warn消失

[[email protected] ~]# spark-shell --master yarn --deploy-mode client

繼續閱讀