spark-submit,以kmeans為例
本地模式:
使用setMaster("local"),在idea中直接右鍵run即可
import org.apache.log4j.{ Level, Logger }
import org.apache.spark.{ SparkConf, SparkContext }
import org.apache.spark.mllib.clustering._
import org.apache.spark.mllib.linalg.Vectors
object test {
def main(args: Array[String]): Unit = {
// 1. 構造spark對象
val conf = new SparkConf().setMaster("yarn-client").setAppName("KMeans").set("spark.driver.memory", "512m").set("spark.executor.memory", "512m")
val sc = new SparkContext(conf)
println("mode:"+sc.master)
// 去除多餘的warn資訊
// 2. 讀取樣本資料,LIBSVM格式
val data = sc.textFile("file:///test/kmeans_data.txt")
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()
// 3. 建立KMeans模型,并訓練
val model = KMeans.train(parsedData,2,20)
// 列印聚類中心
model.clusterCenters.foreach(println)
}
}
也可以讀取hdfs的檔案:
val data = sc.textFile("file:///test/kmeans_data.txt")
yarn模式:
首先打jar包,可以用idea也可以用sbt
然後選擇build --》build artifacts,test --》build,
然後在工程目錄的子目錄下會生成對應的jar檔案:
在src檔案的旁的/out/artifacts中找到我們需要的test.jar
複制到好找到的目錄,如/export/spark_jar/
[[email protected] ~]# spark-submit --class test --master yarn file:///export/spark_jar/test.jar
結果:
spark-submit指令參數:
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 1 \
--queue thequeue \
examples/target/scala-2.11/jars/spark-examples*.jar 10
遇到的錯誤:
1.記憶體不足:
diagnostics: Application application_1555606869295_0001 failed 2 times due to AM Container for appattempt_1555606869295_0001_000002 exited with exitCode: -103
Failing this attempt.Diagnostics: [2019-04-19 01:02:32.548]Container [pid=18294,containerID=container_1555606869295_0001_02_000001] is running 125774336B beyond the 'VIRTUAL' memory limit. Current usage: 75.3 MB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1555606869295_0001_02_000001 :
Container要用2.2GB的記憶體,而虛拟記憶體隻有2.1GB,不夠用了,是以Kill了Container
我的SPARK-EXECUTOR-MEMORY設定的是1G,即實體記憶體是1G,Yarn預設的虛拟記憶體和實體記憶體比例是2.1,也就是說虛拟記憶體是2.1G,小于了需要的記憶體2.2G。解決的辦法是把拟記憶體和實體記憶體比例增大,在yarn-site.xml中增加一個設定:
解決方法:
yarn-site.xml中增加一個設定:
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.5</value>
</property>
修改hadoop的yarn-site.xml,不檢查一些東西
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
2.警告warn:Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
原因:
原因是因為Spark送出任務到yarn叢集,需要上傳Hadoop相關yarn的jar包,隻是warn,可以不管
解決辦法:
把yarn的jar包上傳到hdfs,修改配置檔案指定路徑即可
(1)上傳
[[email protected] ~]# hadoop fs -mkdir -p /spark/jars
[[email protected] ~]# hadoop fs -ls /
[[email protected] ~]# hadoop fs -put /export/servers/spark-2.3.1-bin-hadoop2.7/jars/* /spark/jars/
(2)在spark的conf的spark-default.conf ,添加: spark.yarn.jars hdfs://192.168.12.129:9000//spark/jars/*
(3)重新運作,warn消失
[[email protected] ~]# spark-shell --master yarn --deploy-mode client