前言
Spark 自從2.3版本以來就支援運作在k8s上,本文主要介紹如何運作Spark在阿裡雲容器服務-Kubernetes。
前提條件
1、 已經購買阿裡雲容器服務-Kubernetes。購買連結:
Kubernetes控制台。本例k8s叢集類型為:Kubernetes 托管版。
2、 Spark鏡像已建構。本例建構Spark的鏡像的Dokerfile内容為:
# 基礎鏡像
FROM registry.cn-beijing.aliyuncs.com/acs/spark:v2.4.0
# 作者
LABEL maintainer "[email protected]"
#拷貝jar包到制定目錄
COPY ./spark-examples-0.0.1-SNAPSHOT.jar /opt/spark/examples/jars/
鏡像建構完後的registry的位址為:registry.cn-beijing.aliyuncs.com/bill_test/image_test:v2.4.0
3、 本例通過k8s 用戶端送出yaml檔案啟動Spark任務。
下面詳情介紹下每個步驟
制作Spark鏡像
制作Spark鏡像需要先在本地安裝Docker,本例介紹Mac的安裝方法。在Mac中執行:
brew cask install docker
安裝完畢後執行docker指令顯示如下内容說明安裝成功。
Usage: docker [OPTIONS] COMMAND
A self-sufficient runtime for containers
Options:
--config string Location of client config files (default "/Users/bill.zhou/.docker")
-D, --debug Enable debug mode
-H, --host list Daemon socket(s) to connect to
-l, --log-level string Set the logging level ("debug"|"info"|"warn"|"error"|"fatal") (default "info")
--tls Use TLS; implied by --tlsverify
--tlscacert string Trust certs signed only by this CA (default "/Users/bill.zhou/.docker/ca.pem")
--tlscert string Path to TLS certificate file (default "/Users/bill.zhou/.docker/cert.pem")
--tlskey string Path to TLS key file (default "/Users/bill.zhou/.docker/key.pem")
--tlsverify Use TLS and verify the remote
-v, --version Print version information and quit
制作docker鏡像需要編寫Dockerfile,本例的Dockerfile建立過程如下。
#進入目錄:
cd /Users/bill.zhou/dockertest
#拷貝測試jar包到此目錄:
cp /Users/jars/spark-examples-0.0.1-SNAPSHOT.jar ./
#建立檔案Dockerfile
vi Dockerfile
在Dockerfile檔案中輸入如下内容:
# 基礎鏡像
FROM registry.cn-beijing.aliyuncs.com/acs/spark:v2.4.0
# 作者
LABEL maintainer "[email protected]"
#拷貝jar包到制定目錄
COPY ./spark-examples-0.0.1-SNAPSHOT.jar /opt/spark/examples/jars/
本例鏡像引用了别人的基礎鏡像:registry.cn-beijing.aliyuncs.com/acs/spark:v2.4.0,然加入了自己的測試代碼jar包:spark-examples-0.0.1-SNAPSHOT.jar。
Dockerfile編寫完畢後,開始建構鏡像,指令如下:
docker build /Users/bill.zhou/dockertest/ -t registry.cn-beijing.aliyuncs.com/bill_test/image_test:v2.4.0
建構完畢後需要上傳鏡像到registry,指令如下:
#先登入自己的阿裡雲賬号
docker login [email protected] registry.cn-beijing.aliyuncs.com
#上傳鏡像
docker push registry.cn-beijing.aliyuncs.com/bill_test/image_test:v2.4.0
鏡像制作完畢後可以開始試用Spark鏡像。
送出任務到k8s叢集
本例通過k8s用戶端kubectl送出yaml到k8s。
首先購買一次ECS(和k8s在同一個vpc下),安裝k8s用戶端kubectl。安裝指導參考:
安裝k8s指導。
安裝完畢後配置叢集的憑證後就可以通路k8s叢集了。叢集憑證配置方法,進入k8s叢集“基本資訊”頁面擷取憑證資訊,如下圖:

然後參考如下步驟送出spark任務:
## 安裝crd
kubectl apply -f manifest/spark-operator-crds.yaml
## 安裝operator的服務賬号與授權政策
kubectl apply -f manifest/spark-operator-rbac.yaml
## 安裝spark任務的服務賬号與授權政策
kubectl apply -f manifest/spark-rbac.yaml
## 安裝spark-on-k8s-operator
kubectl apply -f manifest/spark-operator.yaml
## 下發spark-pi任務
kubectl apply -f examples/spark-pi.yaml
對應的檔案參可以從
開源社群下載下傳最新版本。
運作完畢後可以通過指令檢視運作日志。如下:
#檢視pod -n指定命名空間
kubectl get pod -n spark-operator
#檢視pod 日志
kubectl log spark-pi-driver -n spark-operator
看下如下内容說明執行成功:
2019-07-23 11:55:54 INFO SparkContext:54 - Created broadcast 0 from broadcast at DAGScheduler.scala:1161
2019-07-23 11:55:54 INFO DAGScheduler:54 - Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:17) (first 15 tasks are for partitions Vector(0, 1))
2019-07-23 11:55:54 INFO TaskSchedulerImpl:54 - Adding task set 0.0 with 2 tasks
2019-07-23 11:55:55 INFO TaskSetManager:54 - Starting task 0.0 in stage 0.0 (TID 0, 172.20.1.9, executor 1, partition 0, PROCESS_LOCAL, 7878 bytes)
2019-07-23 11:55:55 INFO TaskSetManager:54 - Starting task 1.0 in stage 0.0 (TID 1, 172.20.1.9, executor 1, partition 1, PROCESS_LOCAL, 7878 bytes)
2019-07-23 11:55:57 INFO BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on 172.20.1.9:36662 (size: 1256.0 B, free: 117.0 MB)
2019-07-23 11:55:57 INFO TaskSetManager:54 - Finished task 1.0 in stage 0.0 (TID 1) in 2493 ms on 172.20.1.9 (executor 1) (1/2)
2019-07-23 11:55:57 INFO TaskSetManager:54 - Finished task 0.0 in stage 0.0 (TID 0) in 2789 ms on 172.20.1.9 (executor 1) (2/2)
2019-07-23 11:55:57 INFO TaskSchedulerImpl:54 - Removed TaskSet 0.0, whose tasks have all completed, from pool
2019-07-23 11:55:58 INFO DAGScheduler:54 - ResultStage 0 (reduce at SparkPi.scala:21) finished in 5.393 s
2019-07-23 11:55:58 INFO DAGScheduler:54 - Job 0 finished: reduce at SparkPi.scala:21, took 6.501405 s
**Pi is roughly 3.142955714778574**
2019-07-23 11:55:58 INFO AbstractConnector:318 - Stopped Spark@49096b06{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2019-07-23 11:55:58 INFO SparkUI:54 - Stopped Spark web UI at http://spark-test-1563882878789-driver-svc.spark-operator-t01.svc:4040
2019-07-23 11:55:58 INFO KubernetesClusterSchedulerBackend:54 - Shutting down all executors
2019-07-23 11:55:58 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint:54 - Asking each executor to shut down
2019-07-23 11:55:58 WARN ExecutorPodsWatchSnapshotSource:87 - Kubernetes client has been closed (this is expected if the application is shutting down.)
2019-07-23 11:55:59 INFO MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2019-07-23 11:55:59 INFO MemoryStore:54 - MemoryStore cleared
2019-07-23 11:55:59 INFO BlockManager:54 - BlockManager stopped
2019-07-23 11:55:59 INFO BlockManagerMaster:54 - BlockManagerMaster stopped
2019-07-23 11:55:59 INFO OutputCommitCoordinator$
最後看下spark-pi.yaml檔案内容的關鍵資訊。
apiVersion: "sparkoperator.k8s.io/v1beta1"
kind: SparkApplication
metadata:
name: spark-pi
namespace: spark-operator
spec:
type: Scala
mode: cluster
image: "registry.cn-beijing.aliyuncs.com/acs/spark:v2.4.0" #運作的鏡像registry路徑。
imagePullPolicy: Always
mainClass: org.apache.spark.examples.SparkPi #運作的入口類。
mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar" #運作的類的相關jar包,這個路徑是鏡像中的路徑。
sparkVersion: "2.4.0"
restartPolicy:
type: Never
volumes:
- name: "test-volume"
hostPath:
path: "/tmp"
type: Directory
#定義spark driver端的資源
driver:
cores: 0.1
coreLimit: "200m"
memory: "512m"
labels:
version: 2.4.0
serviceAccount: spark
volumeMounts:
- name: "test-volume"
mountPath: "/tmp"
#定義executor 端的資源
executor
cores: 1
instances: 1
memory: "512m"
labels:
version: 2.4.0
volumeMounts:
- name: "test-volume"
mountPath: "/tmp"
yaml是送出到k8s檔案标準格式,yaml中定義了所需的鏡像,Spark driver和executor端的資源。
總結
k8s容器介紹請參考:
容器服務Kubernetes版其它spark on k8s請參考:
Spark in action on Kubernetes - Playground搭建與架構淺析 Spark in action on Kubernetes - Spark Operator的原了解析