Zero to JupyterHub with Kubernetes @aliyun

序言

JupyterHub是一個可以支援多個客戶同時線上的Jypter管理平台

JupyterHub的目标

• A cloud provider such as Google Cloud, Microsoft Azure, Amazon EC2, IBM Cloud，Alibaba Cloud…

• Kubernetes to manage resources on the cloud

• Helm to configure and control the packaged JupyterHub installation

• JupyterHub to give users access to a Jupyter computing environment

• A terminal interface on some operating system

引子

JupyterHub并沒有提供alibaba的指導，是以本文補充在JupyterHub在aliyun 容器服務上從零開始的步驟。

此外在GPU的使用上通過GPU共享方案--CGPU來提高GPU的使用率。

操作步驟

建立kubernetes叢集

建立容器服務叢集
添加GPU節點
設定GPU節點為共享模式，參考《嘗鮮阿裡雲容器服務Kubernetes 1.16，擁抱GPU新姿勢》

安裝JupyterHub

安裝Helm

阿裡雲容器服務目前預設支援的Helm版本為v3。

從

https://github.com/helm/helm/releases/

頁面找到最新的版本，下載下傳helm檔案，并添加到對應的path路徑中。

Generate a random hex string representing 32 bytes to use as a security token. Run this command in a terminal and copy the output:

openssl rand -hex 32

建立PVC和StorageClass資訊

建立Nas存儲，并建立挂載點。參考連結
https://help.aliyun.com/document_detail/144398.html?spm=a2c4g.11186623.6.757.71aa2b8djPh6fE

如下檔案存儲為storage.yaml 檔案，并通過 `kubectl apply -f storage.yaml

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: alicloud-nas-subpath
mountOptions:
- nolock,tcp,noresvport
- vers=3
parameters:
  volumeAs: subpath
  server: "0994fd65-66f5.cn-zhangjiakou.extreme.nas.aliyuncs.com:/share"   #需要放自己的nas的挂載點，操作參考 https://help.aliyun.com/document_detail/144398.html?spm=a2c4g.11186623.6.757.71aa2b8djPh6fE
provisioner: nasplugin.csi.alibabacloud.com
reclaimPolicy: Retain
#---
#kind: PersistentVolumeClaim
#apiVersion: v1
#metadata:
#  name: hub-db-dir
#  namespace: jhub
#spec:
#  accessModes:
#    - ReadWriteMany
#  storageClassName: alicloud-nas-subpath
#  resources:
#    requests:
#      storage: 1Gi

修改原始config檔案
修改config.yaml檔案中各個鏡像的路徑，避免拉取google倉庫的鏡像而造成的失敗
因為目前使用的版本0.8.2不能和k8s1.16正确比對，在install腳本中添加了對應的path 腳本參考 https://github.com/jupyterhub/kubespawner/issues/354
修改pod申請gpu資源的配置，設定CGPU的模式 aliyun.com/gpu-mem: 4 https://zero-to-jupyterhub.readthedocs.io/en/latest/customizing/user-resources.html
```
以上修改已經在下面的config.yaml檔案中完成了修改。 注意proxy.secretToken必須換為第一步生成的資訊
           
```

proxy:
  secretToken: "d8d198d787f22869e67df3ad3ac5f4d99a843c20243f9ed785f77822fb4ce517" ## 該token選用自己在步驟1中生成的即可
prePuller:
  continuous:
    enabled: false
  extraImages: {}
  hook:
    enabled: true
    image:
      name: jupyterhub/k8s-image-awaiter
      tag: 0.8.2
  pause:
    image:
      # 替換預設的google鏡像
      name: registry.cn-zhangjiakou.aliyuncs.com/kubernetesmirror/pause 
      tag: "3.1"
hub:
  #image:
  #  name: jupyterhub/k8s-hub
  #  tag: 0.9.0-beta.3
  #  0.8.2  # 嘗試提高該版本來與kubernetes 1.16比對，待驗證https://github.com/jupyterhub/kubespawner/issues/354
  db:
    password: null
    pvc:
      accessModes:
      - ReadWriteOnce
      annotations: {}
      selector: {}
      storage: 1Gi
      storageClassName: alicloud-nas-subpath
singleuser:
  storage:
    capacity: 2Gi
    dynamic:
      pvcNameTemplate: claim-{username}{servername}
      storageAccessModes:
      - ReadWriteOnce
      storageClass: alicloud-nas-subpath
      volumeNameTemplate: volume-{username}{servername}
  profileList:
    - display_name: "CGPU Server" ## 共享GPU使用模式
      description: "Spawns a notebook server with access to a CGPU"
      kubespawner_override:
        extra_resource_limits:
          aliyun.com/gpu-mem: 2
    - display_name: "GPU Server"  ## 普通GPU使用模式
      description: "Spawns a notebook server with access to a GPU"
      kubespawner_override:
        extra_resource_limits:
          nvidia.com/gpu: 1
  image:
    #name: jupyterhub/k8s-singleuser-sample
    #name: tensorflow/tensorflow
    pullPolicy: IfNotPresent
    # 替換為自己需要的鏡像即可，本例中使用的是支援TensorFlow的鏡像，可以從官方網站上去找合适的鏡像
    name: registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tensorflow-notebook
    tag: 1.15.2

執行安裝腳本

将如下記憶體存儲為install.sh 檔案，并将步驟3存儲的config.yaml 檔案放在同一個目錄下，執行安裝指令

bash sh ./install.sh

# Suggested values: advanced users of Kubernetes and Helm should feel
# free to use different values.
RELEASE=jhub
NAMESPACE=jhub
helm upgrade --install $RELEASE jupyterhub/jupyterhub \
  --namespace $NAMESPACE  \
  --version=0.8.2 \
  --values config.yaml -v 6
sleep 30
export NAMESPACE=jhub
kubectl patch deploy -n $NAMESPACE hub --type json --patch '[{"op": "replace", "path": "/spec/template/spec/containers/0/command", "value": ["bash", "-c", "\nmkdir -p ~/hotfix\ncp -r /usr/local/lib/python3.6/dist-packages/kubespawner ~/hotfix\nls -R ~/hotfix\npatch ~/hotfix/kubespawner/spawner.py << EOT\n72c72\n<             key=lambda x: x.last_timestamp,\n---\n>             key=lambda x: x.last_timestamp and x.last_timestamp.timestamp() or 0.,\nEOT\n\nPYTHONPATH=$HOME/hotfix jupyterhub --config /srv/jupyterhub_config.py --upgrade-db\n"]}]'

注意：“kubectl patch deploy...” 該指令執行是在安裝helm後30秒後執行，注意耐心等待，不要提前終止

驗證與使用

驗證pod和svc的狀态均為正常

jumper(⎈ |zjk-gpu:jhub)➜  ~ k get pod -n jhub
NAME                     READY   STATUS    RESTARTS   AGE
hub-86d7754c55-jnsd8     1/1     Running   0          10h
proxy-657b654c85-htl62   1/1     Running   0          10h
jumper(⎈ |zjk-gpu:jhub)➜  ~ k get svc -n jhub
NAME           TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
hub            ClusterIP      172.21.10.208   <none>        8081/TCP                     10h
proxy-api      ClusterIP      172.21.13.171   <none>        8001/TCP                     10h
proxy-public   LoadBalancer   172.21.15.40    47.92.24.78   80:30889/TCP,443:31823/TCP   10h

根據通過svc proxy-public 對應的公網IP通路網站，由于沒有設定使用者密碼，可以随意設定

Zero to JupyterHub with Kubernetes @aliyun
如下圖是在config.yaml檔案中配置的多個profile，分别是申請普通的GPU資源，以及共享型的GPU資源

Zero to JupyterHub with Kubernetes @aliyun
如下圖，正在建立正在使用的pod

Zero to JupyterHub with Kubernetes @aliyun
檢視命名空間 jhub下的pod的情況，有一個jupyter-${使用者名}的pod生成

jumper(⎈ |zjk-gpu:jhub)➜  54_cgpu_demo git:(master) ✗ k get pod
NAME                     READY   STATUS    RESTARTS   AGE
hub-5ff8cff85f-nmhfl     1/1     Running   0          46h
jupyter-lilong           1/1     Running   0          17s
proxy-657b654c85-8mn6t   1/1     Running   0          46h

至此，環境建立成功。

問題記錄

搭建過程中的問題與記錄

問題1 NoneType

錯誤資訊

[E 2020-05-30 00:24:14.373 JupyterHub base:1011] Preventing implicit spawn for a because last spawn failed: '<' not supported between instances of 'NoneType' and 'NoneType'

說明：已知問題，參考

上面在安裝helm chart之後的“kubectl patch deploy...”就是修複該問題

問題2 存儲

不管是hub的建立，還是每個客戶運作自己環境建立的pod均需要使用到存儲，如果pod啟動pending，且kubectl describe pod * 裡面顯示存儲資源不足，可以參考nas相關的存儲的設定。

參考

NAS動态存儲卷 https://help.aliyun.com/document_detail/144398.html
JupyterHub在k8s上的安裝 https://zero-to-jupyterhub.readthedocs.io/en/latest/
自定義環境 https://zero-to-jupyterhub.readthedocs.io/en/latest/customizing/user-environment.html

Zero to JupyterHub with Kubernetes @aliyun

序言

JupyterHub的目标

引子

操作步驟

建立kubernetes叢集

安裝JupyterHub

安裝Helm

驗證與使用

問題記錄

問題1 NoneType

問題2 存儲

參考

繼續閱讀

在目前位置打開指令行視窗的技巧

unit 1 - redhat Enterprise 8.0 Linux 指令行使用技巧

Windows指令行中使用SSH連接配接Linux

Linux下指令行中的複制和粘貼

1.Linux指令行使用技巧

spec檔案詳解

BMP檔案結構及圖像每行位元組計算方法

磁盤結構及在Linux中的命名

HK-2000資料采集儀資料庫操作說明

終端環境之tmux

查找檔案中的字元串

拒絕使用者登入:/bin/false和/usr/sbin/nologin

Shell程式設計——sort排序、uniq忽略重複、tr替換壓縮删除、cut指定删除字段、正規表達式元字元sort 指令uniq 指令tr 指令cut 指令正規表達式

Linxu常用指令技巧彙總

《Linux指令行與Shell腳本程式設計大全第2版.布盧姆》pdf

ACS基本配置-權限等級管理