參考文檔:
Install Kubeflow v1.3
注: 要在本地安裝,您隻需安裝 MicroK8s 并啟用 Kubeflow 插件。
本指南列出了在任何符合标準的 Kubernetes(包括 AKS、EKS、GKE、Openshift 和任何 kubeadm 部署的叢集)上安裝 Kubeflow 所需的步驟,前提是您可以通過
kubectl
通路它。
![](https://img.laitimes.com/img/__Qf2AjLwojIjJCLyojI0JCLiYTMfhHLlN3XnxCM38FdsYkRGZkRG9lcvx2bjxCMy8VZ6l2cs0TPRplbk1mYshGWaFXNXFWQClGVF5UMR9Fd4VGdsATNfd3bkFGazxycykFaKdkYzZUbapXNXlleSdVY2pESa9VZwlHdssmch1mclRXY39CXldWYtlWPzNXZj9mcw1ycz9WL49zZuBnL1YGZzkzYiNWYmdTOiR2Y4UDO2QjNlFDZjRzNkRTY4czLc52YucWbp5GZzNmLn9Gbi1yZtl2Lc9CX6MHc0RHaiojIsJye.png)
1 安裝juju用戶端
在 Linux 上,使用以下指令通過 snap 安裝 juju:
snap install juju --classic
或者,在 macOS 上
brew install juju
或下載下傳 Windows 安裝程式。
2 将 Juju 連接配接到您的 Kubernetes 叢集
為了使用 Juju 操作 Kubernetes 叢集中的工作負載,您必須通過 add-k8s 指令将叢集添加到 juju 的雲清單中。
如果您的 Kubernetes 配置檔案位于标準位置(Linux 上的 ~/.kube/config),并且您隻有一個叢集,則隻需運作:
juju add-k8s myk8s
注,要按照前文的安裝portainer的方法,擷取配置檔案和安裝openebs。
如果您的 kubectl 配置檔案包含多個叢集,您可以按名稱指定合适的叢集:
juju add-k8s myk8s --cluster-name=foo
最後,要使用不同的配置檔案,您可以将 KUBECONFIG 環境變量設定為指向相關檔案。例如:
KUBECONFIG=path/to/file juju add-k8s myk8s
有關更多詳細資訊,請參閱 Juju 文檔。
3 建立控制器
為了在 Kubernetes 叢集上運作工作負載,Juju 使用控制器。您可以使用 bootstrap 指令建立控制器:
juju bootstrap myk8s my-controller
此指令将在 my-controller 命名空間下建立幾個 pod。您可以使用
juju controllers
指令檢視您的控制器。
您可以在 Juju 文檔中閱讀有關控制器的更多資訊。
4 建立模型
Juju 中的模型是一個空白畫布,您的操作員将在其中部署,它與 Kubernetes 命名空間保持 1:1 的關系。
您可以建立一個模型并為其命名,例如kubeflow,使用 add-model 指令,您還将建立一個同名的 Kubernetes 命名空間:
juju add-model kubeflow
您可以使用
juju models
指令列出您的模型。
5 部署 Kubeflow
要求:
部署 kubeflow 所需的最低資源是:50Gb 磁盤空間、14Gb RAM 和 2 個可用于 Linux 機器或 VM 的 CPU。
如果您的資源較少,請部署 kubeflow-lite 或 kubeflow-edge.
擁有模型後,您可以簡單地 juju 将任何提供的 Kubeflow 包部署到您的叢集中,并在前面加上 cs。
例如,對于 Kubeflow lite 包,運作:
juju deploy cs:kubeflow-lite
恭喜,Kubeflow 正在安裝!
您可以使用以下指令觀察您的 Kubeflow 部署:
watch -c juju status --color
6 在身份驗證方法中設定 URL
啟用 Kubeflow 儀表闆通路權限的最後一步是通過以下指令将儀表闆公共 URL 提供給 dex-auth 和 oidc-gatekeeper:
juju config dex-auth public-url=http://<URL>
juju config oidc-gatekeeper public-url=http://<URL>
其中 是 Kubeflow 儀表闆響應的主機名。例如,在典型的 MicroK8s 安裝中,此 URL 是 http://10.64.140.43.nip.io。請注意,當您設定 DNS 時,您應該使用 istio-ingressgateway 使用的可解析位址。
7 添加 RBAC 角色
目前,為了在啟用 RBAC 時正确設定 Kubeflow 和 Istio,您需要提供 istio-ingressgateway 操作員對 Kubernetes 資源的通路權限。以下指令将建立适當的角色:
kubectl patch role -n kubeflow istio-ingressgateway-operator -p '{"apiVersion":"rbac.authorization.k8s.io/v1","kind":"Role","metadata":{"name":"istio-ingressgateway-operator"},"rules":[{"apiGroups":["*"],"resources":["*"],"verbs":["*"]}]}'
有問題?
如果您在遵循這些說明時遇到任何困難,請在
[此處](https://github.com/juju-solutions/bundle-kubeflow/issues)
建立問題。
以下是實際過程:
1 由于已經有了juju用戶端,第一步略過。
2 添加模型
由于 Kubernetes 配置檔案位于标準位置(Linux 上的 ~/.kube/config),并且隻有一個叢集,是以使用以下指令添加模型:
juju add-k8s myk8s
3 建立控制器:
juju bootstrap myk8s my-controller --debug
不幸的是,出現了以下錯誤:
ERROR juju.cmd.juju.commands bootstrap.go:883 failed to bootstrap model: creating controller stack: creating statefulset for controller: timed out waiting for controller pod: pending: -
13:53:22 DEBUG juju.cmd.juju.commands bootstrap.go:884 (error details: [{/build/snapcraft-juju-35d6cf/parts/juju/src/cmd/juju/commands/bootstrap.go:983: failed to bootstrap model} {/build/snapcraft-juju-35d6cf/parts/juju/src/environs/bootstrap/bootstrap.go:667: } {/build/snapcraft-juju-35d6cf/parts/juju/src/environs/bootstrap/bootstrap.go:298: } {/build/snapcraft-juju-35d6cf/parts/juju/src/caas/kubernetes/provider/k8s.go:493: creating controller stack} {/build/snapcraft-juju-35d6cf/parts/juju/src/caas/kubernetes/provider/bootstrap.go:502: creating statefulset for controller} {/build/snapcraft-juju-35d6cf/parts/juju/src/caas/kubernetes/provider/bootstrap.go:917: } {/build/snapcraft-juju-35d6cf/parts/juju/src/caas/kubernetes/provider/bootstrap.go:1051: timed out waiting for controller pod} {/build/snapcraft-juju-35d6cf/parts/juju/src/caas/kubernetes/provider/bootstrap.go:1008: pending: - }])
13:53:22 DEBUG juju.cmd.juju.commands bootstrap.go:1634 cleaning up after failed bootstrap
懷疑是由于使用了标準的Charmed Kubernetes #679部署kubernetes,其中的kubernetes子產品配置是cores=4 mem=4G root-disk=16G,懷疑硬碟配置過少,出錯。故重新部署了三個100G硬碟的worker節點
處理辦法:
1 先在maas上部署的三個虛機”cores=4 mem=4G root-disk=100G“ #建議更改為記憶體8G,因為前文的portainer對叢集的記憶體比較高。
2 根據前文ubuntu20.04下使用juju+maas環境部署k8s-9-縮放節點:
2.1停止硬碟16G的節點kubernetes-worker/0
juju run-action kubernetes-worker/0 pause --wait
2.2删除此節點。
juju remove-unit kubernetes-worker/0
2.3 增加100G單元
juju add-unit kubernetes-worker
2.4 重複上述步驟兩次。
3 再次重新部署控制器,
juju bootstrap myk8s my-controller --debug
成功。
4 建立模型
juju add-model kubeflow
5 部署 Kubeflow
juju deploy kubeflow --debug
安裝完畢後,運作一段時間,檢視狀态,會發現出現很多錯誤,不用擔心,是由于國際線路太忙,映像下載下傳不下來造成的,國際線路閑時一般是淩晨2-7點,建議使用at指令定時安裝,可以比較順暢的安裝,不用多次執行
juju deploy kubeflow --debug
。
at 3:00 #定時在3點
>juju deploy kubeflow --debug
ctrl+d #儲存
at -c <job号> 檢視
下午檢視大概類似這個狀态:
juju status
Model Controller Cloud/Region Version SLA Timestamp
kubeflow my-controller myk8s 2.9.14 unsupported 14:00:40+08:00
App Version Status Scale Charm Store Channel Rev OS Address Message
admission-webhook res:[email protected] active 1 admission-webhook charmstore stable 10 kubernetes 10.152.183.253
argo-controller res:[email protected] waiting 1 argo-controller charmstore stable 51 kubernetes
dex-auth res:[email protected] active 1 dex-auth charmstore stable 60 kubernetes 10.152.183.133
istio-ingressgateway waiting 1 istio-ingressgateway charmstore stable 20 kubernetes Waiting for istio-pilot relation data
istio-pilot res:[email protected] waiting 1 istio-pilot charmstore stable 20 kubernetes 10.152.183.223
jupyter-controller res:[email protected] active 1 jupyter-controller charmstore stable 56 kubernetes
jupyter-ui res:[email protected] active 1 jupyter-ui charmstore stable 10 kubernetes 10.152.183.134
kfp-api res:[email protected] waiting 1 kfp-api charmstore stable 12 kubernetes 10.152.183.121
kfp-db mariadb/server:10.3 active 1 mariadb-k8s charmstore stable 35 kubernetes 10.152.183.137
kfp-persistence res:[email protected] waiting 1 kfp-persistence charmstore stable 9 kubernetes
kfp-schedwf res:[email protected] waiting 1 kfp-schedwf charmstore stable 9 kubernetes
kfp-ui res:[email protected] waiting 1 kfp-ui charmstore stable 12 kubernetes 10.152.183.153
kfp-viewer res:[email protected] active 1 kfp-viewer charmstore stable 9 kubernetes
kfp-viz res:[email protected] waiting 1 kfp-viz charmstore stable 8 kubernetes 10.152.183.233
kubeflow-dashboard res:[email protected] waiting 1 kubeflow-dashboard charmstore stable 56 kubernetes 10.152.183.32
kubeflow-profiles res:[email protected] active 1 kubeflow-profiles charmstore stable 52 kubernetes 10.152.183.182
kubeflow-volumes res:[email protected] active 1 kubeflow-volumes charmstore stable 0 kubernetes 10.152.183.164
minio res:[email protected] waiting 1 minio charmstore stable 55 kubernetes 10.152.183.215
mlmd res:[email protected] active 1 mlmd charmstore stable 5 kubernetes 10.152.183.46
oidc-gatekeeper res:[email protected] active 1 oidc-gatekeeper charmstore stable 54 kubernetes 10.152.183.183
pytorch-operator res:[email protected] waiting 1 pytorch-operator charmstore stable 53 kubernetes
seldon-controller-manager res:[email protected] active 1 seldon-core charmstore stable 50 kubernetes 10.152.183.113
tfjob-operator res:[email protected] active 1 tfjob-operator charmstore stable 1 kubernetes
Unit Workload Agent Address Ports Message
admission-webhook/0* active idle 10.1.44.32 443/TCP
argo-controller/0* error idle 10.1.20.90 OCI image pull error: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/argo-charmers/argo-controller/[email protected]:c1746aec607fac57e7e5006329b58c7a566f042c5bf0cf3cbae192adc5b06bb5": failed commit on ref "layer-sha256:2e2462c07d2af70a0af7ef14ba643c28c1d854336996c534e193e69dcd32df64": "layer-sha256:2e2462c07d2af70a0af7ef14ba643c28c1d854336996c534e193e69dcd32df64" failed size validation: 3920502 != 24609777: failed precondition
dex-auth/0* active idle 10.1.20.39 5556/TCP
istio-ingressgateway/0* waiting idle Waiting for istio-pilot relation data
istio-pilot/0* error idle 10.1.20.62 8080/TCP,15010/TCP,15012/TCP,15017/TCP OCI image pull error: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/istio-charmers/istio-pilot/[email protected]:e3e03b31cebfc4c73d4788b83af3339685673970a5c3bf3167db399d39696ed8": failed commit on ref "layer-sha256:64d67ae6b2e3b0799483b95c62b8594afffe04a615e5420a552c3ab25766c17e": "layer-sha256:64d67ae6b2e3b0799483b95c62b8594afffe04a615e5420a552c3ab25766c17e" failed size validation: 3806023 != 29905420: failed precondition
jupyter-controller/0* active idle 10.1.20.48
jupyter-ui/0* active idle 10.1.20.50 5000/TCP
kfp-api/0* error idle 10.1.20.93 8888/TCP,8887/TCP OCI image pull error: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/kubeflow-charmers/kfp-api/[email protected]:8e608409f50a332e787923dda2ea4eb5c9f0839a4c9ff3f77d535efa03eac9e9": failed commit on ref "layer-sha256:16cf3fa6cb1190b4dfd82a5319faa13e2eb6e69b7b4828d4d98ba1c0b216e446": "layer-sha256:16cf3fa6cb1190b4dfd82a5319faa13e2eb6e69b7b4828d4d98ba1c0b216e446" failed size validation: 5028131 != 45380216: failed precondition
kfp-db/0* active idle 10.1.20.63 3306/TCP ready
kfp-persistence/0* error idle 10.1.20.91 crash loop backoff: back-off 5m0s restarting failed container=ml-pipeline-persistenceagent pod=kfp-persistence-864dc895d5-xshwz_kubeflow(384f9b66-dcc5-45c4-9f23-83592dfbc228)
kfp-schedwf/0* error idle 10.1.20.57 OCI image pull error: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/kubeflow-charmers/kfp-schedwf/[email protected]:4ab648890dad76ea51fdfb432d95992136127340b832e3d345207b839c6db23e": failed commit on ref "layer-sha256:1c3b653ff1c285f8579579c2729c7b84b3e8a14153ed7bc076316f90dda1e41c": "layer-sha256:1c3b653ff1c285f8579579c2729c7b84b3e8a14153ed7bc076316f90dda1e41c" failed size validation: 3804543 != 21611777: failed precondition
kfp-ui/0* error idle 10.1.20.92 3000/TCP OCI image pull error: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/kubeflow-charmers/kfp-ui/[email protected]:04a4348d6b2ec8142cc0a1dd45f738b719fef7cca5c2585ec5b935d43eab1aa8": failed commit on ref "layer-sha256:f28e01f8f11f1d6aa71000847f46725c1ad868057963d5c72b6fffedbbdec85f": "layer-sha256:f28e01f8f11f1d6aa71000847f46725c1ad868057963d5c72b6fffedbbdec85f" failed size validation: 4635725 != 28057227: failed precondition
kfp-viewer/0* active idle 10.1.20.65
kfp-viz/0* error idle 10.1.20.68 8888/TCP OCI image pull error: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/kubeflow-charmers/kfp-viz/[email protected]:c90a5818043da47448c4230953b265a66877bd143e4bdd991f762cf47e2a16d6": failed commit on ref "layer-sha256:08d3fb8816994acdeef83d6a1181b92e447d6d3bbcb737c93b16cdd0f28a6fbf": "layer-sha256:08d3fb8816994acdeef83d6a1181b92e447d6d3bbcb737c93b16cdd0f28a6fbf" failed size validation: 3811546 != 3978030: failed precondition
kubeflow-dashboard/0* error idle 10.1.20.95 8082/TCP OCI image pull error: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/kubeflow-charmers/kubeflow-dashboard/[email protected]:126c9a9f0b56c9eaa614cc24f1989f9aa2d47e9cfdce70373f5ce0937a7820e2": failed commit on ref "layer-sha256:ce95b9be2a82bcdc673694e30eaecff34d6144bf4c0ca3116d949ccd6b33e231": "layer-sha256:ce95b9be2a82bcdc673694e30eaecff34d6144bf4c0ca3116d949ccd6b33e231" failed size validation: 4150633 != 29259154: failed precondition
kubeflow-profiles/0* active idle 10.1.20.71 8080/TCP,8081/TCP
kubeflow-volumes/0* active idle 10.1.20.72 5000/TCP
minio/0* error idle 10.1.20.77 9000/TCP OCI image pull error: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/minio-charmers/minio/[email protected]:4707912566436c2c1faeedb8c085a8d40b99cdf4bb0e2414295a8936e573866e": failed commit on ref "layer-sha256:a9386ba5687108909fb6a6d0155ba5bb2eea96a6d2672a372ee9e743d685d561": "layer-sha256:a9386ba5687108909fb6a6d0155ba5bb2eea96a6d2672a372ee9e743d685d561" failed size validation: 3872649 != 28593534: failed precondition
mlmd/0* active idle 10.1.20.84 8080/TCP
oidc-gatekeeper/0* active idle 10.1.20.94 8080/TCP
pytorch-operator/0* error idle 10.1.20.87 8443/TCP OCI image pull error: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/pytorch-charmers/pytorch-operator/[email protected]:08c3373247c853e804d74041366a3b161334d25b953e233776884ffab9012fc4": failed commit on ref "layer-sha256:d1c6fde2f5dd9deb582e4ed7df95242dae916742dc6b51772ecfb51fa4b7aaa6": "layer-sha256:d1c6fde2f5dd9deb582e4ed7df95242dae916742dc6b51772ecfb51fa4b7aaa6" failed size validation: 3775605 != 17524427: failed precondition
seldon-controller-manager/0* active idle 10.1.20.88 8080/TCP,4443/TCP
tfjob-operator/0* active idle 10.1.20.89 8443/TCP
早上8點之後再次查詢類似這個結果:
juju status
Model Controller Cloud/Region Version SLA Timestamp
kubeflow my-controller myk8s 2.9.14 unsupported 09:48:06+08:00
App Version Status Scale Charm Store Channel Rev OS Address Message
admission-webhook res:[email protected] active 1 admission-webhook charmstore stable 10 kubernetes 10.152.183.157
argo-controller res:[email protected] active 1 argo-controller charmstore stable 51 kubernetes
dex-auth res:[email protected] active 1 dex-auth charmstore stable 60 kubernetes 10.152.183.6
istio-ingressgateway waiting 1 istio-ingressgateway charmstore stable 20 kubernetes Waiting for Istio Pilot information
istio-pilot res:[email protected] active 1 istio-pilot charmstore stable 20 kubernetes 10.152.183.223
jupyter-controller res:[email protected] active 1 jupyter-controller charmstore stable 56 kubernetes
jupyter-ui res:[email protected] active 1 jupyter-ui charmstore stable 10 kubernetes 10.152.183.214
kfp-api res:[email protected] active 1 kfp-api charmstore stable 12 kubernetes 10.152.183.174
kfp-db mariadb/server:10.3 active 1 mariadb-k8s charmstore stable 35 kubernetes 10.152.183.129
kfp-persistence res:[email protected] active 1 kfp-persistence charmstore stable 9 kubernetes
kfp-schedwf res:[email protected] active 1 kfp-schedwf charmstore stable 9 kubernetes
kfp-ui res:[email protected] active 1 kfp-ui charmstore stable 12 kubernetes 10.152.183.30
kfp-viewer res:[email protected] active 1 kfp-viewer charmstore stable 9 kubernetes
kfp-viz res:[email protected] active 1 kfp-viz charmstore stable 8 kubernetes 10.152.183.34
kubeflow-dashboard res:[email protected] active 1 kubeflow-dashboard charmstore stable 56 kubernetes 10.152.183.59
kubeflow-profiles res:[email protected] active 1 kubeflow-profiles charmstore stable 52 kubernetes 10.152.183.48
kubeflow-volumes res:[email protected] active 1 kubeflow-volumes charmstore stable 0 kubernetes 10.152.183.209
minio res:[email protected] active 1 minio charmstore stable 55 kubernetes 10.152.183.247
mlmd res:[email protected] active 1 mlmd charmstore stable 5 kubernetes 10.152.183.167
oidc-gatekeeper res:[email protected] active 1 oidc-gatekeeper charmstore stable 54 kubernetes 10.152.183.4
pytorch-operator res:[email protected] active 1 pytorch-operator charmstore stable 53 kubernetes
seldon-controller-manager res:[email protected] active 1 seldon-core charmstore stable 50 kubernetes 10.152.183.215
tfjob-operator res:[email protected] active 1 tfjob-operator charmstore stable 1 kubernetes
Unit Workload Agent Address Ports Message
admission-webhook/0* active idle 10.1.29.18 443/TCP
argo-controller/0* active idle 10.1.29.67
dex-auth/0* active idle 10.1.29.45 5556/TCP
istio-ingressgateway/0* waiting idle Waiting for Istio Pilot information
istio-pilot/0* active idle 10.1.29.68 8080/TCP,15010/TCP,15012/TCP,15017/TCP
jupyter-controller/0* active idle 10.1.73.23
jupyter-ui/0* active idle 10.1.29.50 5000/TCP
kfp-api/0* active idle 10.1.29.71 8888/TCP,8887/TCP
kfp-db/0* active idle 10.1.29.66 3306/TCP ready
kfp-persistence/0* active idle 10.1.29.70
kfp-schedwf/0* active idle 10.1.29.47
kfp-ui/0* active idle 10.1.29.69 3000/TCP
kfp-viewer/0* active idle 10.1.29.24
kfp-viz/0* active idle 10.1.29.51 8888/TCP
kubeflow-dashboard/0* active idle 10.1.29.72 8082/TCP
kubeflow-profiles/0* active idle 10.1.29.56 8080/TCP,8081/TCP
kubeflow-volumes/0* active idle 10.1.29.44 5000/TCP
minio/0* active idle 10.1.29.55 9000/TCP
mlmd/0* active idle 10.1.29.41 8080/TCP
oidc-gatekeeper/0* active idle 10.1.29.73 8080/TCP
pytorch-operator/0* active idle 10.1.29.46 8443/TCP
seldon-controller-manager/0* active idle 10.1.29.48 8080/TCP,4443/TCP
tfjob-operator/0* active idle 10.1.29.49 8443/TCP