天天看點

ubuntu20.04下使用juju+maas環境部署k8s-13-charmed kubeflow-2-安裝 Kubeflow v1.3

參考文檔:

Install Kubeflow v1.3

注: 要在本地安裝,您隻需安裝 MicroK8s 并啟用 Kubeflow 插件。

本指南列出了在任何符合标準的 Kubernetes(包括 AKS、EKS、GKE、Openshift 和任何 kubeadm 部署的叢集)上安裝 Kubeflow 所需的步驟,前提是您可以通過

kubectl

通路它。

ubuntu20.04下使用juju+maas環境部署k8s-13-charmed kubeflow-2-安裝 Kubeflow v1.3

1 安裝juju用戶端

在 Linux 上,使用以下指令通過 snap 安裝 juju:

snap install juju --classic
           

或者,在 macOS 上

brew install juju

或下載下傳 Windows 安裝程式。

2 将 Juju 連接配接到您的 Kubernetes 叢集

為了使用 Juju 操作 Kubernetes 叢集中的工作負載,您必須通過 add-k8s 指令将叢集添加到 juju 的雲清單中。

如果您的 Kubernetes 配置檔案位于标準位置(Linux 上的 ~/.kube/config),并且您隻有一個叢集,則隻需運作:

juju add-k8s myk8s
           
注,要按照前文的安裝portainer的方法,擷取配置檔案和安裝openebs。

如果您的 kubectl 配置檔案包含多個叢集,您可以按名稱指定合适的叢集:

juju add-k8s myk8s --cluster-name=foo
           

最後,要使用不同的配置檔案,您可以将 KUBECONFIG 環境變量設定為指向相關檔案。例如:

KUBECONFIG=path/to/file juju add-k8s myk8s
           

有關更多詳細資訊,請參閱 Juju 文檔。

3 建立控制器

為了在 Kubernetes 叢集上運作工作負載,Juju 使用控制器。您可以使用 bootstrap 指令建立控制器:

juju bootstrap myk8s my-controller
           

此指令将在 my-controller 命名空間下建立幾個 pod。您可以使用

juju controllers

指令檢視您的控制器。

您可以在 Juju 文檔中閱讀有關控制器的更多資訊。

4 建立模型

Juju 中的模型是一個空白畫布,您的操作員将在其中部署,它與 Kubernetes 命名空間保持 1:1 的關系。

您可以建立一個模型并為其命名,例如kubeflow,使用 add-model 指令,您還将建立一個同名的 Kubernetes 命名空間:

juju add-model kubeflow
           

您可以使用

juju models

指令列出您的模型。

5 部署 Kubeflow

要求:
部署 kubeflow 所需的最低資源是:50Gb 磁盤空間、14Gb RAM 和 2 個可用于 Linux 機器或 VM 的 CPU。
如果您的資源較少,請部署 kubeflow-lite 或 kubeflow-edge.
           

擁有模型後,您可以簡單地 juju 将任何提供的 Kubeflow 包部署到您的叢集中,并在前面加上 cs。

例如,對于 Kubeflow lite 包,運作:

juju deploy cs:kubeflow-lite
           

恭喜,Kubeflow 正在安裝!

您可以使用以下指令觀察您的 Kubeflow 部署:

watch -c juju status --color
           

6 在身份驗證方法中設定 URL

啟用 Kubeflow 儀表闆通路權限的最後一步是通過以下指令将儀表闆公共 URL 提供給 dex-auth 和 oidc-gatekeeper:

juju config dex-auth public-url=http://<URL>
juju config oidc-gatekeeper public-url=http://<URL>
           

其中 是 Kubeflow 儀表闆響應的主機名。例如,在典型的 MicroK8s 安裝中,此 URL 是 http://10.64.140.43.nip.io。請注意,當您設定 DNS 時,您應該使用 istio-ingressgateway 使用的可解析位址。

7 添加 RBAC 角色

目前,為了在啟用 RBAC 時正确設定 Kubeflow 和 Istio,您需要提供 istio-ingressgateway 操作員對 Kubernetes 資源的通路權限。以下指令将建立适當的角色:

kubectl patch role -n kubeflow istio-ingressgateway-operator -p '{"apiVersion":"rbac.authorization.k8s.io/v1","kind":"Role","metadata":{"name":"istio-ingressgateway-operator"},"rules":[{"apiGroups":["*"],"resources":["*"],"verbs":["*"]}]}'
           

有問題?

如果您在遵循這些說明時遇到任何困難,請在

[此處](https://github.com/juju-solutions/bundle-kubeflow/issues)

建立問題。

以下是實際過程:

1 由于已經有了juju用戶端,第一步略過。

2 添加模型

由于 Kubernetes 配置檔案位于标準位置(Linux 上的 ~/.kube/config),并且隻有一個叢集,是以使用以下指令添加模型:

juju add-k8s myk8s
           

3 建立控制器:

juju bootstrap myk8s my-controller --debug
           

不幸的是,出現了以下錯誤:

ERROR juju.cmd.juju.commands bootstrap.go:883 failed to bootstrap model: creating controller stack: creating statefulset for controller: timed out waiting for controller pod: pending:  -
13:53:22 DEBUG juju.cmd.juju.commands bootstrap.go:884 (error details: [{/build/snapcraft-juju-35d6cf/parts/juju/src/cmd/juju/commands/bootstrap.go:983: failed to bootstrap model} {/build/snapcraft-juju-35d6cf/parts/juju/src/environs/bootstrap/bootstrap.go:667: } {/build/snapcraft-juju-35d6cf/parts/juju/src/environs/bootstrap/bootstrap.go:298: } {/build/snapcraft-juju-35d6cf/parts/juju/src/caas/kubernetes/provider/k8s.go:493: creating controller stack} {/build/snapcraft-juju-35d6cf/parts/juju/src/caas/kubernetes/provider/bootstrap.go:502: creating statefulset for controller} {/build/snapcraft-juju-35d6cf/parts/juju/src/caas/kubernetes/provider/bootstrap.go:917: } {/build/snapcraft-juju-35d6cf/parts/juju/src/caas/kubernetes/provider/bootstrap.go:1051: timed out waiting for controller pod} {/build/snapcraft-juju-35d6cf/parts/juju/src/caas/kubernetes/provider/bootstrap.go:1008: pending:  - }])
13:53:22 DEBUG juju.cmd.juju.commands bootstrap.go:1634 cleaning up after failed bootstrap
           

懷疑是由于使用了标準的Charmed Kubernetes #679部署kubernetes,其中的kubernetes子產品配置是cores=4 mem=4G root-disk=16G,懷疑硬碟配置過少,出錯。故重新部署了三個100G硬碟的worker節點

處理辦法:

1 先在maas上部署的三個虛機”cores=4 mem=4G root-disk=100G“ #建議更改為記憶體8G,因為前文的portainer對叢集的記憶體比較高。

2 根據前文ubuntu20.04下使用juju+maas環境部署k8s-9-縮放節點:

2.1停止硬碟16G的節點kubernetes-worker/0

juju run-action kubernetes-worker/0 pause --wait
           

2.2删除此節點。

juju remove-unit  kubernetes-worker/0
           

2.3 增加100G單元

juju add-unit kubernetes-worker
           

2.4 重複上述步驟兩次。

3 再次重新部署控制器,

juju bootstrap myk8s my-controller --debug
           

成功。

4 建立模型

juju add-model kubeflow
           

5 部署 Kubeflow

juju deploy kubeflow --debug
           

安裝完畢後,運作一段時間,檢視狀态,會發現出現很多錯誤,不用擔心,是由于國際線路太忙,映像下載下傳不下來造成的,國際線路閑時一般是淩晨2-7點,建議使用at指令定時安裝,可以比較順暢的安裝,不用多次執行

juju deploy kubeflow --debug

at 3:00 #定時在3點
>juju deploy kubeflow --debug
ctrl+d #儲存

at  -c <job号> 檢視
           

下午檢視大概類似這個狀态:

juju status
Model     Controller     Cloud/Region  Version  SLA          Timestamp
kubeflow  my-controller  myk8s         2.9.14   unsupported  14:00:40+08:00

App                        Version                    Status   Scale  Charm                 Store       Channel  Rev  OS          Address         Message
admission-webhook          res:[email protected]      active       1  admission-webhook     charmstore  stable    10  kubernetes  10.152.183.253
argo-controller            res:[email protected]      waiting      1  argo-controller       charmstore  stable    51  kubernetes
dex-auth                   res:[email protected]      active       1  dex-auth              charmstore  stable    60  kubernetes  10.152.183.133
istio-ingressgateway                                  waiting      1  istio-ingressgateway  charmstore  stable    20  kubernetes                  Waiting for istio-pilot relation data
istio-pilot                res:[email protected]      waiting      1  istio-pilot           charmstore  stable    20  kubernetes  10.152.183.223
jupyter-controller         res:[email protected]      active       1  jupyter-controller    charmstore  stable    56  kubernetes
jupyter-ui                 res:[email protected]      active       1  jupyter-ui            charmstore  stable    10  kubernetes  10.152.183.134
kfp-api                    res:[email protected]      waiting      1  kfp-api               charmstore  stable    12  kubernetes  10.152.183.121
kfp-db                     mariadb/server:10.3        active       1  mariadb-k8s           charmstore  stable    35  kubernetes  10.152.183.137
kfp-persistence            res:[email protected]      waiting      1  kfp-persistence       charmstore  stable     9  kubernetes
kfp-schedwf                res:[email protected]      waiting      1  kfp-schedwf           charmstore  stable     9  kubernetes
kfp-ui                     res:[email protected]      waiting      1  kfp-ui                charmstore  stable    12  kubernetes  10.152.183.153
kfp-viewer                 res:[email protected]      active       1  kfp-viewer            charmstore  stable     9  kubernetes
kfp-viz                    res:[email protected]      waiting      1  kfp-viz               charmstore  stable     8  kubernetes  10.152.183.233
kubeflow-dashboard         res:[email protected]      waiting      1  kubeflow-dashboard    charmstore  stable    56  kubernetes  10.152.183.32
kubeflow-profiles          res:[email protected]  active       1  kubeflow-profiles     charmstore  stable    52  kubernetes  10.152.183.182
kubeflow-volumes           res:[email protected]      active       1  kubeflow-volumes      charmstore  stable     0  kubernetes  10.152.183.164
minio                      res:[email protected]      waiting      1  minio                 charmstore  stable    55  kubernetes  10.152.183.215
mlmd                       res:[email protected]      active       1  mlmd                  charmstore  stable     5  kubernetes  10.152.183.46
oidc-gatekeeper            res:[email protected]      active       1  oidc-gatekeeper       charmstore  stable    54  kubernetes  10.152.183.183
pytorch-operator           res:[email protected]      waiting      1  pytorch-operator      charmstore  stable    53  kubernetes
seldon-controller-manager  res:[email protected]      active       1  seldon-core           charmstore  stable    50  kubernetes  10.152.183.113
tfjob-operator             res:[email protected]      active       1  tfjob-operator        charmstore  stable     1  kubernetes

Unit                          Workload  Agent  Address     Ports                                   Message
admission-webhook/0*          active    idle   10.1.44.32  443/TCP
argo-controller/0*            error     idle   10.1.20.90                                          OCI image pull error: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/argo-charmers/argo-controller/[email protected]:c1746aec607fac57e7e5006329b58c7a566f042c5bf0cf3cbae192adc5b06bb5": failed commit on ref "layer-sha256:2e2462c07d2af70a0af7ef14ba643c28c1d854336996c534e193e69dcd32df64": "layer-sha256:2e2462c07d2af70a0af7ef14ba643c28c1d854336996c534e193e69dcd32df64" failed size validation: 3920502 != 24609777: failed precondition
dex-auth/0*                   active    idle   10.1.20.39  5556/TCP
istio-ingressgateway/0*       waiting   idle                                                       Waiting for istio-pilot relation data
istio-pilot/0*                error     idle   10.1.20.62  8080/TCP,15010/TCP,15012/TCP,15017/TCP  OCI image pull error: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/istio-charmers/istio-pilot/[email protected]:e3e03b31cebfc4c73d4788b83af3339685673970a5c3bf3167db399d39696ed8": failed commit on ref "layer-sha256:64d67ae6b2e3b0799483b95c62b8594afffe04a615e5420a552c3ab25766c17e": "layer-sha256:64d67ae6b2e3b0799483b95c62b8594afffe04a615e5420a552c3ab25766c17e" failed size validation: 3806023 != 29905420: failed precondition
jupyter-controller/0*         active    idle   10.1.20.48
jupyter-ui/0*                 active    idle   10.1.20.50  5000/TCP
kfp-api/0*                    error     idle   10.1.20.93  8888/TCP,8887/TCP                       OCI image pull error: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/kubeflow-charmers/kfp-api/[email protected]:8e608409f50a332e787923dda2ea4eb5c9f0839a4c9ff3f77d535efa03eac9e9": failed commit on ref "layer-sha256:16cf3fa6cb1190b4dfd82a5319faa13e2eb6e69b7b4828d4d98ba1c0b216e446": "layer-sha256:16cf3fa6cb1190b4dfd82a5319faa13e2eb6e69b7b4828d4d98ba1c0b216e446" failed size validation: 5028131 != 45380216: failed precondition
kfp-db/0*                     active    idle   10.1.20.63  3306/TCP                                ready
kfp-persistence/0*            error     idle   10.1.20.91                                          crash loop backoff: back-off 5m0s restarting failed container=ml-pipeline-persistenceagent pod=kfp-persistence-864dc895d5-xshwz_kubeflow(384f9b66-dcc5-45c4-9f23-83592dfbc228)
kfp-schedwf/0*                error     idle   10.1.20.57                                          OCI image pull error: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/kubeflow-charmers/kfp-schedwf/[email protected]:4ab648890dad76ea51fdfb432d95992136127340b832e3d345207b839c6db23e": failed commit on ref "layer-sha256:1c3b653ff1c285f8579579c2729c7b84b3e8a14153ed7bc076316f90dda1e41c": "layer-sha256:1c3b653ff1c285f8579579c2729c7b84b3e8a14153ed7bc076316f90dda1e41c" failed size validation: 3804543 != 21611777: failed precondition
kfp-ui/0*                     error     idle   10.1.20.92  3000/TCP                                OCI image pull error: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/kubeflow-charmers/kfp-ui/[email protected]:04a4348d6b2ec8142cc0a1dd45f738b719fef7cca5c2585ec5b935d43eab1aa8": failed commit on ref "layer-sha256:f28e01f8f11f1d6aa71000847f46725c1ad868057963d5c72b6fffedbbdec85f": "layer-sha256:f28e01f8f11f1d6aa71000847f46725c1ad868057963d5c72b6fffedbbdec85f" failed size validation: 4635725 != 28057227: failed precondition
kfp-viewer/0*                 active    idle   10.1.20.65
kfp-viz/0*                    error     idle   10.1.20.68  8888/TCP                                OCI image pull error: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/kubeflow-charmers/kfp-viz/[email protected]:c90a5818043da47448c4230953b265a66877bd143e4bdd991f762cf47e2a16d6": failed commit on ref "layer-sha256:08d3fb8816994acdeef83d6a1181b92e447d6d3bbcb737c93b16cdd0f28a6fbf": "layer-sha256:08d3fb8816994acdeef83d6a1181b92e447d6d3bbcb737c93b16cdd0f28a6fbf" failed size validation: 3811546 != 3978030: failed precondition
kubeflow-dashboard/0*         error     idle   10.1.20.95  8082/TCP                                OCI image pull error: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/kubeflow-charmers/kubeflow-dashboard/[email protected]:126c9a9f0b56c9eaa614cc24f1989f9aa2d47e9cfdce70373f5ce0937a7820e2": failed commit on ref "layer-sha256:ce95b9be2a82bcdc673694e30eaecff34d6144bf4c0ca3116d949ccd6b33e231": "layer-sha256:ce95b9be2a82bcdc673694e30eaecff34d6144bf4c0ca3116d949ccd6b33e231" failed size validation: 4150633 != 29259154: failed precondition
kubeflow-profiles/0*          active    idle   10.1.20.71  8080/TCP,8081/TCP
kubeflow-volumes/0*           active    idle   10.1.20.72  5000/TCP
minio/0*                      error     idle   10.1.20.77  9000/TCP                                OCI image pull error: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/minio-charmers/minio/[email protected]:4707912566436c2c1faeedb8c085a8d40b99cdf4bb0e2414295a8936e573866e": failed commit on ref "layer-sha256:a9386ba5687108909fb6a6d0155ba5bb2eea96a6d2672a372ee9e743d685d561": "layer-sha256:a9386ba5687108909fb6a6d0155ba5bb2eea96a6d2672a372ee9e743d685d561" failed size validation: 3872649 != 28593534: failed precondition
mlmd/0*                       active    idle   10.1.20.84  8080/TCP
oidc-gatekeeper/0*            active    idle   10.1.20.94  8080/TCP
pytorch-operator/0*           error     idle   10.1.20.87  8443/TCP                                OCI image pull error: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "registry.jujucharms.com/pytorch-charmers/pytorch-operator/[email protected]:08c3373247c853e804d74041366a3b161334d25b953e233776884ffab9012fc4": failed commit on ref "layer-sha256:d1c6fde2f5dd9deb582e4ed7df95242dae916742dc6b51772ecfb51fa4b7aaa6": "layer-sha256:d1c6fde2f5dd9deb582e4ed7df95242dae916742dc6b51772ecfb51fa4b7aaa6" failed size validation: 3775605 != 17524427: failed precondition
seldon-controller-manager/0*  active    idle   10.1.20.88  8080/TCP,4443/TCP
tfjob-operator/0*             active    idle   10.1.20.89  8443/TCP
           

早上8點之後再次查詢類似這個結果:

juju status
Model     Controller     Cloud/Region  Version  SLA          Timestamp
kubeflow  my-controller  myk8s         2.9.14   unsupported  09:48:06+08:00

App                        Version                    Status   Scale  Charm                 Store       Channel  Rev  OS          Address         Message
admission-webhook          res:[email protected]      active       1  admission-webhook     charmstore  stable    10  kubernetes  10.152.183.157
argo-controller            res:[email protected]      active       1  argo-controller       charmstore  stable    51  kubernetes
dex-auth                   res:[email protected]      active       1  dex-auth              charmstore  stable    60  kubernetes  10.152.183.6
istio-ingressgateway                                  waiting      1  istio-ingressgateway  charmstore  stable    20  kubernetes                  Waiting for Istio Pilot information
istio-pilot                res:[email protected]      active       1  istio-pilot           charmstore  stable    20  kubernetes  10.152.183.223
jupyter-controller         res:[email protected]      active       1  jupyter-controller    charmstore  stable    56  kubernetes
jupyter-ui                 res:[email protected]      active       1  jupyter-ui            charmstore  stable    10  kubernetes  10.152.183.214
kfp-api                    res:[email protected]      active       1  kfp-api               charmstore  stable    12  kubernetes  10.152.183.174
kfp-db                     mariadb/server:10.3        active       1  mariadb-k8s           charmstore  stable    35  kubernetes  10.152.183.129
kfp-persistence            res:[email protected]      active       1  kfp-persistence       charmstore  stable     9  kubernetes
kfp-schedwf                res:[email protected]      active       1  kfp-schedwf           charmstore  stable     9  kubernetes
kfp-ui                     res:[email protected]      active       1  kfp-ui                charmstore  stable    12  kubernetes  10.152.183.30
kfp-viewer                 res:[email protected]      active       1  kfp-viewer            charmstore  stable     9  kubernetes
kfp-viz                    res:[email protected]      active       1  kfp-viz               charmstore  stable     8  kubernetes  10.152.183.34
kubeflow-dashboard         res:[email protected]      active       1  kubeflow-dashboard    charmstore  stable    56  kubernetes  10.152.183.59
kubeflow-profiles          res:[email protected]  active       1  kubeflow-profiles     charmstore  stable    52  kubernetes  10.152.183.48
kubeflow-volumes           res:[email protected]      active       1  kubeflow-volumes      charmstore  stable     0  kubernetes  10.152.183.209
minio                      res:[email protected]      active       1  minio                 charmstore  stable    55  kubernetes  10.152.183.247
mlmd                       res:[email protected]      active       1  mlmd                  charmstore  stable     5  kubernetes  10.152.183.167
oidc-gatekeeper            res:[email protected]      active       1  oidc-gatekeeper       charmstore  stable    54  kubernetes  10.152.183.4
pytorch-operator           res:[email protected]      active       1  pytorch-operator      charmstore  stable    53  kubernetes
seldon-controller-manager  res:[email protected]      active       1  seldon-core           charmstore  stable    50  kubernetes  10.152.183.215
tfjob-operator             res:[email protected]      active       1  tfjob-operator        charmstore  stable     1  kubernetes

Unit                          Workload  Agent  Address     Ports                                   Message
admission-webhook/0*          active    idle   10.1.29.18  443/TCP
argo-controller/0*            active    idle   10.1.29.67
dex-auth/0*                   active    idle   10.1.29.45  5556/TCP
istio-ingressgateway/0*       waiting   idle                                                       Waiting for Istio Pilot information
istio-pilot/0*                active    idle   10.1.29.68  8080/TCP,15010/TCP,15012/TCP,15017/TCP
jupyter-controller/0*         active    idle   10.1.73.23
jupyter-ui/0*                 active    idle   10.1.29.50  5000/TCP
kfp-api/0*                    active    idle   10.1.29.71  8888/TCP,8887/TCP
kfp-db/0*                     active    idle   10.1.29.66  3306/TCP                                ready
kfp-persistence/0*            active    idle   10.1.29.70
kfp-schedwf/0*                active    idle   10.1.29.47
kfp-ui/0*                     active    idle   10.1.29.69  3000/TCP
kfp-viewer/0*                 active    idle   10.1.29.24
kfp-viz/0*                    active    idle   10.1.29.51  8888/TCP
kubeflow-dashboard/0*         active    idle   10.1.29.72  8082/TCP
kubeflow-profiles/0*          active    idle   10.1.29.56  8080/TCP,8081/TCP
kubeflow-volumes/0*           active    idle   10.1.29.44  5000/TCP
minio/0*                      active    idle   10.1.29.55  9000/TCP
mlmd/0*                       active    idle   10.1.29.41  8080/TCP
oidc-gatekeeper/0*            active    idle   10.1.29.73  8080/TCP
pytorch-operator/0*           active    idle   10.1.29.46  8443/TCP
seldon-controller-manager/0*  active    idle   10.1.29.48  8080/TCP,4443/TCP
tfjob-operator/0*             active    idle   10.1.29.49  8443/TCP
           

繼續閱讀