随着雲原生技術的普及和落地，越來越多的雲原生應用被部署到生産環境中，由于雲原生應用通常都是基于雲的分布式部署模式，且每個應用可能是由多個功能元件互相調用來一起提供完整的服務的，每個元件都有自己獨立的疊代流程和計劃。在這種情況下，功能元件越多，意味着應用的釋出管理越複雜，如果沒有一個好的方案或者系統來管理複雜應用的釋出上線的話，業務面臨的風險也是非常大的。開源社群在複雜應用釋出管理方面逐漸開始發力，本文對其中2種針對上層應用釋出管理的方案進行對比和分析，它們就是Intuit的 ArgoCD 和 ArgoRollouts 結合的方案以及Weaveworks的 Flux Flagger

結合的方案。

ArgoCD和Flux（或者Flux CD）的主要職責都是監聽Git Repository源中的應用編排變化，并與目前環境中應用運作狀态進行對比，自動化同步拉取應用變更并部署到進群中。

ArgoRollouts和Flagger的主要職責都是執行更複雜的應用釋出政策，比如藍綠釋出、金絲雀釋出、AB Testing等。

1. ArgoCD與Flux CD

ArgoCD與Flux CD的主要職責是監聽Git Repositories變化，對比目前應用運作狀态與期望運作狀态的差異，然後自動拉取變更并同步部署到叢集環境中。但架構設計與功能支援上有很多差異點。本文從以下幾個方面對其進行對比分析。

1.1 ArgoCD

1.1.1 架構

ArgoCD包括3個主要元件：

API Server

ArgoCD API server是一個gRPC/REST風格的server，提供API給Web UI，CLI以及其他CI/CD做系統調用或內建，包括以下職責：

應用管理和狀态上報
執行應用變更，如同步、復原等
Git Repository和叢集憑證的管理（存儲為k8s secret）
對外部身份提供者進行身份驗證（添加外部叢集）
RBAC增強
監聽和轉發Git webhook事件

Repository Server

repossitory server是一個内部服務，維護一個Git Repo中應用編排檔案的本地緩存。支援以下參數設定：

repository URL
revision (commit, tag, branch)
application path，Git Repo中的subpath
模闆化的參數設定，如parameters, ksonnet environments, helm values.yaml

Application Controller

Application Controller是一個Kubernetes Controller，主要工作是持續監聽應用目前運作狀态與期望狀态（Git Repo中描述的狀态）的不同。自動檢測應用OutOfSync狀态并根據Sync測試執行下一步動作。

Argo與Flux在雲原生GitOps實踐上的能力對比與分析1. ArgoCD與Flux CD2.ArgoRollouts與Flagger3. GitOps Engine參考資料

1.1.2 Web UI

Argo UI之前是Argo組織下一個獨立的項目，現在已經合并到Argo CD項目裡。Argo CD的UI基本已經實作了Argcd CLI的大部分功能，使用者可以通過UI做以下事情：

配置和連接配接遠端Git Repositories
配置用于連接配接私有Git Repositories的憑證
添加管理不同的k8s叢集
配置project

Argo與Flux在雲原生GitOps實踐上的能力對比與分析1. ArgoCD與Flux CD2.ArgoRollouts與Flagger3. GitOps Engine參考資料
建立應用
動态展示應用目前運作狀态
應用釋出
應用曆史版本復原

Argo與Flux在雲原生GitOps實踐上的能力對比與分析1. ArgoCD與Flux CD2.ArgoRollouts與Flagger3. GitOps Engine參考資料

1.1.3 多叢集管理

ArgoCD支援管理多叢集。 ArgoCD可以添加管理多個叢集，它以secret的方式存儲外部叢集的憑證，secret中包括一個與被托管叢集kube-system命名空間下名為argocd-manager的ServiceAccount相關聯的k8s API bearer token，和連接配接被托管叢集API Server方式資訊。支援revoke。

1.1.4 應用管理能力

ArgoCD下有明确的應用的概念，基本上覆寫了一個應用生命周期内所有需要的操作；此外，ArgoCD支援引用單一Git Repository在不同叢集中建立不同應用，實際上我們還可以利用這個能力配合應用的定制化能力，實作在不同叢集中建立相同應用的場景。

應用生命周期管理：

$ argocd app -h
Available Commands:
  actions        Manage Resource actions
  create         Create an application
  delete         Delete an application
  diff           Perform a diff against the target and live state.
  edit           Edit application
  get            Get application details
  history        Show application deployment history
  list           List applications
  manifests      Print manifests of an application
  patch          Patch application
  patch-resource Patch resource in an application
  rollback       Rollback application to a previous deployed version by History ID
  set            Set application parameters
  sync           Sync an application to its target state
  terminate-op   Terminate running operation of an application
  unset          Unset application parameters
  wait           Wait for an application to reach a synced and healthy state

引用單一Git Repository在不同叢集中建立應用，下面是一個引用

https://github.com/haoshuwei/argocd-samples.git

在ack-pre、ack-pro和gke-pro 3個k8s叢集中建立應用的示例：

$ argocd cluster list
SERVER                          NAME     VERSION  STATUS      MESSAGE
https://xxx.xx.xxx.xxx:6443     ack-pro  1.14+    Successful
https://xx.xx.xxx.xxx           gke-pro  1.14+    Successful
https://xxx.xx.xxx.xxx:6443     ack-pre  1.14+    Successful
https://kubernetes.default.svc           1.14+    Successful

ack-pre

，部署

argocd-samples

項目下

overlays/pre

子目錄裡的編排檔案，分支為

latest

，應用部署到api server為

https://xx.xx.xxx.xxx:6443

的叢集，命名空間為

argocd-samples

，同步政策為

automated

$ argocd app create --project default --name ack-pre --repo https://github.com/haoshuwei/argocd-samples.git --path overlays/pre --dest-server https://xx.xx.xxx.xxx:6443 --dest-namespace  argocd-samples --revision latest --sync-policy automated

ack-pro

argocd-samples

overlays/pro

master

https://xx.xx.xxx.xxx:6443

argocd-samples

automated

$ argocd app create --project default --name ack-pro --repo https://github.com/haoshuwei/argocd-samples.git --path overlays/pro --dest-server https://xx.xx.xxx.xxx:6443 --dest-namespace  argocd-samples --revision master --sync-policy automated

gke-pro

argocd-samples

overlays/gke

master

https://xx.xx.xxx.xxx

argocd-samples

automated

$ argocd app create --project default --name gke-pro --repo https://github.com/haoshuwei/argocd-samples.git --path overlays/gke --dest-server https://xx.xx.xxx.xxx --dest-namespace  argocd-samples --revision master

1.1.5 kubernetes應用編排工具支援

ArgoCD支援Kustomize應用、Helm Charts、Ksonnet應用、Jsonnet檔案，并且可以通過管理和配置插件的方式支援其他你想要使用的編排工具。

1.1.6 安全

ArgoCD隻設定了一個内置的admin使用者，隻能登入使用者才可以進行下一步操作。ArgoCD認為一個内置的admin使用者主要用于管理和配置比App資源更底層的叢集資源，App資源的變更曆史記錄都需要在Git Provider端進行審計。盡管如此，ArgoCD也支援了其他使用者使用SSO的方式進行登入，比如通過OAuth2接入阿裡雲RAM服務或GitHub：

另外，ArgoCD在App資源的上層又抽象出來一個Project的概念，Project是應用管理的一個邏輯上的組，每一個應用都要屬于且隻能屬于一個組，針對多團隊合作的情景而設計的一個概念。它的主要作用包括：限制指定的Git Repository才可以被部署；限制應用隻能被部署到指定clusters的指定namespace下；限制指定的資源類型可以被部署或不能被部署；定義project roles來提供對應用的角色通路控制RBAC。

1.1.7 應用的自動化同步能力

ArgoCD隻監聽Git Repository源中的代碼變更，不關心容器鏡像倉庫中鏡像tag的變換。應用的同步政策有兩種：

automated

none

1.1.8 Git Repositories支援

支援ssh、git、http/https協定；

支援自簽發證書添加、ssh known hosts添加

私有倉庫的設定支援private token和ssh private key

1.2 Flux CD

1.2.1 架構

Flux是Weaveworks公司在2016年發起的開源項目，2019年8月成為CNCF基金會的一個孵化項目。

Fluxd

Flux daemon，主要職責是持續監聽使用者配置的Git Repo中Kubernetes資源編排檔案的變化，然後同步部署到叢集中；它還可以檢測容器鏡像倉庫中image更新，送出更新到使用者Git Repo，然後同步部署到叢集中。以上同步部署動作均基于使用者設定的政策。

1.2.2 Web UI

Flux CD目前不提供UI。

1.2.3 多叢集管理

Flux CD不支援多叢集管理，需要在每個叢集中部署Flux CD元件。

1.2.4 應用管理能力

Flux CD 下需要配置flux連接配接你的目标Git Repository并将應用同步部署到k8s叢集中。如果你想使用單一Git Repository部署應用到不同叢集，則需要在每個叢集中部署flux并為每個叢集設定唯一的git tag。

$ fluxctl -h
Available Commands:
  automate       Turn on automatic deployment for a workload.
  deautomate     Turn off automatic deployment for a workload.
  help           Help about any command
  identity       Display SSH public key
  install        Print and tweak Kubernetes manifests needed to install Flux in a Cluster
  list-images    Show deployed and available images.
  list-workloads List workloads currently running in the cluster.
  lock           Lock a workload, so it cannot be deployed.
  policy         Manage policies for a workload.
  release        Release a new version of a workload.
  save           save workload definitions to local files in cluster-native format
  sync           synchronize the cluster with the git repository, now
  unlock         Unlock a workload, so it can be deployed.
  version        Output the version of fluxctl

1.2.5 kubernetes應用編排工具支援

Flux CD支援配置使用Helm Operator部署Helm Charts，支援Kustomize應用。

1.2.6 安全

Flux CD支援管理和使使用者k8s叢集多租特性，分為叢集管理者和普通開發測試團隊兩種角色，團隊成員隻能修改自己是以命名空間下的應用，對叢集級别的應用或者其他命名空間下的應用無任何操作權限。此外，對于每個團隊或者命名空間來說，隻能指定的一個Git Repository，對應地需要啟動一個Flux執行個體。

1.2.7 應用的自動化同步能力

FluxCD 除了監聽Git Repository源中的代碼變更之外，還可以監聽docker registry中與目前運作應用相同的鏡像的tag變化，可以根據不同政策決定是否把鏡像tag變化的資訊自動commit到Git Repository源中，然後再同步部署到叢集中。

1.2.8 Git Repositories支援

私有倉庫的設定隻支援ssh private key

2.ArgoRollouts與Flagger

ArgoRollouts與Flagger的主要職責都是執行更複雜的應用釋出政策，比如藍綠釋出、金絲雀釋出、AB Testing等。目前ArgoRollouts部署并不依賴istio環境，Flagger則必須在istio環境下才能正常工作。

2.1 ArgoRollouts

2.1.1 Istio、Service Mesh或App Mesh的支援

在流量管理方面，ArgoRollouts隻支援istio和ingress，目前還不支援Service Mesh或App Mesh，不過社群下一階段的主要工作就是做這部分的支援。

一個ArgoRollouts結合istio對流量進行管理的編排示例：

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: rollout-example
spec:
  ...
  strategy:
    canary:
      steps:
      - setWeight: 5
      - pause:
          duration: 600
      canaryService: canary-svc # required
      stableService: stable-svc  # required
      trafficRouting:
        istio:
           virtualService: 
            name: rollout-vsvc  # required
            routes:
            - primary # At least one route is required

其中名為

rollout-vsvc

的istio自定義資源

VirtualService

的編排為：

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: rollout-vsvc
spec:
  gateways:
    - istio-rollout-gateway
  hosts:
    - istio-rollout.dev.argoproj.io
  http:
    - name: primary
      route:
        - destination:
            host: stable-svc
          weight: 100
        - destination:
            host: canary-svc
          weight: 0

在釋出應用的時候，ArgoRollouts會根據

spec.strategy.steps

的設定來動态修改

rollout-vsvc

中

stable-svc

canary-svc

的權重比例，直到應用釋出完畢。

2.1.2 全自動漸進式應用釋出支援

全自動漸進式應用釋出是指能使運作在k8s體系上的應用釋出流程全自動化(無人參與), 它能減少釋出的人為關注時間, 并且在釋出過程中能自動識别一些風險(例如:RT,成功率,自定義metrics)并執行復原操作。

ArgoRollouts使用

AnalysisTemplate

AnalysisRun

Experiment

3種crd資源來分析從Prometheus查詢到的監控名額，然後進行決策決定應用是否繼續進一步釋出。

示例：

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: guestbook
spec:
...
  strategy:
    canary: 
      analysis:
        templates:
        - templateName: success-rate
        startingStep: 2 # delay starting analysis run
                        # until setWeight: 40%
        args:
        - name: service-name
          value: guestbook-svc.default.svc.cluster.local
      steps:
      - setWeight: 20
      - pause: {duration: 600}
      - setWeight: 40
      - pause: {duration: 600}
      - setWeight: 60
      - pause: {duration: 600}
      - setWeight: 80
      - pause: {duration: 600}

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
  - name: service-name
  metrics:
  - name: success-rate
    interval: 5m
    successCondition: result >= 0.95
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: |
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code!~"5.*"}[5m]
          )) / 
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}"}[5m]
          ))

在上面的例子中，應用的釋出政策為每

600s

增加20%的路由權重到新版本應用上，這個動作有

AnalysisRun

執行，依賴于

AnalysisTemplate

中定義的

success-rate

，

success-rate

是根據從Prometheus系統中查詢到的資料進行計算得出的，如果某一次

success-rate

小于95%，則ArgoRollouts會進行復原操作，否則逐漸增權重重到100%完成應用釋出。

2.2 Flagger

2.2.1 Istio、Service Mesh或App Mesh的支援

Flagger需要結合Istio, Linkerd, App Mesh, NGINX, Contour or Gloo等的流量管理能力以及Prometheus的名額收集與分析來完成應用的全自動化、漸進式的金絲雀釋出。

Flagger通過spec.provider來指定使用哪種流量管理方案來釋出應用：

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: podinfo
  namespace: test
spec:
  # service mesh provider (optional)
  # can be: kubernetes, istio, linkerd, appmesh, nginx, contour, gloo, supergloo
  provider: istio
  # deployment reference
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: podinfo

Flagger已經與GKE Istio以及EKS App Mesh做了很好的內建并開放給使用者使用：

2.2.2 全自動漸進式應用釋出支援

Flagger的canary analysis資源中定義了應用釋出政策、用于驗證新版本的名額、webhook擴充測試驗證能力和告警設定：

analysis:
    # schedule interval (default 60s)
    interval:
    # max number of failed metric checks before rollback
    threshold:
    # max traffic percentage routed to canary
    # percentage (0-100)
    maxWeight:
    # canary increment step
    # percentage (0-100)
    stepWeight:
    # total number of iterations
    # used for A/B Testing and Blue/Green
    iterations:
    # canary match conditions
    # used for A/B Testing
    match:
      - # HTTP header
    # key performance indicators
    metrics:
      - # metric check
    # alerting
    alerts:
      - # alert provider
    # external checks
    webhooks:
      - # hook

一個示例如下：

analysis:
    # schedule interval (default 60s)
    interval: 1m
    # max number of failed metric checks before rollback
    threshold: 10
    # max traffic percentage routed to canary
    # percentage (0-100)
    maxWeight: 50
    # canary increment step
    # percentage (0-100)
    stepWeight: 5
    # validation (optional)
    metrics:
    - name: request-success-rate
      # builtin Prometheus check
      # minimum req success rate (non 5xx responses)
      # percentage (0-100)
      thresholdRange:
        min: 99
      interval: 1m
    - name: request-duration
      # builtin Prometheus check
      # maximum req duration P99
      # milliseconds
      thresholdRange:
        max: 500
      interval: 30s
    - name: "database connections"
      # custom Prometheus check
      templateRef:
        name: db-connections
      thresholdRange:
        min: 2
        max: 100
      interval: 1m
    # testing (optional)
    webhooks:
      - name: "conformance test"
        type: pre-rollout
        url: http://flagger-helmtester.test/
        timeout: 5m
        metadata:
          type: "helmv3"
          cmd: "test run podinfo -n test"
      - name: "load test"
        type: rollout
        url: http://flagger-loadtester.test/
        metadata:
          cmd: "hey -z 1m -q 10 -c 2 http://podinfo.test:9898/"
    # alerting (optional)
    alerts:
      - name: "dev team Slack"
        severity: error
        providerRef:
          name: dev-slack
          namespace: flagger
      - name: "qa team Discord"
        severity: warn
        providerRef:
          name: qa-discord
      - name: "on-call MS Teams"
        severity: info
        providerRef:
          name: on-call-msteams

對以上示例進行分解分析，

analysis:
    # schedule interval (default 60s)
    interval: 1m
    # max number of failed metric checks before rollback
    threshold: 10
    # max traffic percentage routed to canary
    # percentage (0-100)
    maxWeight: 50
    # canary increment step
    # percentage (0-100)
    stepWeight: 5

以上編排字段設定定義了應用釋出過程中，流量最大隻能切換到50%（maxWeight：50），總共執行10次（maxWeight/stepWeight），每次流量切換的增量為5%（stepWeight：5），每次執行間隔為1m(interval: 1m)，期間允許10次metrics驗證失敗（threshold: 10），若超過10次則進行復原操作。

metrics:
    - name: request-success-rate
      # builtin Prometheus check
      # minimum req success rate (non 5xx responses)
      # percentage (0-100)
      thresholdRange:
        min: 99
      interval: 1m
    - name: request-duration
      # builtin Prometheus check
      # maximum req duration P99
      # milliseconds
      thresholdRange:
        max: 500
      interval: 30s
    - name: "database connections"
      # custom Prometheus check
      templateRef:
        name: db-connections
      thresholdRange:
        min: 2
        max: 100
      interval: 1m

以上編排字段設定定義了3種metrics檢查：request-success-rate（請求成功率）不能超過99%，request-duration（RT均值）不能超過500ms，然後還有一個自定義metrics檢查，即database connections最小不能小于2，最大不能大于100。

webhooks:
      - name: "conformance test"
        type: pre-rollout
        url: http://flagger-helmtester.test/
        timeout: 5m
        metadata:
          type: "helmv3"
          cmd: "test run podinfo -n test"
      - name: "load test"
        type: rollout
        url: http://flagger-loadtester.test/
        metadata:
          cmd: "hey -z 1m -q 10 -c 2 http://podinfo.test:9898/"

以上編排字段設定定義了2種類型的webhook：名為"conformance test"類型為pre-rollout的webhooks需要在流量權重增加之前執行，名為"load test"類型為rollout的webhooks在metrics檢查執行期間運作。

alerts:
      - name: "dev team Slack"
        severity: error
        providerRef:
          name: dev-slack
          namespace: flagger
      - name: "qa team Discord"
        severity: warn
        providerRef:
          name: qa-discord
      - name: "on-call MS Teams"
        severity: info
        providerRef:
          name: on-call-msteams

以上編排字段設定定義了釋出過程中的通知政策，severity表示通知資訊的等級，如error、warn、info等，providerRef引用了不同AlterProvider的詳細資訊，如slack，msteams、dingding等。

3. GitOps Engine

2019年11月，Weaveworks宣布和Intuit合作共建Argo Flux，主要針對Kubernetes應用釋出的GitOps解決方案，AWS作為活躍開發者也加入其中，AWS的BlackRock會是第一個使用此方案的企業級應用服務。這個項目就是

GitOps Engine

。Argo組織正在以一個整體加入CNCF開源基金會，Flux和Flagger已經是CNCF基金會的孵化項目，從基金會的角度是希望他們能尋找一些可以合作的方式的，另外兩者在解決方案上确實有很多相似的地方，兩家公司也希望能集合2個社群的開源力量做出一個更優秀的GitOps解決方案來。不過從目前來看，GitOps Engine還沒有特别大的動作。

GitOps Engine第一步動作是整合ArgoCD和Flux目前已具備的核心能力，未來社群還會考慮ArgoRollouts與Flagger的結合。我們對齊保持持續關注。

持續更新，如有偏差，歡迎指正。

參考資料

https://argoproj.github.io/argo-cd/ https://argoproj.github.io/argo-rollouts/

[

https://github.com/fluxcd/flux]()https://github.com/fluxcd/flux https://docs.fluxcd.io/en/1.18.0/introduction.html https://github.com/weaveworks/flagger https://www.weave.works/blog/flux-joins-the-cncf-sandbox https://docs.flagger.app/ https://www.weave.works/blog/argo-flux-join-forces

Argo與Flux在雲原生GitOps實踐上的能力對比與分析1. ArgoCD與Flux CD2.ArgoRollouts與Flagger3. GitOps Engine參考資料