kubernetes/k8s CRI分析-kubelet删除pod分析

kubernetes CRI分析-k8s CRI分析。kubelet删除pod分析。kubelet調用CRI删除pod分析。kubernetes中有3個功能接口，分别是容器網絡接口CNI、容器運作時接口CRI和容器存儲接口CSI。本文會對kubelet調用CRI删除pod分析。

關聯部落格kubernetes/k8s CRI 分析-容器運作時接口分析

kubernetes/k8s CRI分析-kubelet建立pod分析

kubernetes/k8s CSI分析-容器存儲接口分析

kubernetes/k8s CNI分析-容器網絡接口分析

之前的博文先對 CRI 做了介紹，然後對 kubelet CRI 相關源碼包括 kubelet 元件 CRI 相關啟動參數分析、CRI 相關 interface/struct 分析、CRI 相關初始化分析、kubelet調用CRI建立pod分析 4 個部分進行了分析，沒有看的小夥伴，可以點選上面的連結去看一下。

把之前部落格分析到的 CRI 架構圖再貼出來一遍。

本篇博文将對 kubelet 調用 CRI 删除 pod 做分析。

kubelet中CRI相關的源碼分析

kubelet的CRI源碼分析包括如下幾部分：

（1）kubelet CRI相關啟動參數分析；

（2）kubelet CRI相關interface/struct分析；

（3）kubelet CRI初始化分析；

（4）kubelet調用CRI建立pod分析；

（5）kubelet調用CRI删除pod分析。

上兩篇博文先對前四部分做了分析，本篇博文将對kubelet調用CRI删除pod做分析。

基于tag v1.17.4

https://github.com/kubernetes/kubernetes/releases/tag/v1.17.4

5.kubelet調用CRI删除pod分析

kubelet CRI删除pod調用流程

下面以kubelet dockershim删除pod調用流程為例做一下分析。

kubelet通過調用dockershim來停止容器，而dockershim則調用docker來停止容器，并調用CNI來删除pod網絡。

圖1：kubelet dockershim删除pod調用圖示

dockershim屬于kubelet内置CRI shim，其餘remote CRI shim的建立pod調用流程其實與dockershim調用基本一緻，隻不過是調用了不同的容器引擎來操作容器，但一樣由CRI shim調用CNI來删除pod網絡。

下面進行詳細的源碼分析。

直接看到

kubeGenericRuntimeManager

的

KillPod

方法，調用CRI删除pod的邏輯将在該方法裡觸發發起。

從該方法代碼也可以看出，kubelet删除一個pod的邏輯為：

（1）先停止屬于該pod的所有containers；

（2）然後再停止pod sandbox容器。

注意點：這裡隻是停止容器，而删除容器的操作由kubelet的gc來做。

// pkg/kubelet/kuberuntime/kuberuntime_manager.go
// KillPod kills all the containers of a pod. Pod may be nil, running pod must not be.
// gracePeriodOverride if specified allows the caller to override the pod default grace period.
// only hard kill paths are allowed to specify a gracePeriodOverride in the kubelet in order to not corrupt user data.
// it is useful when doing SIGKILL for hard eviction scenarios, or max grace period during soft eviction scenarios.
func (m *kubeGenericRuntimeManager) KillPod(pod *v1.Pod, runningPod kubecontainer.Pod, gracePeriodOverride *int64) error {
	err := m.killPodWithSyncResult(pod, runningPod, gracePeriodOverride)
	return err.Error()
}

// killPodWithSyncResult kills a runningPod and returns SyncResult.
// Note: The pod passed in could be *nil* when kubelet restarted.
func (m *kubeGenericRuntimeManager) killPodWithSyncResult(pod *v1.Pod, runningPod kubecontainer.Pod, gracePeriodOverride *int64) (result kubecontainer.PodSyncResult) {
	killContainerResults := m.killContainersWithSyncResult(pod, runningPod, gracePeriodOverride)
	for _, containerResult := range killContainerResults {
		result.AddSyncResult(containerResult)
	}

	// stop sandbox, the sandbox will be removed in GarbageCollect
	killSandboxResult := kubecontainer.NewSyncResult(kubecontainer.KillPodSandbox, runningPod.ID)
	result.AddSyncResult(killSandboxResult)
	// Stop all sandboxes belongs to same pod
	for _, podSandbox := range runningPod.Sandboxes {
		if err := m.runtimeService.StopPodSandbox(podSandbox.ID.ID); err != nil {
			killSandboxResult.Fail(kubecontainer.ErrKillPodSandbox, err.Error())
			klog.Errorf("Failed to stop sandbox %q", podSandbox.ID)
		}
	}

	return
}

5.1 m.killContainersWithSyncResult

m.killContainersWithSyncResult作用：停止屬于該pod的所有containers。

主要邏輯：起與容器數量相同的goroutine，調用

m.killContainer

來停止容器。

// pkg/kubelet/kuberuntime/kuberuntime_container.go
// killContainersWithSyncResult kills all pod's containers with sync results.
func (m *kubeGenericRuntimeManager) killContainersWithSyncResult(pod *v1.Pod, runningPod kubecontainer.Pod, gracePeriodOverride *int64) (syncResults []*kubecontainer.SyncResult) {
	containerResults := make(chan *kubecontainer.SyncResult, len(runningPod.Containers))
	wg := sync.WaitGroup{}

	wg.Add(len(runningPod.Containers))
	for _, container := range runningPod.Containers {
		go func(container *kubecontainer.Container) {
			defer utilruntime.HandleCrash()
			defer wg.Done()

			killContainerResult := kubecontainer.NewSyncResult(kubecontainer.KillContainer, container.Name)
			if err := m.killContainer(pod, container.ID, container.Name, "", gracePeriodOverride); err != nil {
				killContainerResult.Fail(kubecontainer.ErrKillContainer, err.Error())
			}
			containerResults <- killContainerResult
		}(container)
	}
	wg.Wait()
	close(containerResults)

	for containerResult := range containerResults {
		syncResults = append(syncResults, containerResult)
	}
	return
}

5.1.1 m.killContainer

m.killContainer方法主要是調用

m.runtimeService.StopContainer

。

runtimeService即RemoteRuntimeService，實作了CRI shim用戶端-容器運作時接口

RuntimeService interface

，持有與CRI shim容器運作時服務端通信的用戶端。是以調用

m.runtimeService.StopContainer

，實際上等于調用了CRI shim服務端的

StopContainer

方法，來進行容器的停止操作。

// pkg/kubelet/kuberuntime/kuberuntime_container.go
// killContainer kills a container through the following steps:
// * Run the pre-stop lifecycle hooks (if applicable).
// * Stop the container.
func (m *kubeGenericRuntimeManager) killContainer(pod *v1.Pod, containerID kubecontainer.ContainerID, containerName string, message string, gracePeriodOverride *int64) error {
	...

	klog.V(2).Infof("Killing container %q with %d second grace period", containerID.String(), gracePeriod)

	err := m.runtimeService.StopContainer(containerID.ID, gracePeriod)
	if err != nil {
		klog.Errorf("Container %q termination failed with gracePeriod %d: %v", containerID.String(), gracePeriod, err)
	} else {
		klog.V(3).Infof("Container %q exited normally", containerID.String())
	}

	m.containerRefManager.ClearRef(containerID)

	return err
}

m.runtimeService.StopContainer

m.runtimeService.StopContainer方法，會調用

r.runtimeClient.StopContainer

，即利用CRI shim用戶端，調用CRI shim服務端來進行停止容器的操作。

分析到這裡，kubelet中的CRI相關調用就分析完畢了，接下來将會進入到CRI shim（以kubelet内置CRI shim-dockershim為例）裡進行停止容器的操作分析。

// pkg/kubelet/remote/remote_runtime.go
// StopContainer stops a running container with a grace period (i.e., timeout).
func (r *RemoteRuntimeService) StopContainer(containerID string, timeout int64) error {
	// Use timeout + default timeout (2 minutes) as timeout to leave extra time
	// for SIGKILL container and request latency.
	t := r.timeout + time.Duration(timeout)*time.Second
	ctx, cancel := getContextWithTimeout(t)
	defer cancel()

	r.logReduction.ClearID(containerID)
	_, err := r.runtimeClient.StopContainer(ctx, &runtimeapi.StopContainerRequest{
		ContainerId: containerID,
		Timeout:     timeout,
	})
	if err != nil {
		klog.Errorf("StopContainer %q from runtime service failed: %v", containerID, err)
		return err
	}

	return nil
}

5.1.2 r.runtimeClient.StopContainer

接下來将會以dockershim為例，進入到CRI shim來進行停止容器操作的分析。

前面kubelet調用

r.runtimeClient.StopContainer

，會進入到dockershim下面的

StopContainer

方法。

// pkg/kubelet/dockershim/docker_container.go
// StopContainer stops a running container with a grace period (i.e., timeout).
func (ds *dockerService) StopContainer(_ context.Context, r *runtimeapi.StopContainerRequest) (*runtimeapi.StopContainerResponse, error) {
	err := ds.client.StopContainer(r.ContainerId, time.Duration(r.Timeout)*time.Second)
	if err != nil {
		return nil, err
	}
	return &runtimeapi.StopContainerResponse{}, nil
}

ds.client.StopContainer

主要是調用

d.client.ContainerStop

// pkg/kubelet/dockershim/libdocker/kube_docker_client.go
// Stopping an already stopped container will not cause an error in dockerapi.
func (d *kubeDockerClient) StopContainer(id string, timeout time.Duration) error {
	ctx, cancel := d.getCustomTimeoutContext(timeout)
	defer cancel()
	err := d.client.ContainerStop(ctx, id, &timeout)
	if ctxErr := contextError(ctx); ctxErr != nil {
		return ctxErr
	}
	return err
}

d.client.ContainerStop

建構請求參數，向docker指定的url發送http請求，停止容器。

// vendor/github.com/docker/docker/client/container_stop.go
// ContainerStop stops a container. In case the container fails to stop
// gracefully within a time frame specified by the timeout argument,
// it is forcefully terminated (killed).
//
// If the timeout is nil, the container's StopTimeout value is used, if set,
// otherwise the engine default. A negative timeout value can be specified,
// meaning no timeout, i.e. no forceful termination is performed.
func (cli *Client) ContainerStop(ctx context.Context, containerID string, timeout *time.Duration) error {
	query := url.Values{}
	if timeout != nil {
		query.Set("t", timetypes.DurationToSecondsString(*timeout))
	}
	resp, err := cli.post(ctx, "/containers/"+containerID+"/stop", query, nil, nil)
	ensureReaderClosed(resp)
	return err
}

5.2 m.runtimeService.StopPodSandbox

在

m.runtimeService.StopPodSandbox

中的runtimeService即RemoteRuntimeService，其實作了CRI shim用戶端-容器運作時接口

RuntimeService interface

m.runtimeService.StopPodSandbox

StopPodSandbox

方法，來進行pod sandbox的停止操作。

分析到這裡，kubelet中的CRI相關調用就分析完畢了，接下來将會進入到CRI shim（以kubelet内置CRI shim-dockershim為例）裡進行停止pod sandbox的分析。

// pkg/kubelet/remote/remote_runtime.go
// StopPodSandbox stops the sandbox. If there are any running containers in the
// sandbox, they should be forced to termination.
func (r *RemoteRuntimeService) StopPodSandbox(podSandBoxID string) error {
	ctx, cancel := getContextWithTimeout(r.timeout)
	defer cancel()

	_, err := r.runtimeClient.StopPodSandbox(ctx, &runtimeapi.StopPodSandboxRequest{
		PodSandboxId: podSandBoxID,
	})
	if err != nil {
		klog.Errorf("StopPodSandbox %q from runtime service failed: %v", podSandBoxID, err)
		return err
	}

	return nil
}

5.2.1 r.runtimeClient.StopPodSandbox

接下來将會以dockershim為例，進入到CRI shim來進行停止pod sandbox的分析。

r.runtimeClient.StopPodSandbox

StopPodSandbox

停止pod sandbox主要有2個步驟：

（1）調用

ds.network.TearDownPod

：删除pod網絡；

（2）調用

ds.client.StopContainer

：停止pod sandbox容器。

需要注意的是，上面的2個步驟隻有都成功了，停止pod sandbox的操作才算成功，且上面2個步驟成功的先後順序沒有要求。

// pkg/kubelet/dockershim/docker_sandbox.go
// StopPodSandbox stops the sandbox. If there are any running containers in the
// sandbox, they should be force terminated.
// TODO: This function blocks sandbox teardown on networking teardown. Is it
// better to cut our losses assuming an out of band GC routine will cleanup
// after us?
func (ds *dockerService) StopPodSandbox(ctx context.Context, r *runtimeapi.StopPodSandboxRequest) (*runtimeapi.StopPodSandboxResponse, error) {
	var namespace, name string
	var hostNetwork bool

	podSandboxID := r.PodSandboxId
	resp := &runtimeapi.StopPodSandboxResponse{}

	// Try to retrieve minimal sandbox information from docker daemon or sandbox checkpoint.
	inspectResult, metadata, statusErr := ds.getPodSandboxDetails(podSandboxID)
	if statusErr == nil {
		namespace = metadata.Namespace
		name = metadata.Name
		hostNetwork = (networkNamespaceMode(inspectResult) == runtimeapi.NamespaceMode_NODE)
	} else {
		checkpoint := NewPodSandboxCheckpoint("", "", &CheckpointData{})
		checkpointErr := ds.checkpointManager.GetCheckpoint(podSandboxID, checkpoint)

		// Proceed if both sandbox container and checkpoint could not be found. This means that following
		// actions will only have sandbox ID and not have pod namespace and name information.
		// Return error if encounter any unexpected error.
		if checkpointErr != nil {
			if checkpointErr != errors.ErrCheckpointNotFound {
				err := ds.checkpointManager.RemoveCheckpoint(podSandboxID)
				if err != nil {
					klog.Errorf("Failed to delete corrupt checkpoint for sandbox %q: %v", podSandboxID, err)
				}
			}
			if libdocker.IsContainerNotFoundError(statusErr) {
				klog.Warningf("Both sandbox container and checkpoint for id %q could not be found. "+
					"Proceed without further sandbox information.", podSandboxID)
			} else {
				return nil, utilerrors.NewAggregate([]error{
					fmt.Errorf("failed to get checkpoint for sandbox %q: %v", podSandboxID, checkpointErr),
					fmt.Errorf("failed to get sandbox status: %v", statusErr)})
			}
		} else {
			_, name, namespace, _, hostNetwork = checkpoint.GetData()
		}
	}

	// WARNING: The following operations made the following assumption:
	// 1. kubelet will retry on any error returned by StopPodSandbox.
	// 2. tearing down network and stopping sandbox container can succeed in any sequence.
	// This depends on the implementation detail of network plugin and proper error handling.
	// For kubenet, if tearing down network failed and sandbox container is stopped, kubelet
	// will retry. On retry, kubenet will not be able to retrieve network namespace of the sandbox
	// since it is stopped. With empty network namespcae, CNI bridge plugin will conduct best
	// effort clean up and will not return error.
	errList := []error{}
	ready, ok := ds.getNetworkReady(podSandboxID)
	if !hostNetwork && (ready || !ok) {
		// Only tear down the pod network if we haven't done so already
		cID := kubecontainer.BuildContainerID(runtimeName, podSandboxID)
		err := ds.network.TearDownPod(namespace, name, cID)
		if err == nil {
			ds.setNetworkReady(podSandboxID, false)
		} else {
			errList = append(errList, err)
		}
	}
	if err := ds.client.StopContainer(podSandboxID, defaultSandboxGracePeriod); err != nil {
		// Do not return error if the container does not exist
		if !libdocker.IsContainerNotFoundError(err) {
			klog.Errorf("Failed to stop sandbox %q: %v", podSandboxID, err)
			errList = append(errList, err)
		} else {
			// remove the checkpoint for any sandbox that is not found in the runtime
			ds.checkpointManager.RemoveCheckpoint(podSandboxID)
		}
	}

	if len(errList) == 0 {
		return resp, nil
	}

	// TODO: Stop all running containers in the sandbox.
	return nil, utilerrors.NewAggregate(errList)
}

d.client.ContainerStop

// pkg/kubelet/dockershim/libdocker/kube_docker_client.go
// Stopping an already stopped container will not cause an error in dockerapi.
func (d *kubeDockerClient) StopContainer(id string, timeout time.Duration) error {
	ctx, cancel := d.getCustomTimeoutContext(timeout)
	defer cancel()
	err := d.client.ContainerStop(ctx, id, &timeout)
	if ctxErr := contextError(ctx); ctxErr != nil {
		return ctxErr
	}
	return err
}

建構請求參數，向docker指定的url發送http請求，停止pod sandbox容器。

// vendor/github.com/docker/docker/client/container_stop.go
// ContainerStop stops a container. In case the container fails to stop
// gracefully within a time frame specified by the timeout argument,
// it is forcefully terminated (killed).
//
// If the timeout is nil, the container's StopTimeout value is used, if set,
// otherwise the engine default. A negative timeout value can be specified,
// meaning no timeout, i.e. no forceful termination is performed.
func (cli *Client) ContainerStop(ctx context.Context, containerID string, timeout *time.Duration) error {
	query := url.Values{}
	if timeout != nil {
		query.Set("t", timetypes.DurationToSecondsString(*timeout))
	}
	resp, err := cli.post(ctx, "/containers/"+containerID+"/stop", query, nil, nil)
	ensureReaderClosed(resp)
	return err
}

總結

CRI架構圖

在 CRI 之下，包括兩種類型的容器運作時的實作：

（1）kubelet内置的

dockershim

，實作了 Docker 容器引擎的支援以及 CNI 網絡插件（包括 kubenet）的支援。

dockershim

代碼内置于kubelet，被kubelet調用，讓

dockershim

起獨立的server來建立CRI shim，向kubelet暴露grpc server；

（2）外部的容器運作時，用來支援

rkt

、

containerd

等容器引擎的外部容器運作時。

kubelet調用CRI删除pod流程分析

kubelet删除一個pod的邏輯為：

（2）然後再停止pod sandbox容器（包括删除pod網絡）。

關聯部落格《kubernetes/k8s CRI 分析-容器運作時接口分析》

《kubernetes/k8s CRI分析-kubelet建立pod分析》

kubernetes/k8s CRI分析-kubelet删除pod分析

kubelet中CRI相關的源碼分析

基于tag v1.17.4

5.kubelet調用CRI删除pod分析

kubelet CRI删除pod調用流程

5.1 m.killContainersWithSyncResult

5.1.1 m.killContainer

5.1.2 r.runtimeClient.StopContainer

5.2 m.runtimeService.StopPodSandbox

5.2.1 r.runtimeClient.StopPodSandbox

總結

CRI架構圖

kubelet調用CRI删除pod流程分析

繼續閱讀

從那一刻開始

一次坐船的經曆

CentOS 7 各個版本的差別

東契奇＆小哈達威＆克萊伯今預計将複出波津小機率出戰

那份愛并不遙遠

kubernetes/k8s CRI分析-kubelet建立pod分析

kubernetes/k8s CRI分析-容器運作時接口分析

.NET Core 給使用.NET的公司所帶來的機遇

實體壓強測的試題

k8s client-go源碼分析 informer源碼分析(2)-初始化與啟動分析

k8s client-go源碼分析 informer源碼分析(1)-概要分析

k8s TLS bootstrap解析-k8s TLS bootstrap流程分析

kubeadm工作原理-kubeadm init原理分析-kubeadm join原理分析

kube-scheduler源碼分析（2）-核心處理邏輯分析

kube-scheduler源碼分析（1）-初始化與啟動分析

kube-controller-manager 啟動流程controller manager啟動流程