kubernetes/k8s CRI分析-kubelet删除pod分析

关联博客《kubernetes/k8s CRI 分析-容器运行时接口分析》

《kubernetes/k8s CRI分析-kubelet创建pod分析》

之前的博文先对 CRI 做了介绍，然后对 kubelet CRI 相关源码包括 kubelet 组件 CRI 相关启动参数分析、CRI 相关 interface/struct 分析、CRI 相关初始化分析、kubelet调用CRI创建pod分析 4 个部分进行了分析，没有看的小伙伴，可以点击上面的链接去看一下。

把之前博客分析到的 CRI 架构图再贴出来一遍。

本篇博文将对 kubelet 调用 CRI 删除 pod 做分析。

kubelet中CRI相关的源码分析

kubelet的CRI源码分析包括如下几部分：

（1）kubelet CRI相关启动参数分析；

（2）kubelet CRI相关interface/struct分析；

（3）kubelet CRI初始化分析；

（4）kubelet调用CRI创建pod分析；

（5）kubelet调用CRI删除pod分析。

上两篇博文先对前四部分做了分析，本篇博文将对kubelet调用CRI删除pod做分析。

基于tag v1.17.4

https://github.com/kubernetes/kubernetes/releases/tag/v1.17.4

5.kubelet调用CRI删除pod分析

kubelet CRI删除pod调用流程

下面以kubelet dockershim删除pod调用流程为例做一下分析。

kubelet通过调用dockershim来停止容器，而dockershim则调用docker来停止容器，并调用CNI来删除pod网络。

图1：kubelet dockershim删除pod调用图示

dockershim属于kubelet内置CRI shim，其余remote CRI shim的创建pod调用流程其实与dockershim调用基本一致，只不过是调用了不同的容器引擎来操作容器，但一样由CRI shim调用CNI来删除pod网络。

下面进行详细的源码分析。

直接看到kubeGenericRuntimeManager的KillPod方法，调用CRI删除pod的逻辑将在该方法里触发发起。

从该方法代码也可以看出，kubelet删除一个pod的逻辑为：

（1）先停止属于该pod的所有containers；

（2）然后再停止pod sandbox容器。

注意点：这里只是停止容器，而删除容器的操作由kubelet的gc来做。

// pkg/kubelet/kuberuntime/kuberuntime_manager.go

// KillPod kills all the containers of a pod. Pod may be nil, running pod must not be.

// gracePeriodOverride if specified allows the caller to override the pod default grace period.

// only hard kill paths are allowed to specify a gracePeriodOverride in the kubelet in order to not corrupt user data.

// it is useful when doing SIGKILL for hard eviction scenarios, or max grace period during soft eviction scenarios.

func (m *kubeGenericRuntimeManager) KillPod(pod *v1.Pod, runningPod kubecontainer.Pod, gracePeriodOverride *int64) error {

	err := m.killPodWithSyncResult(pod, runningPod, gracePeriodOverride)

	return err.Error()

}

// killPodWithSyncResult kills a runningPod and returns SyncResult.

// Note: The pod passed in could be *nil* when kubelet restarted.

func (m *kubeGenericRuntimeManager) killPodWithSyncResult(pod *v1.Pod, runningPod kubecontainer.Pod, gracePeriodOverride *int64) (result kubecontainer.PodSyncResult) {

	killContainerResults := m.killContainersWithSyncResult(pod, runningPod, gracePeriodOverride)

	for _, containerResult := range killContainerResults {

		result.AddSyncResult(containerResult)

	}

	// stop sandbox, the sandbox will be removed in GarbageCollect

	killSandboxResult := kubecontainer.NewSyncResult(kubecontainer.KillPodSandbox, runningPod.ID)

	result.AddSyncResult(killSandboxResult)

	// Stop all sandboxes belongs to same pod

	for _, podSandbox := range runningPod.Sandboxes {

		if err := m.runtimeService.StopPodSandbox(podSandbox.ID.ID); err != nil {

			killSandboxResult.Fail(kubecontainer.ErrKillPodSandbox, err.Error())

			klog.Errorf("Failed to stop sandbox %q", podSandbox.ID)

		}

	}

	return

}

5.1 m.killContainersWithSyncResult

m.killContainersWithSyncResult作用：停止属于该pod的所有containers。

主要逻辑：起与容器数量相同的goroutine，调用m.killContainer来停止容器。

// pkg/kubelet/kuberuntime/kuberuntime_container.go

// killContainersWithSyncResult kills all pod's containers with sync results.

func (m *kubeGenericRuntimeManager) killContainersWithSyncResult(pod *v1.Pod, runningPod kubecontainer.Pod, gracePeriodOverride *int64) (syncResults []*kubecontainer.SyncResult) {

	containerResults := make(chan *kubecontainer.SyncResult, len(runningPod.Containers))

	wg := sync.WaitGroup{}

	wg.Add(len(runningPod.Containers))

	for _, container := range runningPod.Containers {

		go func(container *kubecontainer.Container) {

			defer utilruntime.HandleCrash()

			defer wg.Done()

			killContainerResult := kubecontainer.NewSyncResult(kubecontainer.KillContainer, container.Name)

			if err := m.killContainer(pod, container.ID, container.Name, "", gracePeriodOverride); err != nil {

				killContainerResult.Fail(kubecontainer.ErrKillContainer, err.Error())

			}

			containerResults <- killContainerResult

		}(container)

	}

	wg.Wait()

	close(containerResults)

	for containerResult := range containerResults {

		syncResults = append(syncResults, containerResult)

	}

	return

}

5.1.1 m.killContainer

m.killContainer方法主要是调用m.runtimeService.StopContainer。

runtimeService即RemoteRuntimeService，实现了CRI shim客户端-容器运行时接口RuntimeService interface，持有与CRI shim容器运行时服务端通信的客户端。所以调用m.runtimeService.StopContainer，实际上等于调用了CRI shim服务端的StopContainer方法，来进行容器的停止操作。

// pkg/kubelet/kuberuntime/kuberuntime_container.go

// killContainer kills a container through the following steps:

// * Run the pre-stop lifecycle hooks (if applicable).

// * Stop the container.

func (m *kubeGenericRuntimeManager) killContainer(pod *v1.Pod, containerID kubecontainer.ContainerID, containerName string, message string, gracePeriodOverride *int64) error {

	...

	klog.V(2).Infof("Killing container %q with %d second grace period", containerID.String(), gracePeriod)

	err := m.runtimeService.StopContainer(containerID.ID, gracePeriod)

	if err != nil {

		klog.Errorf("Container %q termination failed with gracePeriod %d: %v", containerID.String(), gracePeriod, err)

	} else {

		klog.V(3).Infof("Container %q exited normally", containerID.String())

	}

	m.containerRefManager.ClearRef(containerID)

	return err

}

m.runtimeService.StopContainer

m.runtimeService.StopContainer方法，会调用r.runtimeClient.StopContainer，即利用CRI shim客户端，调用CRI shim服务端来进行停止容器的操作。

分析到这里，kubelet中的CRI相关调用就分析完毕了，接下来将会进入到CRI shim（以kubelet内置CRI shim-dockershim为例）里进行停止容器的操作分析。

// pkg/kubelet/remote/remote_runtime.go

// StopContainer stops a running container with a grace period (i.e., timeout).

func (r *RemoteRuntimeService) StopContainer(containerID string, timeout int64) error {

	// Use timeout + default timeout (2 minutes) as timeout to leave extra time

	// for SIGKILL container and request latency.

	t := r.timeout + time.Duration(timeout)*time.Second

	ctx, cancel := getContextWithTimeout(t)

	defer cancel()

	r.logReduction.ClearID(containerID)

	_, err := r.runtimeClient.StopContainer(ctx, &runtimeapi.StopContainerRequest{

		ContainerId: containerID,

		Timeout:     timeout,

	})

	if err != nil {

		klog.Errorf("StopContainer %q from runtime service failed: %v", containerID, err)

		return err

	}

	return nil

}

5.1.2 r.runtimeClient.StopContainer

接下来将会以dockershim为例，进入到CRI shim来进行停止容器操作的分析。

前面kubelet调用r.runtimeClient.StopContainer，会进入到dockershim下面的StopContainer方法。

// pkg/kubelet/dockershim/docker_container.go

// StopContainer stops a running container with a grace period (i.e., timeout).

func (ds *dockerService) StopContainer(_ context.Context, r *runtimeapi.StopContainerRequest) (*runtimeapi.StopContainerResponse, error) {

	err := ds.client.StopContainer(r.ContainerId, time.Duration(r.Timeout)*time.Second)

	if err != nil {

		return nil, err

	}

	return &runtimeapi.StopContainerResponse{}, nil

}

ds.client.StopContainer

主要是调用d.client.ContainerStop。

// pkg/kubelet/dockershim/libdocker/kube_docker_client.go

// Stopping an already stopped container will not cause an error in dockerapi.

func (d *kubeDockerClient) StopContainer(id string, timeout time.Duration) error {

	ctx, cancel := d.getCustomTimeoutContext(timeout)

	defer cancel()

	err := d.client.ContainerStop(ctx, id, &timeout)

	if ctxErr := contextError(ctx); ctxErr != nil {

		return ctxErr

	}

	return err

}

d.client.ContainerStop

构建请求参数，向docker指定的url发送http请求，停止容器。

// vendor/github.com/docker/docker/client/container_stop.go

// ContainerStop stops a container. In case the container fails to stop

// gracefully within a time frame specified by the timeout argument,

// it is forcefully terminated (killed).

//

// If the timeout is nil, the container's StopTimeout value is used, if set,

// otherwise the engine default. A negative timeout value can be specified,

// meaning no timeout, i.e. no forceful termination is performed.

func (cli *Client) ContainerStop(ctx context.Context, containerID string, timeout *time.Duration) error {

	query := url.Values{}

	if timeout != nil {

		query.Set("t", timetypes.DurationToSecondsString(*timeout))

	}

	resp, err := cli.post(ctx, "/containers/"+containerID+"/stop", query, nil, nil)

	ensureReaderClosed(resp)

	return err

}

5.2 m.runtimeService.StopPodSandbox

在m.runtimeService.StopPodSandbox中的runtimeService即RemoteRuntimeService，其实现了CRI shim客户端-容器运行时接口RuntimeService interface，持有与CRI shim容器运行时服务端通信的客户端。所以调用m.runtimeService.StopPodSandbox，实际上等于调用了CRI shim服务端的StopPodSandbox方法，来进行pod sandbox的停止操作。

分析到这里，kubelet中的CRI相关调用就分析完毕了，接下来将会进入到CRI shim（以kubelet内置CRI shim-dockershim为例）里进行停止pod sandbox的分析。

// pkg/kubelet/remote/remote_runtime.go

// StopPodSandbox stops the sandbox. If there are any running containers in the

// sandbox, they should be forced to termination.

func (r *RemoteRuntimeService) StopPodSandbox(podSandBoxID string) error {

	ctx, cancel := getContextWithTimeout(r.timeout)

	defer cancel()

	_, err := r.runtimeClient.StopPodSandbox(ctx, &runtimeapi.StopPodSandboxRequest{

		PodSandboxId: podSandBoxID,

	})

	if err != nil {

		klog.Errorf("StopPodSandbox %q from runtime service failed: %v", podSandBoxID, err)

		return err

	}

	return nil

}

5.2.1 r.runtimeClient.StopPodSandbox

接下来将会以dockershim为例，进入到CRI shim来进行停止pod sandbox的分析。

前面kubelet调用r.runtimeClient.StopPodSandbox，会进入到dockershim下面的StopPodSandbox方法。

停止pod sandbox主要有2个步骤：

（1）调用ds.network.TearDownPod：删除pod网络；

（2）调用ds.client.StopContainer：停止pod sandbox容器。

需要注意的是，上面的2个步骤只有都成功了，停止pod sandbox的操作才算成功，且上面2个步骤成功的先后顺序没有要求。

// pkg/kubelet/dockershim/docker_sandbox.go

// StopPodSandbox stops the sandbox. If there are any running containers in the

// sandbox, they should be force terminated.

// TODO: This function blocks sandbox teardown on networking teardown. Is it

// better to cut our losses assuming an out of band GC routine will cleanup

// after us?

func (ds *dockerService) StopPodSandbox(ctx context.Context, r *runtimeapi.StopPodSandboxRequest) (*runtimeapi.StopPodSandboxResponse, error) {

	var namespace, name string

	var hostNetwork bool

	podSandboxID := r.PodSandboxId

	resp := &runtimeapi.StopPodSandboxResponse{}

	// Try to retrieve minimal sandbox information from docker daemon or sandbox checkpoint.

	inspectResult, metadata, statusErr := ds.getPodSandboxDetails(podSandboxID)

	if statusErr == nil {

		namespace = metadata.Namespace

		name = metadata.Name

		hostNetwork = (networkNamespaceMode(inspectResult) == runtimeapi.NamespaceMode_NODE)

	} else {

		checkpoint := NewPodSandboxCheckpoint("", "", &CheckpointData{})

		checkpointErr := ds.checkpointManager.GetCheckpoint(podSandboxID, checkpoint)

		// Proceed if both sandbox container and checkpoint could not be found. This means that following

		// actions will only have sandbox ID and not have pod namespace and name information.

		// Return error if encounter any unexpected error.

		if checkpointErr != nil {

			if checkpointErr != errors.ErrCheckpointNotFound {

				err := ds.checkpointManager.RemoveCheckpoint(podSandboxID)

				if err != nil {

					klog.Errorf("Failed to delete corrupt checkpoint for sandbox %q: %v", podSandboxID, err)

				}

			}

			if libdocker.IsContainerNotFoundError(statusErr) {

				klog.Warningf("Both sandbox container and checkpoint for id %q could not be found. "+

					"Proceed without further sandbox information.", podSandboxID)

			} else {

				return nil, utilerrors.NewAggregate([]error{

					fmt.Errorf("failed to get checkpoint for sandbox %q: %v", podSandboxID, checkpointErr),

					fmt.Errorf("failed to get sandbox status: %v", statusErr)})

			}

		} else {

			_, name, namespace, _, hostNetwork = checkpoint.GetData()

		}

	}

	// WARNING: The following operations made the following assumption:

	// 1. kubelet will retry on any error returned by StopPodSandbox.

	// 2. tearing down network and stopping sandbox container can succeed in any sequence.

	// This depends on the implementation detail of network plugin and proper error handling.

	// For kubenet, if tearing down network failed and sandbox container is stopped, kubelet

	// will retry. On retry, kubenet will not be able to retrieve network namespace of the sandbox

	// since it is stopped. With empty network namespcae, CNI bridge plugin will conduct best

	// effort clean up and will not return error.

	errList := []error{}

	ready, ok := ds.getNetworkReady(podSandboxID)

	if !hostNetwork && (ready || !ok) {

		// Only tear down the pod network if we haven't done so already

		cID := kubecontainer.BuildContainerID(runtimeName, podSandboxID)

		err := ds.network.TearDownPod(namespace, name, cID)

		if err == nil {

			ds.setNetworkReady(podSandboxID, false)

		} else {

			errList = append(errList, err)

		}

	}

	if err := ds.client.StopContainer(podSandboxID, defaultSandboxGracePeriod); err != nil {

		// Do not return error if the container does not exist

		if !libdocker.IsContainerNotFoundError(err) {

			klog.Errorf("Failed to stop sandbox %q: %v", podSandboxID, err)

			errList = append(errList, err)

		} else {

			// remove the checkpoint for any sandbox that is not found in the runtime

			ds.checkpointManager.RemoveCheckpoint(podSandboxID)

		}

	}

	if len(errList) == 0 {

		return resp, nil

	}

	// TODO: Stop all running containers in the sandbox.

	return nil, utilerrors.NewAggregate(errList)

}

ds.client.StopContainer

主要是调用d.client.ContainerStop。

// pkg/kubelet/dockershim/libdocker/kube_docker_client.go

// Stopping an already stopped container will not cause an error in dockerapi.

func (d *kubeDockerClient) StopContainer(id string, timeout time.Duration) error {

	ctx, cancel := d.getCustomTimeoutContext(timeout)

	defer cancel()

	err := d.client.ContainerStop(ctx, id, &timeout)

	if ctxErr := contextError(ctx); ctxErr != nil {

		return ctxErr

	}

	return err

}

d.client.ContainerStop

构建请求参数，向docker指定的url发送http请求，停止pod sandbox容器。

// vendor/github.com/docker/docker/client/container_stop.go

// ContainerStop stops a container. In case the container fails to stop

// gracefully within a time frame specified by the timeout argument,

// it is forcefully terminated (killed).

//

// If the timeout is nil, the container's StopTimeout value is used, if set,

// otherwise the engine default. A negative timeout value can be specified,

// meaning no timeout, i.e. no forceful termination is performed.

func (cli *Client) ContainerStop(ctx context.Context, containerID string, timeout *time.Duration) error {

	query := url.Values{}

	if timeout != nil {

		query.Set("t", timetypes.DurationToSecondsString(*timeout))

	}

	resp, err := cli.post(ctx, "/containers/"+containerID+"/stop", query, nil, nil)

	ensureReaderClosed(resp)

	return err

}

总结

CRI架构图

在 CRI 之下，包括两种类型的容器运行时的实现：

（1）kubelet内置的 dockershim，实现了 Docker 容器引擎的支持以及 CNI 网络插件（包括 kubenet）的支持。dockershim代码内置于kubelet，被kubelet调用，让dockershim起独立的server来建立CRI shim，向kubelet暴露grpc server；

（2）外部的容器运行时，用来支持 rkt、containerd 等容器引擎的外部容器运行时。

kubelet调用CRI删除pod流程分析