谈谈Kubelet、Docker daemon和容器运行时的设计与实践

谈谈Kubelet、Docker daemon和容器运行时的设计与实践

1. Revisit kubelet, Docker Daemon, and Container Runtime Harry Zhang
2. Background
3. “the original world” Docker Image Docker Daemon Compose Swarm Image ACI Runtime rkt (AppC) Kubernetes Control Panel Mesos Others*
4. “OCI & CNCF” Docker Image RunC RunV Image ACI Runtime rkt (AppC) containerd Docker Daemon Compose Swarm Kubernetes Control Panel Mesos Others*
5. “the new daemon” Docker Image RunC RunV Image ACI Runtime rkt (AppC) containerd Docker Daemon Kubernetes Control Panel Mesos Legacy*
6. “…” Docker Image RunC containerd Image RunV ocid Runtime ACI rkt (AppC) hyperd Mesos Containerizer Containerizer Container Runtime Interface Docker Daemon Kubernetes Control Panel Mesos Legacy*
7. Interesting Fact • kubelet is becoming more like a traditional Docker Daemon
8. We Will Answer • What is kubelet? • kubelet handle pods instead of containers • • kubelet manage containers like a Docker daemon • • why? how? kubelet does not rely on any container runtime • Container Runtime Interface
9. kubelet proxy kubelet SyncLoop 1 Pod created scheduler proxy api-server etcd kubelet SyncLoop
10. kubelet proxy kubelet SyncLoop scheduler 2 Pod object added proxy api-server etcd kubelet SyncLoop
11. kubelet proxy kubelet SyncLoop 3.1 New pod object detected 3.2 Bind pod with node scheduler proxy api-server etcd kubelet SyncLoop
12. kubelet proxy kubelet SyncLoop scheduler proxy api-server etcd kubelet 4.1 Detected pod bind with me 4.2 Start containers in pod SyncLoop
13. Kubelet Use Pod Model • The atomic scheduling unit • Process group in container cloud
14. “atomic scheduling unit” • InitContainers: one or more containers started in sequence before the pod's normal containers are started. • request = max_resource(sum_pod, any_init_container) • Kubernetes = Spring Framework • Pod = IoC
15. DIY pod? Not so easy swarm peer node 1. 2. 3. • Omega “gang scheduling problem” • two affiliated containers • one scheduled success, no place for another
16. *4-sources api-server (primary, watch) http endpoint (pull) http server (push) file (pull) kubelet chooseRuntime docker, rkt, hyper/remote Eviction AddPodAdmitHandler NodeStatus NewContainerGC Network Status register listers status Manager diskSpaceManager oomWatcher NewGenericPLEG InitNetworkPlugin PLEG SyncLoop <-chan kubetypes.PodUpdate* <-chan *pleg.PodLifecycleEvent periodic sync events housekeeping events • • • Pod Update Worker (e.g.ADD) generale pod status check volume status call runtime to start containers volume Manager image Manager PodUpdate HandlePods {Add, Update, Remove, Delete, …}
17. Volume • Get mountedVolume from actualStateOfWorld • Unmount volumes in mountedVolume but not in desiredStateOfWorld • Find new pods • createVolumeSpec(newPod) • Cache volumes[volName].pods[podName] = pod AttachVolume() if vol in desiredStateOfWorld and not attached • • MountVolume() if vol in desiredStateOfWorld and not in mountedVolume • Verify devices that should be detached/unmounted are detached/unmounted • Volume Manager Tips: 1. -v host:path 2. attach VS mount 3. Totally independent from container management reconcile desired World
18. Containers • Pod container A container B • infra container init container Compete container changes • InfraContainer changed • ContainersToKeep • ContainersToStart Start containers (for example) • Start infra container • networkPlugin.SetUpPod() • Run & check init containers in order • Start regular containers volume • call runtime api
19. Eviction • Guaranteed • limits is set for all resources, all containers • limits == requests (if set) • Be killed until they exceed their limits • • • or if the system is under memory pressure and there are no lower priority containers that can be killed. Burstable • requests is set for one or more resources, one or more containers • limits (if set) != requests • killed once they exceed their requests and no Best-Effort pods exist when system under memory pressure Best-Effort • requests and limits are not set for all of the resources, all containers • First to get killed if the system runs out of memory
20. Container Runtime Interface
21. Runtime Design • Kubernetes is designed to support multiple container runtime years ago • Keep Docker as the only one default container runtime • The Remote Container Runtime Kit: https://github.com/kubernetes/frakti • Demo • tryout, star and fork are welcome :)
22. • Everything looks good, unless …
23. “when image is not standard” Docker Image • If Docker image becomes Docker specific someday (I’m not kidding) … • At least, Kubernetes has prepared for that: • kubelet.imageManager becomes runtime specific • but the default container runtime have to be changed • RunC+OCI Image, Rkt Appc + ACI, or NO DEFAULT
24. Kubelet Summary • Kubelet implement Pod model • Kubelet maintains a group of manage-loops named xxxManager • • • volume, status, lifecycle, network, garbage collection, image, log … • extensibility • performance “disagreements”: • Volume: essentially the same (-v hostPath:containerPath) • Network: most network plugins are maintained by third-party providers • Resource model: k8s uses a Borg inspired QoS model Multiple runtimes, even multiple images (docker image, zip, package …)
25. END Lei (Harry) Zhang @reouser https://hyper.sh Kubernetes Project Member CNCF Member Docker …