我在整合Docker与Apache Mesos过程中遇到的疑难杂症

我在整合Docker与Apache Mesos过程中遇到的疑难杂症

1. The Docker Story in Apache Mesos Timothy Chen 1
2. About me: - Previously Mesosphere Lead Engineer, (exMicrosoft/VMWare) - Apache Mesos, Drill PMC - Help maintain Apache Spark on Mesos
3. What I hope you learn from this talk…. - Learn more about Mesos - Learn more about Docker - Make you think twice when you like to build something that integrates with Docker…..
4. 4
6. Back in 2014…..
8. +
9. What is containerizer? Zookeeper Marathon Framework Cassandra Framework Containerizer Mesos Master Mesos Master Mesos Master Between agents and containers Launch/update/destroy containers Mesos Agent Mesos Agent Mesos Agent Provide isolations between containers Containerizer Containerizer Containerizer Report container stats and status Container Container Container Executor Executor Executor T1 T2 T1 T2 T1 T2 9
11. Mesos Containerizer ● Isolation using standard OS features (e.g., cgroups, namespaces) ● Pluggable architecture allowing customization and extension with isolators 11
12. How Mesos Task Launch Work? 12
13. How Mesos Task Launch Work? 13
14. Mesos Master 1, Launch Task Mesos Agent 7. Report Status to Framework Agent Process Containerizer 5. Send Task to Executor 2. Setup Container 4. Register Executor With IP:PORT Container Command Executor 6. Launch Task 3. Launch Executor
15. Mesos Master 1, Launch Task Mesos Agent Agent Process Containerizer Container Custom Executor Task Task
16. Docker integration early 2014…. Some condierations - Is Docker going to gain a lot of adoption (obvious now)? - Are there more Container integrations we need to do? - Docker is moving fast, how do we keep up? - The production users of Mesos isn’t going to be using Docker, how do we change as little as possible? 16
17. Here comes the External Containerizer + Deimos… External Containerizer - All calls to the containerizer invokes defined script names, i.e: Launch -> /bin/sh –c launch… (http://mesos.apache.org/documentation/latest/external-containerizer) We built a Docker plugin for the External Containerizer in Python: https://github.com/mesosphere/deimos 17
18. And? - Docker became very popular - There wasn’t any more container integration we can see - External Containerizer + Deimos was buggy and hard to use: - All Docker/Deimos activity is external and we have no visibility in Mesos - Too hard to setup, as it needs to also install Deimos on every box with python 18
19. External Containerizer is eventually deprecated.. https://issues.apache.org/jira/browse/MESOS-3370 19
21. What we want is… - Native integration with Docker - No extra setup required, just the Docker daemon and client - Clean and simple way to use from the Framework - Able to recover from slave crash and containers still run (we do better than the Docker daemon J) 21
22. How Mesos Task Launch Work? 22
23. Docker Containerizer - Communicate with Docker daemon with local Docker client (docker run, pull, inspect, etc) - Add specific protobuf Docker options to Schedulers - Stream all Docker logs to stdout/stderr files 23
24. Docker Containerizer - Communicate with Docker daemon with Docker client (docker run, pull, inspect, etc) - (HTTP API is not stable!) - Add specific protobuf Docker options to Schedulers - Stream all Docker logs to stdout/stderr files 24
25. Docker Containerizer 25
26. Docker Containerizer 26
27. POST { “docker”: { “image”: “busybox” }, “cmd”: “sleep 100” } Marathon Agent Process CommandInfo { DockerInfo { “image”: “busybox” } “value”: “sleep 100” } Docker Daemon Mesos Agent Docker pull image Docker run image command Docker logs container id to stdout/stderr Container Docker wait containerId Command Executor
28. And of course, we start hitting Bugs…
29. How about recovery? Docker itself doesn’t support daemon crashing and recovering containers (finally until few months ago?) Assume just the Mesos agent restarted, how do we recover all docker containers launched by Mesos? Mesos agent supports checkpointing, which writes the executor pid into filesystem. 29
30. Tag Containers Docker labels isn’t available yet, so let’s name them! - Docker run –-name mesos-<mesos_container_id> …. On Docker Containerizer restart: - docker ps –a, read output and find all containers starting with mesos- Wait again on the checkpointed command executor pids with a matching container 30
31. How about custom executors? - We can launch the custom executor in a Docker container by allowing the user to specify a Docker image in the ExecutorInfo, it will connect back to the Agent and everything should work? 31
32. How about custom executors? - But how do we update the size of the executor container while it’s running? (Docker until today doesn’t support update in their API) - We have no other choice but to go behind Docker to update it…. - Docker inspect container to get the pid - Modify /sys/fs/cgroup/cpu memory/..... 32
33. Done! 33
34. Not so fast… Users started to report problems - Docker containers cannot be recovered after restart - Unable to update container sizes 34
35. Running Mesos in Docker to run Docker We realized users was running Mesos agent in Docker containers - Executors are gone after Agent restarted, because all forked processes are reaped after container exited - Mesos was unable to modify cgroup because we chrooted 35
36. Docker Container Docker Daemon Mesos Agent Agent Process Docker pull image Docker run image Launch Docker executor in Docker Container Container Docker wait containerId Docker executor
37. Recovery again - Find all running executors in Docker containers, rewait for those container’s pid - (What about when there are multiple agents running in the same box?) 37
38. Everything good now….? - Docker moves fast, new command line options and new features available over time 38
39. Cgroups disappeared? - Users started to report Mesos agent crashed after not able to recover containers and update them. - Looking at the logs, all we see is that it wasn’t able to find cgroups root at the same location anymore. - How can a cgroup root disappear while the container is running?! 39
40. Systemd + Docker + Mesos love story - After days of investigation, able to reproduce the problem when we run systemd config update, and noticing Docker containers cgroup just went missing. (https://issues.apache.org/jira/browse/MESOS-3009) - We started reading Systemd source code, looking for clues - We realize Systemd migrates pids and recreates cgroups when processes finish according to some conditions 40
41. Systemd + Docker + Mesos love story continues - Mesos will now create a Systemd specific container launch - Registers a new systemd slice and uses the delegate flag to ask systemd not to migrate. (Delegate is not available in older versions…..) 41
42. We have a bigger problem… - Docker works great, until you don’t need to work around it - Also works great when you don’t have a lot of containers to run (stability problems) - There has been a lot of changes since the beginning about Docker since 2 years ago (rkt, runc, containerd, etc) - But it’s still not enough to add features around Docker containers (GPU, isolators, etc) 42
43. What we really want from Docker - We realize the biggest feature we want from Docker, is not the daemon, but just the image format 43
44. Unified containerizer ●Pluggable architecture ●Container image ●Container network ●Container storage ●Container security ●Customization and extensions 44
45. Unified containerizer Pluggable architecture Unified containerizer Launcher Process management Isolators Container lifecycle hook Provisioner Container image support 45
46. Unified containerizer Container image support Start from 0.28, you can run your Docker container on Mesos without a Docker daemon installed! One less dependency in your stack Agent restart handled gracefully, task not affected Compose well with all existing isolators Easier to add extensions 46
47. Unified containerizer Provisioner Manage container images Store: fetch and cache image layers Backend: assemble rootfs from image layers ● E.g., copy, overlayfs, bind, aufs Store can be extended Currently supported: Docker, Appc Plan to support: CVMFS (join the MesosCon talk!) 47
48. Unified containerizer Container image framework API message Image { enum Type { DOCKER = 1; APPC = 2; } required Type type; optional Appc appc; optional Docker docker; optional bool cached; message Docker { ... } message Appc { ... } } message TaskInfo { message ContainerInfo { enum Type { DOCKER = 1; MESOS = 2; } message MesosInfo { optional Image image; } required Type type; optional MesosInfo mesos; } optional ContainerInfo container; } 48
49. Unified containerizer Example: launch a Docker container w/ unified containerizer TaskInfo { ... Instead of “DOCKER”, “container” : { which uses Docker “type” : “MESOS”, containerizer “mesos” : { “image” : { “type” : “DOCKER”, “docker” : { “name” : “busybox” } } } } More details can be found at: } https://github.com/apache/mesos/blob/master/docs/container-image.md 49
50. Future of containerization in Mesos Future: unified containerizer! Make it awesome ● Nested container ● VM support ● Unified fetching and caching ● Better abstraction for isolators 50
51. Future work Nested container Custom executor wants to create sub-containers Isolation between sub-containers Sub-containers have container images (e.g., Docker) When executor dies, sub-containers will be destroyed Use cases: K8s on Mesos Jenkins on Mesos Native POD support 51
52. Future work VM support Motivation and use cases More secure containers Goal: launching Mesos tasks/executors in VMs VM workload OpenStack integration Possible implementations A new containerizer? A plugin to unified containerizer? 52
53. Future work Unified fetching and caching Pluggable fetcher A fetcher for each URI scheme Allow URIs with custom scheme Problems: Different ways to fetch URIs and container images Cached in different places Fetcher is modularized and can be extended (e.g., p2p) Unified caching All artifacts are cached the same way Content addressable storage Garbage collection Pre-fetching support 53
54. Future work Better abstraction for isolators Problems with the existing abstraction Cannot specify dependencies between isolators Sharing information between isolators is not possible Upgrading isolators in a backward compatible way is hard Potential solutions Explicit isolator dependency, both data and control Isolator versioning, and version checkpointing Isolator registry? 54
55. Next big question to ask… ●How to solve utilization, fairness and performance when running all of these services at scale? 55
56. Thanks! tnachen@gmail.com 56