Docker 应用:如何设计超大规模容器调度系统

微风

2019/03/24 发布于 技术 分类

文字内容
1. 构建基于Kubernetes的容器云系统 才云科技CTO / 邓德源
2. 2016-4-21
3. About me And when I was young in the good old days…
4. Case Study: Containers in Google Using containers for a decade Running 2 billion containers a week Solves application migration nightmare Saves billion dollars a year Docker: 5x yearly growth rate
5. Solved Problems of the World?
6. Case Study: Cluster Management in Google 1 SRE handles ~ 10,000 machines with 99.999% reliability Clustering is the hard part. In Google: NO team dedicated for container study HUNDREDS of engineers built THREE cluster manager systems HUNDREDS of teams building ecosystems Clustering is the real value in building serious production container systems
7. All the Fun Stuff Began Here
8. Kubernetes Design Principles declarative > imperative simple > complex labels > hierarchy legacy compatible extensible and pluggable application centric
9. Kubernetes Highlights How to group resources? • Pods Lessen learned from Borg • • • • Jobs are usually grouped • e.g. log offloading Allow teams to develop distinct part of application improve robustness • e.g. log can be offloaded even if container crashes Atomic scheduling • C1: 1core, C2: 2cores • M1: 2cores, M2: 8 cores • C1 -> M1? App containers Volumes Config Containers
10. Kubernetes Highlights How to manage massive resources in a flexible way? • • app:'>app:'>app:'>app:'>app:'>app:'>app:'>app: booking env:'>env:'>env:'>env:'>env:'>env:'>env:'>env: prod team:'>team:'>team:'>team:'>team:'>team:'>team:'>team: travel version:'>version:'>version:'>version: v1 app:'>app:'>app:'>app:'>app:'>app:'>app:'>app: mongo env:'>env:'>env:'>env:'>env:'>env:'>env:'>env: prod team:'>team:'>team:'>team:'>team:'>team:'>team:'>team: storage app:'>app:'>app:'>app:'>app:'>app:'>app:'>app: booking env:'>env:'>env:'>env:'>env:'>env:'>env:'>env: prod team:'>team:'>team:'>team:'>team:'>team:'>team:'>team: travel version:'>version:'>version:'>version: v2 app:'>app:'>app:'>app:'>app:'>app:'>app:'>app: booking env:'>env:'>env:'>env:'>env:'>env:'>env:'>env: uat team:'>team:'>team:'>team:'>team:'>team:'>team:'>team: travel version:'>version:'>version:'>version: v2 Labels and its query API Selectors
11. Kubernetes Highlights How to do service discovery for external services? • Use services and endpoints together Prod Endpoint:'>Endpoint: Name:'>Name: “Oracle” IP:'>IP: 10.254.1.1 QA Endpoint:'>Endpoint: Name:'>Name: “Oracle” IP:'>IP: 192.168.1.1
12. Kubernetes Highlights How to deal with configurations varying in different environments? ConfigMap in Pod:'>Pod: ORACLE_PASSWD:'>PASSWD: “7h6#f)” JETTY_CONFIG_PATH:'>PATH: … APP in Pod:'>Pod: ref: ENV[ORACLE_PASSWD] Single image, decoupled from varying configurations ConfigMap in UAT: ORACLE_PASSWD:'>PASSWD: “123456” JETTY_CONFIG_PATH:'>PATH:
13. Kubernetes Highlights How to NOT let docker persist my credentials? • Secrets apiVersion:'>apiVersion: v1 kind:'>kind: Secret data:'>metadata: name:'>name:'>name:'>name:'>name:'>name:'>name:'>name: aliyun-api-keys data: api-client1:api-client2:aliyun-api-keys:apiVersion:'>apiVersion: v1 kind:'>kind: ReplicationController data:'>metadata: name:'>name:'>name:'>name:'>name:'>name:'>name:'>name: caicloud-cluster-manager spec:volumes: - name:'>name:'>name:'>name:'>name:'>name:'>name:'>name: aliyun-api-keys secret:'>secret: secretName:'>secretName: aliyun-api-keys - name:'>name:'>name:'>name:'>name:'>name:'>name:'>name: caicloudapp-ssl-cert secret:'>secret: secretName:'>secretName: caicloudapp-ssl-cert
14. Kubernetes Highlights How to perform fine-grained resource control and access control? • Namespaces and service accounts How to handle services or applications that are stateful? • • • • • • L7 load balancer Ingress controller node affinity PetSet (coming) Pod Lifecyle interfaces How to automatically create, delete, and allocate storage resources? • Persistent volumes and claims
15. Our Practice on Using Kubernetes to Build Cluster Management System
16. Architecture User browser Problems: Nginx Rate Limiting SSL Termination - Duplicate functionalities - System tends to be monolithic - Complex frontend logic due to varied APIs Solution Manager Public API Paging Public API Cluster Manager Public API Authentication Transformation Authentication Logging Authentication Authorization Solution Manager Private API Paging Private API Cluster Manager Private API … … … Paging Third Party API
17. Architecture pull/push image User browser Cargo (registry) Caicloud admin browser Circle (release mgmt) API Gateway SolutionManager Manager Solution (compositeapps, apps, (composite cloud native) cloud native) Accounting Paging Admin Console ClusterManager Manager Cluster (federation,isolation, isolation, (federation, logging, HA, lifecycle) logging, HA, lifecycle) …… Cubernetes/Caicloud API calls User Cubernetes Clusters Monitoring Logging User Cubernetes Clusters Monitoring Logging User Cubernetes Clusters Monitoring Logging …… User Cubernetes Clusters Monitoring Logging
18. Design: CLaaS vs CaaS Cluster vs Container as the operation units • Additional higher-level management • E.g., clone entire clusters and ensure holistic consistency (e.g., config) beyond just image consistency Clustered applications vs container processes • CLaaS:'>CLaaS: ES cluster with data, client, and master nodes + offload data processing • CaaS:'>CaaS: docker run elasticsearch:1.7.4 Host cluster exposure vs container black box • CLaaS:'>CLaaS: requires dedicated clusters; additional host information and access points • CaaS:'>CaaS: hosts are abstracted away; obscures debugging, tooling, and customization
19. Example: The True Consistency and Portability Scenario: • Tomcat, Redis Cluster, Elastic Search, Mongo DB • Want to setup development, testing, and production environments and be 1) fast and 2) consistent CaaS: image-level consistency • Tomcat1…3, Redis1…3 and ES1…3 have consistent images • How to handle different IPs, references, config files, dependencies? The MOST troublesome part unsolved! CLaaS: system-level consistency • Tomcat1, tomcat2 and tomcat3 have consistent images • IPs use consistent names (even for external services) • Config uses consistent references, dependencies are respected
20. Design: Release Management • • • • • • Where is my-awesome-app running? What is the latest version of my-awesome-app? What is the live version of my-awesome-app? Is version Y running long enough to roll out (and upgrade version X)? Can I continuously deploy my-awesome-app to test cluster. How can I upgrade my-awesome-app with his-xxx-app now that I have to depend on it? …… Cluster Cluster Cluster
21. Design: Release Management • Static Configuration • Easy but ‘static’, works well in most cases • Dynamic tracking • Record status while deploying • Use kubernetes annotation for tracking • Dynamic dependency management remains unsolved Hypervisor Hypervisor Hypervisor CI Module (Hypervisor) Logging Module Deployment Module Solution Manager
22. Design: Managing Solutions Not Containers How Cloud Treats Their Customers Today • Users have to map logical solutions to the list of resources/containers • Containers DO NOT help! backend Mongo storage mid-tier Codis frontend Elastic Search Jetty VPS VPS storage VPS container container container container container container container container container
23. Solution Manager: Solutions as 1st Class Citizen Key benefits • Includes both workloads and infrastructure and the topology • Additional higher level meta-data and management interfaces organizations backend applications Mongo mid-tier Elastic Search Codis container container vehicles VPS storage frontend VPS storage storage storage Jetty container container container
24. Implementation: Stateful -> Stateless Rolling update is great, when: - you want to test multiple versions of code or configuration - you want to update application without service interruption Cluster Manager Cluster Manager Server Cluster Manager Server Cluster Manager Server Server Worker container etcd cluster Cluster Manager Validator Cluster Manager Cluster Manager Executor Cluster Manager Executor Executor Worker container - Restart creating cluster - Easy, but bad user experience - Pick up where it left - Error prone - Graceful termination - 1min is too small and too large - A new docker container - Spawn a new docker container Chaos testing: https://github.com/gaia-adm/pumba
25. Implementation: High Availability and Load balancing Kubernetes supports: - etcd cluster for storage - Multiple API servers - Master elected scheduler and controller manager - Kubelet babysitter
26. Implementation: High Availability and Load balancing High Availability: Manager Haproxy Keepalived Kubelet Manager Haproxy Keepalived Kubelet VIP VIP
27. About Us: Cloud Team from Google + Amazon + CMU • COO|Jiayao Han • Experienced series Entrepreneur in the US • Four degrees in Information Science, Law, Art, History from University of Pittsburgh • CEO|Xin Zhang • Ex-Googler specialized on Google private cloud, GAE and GCE, received multiple spot bonuses from several Google VPs • CS Ph.D from CMU specialized in distributed systems and security • CTO|Deyuan Deng • Ex-Googler and top open-source Docker and container cluster contributor • 1st Prize in World Robotics Competition • CMU ECE • Chief Architect| Pengcheneg Tang • Ex-Amazon engineer and expert in Docker and Kubernetes • CMU ECE • • • • Chief Data Scientist|Zeyu Zheng Ex-Googler specialized in Big Data ACM competition team lead CMU Computer Science