全球架构师峰会 Arch Summit 2018

杨巍威 Apache Hadoop YARN State of the Union

1. Apache Hadoop YARN: State of the Union Weiwei Yang/Chunde Ren Hortonworks/Alibaba
3. Weiwei Yang • Staff Software Engineer at Hortonworks, YARN dev team • Apache Hadoop committer and PMC member Chunde Ren • Staff Software Engineer at Alibaba • Leading the Hadoop team in real-time computation platform
4. • State of the Union: Service, ML, Cloud and beyond Scale and Performance Unified platform • Apache YARN 3.1 in Alibaba Utilization+: balance & oversubscription Hybrid clusters • Q&A
5. State of the Union: Service, ML, Cloud… Weiwei Yang Hortonworks, YARN
6. 1 Year Timeline: GA Releases • YARN Federation • Opportunistic Containers • New YARN UI • Timeline v2 Nov 17 2.9.1 • Global Scheduling • Multiple Resource types • New YARN UI • Timeline v2 Dec 17 3.0.3 • GPU/FPGA • YARN Native Service • Global scheduling • Placement Constraints Aug 18 3.1.1 • Submarine • Node attributes • Service upgrade • Containerize improvements Oct 18 3.2.0 2.9.0 3.0.0 3.1.0 3.2.0
7. Unified Data Operative System SLA ML Deep Learning No-SQL Ad-hoc Streaming Service SQL Compute Apache Hadoop YARN Utilization Resource
8. Focus area • Continue to evolve at large scale • Scale • Global Scheduling • Unified platform • Container runtime and Services • Placement constraints • Beyond: Submarine/CSI
9. Scale at Today • Tons of sites with clusters made up by large amount of nodes • Oath(Yahoo!), Twitter, LinkedIn, Microsoft, Alibaba etc. • 50K nodes in a single cluster of Microsoft[1] • Roadmap: To 100K and beyond [1] https://azure.microsoft.com/en-us/blog/how-microsoft-drives-exabyte-analytics-on-the-world-s-largest-yarn-cluster/
10. Global Scheduling Scheduler state Thread 1 Proposal 1 Thread 2 Proposal 2 Thread 3 Proposal 3 Placement Committer
11. Global Scheduling: takeaways • Addresses hotspot issues • Allows to plug customized node scoring policies (customize slot- selection themes) • Scoring can be done at background or in-place • Not fit for clusters merely run small batches
12. Docker Container • Better packing model • Light-weighted mechanism for packaging and resource isolation • Popularized and made accessible by Docker • Native integrated in YARN • Docker container runtime • Many security/usability improvements added to 3.x
13. Container Runtime Run both docker and non-docker containers on same cluster
14. YARN Service Simplified Service Framework • Service discovery: DNS Registration • Service timeout • Upgrade • Integrated REST API/CLI/UI
15. Service - UI
16. Placement Constraints Don't place containers together Anti-affinity Collocate containers Affinity Control number of containers per node/rack Cardinality Expression, namespace, service spec and more
17. Node Attributes 1. Centralized/Distributed node-attributes 2. Distributed node-attributes support config/script providers 3. Admin tools and restful APIs hostName=host1 javaVersion>1.6 RM STATE rm.yarn.io/hostName=“host1” nm.yarn.io/javaVersion=“1.8” rm.yarn.io/hostType=“new” rm.yarn.io/hostName=“host2” nm.yarn.io/javaVersion=“1.9” rm.yarn.io/hostType=“old” Host1 Host2 Allocation = host1 Allocation = host1 host2
18. Submarine: TF Hello world Run distributed TF training with one command: yarn jar hadoop-yarn-applications-submarine-<version>.jar job run \ --name tf-job-001 --docker_image <your docker image> \ --input_path hdfs://default/dataset/cifar-10-data \ --checkpoint_path hdfs://default/tmp/cifar-10-jobdir \ --num_workers 2 \ --worker_resources memory=8G,vcores=2,gpu=2 \ --worker_launch_cmd "cmd for worker ..." \ --num_ps 2 \ --ps_resources memory=4G,vcores=2,gpu=0 \ --ps_launch_cmd "cmd for ps"
19. Submarine - Architecture
20. Container Storage Interface CSI The Adaption of CSI • Container Storage Interface (CSI): a vendor neural interface to bridge Container Orchestrator and Storage Providers. HaHdOoaozdpooOnozeopne Amazon EBS
21. Volume Resource Part of container resource request
22. Deployment Master Slave Resource Manager Node Manager Volume Manager CSI Driver Adaptor Container Identity Plugin Controller Plugin Node Plugin Container Volume Mount 3rd Party CSI Driver Slave Node Manager CSI Driver Adaptor Container Identity Plugin Controller Plugin Node Plugin Container Volume Mount Storage System
23. Apache YARN 3.1 in Alibaba Chunde Ren Alibaba, Real-time Computation
24. The Ecosystem BI Ads Recommendation Search Streaming + Batch Apache YARN Apache HDFS Security
25. The challenges & solutions • YARN Clusters • Tens thousand nodes, version: 3.1.0+patches • Long running streaming jobs >> Batch jobs, Machine Learning • SLA: latency, priority, failover • HA, performance, load-balancer, placement constraints, isolation • Resource Utilization • Elastic queue capacity & preemption, Oversubscription, Hybrid cluster
26. Enhanced Capacity Scheduler
27. Load Balance – Node Scores Dynamically optimize container distribution across the cluster • Policy: app or queue • Shuffle candidates • Auto resize allocate thread • Score Cache
28. Load Balance - Reschedule 28 Eliminate Hotspots and Fragmentations • NM/Container utilization metrics • Identify hotspot & idle nodes • Preemption: - Priority - candidate app cache - preemption contract protocol Resource Manager Rescheduler editScheduler Hotspot Candidates Selector App1 App2 Container Container MainScheduler Notify Scheduler MARK_CONTAINER_FOR_PREEMPTION Scheduler Scheduler States
29. Resource Oversubscription Container execution type:Guaranteed/Opportunistic Schedule Mode • Distributed Scheduler • Centralized by CapacityScheduler+ Isolation • CPU: O container share = 2 • Memory: OOM Controller + Killer vs QoS Monitor • Network: net_cls + tc + switch • Disk IO: HDD blk.weight vs SSD Preemption • High/Low watermark • O(App Priority) < G(App Priority) Result: auto operation, utilization >10% 29
30. Hybrid Cluster Improve Online Service Cluster Resource utilization Architecture: YARN on K8s Isolation: powered by aliKernel Use Case: machine learning & deep learning 30

相关幻灯片