马达 Volcano 在Kubernetes上运行高性能作业

文字内容
1. Volcano: Bring Batch Capability Into Kubernetes Da Ma (@k82cn) Huawei Expert
4. Da Ma • • • • • Software Architect Kubernetes SIG-Scheduling co-Leader Volcano & kube-batch creator Expert at Huawei (now) Ex-IBM Spectrum CE/L3 Team/Tech Lead Jilin University master’s degree, majoring in grid computing and distributed system
6. Gaps lJob/Queue Management nQueue status/configuration nHierarchical queue nJob with multiple pod template nLifecycle management of Job, e.g. restart, suspend/resume nError Handling, e.g. restart job if pod failed (MPI, TFJob) nIndexed Job nTask dependency, e.g. Spark (executor/driver) nDelay Pod Creation n… lRuntime nSingularity n… lScheduler nCoscheduling nFaire-share nQueue nPreemption/Reclaim nReserve/Backfill nTopology (network, accelerator) n… lOthers nThroughput nRound-trip nData locality (Data Aware Scheduling) n…
8. Overall Architecture Domain frameworks: • Deployment/Installation of framework in k8s • Map framework’s terms/concepts into common concept, e.g. Job, Queue • Enable related features for frameworks, e.g. gang-scheduling for TensorFlow training Common Service for high performance workload: • Batch scheduling, e.g. fair-share, gang-scheduling • Enhanced job management, e.g. multiple pod template, error handling • Accelerator, e.g. GPU, FPGA • kubectl plugins, e.g. show Job/Queue information
9. Overall Architecture l Kubectl creates a JobEx object in apiserver if all admission passed l JobExController create Pods based on its replicas and templates l vk-scheduler get the notification of Pod from apiserver l vk-scheduler chooses one l The policy in vk-scheduler is pluggable, e.g. DRF, Priority, host for the Pod of JobEx Gang based on its policy l vk-controllers includes JobExController, QueueController l Volcano handles high performance workload l kubelet gets the notification of Pod from apiserver; and then start the container
10. Scenarios: MPI worker_1 mpirun worker_2 s worker_3 • Multiple Pod Template • Lifecycle Policy • Gang-scheduling • ssh or kubectl • Complete job when mpirun completed • Headless service
11. Scenarios: MPI s • The workers are deleted by job controller because of sshd • The pod of mpiexec/mpirun will not be deleted for output • The pod of mpiexec/mpirun may restart few times because of network setup delay
12. MPI Demo Video s
13. Scenarios: Faire Share s
14. Job Fair-Share Demo Video s
15. Gang-scheduling: Job Execution Time • Case 1: 1 job with 2ps + 4workers • Case 2: 2 jobs with 2ps + 4workers • Case 3: 5 jobs with 2ps + 4workers • No enough resource for 2 Jobs to run concurrently; one of them wasting resources without Gang-Scheduling ! • 2 of 5 jobs was finished because of deadlock (+20 hours) http://status.openlabtesting.org/builds?project=theopenlab%2Fvolcano
16. Task-Topology + Binpack • The execution time of 3 jobs in total; 2ps + 4workers for each job 要 • The execution time is unstable when tested by default scheduler • The improvement dependent on data exchanges between pods • Task-topology within a Job also improved scheduler’s performance • Open Source at volcano-sh/volcano#272 Reference:Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters”
17. Integrations Framework MPI Horovod Kubeflow/tf-operator Kubeflow/arena Spark-Operator Cromwell PaddlePaddle … Status Done Done s Done Done On-going On-going On-going On-going API Volcano Job Volcano Job PodGroup Volcano Job PodGroup Volcano Job Volcano Job Volcano Job / PodGroup
18. Pipeline ●GPU Share/Topology ●Job Management ●Queue Management ●Hierarchical Queue ●Preemption/Reclaim ●…… s Call for Contribution