商汤科技 孙鹏 大规模分布式深度神经网络训练系统中的通信优化

CodeWarrior

2019/07/08 发布于 编程 分类

GIAC2019 

文字内容
1. bvg ceidlrus 188 w ot x z ahmf ypkn
3. IMGKE LACHIME CAMMNR HAMDKE IG LNDEK AMD IG DARA Big Model Big Data Various of deep learning (DL) models (e.g. RNN, CNN) have been proposed in recent years. Big Data is an essential factor for DL models to offer higher precision and accuracy. Model Size Alexnet: 61M Parameters Data Size Far Larger than Memory Capacity of a Single Machine or GPU Model FLOP Resnet-152: 22GFLOP
4. 6ETEPAGIMG KAPGE CAKE 2:> CKS REP AMD DI RPI SRED /6 RN HAMDKE IG DARA IG LNDEK Big Model Big Data Big Compute + Distributed DL Various of deep learning (DL) models (e.g. RNN, CNN) have been proposed in recent years. Big Data is an essential factor for DL models to offer higher precision and accuracy. Training a single DL model with a number of high performance computing resources.
5. 6APGE CAKE DI RPI SRED /6 CNSKD EFFICIEMRKW ILOPNTE OPNDSCRITIRW FNP PE EAPCH AMD IMDS RPW 5F R E RPAIMIMG RILE NF 16 AOO ICARINM I • 7IMSRE 3NSP , IMREPACRITE PE EAPC DWMALICA W SODARE LNDE E OECIA W S EFS FNP SRN76 • • /AW , N EPAB E RPAIM LS RIO E LNDE EE , DAMGEPNS NM W RPAIM 16 LNDE IM OAPA E FNP IG TA SE RA • 7NMRH , 1N MNR DN R I R IMG > IMG DI RPI SRED /6 RN OEED SO / RPAIMIMG
6. UN ONOSKAP W REL APCHIRECRSPE FNP DI RPI SRED / GPU Worker Gradients Params Gradients Params Gradients Params Gradients Params Gradients Parameter Server GPU GPU GPU GPU Worker Worker Worker Worker Gradients GPU Worker Network GPU Gradients Worker (a) Master-Slave Architecture (Parameter Server Approach) Gradients GPU Worker (b) P2P Architecture (AllReduce Approach) E FNCS NM R E MERUNP NORILI ARINM FNP R E EDSCE AOOPNAC UN C S REP • 0 S REP :, %+ :052 :A CA 4:> CNMMECRED BW 31 5/ • 0 S REP , ( % :052 N RA 4:> CNMMECRED BW 31 5/ ) UN C A IC LNDE , ) EPTEP EPTEP EV8ER AMD E 8ER ( UN W REL , :W NPC AMD W REL 5 IMREPMA W REL IM EM E ILE
7. - A EKIME W REL APCHIRECRSPE FNP DI RPI SRED / Data Layer Layer 1 … L1 L2 … Ln Data Layer Layer 1 … L1 L2 … GPU 3 Params Ln Data Layer Layer 1 … Forward Pass Ln GPU 2 Params Forward Pass … Forward Pass L1 L2 GPU 1 Params L1 L2 … Layer 1 … Layer n Loss Loss Loss Loss … Layer 1 Local Gradients L1 L2 … Ln Layer n … Layer 1 Local Gradients L1 L2 … Ln Layer n … Layer 1 Local Gradients L1 L2 … Ln Backward Pass Layer n Backward Pass Layer n Layer n Ln Data Layer Layer n Backward Pass Forward Pass Backward Pass Update Parameters with Aggregated Gradients GPU 0 Params Layer n … Layer 1 Local Gradients L1 L2 … Ln AllReduce Operation for Gradient Tensors of Layer 1, Layer 2, …, Layer n Gradients L1 L2 … Ln Gradients L1 L2 … Ln Gradients L1 L2 … Ln Gradients L1 L2 … Ln
8. HE OEPFNPLAMCE NF 2KNN AMD 7:4 I ONNP NM KS REP : E 8ER ( EV8ER 0 S REP : PyTorch (Gloo) PyTorch (Ideal) System-I (MPI) System-I (Ideal) Images/s 1.5E5 1.5E4 1.0E5 5.0E4 0.0 1 2 4 8 16 32 PyTorch (Gloo) PyTorch (Ideal) System-I(MPI) System-I(Ideal) 2.0E4 Images/s 2.0E5 64 128 1.0E4 5.0E3 0.0 1 2 GPU Number 2MAB E 0 S REP : 4 8 16 32 GPU Number 64 128 /7- IM A RE R 7:5 %+ 4:>, % AMD )( W REL SRI I ARINM FNP 4 NN %+ 4:>, CAMMNR RAPR SO R E RPAIMIMG RA EV8ER AMD E 8ER FNP BNR LNDE
9. ERUNP NORILI ARINM PIC , > IMG 8006 IMG FNP DI RPI SRED / • PIC EDSCE • PIC %, 7IVED :PECI INM PAIMIMG • PIC , 0NLLSMICARINM 0NLOSRARINM TEP AO • PIC , 3S ED 0NLLSMICARINM 6A W EDSCE • PIC (, 0NAP E 4PAIMED OAP E 0NLLSMICARINM 8EAP 6IMEAP OEEDSO ARIN FNP 188 RPAIMIMG NM ( % 4:>
10. PIC + > IMG 6 UIRH IMG -KK EDSCE 1ARA RPAM FEP RN AMD FPNL A IMG E PEDSCEP 4:> 4:> APPAMGED IM A NGICA PIMG
11. 6 CNSKD ILOPNTE RHE UHNKE W REL OEPFNPLAMCE SR MNR EMNSGH EV8ER 0 S REP : PyTorch (Horovod) PyTorch (NCCL) PyTorch (Ideal) System-I (NCCL) System-I (Ideal) 1.5E5 1.0E5 ABNSR ( 5.0E4 UIR RPIC 1.5E4 SRI I ARINM 0.0 0 S REP : PyTorch (Horovod) PyTorch (NCCL) PyTorch (Ideal) System-I (NCCL) System-I (Ideal) 2.0E4 Images/s 2.0E5 Images/s E 8ER ( 1.0E4 ABNSR ( 5.0E3 UIR RPIC SRI I ARINM 0.0 1 2 4 8 16 32 64 128 1 2 GPU Number 8006 %+ 4:>, &( AMD ( 4 8 16 32 64 128 GPU Number W REL SRI I ARINM FNP EV8ER AMD E 8ER
12. ORILI ARINM PIC %+ 7IVED :PECI INM PAIMIMG GPU 1 FP32->FP16 Params Forward Backward FP16->FP32 Gradients SGD Update FP32 Params SGD Update FP32 Params AllReduce … FP32->FP16 Params FP16 Gradients Forward Backward FP16 Gradients FP16->FP32 Gradients GPU n EDSCE R E GEMEPARED MERUNP RPAFFIC RN A F 3: % -3: )
13. PIC %+ 7IVED :PECI INM PAIMIMG E CNLOSRARINM O A E FNP 188 RPAIMIMG BECNLE FA REP U IC I ONUEPED BW EM NP 0NPE
14. HE NTEPAKK W REL SRIKI ARINM I RIKK ONNP ETEM UIRH 6 7IVED :PECI INM PAIMIMG NM KS REP : EV8ER 0 S REP : 3.0E4 System-I (Mixed-Precision) Ideal (Mixed-Precision) System-I (Single-Precision) Ideal (Single-Precision) 2.5E5 2.0E5 1.5E5 ABNSR SRI I ARINM 1.0E5 UIR RPIC % 5.0E4 0.0 2.0E4 1.5E4 ABNSR () 1.0E4 0 S REP : System-I (Mixed-Precision) Ideal (Mixed-Precision) System-I (Single-Precision) Ideal (Single-Precision) 2.5E4 Images/s Images/s E 8ER ( UIR RPIC % SRI I ARINM 5.0E3 2 8 32 GPU Number 128 0.0 2 8 32 GPU Number E NTEPA SRI I ARINM NF E 8ER ( RPAIMIMG BECNLE DSE RN ILOPNTED CNLOSRARINM OEPFNPLAMCE NUEP 128
15. TAKSARINM PE SKR NF W REL 4 UIRH :PECI INM PAIMIMG NM KS REP 6 AMD 7IVED EV8ER 0 S REP 2.0E5 1.5E5 ABNSR % UIR RPIC % SRI I ARINM 0 S REP System-I (Mixed-Precision) Ideal (Mixed-Precision) System-I (Single-Precision) Ideal (Single-Precision) 1.5E6 Images/s 2.5E5 Images/s 2.0E6 System-I (Mixed-Precision) Ideal (Mixed-Precision) System-I (Single-Precision) Ideal (Single-Precision) 3.0E5 1.0E5 E 8ER ( 1.0E6 ABNSR %( UIR RPIC % SRI I ARINM 5.0E5 5.0E4 0.0 0.0 2 8 32 128 512 GPU Number 8NRE, 0 S REP I ONUEPED BW EM NP 0NPE 2 8 32 GPU Number 128 512
16. Layer n NLOSRARINM TEPKAO Layer n AllReduce Layer n-1 Ln Gradients Ln Gradients … Layer n-1 … Layer 1 … … L1 Gradients L1 Gradients Layer 1 Synchronization Gradients L1 L2 … Ln Gradients L1 L2 … Ln Gradients L1 L2 … Ln Gradients L1 L2 … Ln Backwar Backward PIC &+ NLLSMICARINM
17. NLLSMICARINM NLOSRARINM N N TINS KW NM KS REP TEPKAO UNP EV8ER 0 S REP 2.0E6 Disable-Overlap Enable-Overlap Ideal ABNSR % 1.0E6 UIR RPIC 2.5E5 SRI I ARINM 5.0E5 2.0E5 ABNSR %( 1.5E5 0 S REP Disable-Overlap Enable-Overlap Ideal 3.0E5 Images/s 1.5E6 Images/s E 8ER ( SR MNR SRI I ARINM IR RPIC 1.0E5 5.0E4 0.0 2 8 32 GPU Number 128 512 0.0 2 8 32 GPU Number RE R EMAB E 8006 AMD 7IVED :PECI INM PAIMIMG 128 512
18. + 1S ED NLLSMICARINM 6A W -KK EDSCE Backward Layer n Layer n-1 memory pool memory pool Tensor m Tensor m Tensor m-1 AllReduce Tensor m-1 Tensor m-2 Tensor m-2 Tensor m-3 AllReduce Tensor m-3 … Layer 1 Layer 1 GPU A Layer n-1 Layer n-2 Layer n-2 … Layer n … … Tensor 1 AllReduce Tensor 1 EDSCE LS RIO E AWEP IM A IMG E NOEPARINM GPU A Backward PIC
19. 6A W -KK EDSCE IGMIFICAMRKW ILOPNTE CKS REP SRIKI ARINM EV8ER 0 S REP 2.0E6 Disable-Overlap Enable-Overlap Ideal 1.0E6 ABNSR + 2.5E5 SRI I ARINM UIR RPIC 5.0E5 2.0E5 1.5E5 1.0E5 0 S REP Disable-Overlap Enable-Overlap Ideal 3.0E5 Images/s 1.5E6 Images/s E 8ER ( ABNSR +( UIR RPIC SRI I ARINM 5.0E4 0.0 2 8 32 GPU Number 128 512 0.0 2 8 32 GPU Number RE R EMAB E 8006 7IVED :PECI INM PAIMIMG AMD 0NLLSMICARINM 0NLOSRARINM TEP AO 128 512
20. PIC (+ NAP E 2PAIMED OAP E NLLSMICARINM Layer n-1 memory pool memory pool Tensor m Tensor m Tensor m-1 Tensor m-1 AllReduce Layer n-2 … Layer n Layer n-1 Layer n-2 Tensor m-2 Tensor m-2 Tensor m-3 Tensor m-3 Layer 1 GPU A … Layer 1 … … AllReduce Tensor 1 M W RPAM LIR ILONPRAMR GPADIEMR IM Tensor 1 EDSCE NOEPARINM GPU B Backward Backward Layer n
21. OAP E CNLLSMICARINM FSRSPE ILOPNTE SRIKI ARINM EV8ER 0 S REP 2.0E6 Disable-Overlap Enable-Overlap Ideal 1.0E6 ABNSR + 2.5E5 SRI I ARINM UIR RPIC 5.0E5 2.0E5 1.5E5 1.0E5 0 S REP Disable-Overlap Enable-Overlap Ideal 3.0E5 Images/s 1.5E6 Images/s E 8ER ( ABNSR +( UIR RPIC SRI I ARINM 5.0E4 0.0 2 8 32 128 512 0.0 2 GPU Number RE R EMAB E 8006 7IVED :PECI INM PAIMIMG 0NLLSMICARINM 0NLOSRARINM TEP AO AMD 6A W 8 32 128 GPU Number EDSCE 512
22. SLLAPW NF AKK ( MERUNP RPIC FNP DI RPI SRED /
23. 1IMAK PIC + -DTAMCE 3APDUIPE : 4 , 1/ , / 64 5 3/