图森未来 靳江明 自动驾驶关键技术探索

CodeWarrior

2019/07/08 发布于 编程 分类

GIAC2019 

文字内容
1. e r
3. h b s X • p07, 8Q KPGP PGRC GOH , IAQI PGLK l • k n f
4. X l
5. U L
6. V
7. yc L ~ z u K k k k
8. n mo 2009 - 2014 • • Google 1980s – 2000s • L2 – L4 2015 -
9. sxw v h r Xw d s b s X h s
10. s zb o r
11. T P
12. K a r i tt ○ ○ ○ ○ Image-20Hz Lidar-20Hz GPS-50Hz
13. K p tt r ○ ○ ○ ○ o
14. K N O tt f ○ p ○ ○ i P
15. M P
16. t P r ○ b o
17. S • • v •
18. t P r b ○ ○ h r o
19. 4AD F CC4 4CC r • • r -AD 4D
20. t P r b ○ ○ h ○ ,7 r g o
21. D G A C CNN 1 Key Differences 2.4 -6.2 … 3.3 1.8 0.8 -0.1 Weights Inputs 4.3 -7.8 … 0.3 -0.8 = … … … ∗ 5.0 -9.1 … 1. Inputs are binarized (−1 or +1) 2. Weights are binarized (−1 or +1) 3. Results are binarized Outputs BNN 1 −1 … 1 1 Binarization … (Binary) 1 −1 … Weights 3 −7 … (Binary) 1 −1 = 1 −1 … … … … Inputs ∗ 1 −1 1 −3 … Outputs (Binary)
22. D G A C 1
23. D G A C 1 Step 1: Bit-packing of input along channel dimension Step 2: Bit-packing of weight along channel dimension Step 3: Perform convolution on compressed input tensor and weight with bit operations. Optimization Utilize vector parallelism on the C dimension; Utilize multi-core parallelism on the H and W dimension.
24. D G A C 1 Full-precision VGG Binarized VGG MNIST (%) 99.4 98.2 CIFAR10 (%) 92.5 87.8 ImageNet top5 (%) 88.4 76.8 Model Size (MB) 534 17 Xeon Phi 8.9% improvement over GTX 1080 for VGG16 9.1% for VGG19 Intel i7 achieves ~80% performance of GTX 1080 for VGG16 and VGG19 TuSimple HPC, Tsinghua U., U. Rochester, “BitFlow: Exploiting Vector Parallelism for Binary Neural Networks on CPU , in IEEE IPDPS 2018
25. t P r b ○ ○ h ○ ,7 ○ r o g o
26. 2 https://tvm.ai/ D A B D4
27. 2 ED D E A B D4
28. 2 D A B D4 U. Washington, AWS, SJTU, UC Davis, TuSimple HPC, “TVM: End-to-End Optimization Stack for Deep Learning”, in https://arxiv.org/pdf/1802.04799v1.pdf & OSDI 2018
29. D C 2 B4B ARM Invoke PointCloud DSL LLVM Clang IR X86 CUDA GPU http://pointclouds.org/
30. t P r b ○ ○ h ○ ,7 ○ ○ r g o 5 z o
32. z n r z d 44 D
33. 3 D M A C 4 5CP n https://mxnet.apache.org/ f a m u 4 5CP v
34. z z d k g o 1KPCN KLBC 1KPN KLBC -4 X L,. 1KDGKG 5 3GKH z /z z KB
35. d k B B CC F 4D : B 4 4D ED 1C • • • b 5 5 y /7 u z TuSimple HPC, “Training Deep Nets with Progressive Batch Normalization on Multi-GPUs”, in International Journal of Parallel Programming volume 47, issue3, 2019
36. B • • • B CC F 4D : N GKGKE AAQN AT N GOCO QM PL % % B 4 4D S CK OTKA NLKGU PGLK ENLQM OGUC GO 6RCN C B LD B P OTKA ANLOO II BCRGACO LAAQMGCO V C PLP I LRCN C B NCBQACO O P C ENLQM OGUC BCANC OCO LD P C PLP I / 5
37. B B CC F 4D : • 7 5 A GCRCO I LOP P C O • 7 5 A K L P GK QM PL • 7 5 A K L P GK C 16 B 4 4D OALNC SGP / 5 DPCN P C S LIC PN GKGKE MNLACOO AAQN AT G MNLRC CKP AL M NGKE SGP P C MCNDLN OCIGKC KAC G MNLRC CKP AL M NGKE SGP / 5
38. t P r b ○ ○ h ○ ,7 ○ ○ ○ r g o 5 fp z 2 O o
39. z i 2Q CDILS f CKOLNDILS 7TPLNA TuSimple HPC, operator 4 5CP , DDC Mxnet-operator in kubeflow 2Q CNKCPCOe f , https://github.com/kubeflow/mxnet-
40. 3 D G D: E B D C B4 ED D A B4D B i D A B4D B E i
41. s c • • • • o • • i • o
42. 关注msup微信公众账号 关注高可用架构公众账号 获取更多技术实践干货 改变互联网的构建方式