iQIYI+基于CPU的深度学习服务优化方案和实践

1.   
2. •  !   •  !  •  CPU   " • 
3. •  !   •  !  •  CPU   " • 
5.   !     "     ! 
6.        GPU     
7. •  !   •  !  •  CPU   " • 
8.    • !5TIL CPU  GPU • NX=JAW T • D2$+]E,'SHJAYOQ • /( :;  76 Caffe/Tensorflow/Torch/MxNet H • 1*KNUM-AG"4 • [email protected])^CV/[0X= •  F3  3 <W >9#%&P • \76I8KNZR ._X=B?
9.   
10.  
11.  Docker Docker Scheduler Mesos Server Server Docker Customized Serving application Model Optimizer Tool VINO model OpenVINO Inference Engine Docker Model Transform Tool Tensorflow Serving TF model Tensorflow with MLK-DNN
12. •  !   •  !  •  CPU   " • 
13.            I/O   
14.   20min/17 500ms/ 200ms/0 200 !17 /  100 %/(  50 QPS 100 QPS +'& (precision) 99.3% & (recall) 97.8% $'& (accuracy) 95.1% * •-.2 • •,/ •362 8 5 •17") •17# •17 •174
15.   $   ! % #     "$
16.   • #" •  • %  • $  • "!  • Compiler Option • MKL-DNN • Intel OpenVINO   • TensorboardTimeline  • Visual DL • Vtune • DtraceStrace  • PlockstatLockstat • PerfSarNumactl • Iostat VmstatBlktrace  • Vtune
17.    Compiler Option Math Library Intel Open VINO SDK Intel CPUs Compiler  sse, sse4 avx, avx2, avx512  Math Library  MKL MKL-DNN  Intel Open VINO SDK         OpenCV/VX  
18.    MKL-DNN • 7 OpenMP GB58 BH  • K DNN C=8N!9JC=  • 60/L#1+M ) • &5D>"A- Open VINO • ($;24!D>0/ (tf, caffe, mxnet) 1 • 1($ 7;: (cpu, gpu, fpga, dsp) H •  CNN-based 1?.GB layer fusion H • C=HI 7 MKL-DNN  compiler  • ($ C++  python @<% 4!&5"A' * Resnet50 MobileNet InceptionV4 0 2 4 6 8 with vino with mkl only F,3E Xeon E5-2650 v4 CPU batchsize=1CPUs=8Mem=16G
19.     OpenMP  • OMP_NUM_THREADS = “number of cpu cores in container” • KMP_BLOCKTIME = 10 • KMP_AFFINITY=granularity=fine, verbose, compact,1,0 120 fps 100 80 60 40 20 0 mobilenet omp_num_threads=2 omp_num_threads=8 Resnet-50 omp_num_threads=4 omp_num_threads=16   Xeon E5-2650 v4 CPU batchsize=1CPUs=8Mem=16G
20. CPU $ (*. • Batchsize 3! CPU $  *8 6) • Batchsize 3!*8 : CPU $ ,9 • $ <11*#+7 -%.;& Batchsize 5 4+ CPU +$ 700 fps 600 500 400 300 200 100 0 alexnet, bs=1 alexnet, bs=128 cpus=2 cpus=4 cpus=8 cpus=12 cpus=24 0"'/ Xeon E5-2650 v4 CPU, 1 2#
21. CPU     Broadwell Xeon E5-2650 v4 Skylake Gold 6148 fp1s20 fps 100 80 60 40 20 0 E5-2650 Gold 6148 InceptionV4 RestNet-50 MobileNet    batchsize=1CPUs=4 Mem=16G
22. NUMA   20 fps 15 10 5 0 InceptionV4 Resnet-50 NUMA(0-3) NUMA(0-2,12) NUMA(0-1,12-13)   Xeon E5-2650 v4 CPU batchsize=1NUMA   node   node  5% ~ 10%
23.     NHWC vs NCHW • NHWCTensorflow  CPU  NHWC • Mkl-dnn   Tensorflow   NCHW • NCHW   
24.      HD SD LD 
25.  1.    2.  mkl      
26.  •  1.     2.   MKL   
27.    720 (second) 14 12 10 8 6 4 2 0      
28.    Batchsize $ • !  batchsize  %  &  • $ latency   batchsize • &'$ throughput  batchsize & (fps) 35 30 25 20 15 10 5 0 InceptionResnet 1 2 4 8 16 32  (ms) 1600 1400 1200 1000 800 600 400 200 0 InceptionResnet 1 2 4 8 16 32 #" Xeon E5-2650 v4 CPUCPUs=4 Mem=16G
29.       •   •      • Post-training Quantization • Training-aware Quantization  • TF-lite   2-5  • Intel Caffe   1-3 
30.   CPU    CPU    100+ GPU  
31. •  !   •  !  •  CPU   " • 
32.