LinkedIn 夏婧姝 - 《应用实时线上流量进行自动化容量测量与性能瓶颈分析》

范姜敏丽

2018/05/13 发布于 技术 分类

夏婧姝夏婧姝(Susie Xia),一个怀揣音乐梦想的硅谷程序媛。 2010年毕业于北京邮电大学,之后在卡内基梅隆大学获得硕士学位。毕业后,先后任职于Salesforce和LinkedIn,从事移动应用开发,平台及大数据系统的性能优化、容量分析和测量自动化的设计与开发工作。工作期间,在计算机会议发表多篇论文,多次受邀北美行业技术会议演讲、分享工作成果,并在2017年荣获第24届IEEE网络服务国际会议(ICWS)最佳论文奖。 工作之余,热爱流行音乐和演唱。2013年,加入硅谷小有名气的Encore Music Club,成为一名业余的流行歌手,从此活跃在湾区大大小小的舞台上。曾与Encore热爱音乐的小伙伴们成功举办过两次售票演唱会。

文字内容
1. Detecting Capacity Limits and Performance Bottlenecks Using Live Traffic ​Susie Xia ​Jeff Weiner​Christopher Coleman ​Chief Executive Officer ​2018 QCon Beijing
2. Agenda 1 Introduction 2 Meet Redliner 3 Use Cases 4 Future Plans
3. LinkedIn Engagement & Growth 546M Members • 20M Companies • 14M+ Open Jobs • 29K+ Schools • 11B+ Endorsements 20% Sessions Growth(YoY) 200+ Countries & Territories • 5th straight quarter of this growth • Record levels of engagement • 60% (YoY) growth in viral actions, such as likes, comments, shares, and messages sent • Available in 24 languages • 70% members outside of US • > 2+ new users join per second
4. Our Dilemma WHY IS SERVER GROWTH OUTPACING PAGE VIEW GROWTH?
5. Over Provisioning 31% Wasted in 2016 • Organic Growth • Unexpected Events • New Products & Features • Emergency Uplifts
6. Motivations • Resource Efficiently • Capacity Plan Effortlessly • Increase Throughput Reliably
7. Challenges • External Interferences • Evolving Product Landscapes • Complex Downstream Dynamics
8. Load Testing Journey Synthetic Synthetic Load in Lab Load in Prod Isolated Host Record & Replay Anything Else? Learnings • + NRCeoanIlmtirsoptilaclecIEdtnnofEvrnanirsvPotirnoumdcnteumuncreteinotn • - HRIneicfgqrohaunlsyistrreiCsusutceCstntuotrsemRtoNeimzsoeutdTlRteSsesepttrSuecpsreipnttsative • - HInaicgrohdnOtsoivseMStrecahanienlteatTda(rHia&nifgfFMihcualOPilnrCpotoefrivnlaeatrsniaocgneeal Cost)
9. Goals • Use Live Production Traffic • Minimize Impact to Users • Require Low Operational Overhead
10. Hello, Redliner
11. Workflow Traffic Shift Request Live Production Traffic Load Balancer Redliner App Instance App Instance App Instance App Instance Health Check Request PASS / FAIL Service Health Evaluator • Errors & Error Rates • Latency Percentiles • System Stats Metric Collection Framework
12. Health Evaluations • Variety of health checks measured every set interval • Evaluations at the host, cluster, and data center levels • Incorporates signal from operational alerting system • Performance comparisons between target and the cluster
13. Health Checks
14. Dynamic Ramping Slow, Steady Ramp Fast, Aggressive Ramp
15. Complete Automation • Manipulation of traffic between nodes in the cluster • Determination of the node’s and service’s health • Identification of potential bottlenecks under stress • Remediation of any issues encountered during test
16. Use Cases
17. 1. Find Single Instance Max Throughput • Gradually stresses the service until it cannot safely handle any additional load • Simplifies resource provisioning • Provides starting point for tuning and optimizations
18. 2. Improve Service Throughput • Investigate health check failures from increased traffic • Discover APIs “A”, “B”, “C” error rates jumped • Caused API “D” latency to double • Resolve issues one by one • Repeat the Redliner test
19. Before Investigation After Investigation
20. 3. Detect and Diagnose Regressions Test Id Test 1 Test 2 Date 2017-11-19 09:01:11 2017-11-19 23:58:09 Version v1.0.0 v1.0.1 Redline 2536.33 534.19 Health Check Failures in Latency • N/A • Endpoint A: Median latency exceeded 20% change in comparison to control target. • Endpoint B: Median latency exceeded 20% change in comparison to control target.
21. The Smiley Curve
22. Live Requests from Service Clients Proxy / Load Balancer 4. A/B Load Testing Production v1.0.0 Service Instance Service Instance Service Instance Canary v1.0.1 Service Instance • Run Redliner test side-by-side on canary and production versions • Code comparisons • Configuration comparisons • OS comparisons • Security updates
23. A/B Load Test Example • Same load on both canary and prod instances until one or both failed health check • Prod instance hits health check failure before canary instance • v1.0.1 on canary has better throughput – new version is encouraged to be deployed
24. 5. Identify Surplus Capacity When ?????????????????????????????????????????? ?????????????????? < ?????????????????????????????? ?????????????????????????????????????????? ??????????????????, ?????????????????????????????? ?????????????????????????????????????????? ?????????????????? ?????????????????????????????? # ???????????? ?????????????????????????????????????????????????????? = ?????????????????????????????????????????? ?????????????????? + ???????????????????????????????????????????????? When ?????????????????????????????????????????? ?????????????????? ≥ ?????????????????????????????? ?????????????????????????????????????????? ??????????????????, ?????????????????????????????? # ???????????? ?????????????????????????????????????????????????????? = 1 + ???????????????????????????????????????????????? ???????????????????????????????????????????????? If ?????????????????????????????? # ???????????? ???????????????????????????????????????????????? ?????????????????????????????????????????????????????? > ?????????????????????????????? # ???????????? ??????????????????????????????????????????????????????, the service is over-provisioned.
25. Server Cap Ex Trend for Service
26. Future Work
27. 1. Dynamic Provisioning • Auto Scaling – Scale predictably to handle natural changes in traffic throughout the day • Efficient Host Packing – Create models for throughput based on resource allocations and deploy most efficient container size
28. 2. Simulating Downstream Behavior • Latency – Test against response times during peak traffic hours at any time in the day • Errors & Failures – Test service behavior when downstream results are acting unreliably • Connectivity – Test resiliency and recovery when dependencies are unavailable
29. 3. Stateful Redlining Source Node • Source Node – Storage node to test • Dark Node – Exact replica of source node • Tee Traffic – Copy the incoming live traffic to source node to dark node 0-19 • Multiply Traffic – Generate extra load Dark Node on dark node based on incoming traffic
30. Key Takeaways
31. Reflection • Don’t Be Afraid of Risk • Prepare for the Surprises • Build Performance Mindset
32. Don’t count servers. Make servers count.
33. Thank you https://engineering.linkedin.com/blog chinajobs@linkedin.com