Airbnb 杜奕凡 Airbnb实时流计算的架构与演进

CodeWarrior

2019/07/08 发布于 编程 分类

GIAC2019 

文字内容
1. Airbnb实时流计算的架构与演进 杜奕凡 Airbnb Software Engineer
2. Outline • What is Airbnb • Spark Use Cases at Airbnb • Upgrade to Spark 2.3 • Near Real-Time Data Ingestion with Kafka & Spark Streaming • Production Tool Built on top of Spark Streaming - Airstream 2
3. What is Airbnb 3
4. What is Airbnb 2009 50-100 LISTING S 101-300 LISTING S 301-1000 LISTING S 1001+ LISTING S 4
5. What is Airbnb TOTAL HOMES ON AIRBNB 6 Million 5
6. What is Airbnb CITIES COUNTRIES AND REGIONS 100K 191+ 6
7. What is Airbnb 7
8. What is Airbnb 8
9. What is Airbnb 9
10. What is Airbnb Airbnb Experiences 10
11. What is Airbnb SP ORTS CLASSES & WORKSHOP S NATURE HISTORY PHOTO GRAPHY 11 FASHION
12. What is Airbnb SP ORTS HISTORY HE ALTH & WELLNESS CONCERTS S O CIAL IMPACT ENTERTAINMENT FO OD & DRINK PHOTO GRAPHY 12 ARTS
13. What is Airbnb 13
14. WHAT DOES IT MEAN FOR DATA INFRASTRUCTURE? 14
15. Data Warehouse Storage (2017-2019) 60-fold Growth! 15
16. Airbnb Spark Use Cases • Search • Pricing • Machine Learning • Data Ingestion • Near Real-time Applications • … 16
17. Search Ranking • More powerful models (DNNs) require more training data! • To process large amounts of data, we need to leverage tools like Spark • Allows for more complex error handling and unit tests • Can re-use code between online system (Java) and offline data pipeline (Scala) • Use broadcast-join to optimize data pipelines to only extract raw logs for searches of interest arXiv paper: Applying Deep Learning To Airbnb Search 17
18. Smart Pricing Blog Post: Learning Market Dynamics for Optimal Pricing 18
19. Financial Intelligence • Support Finance and Accounting Functions • Process all the data in the company - Yes, ALL of it - Varying quality and “cleanliness” - Immense Scale - 10+ years of transactional data • Output clean financial data to power various business functions - Treasury - Revenue Ops - Financial Systems and Technologies Group 19
20. Spark Upgrade to 2.3 20
21. Spark Upgrade from 1.x to 2.3 • Up to 2.5X performance improvement - 60% reduction of batch processing time in a production job • Better SQL support. Spark 2.x can run all the 99 TPC-DS queries, which require many of the SQL:2003 features • Vectorized reader for ORC and Parquet • Reduced cost due to better performance • Better support and integration with Spark and Hadoop ecosystem • Numerous improvements and bug fixes have went into Spark after the 1.6 release in 2016 21
22. Scaling Near Real-Time Data Ingestion With Kafka & Spark Streaming 22
23. Architecture of Logging Data Flow • Bridge between online and offline data • High throughput • Mission critical • SLA (Service-Level Agreement) & recovery • Near real-time • Efficiency & cost 23
24. Challenges and Pain Points • Fast growth - Topics from dozens to thousands - Bytes grow 6x in 2018 • Bottlenecks (e.g., Spark parallelism determined by Kafka partitions) • Skew in event size and QPS 24
25. Logging Events Size Skew 25
26. Current 1-to-1 Kafka Reader 26
27. Spark Task Running Time Skew Image from https://silverpond.com.au/2016/10/06/balancing-spark/ 27
28. More Challenges and Pain Points • Stability & SLA depend on many systems - Near Real-time Ingestion - Headroom for Catch Up - SLA suffer - Operational nightmare - Oncall burnout • Efficiency & cost 28
29. Solution from Spark Community • Outstanding issue and PR • Does not handle data skew among topics 29
30. Balanced Kafka Reader for Spark 30
31. 1-to-N Kafka Reader 31
32. Balanced Partitioning Algorithm • Pre-compute average event size (bytes) per topic • Compute the ideal bytes per split • For new topics, use the average size of all topics Blog Post: Scaling Spark Streaming for Logging Event Ingestion 32
33. Balanced Partitioning Algorithm • Shuffle the list of offset ranges • Starting from split 1. For each offset range - Assign it to the current split if the total weight is less than weight-per-split - If it doesn’t fit, break it apart and assign the subset of the offset range that fits - If the current split is more than bytes-per-split, move to the next split Blog Post: Scaling Spark Streaming for Logging Event Ingestion 33
34. Balanced Kafka Reader Performance 34
35. Results • Support 20X higher throughput with imbalanced topics • Better SLA (happy customers! happy engineers!) • Faster recovery and less hand holding • Ever-increasing throughput 35
36. Open Source will be available soon! • Balanced Kafka Reader for Spark will be available on Airbnb.io and Airbnb Github soon! 36
37. Production Tool Built on top of Spark Streaming Airstream 37
38. What is Airstream • A framework to define and execute data pipelines • Pipelines created by stitching together building blocks • Pipelines defined through configuration • Philosophy: Make simple things easy, complex things possible 38
39. Components of a Pipeline • Sources • Processes • Sink 39
40. Sources • Structured and unstructured - Jitney (Airbnb internal library), JSON, binary • Static and dynamic data sources - Kafka, Kinesis, HBase, S3, Redis etc. 40
41. Processes • Unit of logic • SQL on structured messages • Custom UDFs (user-defined-functions) on unstructured messages 41
42. SQL Process on Structured Messages Write SQL on the converted Thrift Schema 42
43. Sinks • Persist pipeline output • Variety of sinks - Kafka - Jitney - HBase - Hive - Metrics (Datadog) -… 43
44. Sample Configuration 44
45. Data Flow 45
46. Benefits • Ease and speed of pipeline development • Reuse of sources and sinks • SQL lowers barrier of entry • Shields user from underlying infrastructure and its changes • Extensible 46
47. Recap • What is Airbnb & Data behind our business • Using Spark to handle data • One particular case: near-real time logging event data pipeline using Kafka & Spark • Challenge & pain point: data skew due to fast growth • Solution: Airbnb.io • A derived production tool - Airstream 47
48. Thank you! 关注公众号获得 更多案例实践 48