成峰 大数据自助平台的思考与建设

文字内容
1. Thoughts and Practices in Self Service Big Data Platform Cheng Feng Grab
4. Thoughts and Practices in Self Service Big Data Platform 13 July 2019 ArchSummit Shenzhen 2 v1.0
5. Agenda ● Data in Grab ● Why Self Service ● How To Do Self Service ● Data Governance
6. Data in Grab Data Platforms Architecture Challenges on Storage Challenges on Computation
7. Data in Grab Reporting User Trust Machine Learning
8. The Evolution of Data Analytics Platforms
9. Decoupling Storage and Compute
10. Distributed File System vs Object Storage Client PUT/LIST/GET/ DELETE
11. Challenges in S3 1.Eventual Consistency Time 10:00:00 10:00:01 10:00:03 10:00:05 10:00:10 Writer: client1 List s3://TableA/'>s3://TableA/ Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return: A,B Delete s3:// TableA/A,B Put s3://TableA/'>s3://TableA/ C,D List s3://TableA/'>s3://TableA/ Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return: C,D List s3://TableA/'>s3://TableA/ Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return: C,D List s3://TableA/'>s3://TableA/ Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return: A,B List s3://TableA/'>s3://TableA/ Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return: C,D Reader: List s3://TableA/'>s3://TableA/ client2 Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return:'>Return: A,B 1.5XX error 500: Internal Server Error (system problem - can be transient) 503: Service Not Available (often due to loading - internal timeout)
12. Solution: Introducing versioning NO S3 List Read Transform ation YES Write Validation Metastore update S3://'>S3:///Datalake/DB_Name/Table_Name/version-id=1/files S3://'>S3:///Datalake/DB_Name/Table_Name/version-id=2/files
13. Other Solutions ● Netflix Iceberg ○ Tracking table snapshots and metadata ○ https://github.com/apache/incubator-iceberg ● Uber Hudi ○ Support Upsert and Incremental pull ○ https://github.com/apache/incubator-hudi ● Databricks Delta Lake ○ An open-source storage layer that brings ACID transactions to Apache Spark ○ https://github.com/delta-io/delta
14. Challenges in Elastic Computing Compute Engine a.Presto i. Limited fault tolerance ii. Memory Limit per node iii.Write Performance iv.RBO (Moving to CBO) b.Spark i. Spark in K8S ii. Access Control Infrastructure Cost a.HA i. Multiple airflow workers for a Queue ii. Different Presto Clusters b.Auto-scaling i. CPU and Memory Utilization c.Spot instance i. Multiple Node Types ii. Spot + On demand
15. Why Self Service Organization Growth Data Ownership
16. Organization Chart Transport Tech Product /Business Analytics Food Tech Data Science Payments Tech Logistics Tech About 30 tech families; 1000+ Product Engineers Data Engineering Business Users Trust & Safety About 10+ Analytic Teams; 1000+ Data Users
17. Data Ownership Team growth Schema Infrastructure ● Hyper growth on business ● Joint Venture, Let thousands of flowers bloom ● Decoupling the thing, break the connections ● Data Modeling, ● Schema Changes, New columns, Data type Changes, etc ● Schema on Write to Schema on Read ● Database Migration ● Data Ownership Changes ● Network Changes
18. What do we want to be Platform abstracts the complexities!
19. The Journey to Self Service Hugo - Self Service Data Ingestion Slide + Tableau - Self Service BI
20. Data Sources in Grab ● Transaction ○ ○ ○ ○ MySQL - RDS MySQL - Aurora Postgres SQL Server ● Event Streams ○ Kafka ○ DynamoDB ● Service Logs ○ S3 Files ● Others ○ Elasticsearch ○ Google Cloud Storage ○ Different Vendors
21. RDBMS Loader - Parallel Processing select * from order where id>1 and id < 1,000,00 select * from order Executor where id>1,000,00 and id < 2,000,00 Table: Order Indexes: Created_at Raw numbers: 10 millions/ hour 1 select * from order Executor where id>9,000,00 and id < 10,000,00 2 Executor 10 Driver
22. Understand the data source - DB Analyzer
23. Make it simple - GUI
24. Make it simple - Auto Cold Start
25. Usage stats
26. Data User Needs Alation Airflow & Slide Data Catalo g Job Lorem ipsum tempus Scheduler Dash Data Quality Lorem ipsum congue tempus SQL Reporting Data Visualization Tableau & Holistic Presto & Spark SQL & Redshift
27. Slide + Tableau - Self Service BI Lorem 1 Hugo Raw Data Ingestion Prejoin Curated data sets Slide ! Presto SQL Job Scheduler ! Micro Services ! User-friendly WEB UI (one stop) ! Support Multiple Clusters ! Internal & External Lineage Support Lorem 3 Tableau ! Build Reports in Desktop ! Publish to Tableau Server ! Scheduled Critical Reports
28. Key Takeaways ● Product Mindset ○ Out-serving Customers ○ Make it simple ○ Automation ● Transparency of the data platform ○ Visibility - Everyone Pipeline runs ○ SLAs - 45 mins delay compare with production ○ Data Governance - Next Chapter
29. Data Governance Data Quality Data Lineage Data Catalog
30. Data Quality Tool - Dash ● Data completeness/consistency ○ Multiple data sources (RDBMS, Presto, Spark) ○ Customize logic ○ Drill down analysis ● Outlier detection ○ Unexpected values, rules based detection ○ Unexpected trends ● Alerting ○ Slack, Email, Pagerduty
31. Dash - Use Case
32. Data Lineage Tool - Lighthouse ● Dependency Data Collection ○ Data Pipeline Status API - Airflow Operator ○ Customized Dependency Registration ● Impact Analysis ○ Downstream ○ Upstream - Data Provenance ○ Table Popularity - SQL Parser ● Additional features ○ ETA - Daily ○ Alerts ○ Visualization
33. Use cases - Impact Analysis ● Upstreams ○ match q=shortestpath((n)[:Lineage *1..20]->(m)) where m.uuid='datamart_table' and n.uuid <> m.uuid return q ● Downstreams ○ match q=shortestpath((n)[:Lineage *1..20]->(m)) where n.uuid='ods_table' and n.uuid <> m.uuid return q
34. Data Catalog Tool - Alation
35. Key takeaways ● Data Quality ○ Data completeness analysis ○ Outlier detection ● Data Lineage ○ Impact Analysis ○ Data Governance ● Data Dictionary ○ Search & Discovery ○ Collaborative Analytics
36. Questions please?