Delta Lake:Open Source Reliability for Data Lake with Apache Spark 李潇

Razor

2019/10/19 发布于 技术 分类

文字内容
1. The Delta Architect Delta Lake + Apache Spark Structured Streaming ( 0 (3 D G @A C @
3. • Tech Lead and Engineering Manager at Databricks • Apache Spark Committer and PMC Member • Previously, IBM Master Inventor • Spark, Database Replication, Information Integration • Ph.D. in University of Florida • Github: gatorsmile
4. Delta Lake Joins the Linux Foundation! +
6. Dominique Brezinski (Apple Inc.) Michael Armbrust (Databricks) Spark + AI Summit 2017/06 2017/10 2018/06 2019/04 2019/10
7. • U • Delta Lake • Delta • Delta • Delta • Delta Lake & Demo
8. k Process data continuously and incrementally as new data arrive in a cost efficient way without having to choose between batch or streaming
9. k Process data continuously and incrementally as new data arrive in a cost efficient way without having to choose between batch or streaming Kinesis CSV, JSON, TXT… Data Lake AI & Reporting
10. Events Stream Stream Table (Data gets written continuously) AI & Reporting p
11. Events Stream Batch Table (Data gets written continuously) Table Batch AI & Reporting
12. Events Stream 1p Batch Table (Data gets written continuously) Table Batch AI & Reporting
13. Events Stream Stream Unified View Lambda b Batch Table (Data gets written continuously) Batch
14. Events Stream Stream Unified View Batch Table (Data gets written continuously) Batch AI & Reporting
15. Events Stream Stream Unified View AI & Reporting Partition Batch Table (Data gets written continuously) Batch
16. Events Stream Stream Unified View Batch Table (Data gets written continuously) Batch AI & Reporting Update / Merge d Update / Merge
17. Events m a L a d b Stream Stream Unified View Batch Table (Data gets written continuously) Batch AI & Reporting Update / Merge d Update / Merge
18. Diff ere nt f Con cat en ield ate typ sm e g a n i d s a l o l l e c m a f a r Events f a i t u a l d e w s o l s s e y l e s m e r t c x E Stream o s n n o i t a r f e p l O i a t c a d a t t e i M n n o d g e k c o l s B s c d n a h E m m em ven Co a H t u o Stream a w Unified View AI & Reporting l d n u C t o F o t o o N e l i c F n g n i o t s t e g n i p e s e t K te n rol num c y Update / s ! t l u ! s e r ! b b o j t e n e t Merge s r i s n o o c n i : L f A C I padr Batch Batch CRIT q u ? Table ? e ? s e u t s s I e l f b a i (Data gets written T h l s e e r f e Update / Merge R s? continuously)
19. m Lambda Events Stream Stream Unified View Batch Table (Data gets written continuously) Batch AI & Reporting Update / Merge d Update / Merge
20. m Lambda ! Events Stream L Stream Unified View Batch Table (Data gets written continuously) Batch r AI & Reporting Update / Merge d Update / Merge
21. k Process data continuously and incrementally as new data arrive in a cost efficient way without having to choose between batch or streaming ? Kinesis AI & Reporting CSV, JSON, TXT… Data Lake d[ keng ] [ die ]
22. ? Kinesis CSV, JSON, TXT… P Data Lake 2M g f g 3M 4M 5M d t 1M TS AI & Reporting g S P
23. Structured Streaming + = • h • • x Delta
24. Delta Lake a S
25. Delta On Disk (Optional) Partition Directories my_table/ _delta_log/ 00000.json 00001.json date=2019-01-01/ Data Files file-1.parquet Transaction Log Table Versions
26. Table = result of a set of actions Action Types • Change Metadata – name, schema, partitioning, etc. • Add File – adds a file (with optional statistics) • Remove File – removes a file Result: Current Metadata, List of Files, List of Txns, Version
27. Atomicity C C@ G G D AA D 9A @G 000000.json GD GD D @ Add 1.parquet Add 2.parquet 000001.json Remove 1.parquet Remove 2.parquet Add 3.parquet
28. ] 1. Record start version 2. Record reads/writes 3. Attempt commit, check for conflicts among transactions 4. If someone else wins, check if anything you read has changed. 5. Try again. Read:'>Read: Schema Read:'>Read: Schema Write:'>Write: Append Write:'>Write: Append User 1 000000.json 000001.json 000002.json User 2
29. z R commit log files Add 1.parquet Add 2.parquet Remove 1.parquet Remove 2.parquet Add 3.parquet – Use Spark!!! z Checkpoint P
30. Delta
31. ? Kinesis CSV, JSON, TXT… AI & Reporting Data Lake 1M t (Full ACID Transaction) Snapshot isolation between writers and readers. Focus on your data flow, instead of worrying about failures. Delta Table
32. Kinesis AI & Reporting CSV, JSON, TXT… Data Lake t 1M (Full ACID Transaction) Snapshot isolation between writers and readers. Focus on your data flow, instead of worrying about failures. Delta Table Delta Table Delta Table Delta Table
33. ? Kinesis AI & Reporting CSV, JSON, TXT… Data Lake 2M g f 1: partition values? - f Hive metastore 2: c - partition (Scalable metadata handling) partition location path? filesP location list hP
34. ? Kinesis CSV, JSON, TXT… AI & Reporting Data Lake 2M g 1: 2: c f (Scalable metadata handling) partition values? filesP T P
35. Kinesis AI & Reporting CSV, JSON, TXT… Data Lake g 2M f 1: (Scalable metadata handling) partition values? 2: c filesP T • p Parquet • p Spark’s Distributed Vectorized Parquet Reader file paths P
36. ? Kinesis AI & Reporting CSV, JSON, TXT… Data Lake 3M g (rollback) ! g (update/delete/merge)
37. Kinesis AI & Reporting CSV, JSON, TXT… Data Lake 3M g (rollback) ! Time Travel – e e [ PPP g (update/delete/merge)
38. Kinesis AI & Reporting CSV, JSON, TXT… Data Lake 3M g (rollback) ! Time Travel temporal query Debug Table g (update/delete/merge)
39. Kinesis AI & Reporting CSV, JSON, TXT… Data Lake 3M g (rollback) ! g (update/delete/merge)
40. Kinesis CSV, JSON, TXT… Data Lake AI & Reporting
41. Kinesis CSV, JSON, TXT… Data Lake AI & Reporting merge
42. Kinesis AI & Reporting CSV, JSON, TXT… Data Lake g 3M update/delete/merge (rollback) r SQL g (update/delete/merge) P Spark 3.0 is coming Spark 2.4 Delta SQL parser
43. Kinesis AI & Reporting CSV, JSON, TXT… Data Lake 4M TS g (replay historical data) Stream the backfilled historical data through the same pipeline ACID support
44. Kinesis AI & Reporting CSV, JSON, TXT… Data Lake 5M (late arriving data) S Stream any late arriving data added to the table as they get added ACID support MERGE/UPSERT
45. Kinesis AI & Reporting CSV, JSON, TXT… Data Lake t 1M 2M g f 4M 5M (Scalable metadata handling) g 3M TS (read consistent data) (rollback) g (update/delete/merge) g (late arriving data) (replay historical data) S
46. k Process data continuously and incrementally as new data arrive in a cost efficient way without having to choose between batch or streaming - Delta Kinesis CSV, JSON, TXT… Data Lake AI & Reporting
47. Delta A continuous data flow model to unify batch & streaming Kinesis Streaming Analytics CSV, JSON, TXT… Data Lake AI & Reporting
48. Delta A continuous data flow model to unify batch & streaming OVERWRITE INSERT Kinesis CSV, JSON, TXT… Data Lake UPDATE Bronze Raw Ingestion DELETE Silver Filtered, Cleaned Augmented UPSERT Gold Streaming Analytics Business-level Aggregates AI & Reporting Quality
49. Delta
50. #1. Delta Lake table. . Same engine. Same APIs. Same user code. L Lambda p . . L h [e.g., Structured Streaming’s Trigger.Once] . [Trigger.ProcessingTime, Continuous] L L u
51. #2. Dataframes; (materialization) : L s L Source Table (transformations). L T1 T2 T3 T4 T6 T1 T2 T3 T5 T7 a Dest Tables
52. #2. Dataframes; (materialization) : L s L T1 T2 (transformations). T3 L T4 T6 T5 T7
53. #2. DataframesP Reliability T1 end-2-end latency T2 T3 T4 T6 T5 T7
54. U #3. *Data Quality Levels * Bronze Kinesis Raw Ingestion CSV, JSON, TXT… Data Lake Silver Gold Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented AI & Reporting 1. ; . [Trigger.ProcessingTime, Continuous] o
55. U #3. 1. 2. ; : ; clusters ( a warm pool of machines. Spark streaming 3. Trigger.Once , ): . p [ incremental processing] . ; : . p [ incremental processing] . Spark streaming Trigger.Once
56. x #4. predicate [ xO L Partitioning on low cardinality ○ L ( t partitionBy(date, eventType). // e partition T 100 a 1GB). event types Z-Ordering on high cardinality ○ optimize table ZORDER BY userId. // 1 c T user ids
57. #5. t (raw data) + i stream = • . . . • e b elasticity) (initial backfill) (cloud
58. #6. schemas L events T n schema L : g t . data expectations: *Data Quality Levels * Kinesis CSV, JSON, TXT… Data Lake Bronze Raw Ingestion Silver Filtered, Cleaned Augmented Quality Gold Streaming Analytics Business-level Aggregates AI & Reporting
59. #6. DML: L • Retention • Corrections INSERT Kinesis CSV, JSON, TXT… Data Lake UPDATE Bronze • • DELETE Raw Ingestion GDPR UPSERTS Silver MERGE Filtered, Cleaned Augmented OVERWRITE Gold Streaming Analytics Business-level Aggregates AI & Reporting Quality
60. . • s • • 9 a p U x • 9A • G • C C C 1 )4,! ,.,4,!/,2 , g a E @DCG
61. Delta E@E A@C 3.) p l E@E A@C 3.) f E@E A@C [ A C E x p l 12 3 GG@DC@ m @C G @C m 9 @DC G
62. Delta
63. Improved reliability: Petabyte-scale jobs 10x lower compute: → 64! 640 Simpler, faster ETL: 84 jobs → 3 jobs halved data latency
64. Easier transactional updates: No downtime or consistency issues! Simple CDC: Easy with MERGE Improved performance: Queries run faster p →
65. Data consistency and integrity: not available before Increased data quality: name match accuracy up 80% → 95% Faster data loads: Databricks Cluster → 64
66. Instead of parquet … … simply say delta dataframe .write .format("parquet") .save("/data") dataframe .write .format("delta") .save("/data")
67. Add Spark Package pyspark --packages io.delta:delta-core_2.12:0.1.0'>io.delta:delta-core_2.12:0.1.0 bin/spark-shell --packages io.delta:delta-core_2.12:0.1.0'>io.delta:delta-core_2.12:0.1.0 Maven io.delta delta-core_2.12 0.4.0
68. Demo Delta Lake Primer https://dbricks.co/dlw-01
69. Delta Lake
70. Delta Lake Roadmap Releases Features 0.2.0 • Cloud storage support • Improved concurrency 0.3.0 • Scala/Java APIs for DML commands • Scala/Java APIs for query commit history • Scala/Java APIs for vacuuming old files 0.4.0 • Python APIs for DML and utility operations • In-place Conversion of Parquet to Delta Lake table Q4 • Enable Hive support reading Delta tables • SQL DML support with Spark 3.0 • And more
71. Delta Lake Community 20,000 3700+ 15,000 Orgs using Delta 10,000 2+ 5,000 r m be Se pt e Au gu st Ju ly Ju ne M ay Ap ril M ar ch 0 Exabytes of Delta Read/Writes per month
72. https://delta.io/ Delta Lake! Delta Lake Slack Channel: delta-users.slack.com Mailing List: groups.google.com/forum/#!forum/delta-users
73. koalas 3.0
75. (lixiao AT databricks.com)