TiDB 与 TiFlash扩展——向真 HTAP 平台前进 韦万


2019/06/25 发布于 技术 分类

QCon  QCon2019 

1. TiDB with TiFlash Extension A Venture Towards the True HTAP Platform weiwan@pingcap.com
2. 在此键入姓名 在此键入tittle
3. About me ● Wei Wan ⻙韦万 ● R&D @ PingCAP ● Used to be a game / android / big-data dev, now a database core dev. ● Focused on storage engine & performance optimization
4. About this talk ● The TP & AP challenge ● What is TiFlash? ● How TiFlash is built? ● TiDB data platform
5. Data Platform - What You Think It Is BI Reporting Ad hoc App Databases Console
6. Data Platform - What It Really Is BI ETL Reporting Analytical DBs Ad hoc App OLTP DBs Console Data Warehouse / Data Lake
7. Why VS
8. Fundamental Conflicts ● Large / batch process vs point / short access ○ Row format for OLTP ○ Columnar format for OLAP ● Workload Interference ○ A single large analytical query might cause disaster for your OLTP workload
9. A Popular Solution ● Use different types of databases ○ For live and fast data, use an OLTP specialized database or NoSQL ○ For historical data, use Hadoop / analytical database ● Offload data via the ETL process into your Hadoop cluster or analytical database ○ Per hour or even per day ○ Complex offload procedures
10. Good enough, really?
11. Complexity or
12. Freshness or
13. Consistency or
14. TiFlash Extension
15. What Is TiFlash? ● An extended analytical engine for TiDB ○ Columnar storage and vectorized processing ○ Based on ClickHouse with tons of proprietary modifications ● Data sync via extended Raft consensus algorithm ○ Strong consistency ○ Trivial overhead ● Clear workload isolation for not impacting OLTP ● Tight integration with TiDB
16. What Is TiFlash? + Distributed database with MySQL protocol TiFlash AP extension = Real HTAP database!
17. What Is TiFlash? Spark Cluster TiSpark TiSpark Worker Worker TiFlash Node 2 TiFlash Node 1 TiFlash Extension Cluster TiD BTiD B TiKV Node 1 TiKV Node 2 TiKV Node 3 Store 1 Store 2 Store 3 Region 1 Region 2 Region 3 Region 4 Region 4 Region 3 Region 2 Region 1 Region 2 Region 3 Region 4 Region 1 TiKV Cluster
18. Columnstore vs Rowstore ● Columnar Storage stores data in columns instead of rows ○ Suitable for analytical workload ■ Possible for column pruning ○ Compression made possible and further IO reduction ■ Far less storage requirement ○ Bad small random read and write ■ Which is the typical workload for OLTP ● Rowstore is the classic format for databases ○ Researched and optimized for OLTP scenario for decades ○ Cumbersome in analytical use cases
19. Columnstore vs Rowstore Rowstore SELECT avg(age) from emp; id name age 0962 Jane 30 7658 John 45 3589 Jim 20 id name age 5523 Susan 52 0962 Jane 30 7658 John 45 3589 Jim 20 5523 Susan 52 Columnstore
20. Columnstore vs Rowstore “If your mother and your wife fell into a river at the same time, who would you save?” “Why not both?”
21. Low-cost Data Replication ● TiDB replicates log via Raft consensus protocol ● TiFlash replicates data in columnstore via Raft Learner ● Learner is a special read-only role in Raft ● Data is replicated to learner asynchronously ○ Write operation does not wait for learner finish replicating data ● Introduce almost zero latency for the OLTP workload
22. Low-cost Data Replication Leader Learner TiFlash R e g i o n A TiKV Region A TiKV TiKV Region A Region A Follower Follower
23. Strong Consistency ● Although data replication is asynchronous ● Read operation guarantees strong consistency ● Raft Learner read protocol + MVCC do the trick ○ Check readIndex on read and wait for necessary log ○ Read according to Timestamp within each log
24. Learner Read Timestamp : 17 Raft Leader 4 Raft Learner 3
25. Learner Read Raft Leader 4 Raft Learner 4
26. Update support ● It is hard to do update on columnar storage engine compared with row based engine. ○ Block structure ○ Rough index maintenance ○ Scan speed ● Even harder to support ONLINE, TRANSACTIONAL update
27. Update support key ts del value a 102 0 bob a 104 0 alice a 108 1 alice b 105 0 kevin b 107 0 joe Versioned rows (MVCC) L0 L0 L1 L0 } L1 L2 MutableMergeTree Storage Engine (Based on MergeTree of ClickHouse, LSM-Tree like design) In memory, rowbased (raft, transaction, cache) } On disk, columnar (MVCC, AP performance)
28. TiFlash is beyond columnar format
29. Scalability ● An HTAP database needs to store huge amount of data ● Scalability is very important ● TiDB relies on multi-raft for scalability ○ One command to add / remove node ○ Scaling is fully automatic ○ Smooth and painless data rebalance ● TiFlash adopts the same design
30. Isolation ● Perfect Resource Isolation ● Data rebalance based on the “label” mechanism ○ Dedicated nodes for TiFlash / Columnstore ○ Nodes are differentiated by “label” ● Computation Isolation is possible by nature ○ Use a different set of compute nodes ○ Read only from nodes with AP label
31. Isolation Peer 1 Peer 2 Peer 3 Peer 4 TiDB TiSpark TiFlash TiFlash Node 2 Region 2 TiKV TiKV TiKV TiFlash TiKV TiFlash Node 1 TiFlash Extension Cluster TiKV Node 1 TiKV Node 2 TiKV Node 3 Store 1 Store 2 Store 3 Region 1 Region 4 Region 2 Region 2 Region 3 Region 3 Region 3 Region 2 Region 4 Region 4 Region 1 Region 1 TiKV Cluster
32. Integration ● Tightly Integrated Interaction ○ TiDB / TiSpark might choose to read from either side ■ Based on cost ■ Columnstore is treated as a special kind of index ○ Upon TiFlash replica failure, read TiKV replica transparently ○ Join data from both sides in a single query
33. Integration SELECT AVG(s.price) FROM prod p, sales s WHERE p.pid = s.pid AND p.batch_id = ‘B1328’; TiDB / TiSpark Index Scan(batch_id = B1328) TableScan(price,pid) TiFlash Node 2 TiFlash Node 1 TiFlash Extension Cluster TiKV Node 1 TiKV Node 2 Store 1 Store 2 Store 3 Region 1 Region 4 Region 2 Region 2 Region 3 Region 3 Region 3 Region 2 Region 4 Region 4 Region 1 Region 1 TiKV Cluster TiKV Node 3
34. MPP Support ● TiFlash nodes form a MPP cluster by themselves ● Full computation support on MPP layer ○ Speed up TiDB since it is not MPP design ○ Speed up TiSpark by avoiding writing disk during shuffle
35. MPP Support TiFlash nodes exchange data and enable complex operators like distributed join. TiDB / TiSpark Coordinator Plan Segment TiFlash Node 1 TiFlash Node 2 MPP Worker MPP Worker TiFlash Node 3 MPP Worker
36. Performance ● Underlying Storage Engine supports Multi-Raft + MVCC ● Still comparable performance against Parquet ● Benchmark against Apache Spark 2.3 on Parquet ○ Pre-POC version of TiFlash + Spark
37. Performance
38. Performance A new storage engine is on the way, deliver at least 3x performance boost.
39. TiDB Data Platform
40. Traditional Data Platform BI ETL Reporting Analytical DBs Ad hoc App OLTP DBs Traditional data platform relies on complex architecture moving data around via ETL. This Console introduces maintenance cost and delay of data arrival in data warehouse. Data Warehouse / Data Lake
41. TiDB Data Platform BI Reporting Ad hoc App TiDB with TiFlash Console
42. Fundamental Change ● “What happened yesterday” vs “What’s going on right now” ○ Realtime report for sales campaign and adjust price in no time ○ Risk management with up-to-date info always ○ Very fast paced replenishment based on live data and prediction
43. Project status ● Beta / User POC in May, 2019 ● GA, By the end of 2019
44. 在此键入姓名 在此键入tittle
45. Photos and Icons are made by or referred from following sources: Becris@Flaticon smashicons@Flaticon Gregor Cresnar@flaticon www.miifotos.com 三⼈人环游记@知乎 www.formula1.com www.aceros-de-hispania.com www.othaimmarkets.com Freepik@Flaticon