TiDB 申砾_Design and Architecture

承如凡

2017/11/26 发布于 技术 分类

在线事务处理( OLTP )和在线分析处理( OLAP )这两类业务对数据库的要求完全不同,这两类系统在架构上往往区别很大。 大多数情况下,同时有这两种需求的业务会维护两套异构系统,中间通过 ETL 工具连接起来。一套用于承载 OLTP 部分的业务,侧重于数据一致性、高吞吐、低延迟; 一套用于承载 OLAP 部分的业务,侧重于处理复杂的分析 Query 以及存储海量数据。这种方案一方面无法保证数据的时效性,另一方面维护两套系统以及其之间的数据同步也会提高业务的成本以及复杂性。 TiDB 是由 PingCAP 开发的分布式 NewSQL 数据库,希望能够为具有 OLTP / OLAP 混合业务需求的用户提供一站式解决方案,大家可以把 TiDB 当作可以无限扩展的 MySQL 来用,极大简化业务架构和使用成本。这次和大家分享 TiDB 的设计原理与实现。

文字内容
1. 关注公众号回复help, 可获取更多经典学习资 料和文档,电子书
2. TiDB Design And Architecture ShenLi PingCAP
3.  Shen Li (申砾)  Tech Lead of TiDB, VP of Engineering  Netease / 360 / PingCAP  Infrastructure software engineer
4.  Why we need a new database  The goal of TiDB  Design && Architecture  Storage Layer  Scheduler  SQL Layer  Spark integration  TiDB on Kubernetes
5.  From scratch  What’s wrong with the existing DBs?  RDBMS  NoSQL & Middleware  NewSQL: F1 & Spanner RDBMS 1970s NoSQL NewSQL 2010 2015 Present MySQL PostgreSQL Oracle DB2 ... Redis HBase Cassandra MongoDB ... Google Spanner Google F1 TiDB
6.  Scalability  High Availability ● ACID Transaction  SQL A Distributed, Consistent, Scalable, SQL Database that supports the best features of both traditional RDBMS and NoSQL Open source, of course
7.  Data storage  Data distribution  Data replication  Auto balance  ACID Transaction  SQL at scale
8. SQL Layer Storage Layer Applications MySQL Drivers(e.g. JDBC) MySQL Protocol TiDB RPC TiKV
10.  Good start! RocksDB is fast and stable.  Atomic batch write  Snapshot  However… It’s a locally embedded KV store.  Can’t tolerate machine failures  Scalability depends on the capacity of the disk
11. Fault Tolerance  Use Raft to replicate data  Key features of Raft  Strong leader: leader does most of the work, issue all log updates  Leader election  Membership changes  Implementation:  Ported from etcd  Replicas are distributed across machines/racks/data-centers
12. Fault Tolerance Raft Raft RocksDB Machine 1 RocksDB Machine 2 RocksDB Machine 3
13. Scalability  What if we SPLIT data into many regions?  We got many Raft groups.  Region = Contiguous Keys  Hash partitioning or Range partitioning?  Redis: Hash partitioning  HBase: Range partitioning Range Scan: Select * from t where c > 10 and c < 100;
14.  Key: Byte Array Logical Key Space  A globally ordered map  Can’t use hash partitioning  Use range partitioning  Region 1 -> [a - d]  Region 2 -> [e - h] …  Region n -> [w - z] (-∞, +∞) Sorted Map  Data is stored/replicated/scheduled in regions Meta: [Start_key, end_key)
15.  That’s simple  Logical split  Just Split && Move  Split safely using Raft Region 1 Region 1 Region 2
16. Region 1* Region 2 Region 3 Node A Node B Region 1 Region 2 Region 2 Region 3 Region Reg1ion 3 Node C Node D
17. Node B Region 1* Region 2 Region 3 Node A New Node E Region Reg1io^n 2 Region 2 Region 3 Region Reg1ion 3 Node C Node D
18. Node B Region 1 Region 2 Region 3 Node A New Node E Region Reg1io*n 2 Region 1 Region 2 Region 3 Region Reg1ion 3 Node C Node D
19. Node B Region Reg1io*n 2 Region 2 Region 3 Node A New Node E Region 1 Region 2 Region 3 Region Reg1ion 3 Node C Node D
20. Raft Group RPC Store 1 Region 1 Region 3 Region 5 Region 4 TiKV node 1 Client RPC RPC Store 2 Region 1 Region 2 Region 4 Region 3 Store 3 Region 2 Region 5 Region 3 TiKV node 2 TiKV node 3 RPC Store 4 Region 1 Region 2 Region 5 Region 4 TiKV node 4 Placement Driver PD 1 PD 2 PD 3
21.  MVCC  Data layout  key1_version2 -> value  key1_version1 -> value  key2_version3 -> value  Lock-free snapshot reads  Transaction  Inspired by Google Percolator  ‘Almost’ decentralized 2-phase commit
22. ● Highly layered Transaction ● Raft for consistency and scalability MVCC ● No distributed file system RaftKV ○ For better performance and lower latency Local KV Storage (RocksDB)
24.  Provide the God’s view of the entire cluster  Store the metadata  Clients have cache of placement information. Placemen t Driver  Maintain the replication constraint  3 replicas, by default Raft  Data movement for balancing the workload Placemen t  It’s a cluster too, of course. Driver  Thanks to Raft. Raft Raft Placemen t Driver
25. Node 1 Region A Region B Movement Node 2 Region C HeartBeat with Info Schedulin g Command PD Cluster Info Scheduling Strategy Confi g Admin
26.  Replica number in a raft group  Replica geo distribution  Read/Write workload  Leaders and followers  Tables and TiKV instances  Other customized scheduling strategy
28. ● SQL is simple and very productive ● We want to write code like this: SELECT COUNT(*) FROM user WHERE age > 20 and age < 30;
29.  Mapping relational model to Key-Value model  Full-featured SQL layer  Cost-based optimizer (CBO)  Distributed execution engine
30.  Row  Key: TableID + RowID  Value: Row Value CREATE TABLE `t` (`id` int, `age` int, key `age_idx` (`age`)); INSERT INTO `t` VALUES (100, 35);  Index  Key: TableID + IndexID + IndexColumn-Values  Value: RowID Encoded Keys: K1: tid + rowid K2: tid + idxid + 35 K1 10, 35 K2 K1
31.  Key and Value are byte arrays  Row data and index data are converted into Key-Value  Key should be encoded using the memory-comparable encoding algorithm  compare(a, b) == compare (encode(a), encode(b))  Example: Select * from t where age > 10
32.  Can we push down filters?  select count(*) from person where age > 20 and age < 30  It should be much faster, maybe 100x  Less RPC round trip  Less transferring data
33. TiDB Server age > 20 and age < 30 TiDB knows that Region 1 / 2/5 stores the data of person table. Region 1 TiKV Node1 age > 20 and age < 30 age > 20 and age < 30 Region 2 TiKV Node2 Region 5 TiKV Node3
34.  We just build a protocol layer that is compatible with MySQL. Then we have all the MySQL drivers.  All the tools  All the ORMs  All the applications  That’s what TiDB does.
35. KV API Coprocessor Txn, Transaction MVCC RawKV, Raft KV RocksDB Placement Driver MySQL clients Load Balancer (Optional) MySQL Protocol TiDB SQL Layer KV API DistSQL API TiDB Server (Stateless) MySQL Protocol TiDB SQL Layer KV API DistSQL API TiDB Server (Stateless) Pluggable Storage Engine (e.g. TiKV)
36.  TiSpark = Spark SQL on TiKV  SparkSQL directly on top of a distributed Database Storage  Hybrid Transactional/Analytical Processing(HTAP) rocks  Provide strong OLAP capacity together with TiDB  Spark ecosystem
37. Application Syncer TSO/Data location TiDB TiDB DistSQL API TiDB TiDB TiDB ... TiDB Cluster PD PD PD PD Cluster Meta data TiKV TiKV TiKV TiKV TiKV TiKV ... TiKV Cluster (Storage) Data location DistSQL API Spark Driver Job Worker Worker Worker ... Spark Cluster TiSpark
38.  TiKV Connector is better than JDBC connector  Index support  Complex Calculation Pushdown  CBO  Pick up the right Access Path  Join Reorder  Priority & Isolation Level
41. Users Users Admin TiDB Operator Deployment TiDB Cluster Controller TiKV Controller GC Controller PD Controller TiDB Controller TiDB Scheduler: Kube Scheduler + Scheduler Extender DaemonSet Volume Manager DaemonSet TiDB Cloud Manager RESTFul Interface External Service manager Load balancer manager Cloud TiDB Kubernetes Core Scheduler Controller Manager API Server
42.
45.  Multi-tenant  Better Optimizer and Runtime  Performance Improvement  Document Store  Backup & Reload & Migration Tools
46. Q&A https://github.com/pingcap/tidb https://github.com/pingcap/tikv https://github.com/pingcap/pd https://github.com/pingcap/tispark https://github.com/pingcap/docs https://github.com/pingcap/docs-cn Contact Me: shenli@pingcap.com