网易 孙建良:网易新一代对象存储引擎

carefulpeacock

2017/11/14 发布于 技术 分类

网易第一代分布式对象存储系统在2006年,当初主要用于支撑网易的邮箱和互联网业务。但是随着数据量不断增大,原有系统在系统架构和整体设计上不足以去支撑不断膨胀的业务和数据。主要体现在数据可靠性、存储成本、运维难度、性能等各个方面,在2014 开始规划网易新一代对象存储引擎,经过调研团队认为在开源领域没有足够合适的对象存储系统能够满足未来5~10年的发展,所以基于自身原有支撑系统之上,博取众长,从0开始构建能够理想中适合自身的对象存储系统; 此次首次公开分享网易新一代对象存储引擎的关键设计点,比如 * 设计目标 * 系统架构 * Put、Get、Delete * 一致性协议 * 垃圾回收 * 数据放置、数据恢复、数据可靠性 * 大规模存储的物理部署 * EC纠删代码 * 等等。

文字内容
1. 网易新一S代A孙C对建C良2象01存7 储引擎
2. 关于我 • 孙建良 • ⽹网易易 • 图⽚片处理理系统 • ⼩小⽂文件缓存系统 • ⼴广域⽹网上传加速系统 • 新⼀一代对象存储引擎 SACC2017 blog: work-jlsun.github.io
3. Object SStoACraCg20e17vs HDFS
4. HDFS • Summary ✓ unstructured data in arbitary formats 2017✓ Block, usually 64MB. SACC✓ Blocks are replicated. ✓Write once (append allowed) ✓(Often) collocated with compute capacity.
5. Object Store • what is an object store? ✓ Key ✓ Value ✓ Attribute ✓ Bucket SACC2017 ✓RestFul HTTP: https:// bucket.nos.netease.com/doc.txt 、SDK
6. Object Store 7• Good things about object stores C201✓(effectively) infinitely scalable – EB and beyond. SAC✓Various security models – data is safe. ✓Low cost, long term storage solution.
7. Object Store 7• Object Storage is Not a File System C201✓Write once – no append in place SAC✓Usually eventual consistent ✓No Real DIR
8. Outline SAB背Cas景iCcA2rc0h 17 NEFS
9. 对象存储基础架构 bucket.nos.netease.com DNS Service K:MetaData load balance SACC2017Nginx Nginx Nginx Proxy Cluster stateless HTTP Restful Service DBI FSI V:Data ✓ PUT ✓ GET ✓ DELETE DDB Cluster statefull DFS Cluster statefull
10. 背景-DFS • 分布式框架 ✓ 副本组织形式 ✓ 数据写入 ኩ᧗෈໩‫ ݩ‬GRFLG ෈կ඙֢ SN SACፏᏺ C2017໲2‫ݩ‬4 Ӟ̵ԫ̵ӣᕆፓ୯ 10 10 10 10 ෈կ‫ݷ‬ ໲ MDS ᏺ ፏ ໲ MDS ੒ SN ᏺ ፏ ໲ Zookeepers ᜓᅩᇫா࿤ಸ ⽹网易易云存储服务发展之路路
11. • 缺点 • it is simpler ✓ 性能 ✓ 可靠性 ✓ 成本 背景 SACC2017“Everytings should be made as simple as possible, but not simpler”- Albert Einstein • 优点 ✓ 简单、简单、简单 ✓ 复制组、一致性、引
12. Design Goals ✓ Capacity:100PB+ ✓ WorkLoad:适应大小文件 2017✓ Durability:8个9、11个9 SACC✓ Availability: 机架感知、组件高可用、减小依赖 ✓ Scale Easy: 灵活、不影响性能、支持Rebalance ✓ MultiTalent:多租户 ✓Simple:Keep it Simple
13. SNACECF2S017
14. Overview • Netease File System (NEFS) 017✓Key-Value Blob Storage SACC2✓Key:FID(16 Byte)(8+8) • Interface ✓PutFile :: User、Blob -> FID ✓GetFile::FID ->Blob ✓Value : Blob (an arbitrary-sized ✓ DeleteFile:FID->bool byte Chunk) ✓ GetFileInfo:FID->FileStatus
15. Topology ✓ User ✓ Pool ✓ Zone ✓ server ✓ Disk servers SACC2017Pool 1 zone servers servers servers PoRO zone servers servers User Pool zone server disk
16. Architecture Zookeeper ✓PS:Partition Server ✓ MDS、MySQL ✓ ZooKeeper ✓ FSI SACC2017FSI PS PS PS MDS PS ŏŏ PS PS ŏŏ ŏŏ PS PS MySQL PS PS ŏŏ ŏŏ PS PS Node Node Node Control Flow Data Flow
17. FSI PartitionMDSS erverMySQL MetaData PS A 00001-00.log SACC2017Partition X 01000-00.log 03000-00.log 04000-00.log PS B Replica Partition X Partition Partition Partition …… f1 data f2 data f3 data
18. MDS 2017• 数据定位:Topology 、(PartitionID->“ps1-ps2-ps3”) SACC• 数据分布、放置、均衡
19. VS 7• 去中心化v元数据 C201✓consistent hash & Crush SAC✓元数据少 ✓ 不够灵活:扩容、数据迁移
20. Choose 7• Reality C201✓ 元数据本来就少,100PB, SAC几十M ✓ 按需扩容,不希望强制 rebalance
21. index Partition NEFS older data file hint file • 存储单元(BitCast存储模型) SACC2017older data file hint file BlockFileHeader LogEntry LogEntry older dadtaatfialefile hint file ŏŏ LogEntry active data file hint file
22. 数据复制 • Consistency Algorithm ✓Paxos 1990 ✓PacificA 2008 ✓Raft 2013 SACC2017 Replicated State Machine Architecture
23. Basic PacificA client Primary Backup Backup SACC2017 Write Write prepare list Commit Ack piggybacks
24. MemberShip Change MDS Change Leader Leader Partition1 SACC2017Add Replica Remove Replica Follower Partition2 Leader Partition2 Follower Partition2 Server m Follower Partition1 Server p Follower Partition1 Server q
25. PacificA vs Raft VS PacifcA Raft Basic Write-ALL 2/F +1 SACC2017MemberShip Performance Avalibility Durability 依赖外部 Low Low High 依赖⾃自身 High High Low
26. Choose • Reality 17• Write Any Replica(Partition) Group SACC20• MDS in System • Durability is important than Performance • Easy Implementation
27. NEFS SACC2017• Performance • Durability • Cost
28. NEFS SACC2017• Performance • Durability • Cost
29. Performance ✓Just One IO Per Write ✓Big File split Into 1MBs 7slice C201✓IO 优化 SAC✓Limit Concurrent IO index Partition older data file hint file older data file hint file older dadtaatfialefile hint file ✓ GroupCommit active data file hint file ✓Delete Not Force Flush Lazy Update Index Lazy Write Hint Append Write
30. Performance ✓Just One IO Per Write ✓Big File split Into 1MBs 7slice C201✓IO 优化 SAC✓Limit Concurrent IO ✓ GroupCommit ✓Delete Not Force Flush 硬盘性能简测
31. NEFS SACC2017• Performance • Durability • Cost
32. NEFS • Durability SACC2017AWS 产品线 SLA S3 Standard 11个9 • 10S03亿S⽂tGa文ln件adca⼀ier一dr年–年只IA可能丢失11111个个个99⽂文件
33. Durability- 影响因素 ✓ AFR:磁盘年故障率 17✓ RepNum:存储复制因子 SACC20✓T:坏盘恢复时间 ✓ S:系统CopySet数量 ✓ N:系统中磁盘数量
34. Durability- 影响因素 ✓ AFR:磁盘年故障率 “在包含999块磁盘的3备份存储系统中, SACC2017✓RepNum:存储复制因子 ✓ T:坏盘恢复时间 同时坏三块盘情况下的数据丢失概率?” 设计⼀一:把999块磁盘组成333个磁盘对。 333/C(999,3) =5.02*e-07 ✓ S:系统CopySet数量 disk1 disk1-copy2 disk1-copy3 ✓ N:系统中磁盘数量 disk2 ŏŏ disk333 disk2-copy2 ŏŏ disk333-copy2 disk2-copy3 ŏŏ disk333-copy3
35. Durability- 影响因素 ✓ AFR:磁盘年故障率 “在包含999块磁盘的3备份存储系统中, SACC2017✓RepNum:存储复制因子 ✓ T:坏盘恢复时间 同时坏三块盘情况下的数据丢失概率?” 设计⼆二:数据随机打散到999盘中 C(999,3)/C(999,3)=1 ✓ S:系统CopySet数量 ✓ N:系统中磁盘数量 ŏŏ ŏŏ ŏŏ
36. 如何度量 T • 如何量化:离散化 TT T SACTC20…1…7 T T T 1
37. 如何度量 块盘 SΣACC2017Pc = 1 - (1 - ( k ∈ >UQ@ X C(n,k) ) ) njW N HnjW * N 360*24/T 块盘 会对导 丢数 时间内 块盘 分布式存储可靠性
38. NEFS • Durability设计考量 ✓ 副本数 ✓ 恢复时间 ✓ CopySet数量 SACC2017
39. NEFS • 恢复时间 ✓ 布局 ✓ 复制单元放置&大小 ✓ 网络IO限速 Ӥᘶ SACC2017ӥᘶ ٖᗑԻഘ; ٖᗑԻഘ; 40*2=80 ٖ᮱Իഘ C Rack1 Rack2 zone1 Rack3 Pool 1 ٖᗑԻഘ< ٖᗑԻഘ< zone2 ٖᗑԻഘ= ٖᗑԻഘ= zone3
40. NEFS • 恢复时间 ✓ 布局 ✓ 复制单元放置&大小 ✓ 网络IO限速 SACC2017
41. NEFS SACC2017• Performance • Durability • Cost
42. • Cost • 提高复制因子 • 副本技术 • EC NEFS SACC2017Files replica-1 replica-2 හഝࣘ replica-3 Files හഝࣘ ໊ḵࣘ Storage Nodes Storage Nodes
43. Partition older data file NEFS SACC2017older data file older dadtaatfialefile EC Blocks හഝࣘ ໊ḵࣘ active data file
44. 总结 • A scalable high-available log-based Distributed Key-Value Blob Storage system. 7✓Key-Value :Put、Get、Delete C201✓Storage Engine: Log-Based(BitCase)Storage Engine SAC✓Strong Consistent(PacificA ) ✓Durability: 3 Copy Replica & Erase Code ✓It is Simple
45. Future Works • Load Balance • EC Enhance • Performance • Metric、Ops SACC2017 • Remove zookeeper & Mysql • ……
46. SHACirCin20g17
47. SACC2017