Ozone – Next-Gen Storage for Data Lake 堵俊平 腾讯大数据海量存储与计算负责人 ASF Member
自我介绍 Junping Du ASF Member, Apache Hadoop PMC Member & Committer Chair of Tencent Open Source Alliance Director @ Tencent Big Data Years of open source experience focus on big data
Agenda ● ● ● ● ● ABC about Data Lake Overview of Apache Ozone Ozone Architecture, Design and Details Current Status, Work In Progress and Release Plan Ozone in Tencent Scenario
What is Data Lake? A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics - from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.
Why Data Lake? BI Machine Learning Data IDE Data Management Data Lake Database Data Warehouse Object Store NoSQL
Core Features of Data Lake Solution Data Lake Solution includes Data Lake Storage, Data Lake Management and Data Lake Analytics. Difference compared to data warehouse • Follow the natural structure of data • Mainly used for ad-hoc queries and heterogeneous analytics • Not good at routine data modeling, data mart and data governance Data Lake Analytics Data Lake Analytic s Different patterns of analytic workloads to process different data Data Lake Management Metadata governance to manage the lifecycle of heterogeneous data Data Lake Storage Storage system to store different patterns of data, structured and unstructured Data Lake Mgmt Data Lake Storage
Scenario of Data Lake Ad-hoc queries to investigate the value of data Object Store Data Scientist Data Analyst Data Lake Analytics Data exploration • Ad-hoc queries to investigate the value of data • Data exploration • Interactive queries on heterogeneous data NoSql Storage BI Data Lake Analytics Interactive queries on heterogeneous data Data Lake Analytics BI BI
Target Persona Data Analyst Data Engineer Create different analysis • Data collect and inject model • ETL task scheduling • Create layered data mart • Data preprocess, data • Data modeling • Data visualization • governance Data Scientist • • • Interactive data exploration ML and DL Hyper-parameter tuning Project Manager • Project management • User management • Coordination
Theme of Data Lake Storage side Scalability Cloud Machine Learning
Retrospect for HDFS Architecture NN store all metadata in memory Low latency metadata operations Easy Scaling – IO + PBs + Clients Metadata in memory is both the strength and weakness of HDFS
Why Ozone? ● HDFS has scaling problems ● Some users have "make your HDFS healthy” day. ● 200 million files for regular users ● companies with committers/core devs - 400-600 million ● New Opportunities and Challenges ○ Cloud ○ Streaming ○ Small files are the norm
Scaling Challenges When scaling cluster up to 4K+ nodes with about ~500M files Namespace metadata in NN Block management in NN File operation concurrency Block reports handle Client/RPC 150K++ Slow NN startup o o o o o o Small files in HDFS make thing worse !!!
What is Apache Ozone? ● Object Store for Big Data ● Scale both in terms of objects, IOPS. ● Name node is not a bottleneck anymore. ● A set of micro-services each capable of doing its own stuff. ● Leverage learnings from supporting HDFS across a large set of use cases. ● Apache YARN, Map Reduce, Spark, Hive are all tested and certified to work with Apache Ozone. No application changes are required to work with Ozone. ● Supports K8s, CSI and ability to run on K8s natively. ● A spiritual successor to HDFS.
Ozone Architecture Overview NameNode’ NameNode’ NameNode’ Ozone Manager Ozone Manager Ozone Manager Storage Container Manager Datanode Apache Ratis Datanode Apache Ratis Datanode Apache Ratis
Ozone Manager Namespace layer for Ozone Manage objects in a flat namespace Volume/Bucket/Key LSM-based K-V store for metadata LevelDB/RocksDB/… Benefits compared with HDFS Namenode Easy to manage and scale 1B keys tested in a single OM Scale independent of block layer Easily shard based on Bucket No GC pressure Not all in memory
HDDS Storage Containers • HDDS datanode is a plug-in service running in Datanodes • Container is basic unit of replication (2-16GB) • Fully distributed block metadata in LSM-based K-V store • No centralized block map in memory like HDFS Key Value store
Open/Close Containers • Close Container:
keys and their blocks are immutable • Write is not allowed. • Container closed when is it is close to full or failure. CLOSED OPEN • Open Container:
keys and their blocks are mutable • Write via Apace Ratis
HDDS Pipelines • HDDS writes to Containers using Pipelines. • HDDS uses Apache Ratis based pipeline, which still appears as a replicated stream. Leader DN • Apache Ratis is a highly customizable Raft consensus protocol library. • Support high throughput data replication use cases like Ozone. DN 2 DN 1
Ozone - Write Path Create a file ● Blocks are allocated by OM/SCM. ● Blocks are written directly to data nodes ● Very similar to HDFS ● When a file is closed, it is visible for others to use.
Ozone - Read Path ● Reads the block locations from OM. ● The client reads data directly from Datanodes ● AKA, same old HDFS protocol. ● Ozone relies on all things good in HDFS - Including source code..
SCM – Storage Container Manager • Node Management • Handle Node Report • Container Management • Handle Container Report • Replication Management • Replication of Closed Containers instead of blocks • Under/Over replicated Containers • Pipeline Management • Create/Remove pipelines for Open Containers • Security • Approve & Issue x509 Certificate for OM/DNs
Heartbeat from Datanodes • Heartbeat from data nodes contain many reports. • SCM decodes these into different Reports • Different Handlers handle these reports inside SCM.
Container State Manager Container State Manager learns about the state of Containers via Container Reports. Container Manager maintains all state in two classes. Node-to-Container Map – Which keeps track of what containers are on each Node. Container State Map – Keeps all container states. If we lose a container, Container State Manager will detect that and queue replication requests to Replica Manager.
Node State Manager Node State Manager learns about the state of Nodes via Node Reports. Node Manager keeps track of things like Node liveness. For example, if the Node is live or Dead. If the Node status changes then NodeManager fires of Events like DEAD_NODE, STALE_NODE etc.
Replication Manager Container State manager detects the Container Replica is under or over replicated. Container State Manager posts events to Replica Manager asking for Replication. Replica Manager maintains this state in replication pending queues as well as in-flight state.
Pipeline Manager Pipeline Manager keeps track of open pipelines in the Cluster. Pipeline Manager keeps track of the health of the Pipeline. When and if a node goes offline, Pipeline manager closes Containers and Pipelines on that Node.
Ozone Deployment Options
Multiple Ozone Managers + Multiple SCMs
When should we use Ozone? ● If you have a scale issue - Files or Throughput. ● If you need an archival store for HDFS or large data store. ● If you need S3 or cloud-like presence on-prem. ● If you want to set up dedicated storage clusters. ● If you have lots of small files. ● If you are moving to K8s, and needs a big data capable file system. ● If you are having new HDFS deployments.
What are Ozone’s Microservices? ● Namenode or Ozone Manager which deals with file names. ● Block Server or SCM which deals with block allocation and Physical Servers. ● Recon Server - Control Plane. ● S3 Gateway ● Datanodes
Ozone’s scalability ● ● Ozone is designed for scale. The first release of Ozone will officially support 10 billion keys. Ozone achieves this by a combination of factors. ○ Partial namespace in Memory - That is, file system metadata is loaded on demand. ○ Off-Heap Memory usage - To avoid too much GC, we rely on off-heap native memory. This allows us to get away from GC issues. ○ Multiple Ozone Managers and Block Services - Users can scale OM or SCM independently. The end-users will not even know since the Ozone protocol does this scaling automatically. ○ Creating large aggregations of metadata called Storage containers. ○ Distributing Metadata more evenly across the cluster including Datanodes. ○ Multiple OMs and also will have the ability to read from the secondaries.
Ozone’s Data Locality Topology-Aware Read Ozone Hierarchical Topology Topology-Aware Container Replication
Ozone’s Security ● ● ● ● ● ● ● ● ● ● HDFS security is based on Kerberos. Kerberos cannot sustain the scale of applications running in a Hadoop Cluster. So HDFS relies on Delegation tokens and block tokens. Ozone uses the same, so applications have no change. SCM comes with its own Certificate Authority. End users do NOT need to know about it. Security is on-by-default, Not an afterthought. Merged HDDS-4 into Trunk, next release will have security. First class integration with Apache Ranger. TDE - transparent disk encryption support.
Ranger Service Definitions for Ozone Security
Ozone’s HA ● Like HDFS, Ozone will have HA. ● Unlike HDFS, HA is a built-in feature of Ozone. ● Users need to deploy three instances of OM/SCM. That is it. ● HA is automatic even when you run a single node, OM assumes it is in a single HA configuration mode.
Ozone’s Testing ● ● ● ● ● ● Ozone uses K8s based clusters for Testing. Both long running and ephemeral clusters are regularly tested. Uses a load generator called Freon ( earlier called Corona - after the chemical process that creates ozone) Apache Spark, YARN and Hive used to run workloads against Ozone. S3AFileSystem and other open-source test suites used to test S3 Gateway Support. Blockade based tests to make sure that error handling and cluster level failures are tolerated.
Current State of Ozone - Road Map ● ● We are maintaining a two months release cadence. Three Alpha Releases - Done Arches Release 0.2.1 - Basic functionality Acadia Release 0.3.0 - S3 Support Release 0.4.0 - Security Support, Stability improvements ○ ○ ○ ● Follow up with three betas Beta - 1 Release 0.5.0 - Crater Lake - High Availability, First class K8s support, Topology awareness Beta 2 - In-Place Upgrades, Stability improvements Beta 3 - Erasure Coding Support GA ○ ○ ○ ●
Ozone in Tencent – TI-One (One-stop ML platforms) Scenarios Models /Algos 语音识别 图像识别 机器学习算法 深度学习算法 精准推荐 图算法 实时风控 图像模型 语音模型 时序模型 其他应用场景 视频模型 推荐模型 ML Engines Resource Mgmt Storage YARN Ceph HDFS Ozone K8S Parque ORC Delta Lake Iceberg
Questions? We are hiring! email@example.com