HAWQ王伟珣

素人渔夫

2018/05/13 发布于 技术 分类

今年是中国数据库技术大会第六个年头,大会将继续秉承分享IT最佳应用实践的宗旨,围绕传统数据库和大数据两条技术主线,在目前IT技术和管理快速的大背景下,更加深入地探讨数据库技术的现状和未来的发展方向,以及我们在这个转型过程中的实践经验和教训。

文字内容
2. HAWQ MPP SQL for HDFS of Hadoop 基于Hadoop原生HDFS的大规模并行SQL
3. HAWQ Is The… Enterprise platform that provides the fewest barriers, lowest risk, most cost effective and fastest way to enter in to big data analytics on HDFS of Hadoop
4. Multi-User Platform Resource Queues Concurrency Data Encryption Role-Based Security HAWQ 简述 ANSI SQL 2003/2011 Support SQL Engine Cost-Based Query Optimization Robust Query Optimizer Complex Data Management Sub-Partitioning Distributions Partitioning CPU Mem Disk Users Accessibility ODBC/JDBC Driver L3,4 Parallel Loading/Unloading HDFS Native Formats Extendable… Greenplum database re-platformed on Hadoop/HDFS txt Avro Seq HBase Hive Storage Options Polymorphic Storage Row/Columnar Storage Built-in Compression HDFS Native Formats MapReduce Integration
5. HAWQ的优点…  支持Apache Hadoop原生HDFS的SQL大规模并行引擎(MPP SQL)  GPFX External Tables 接口,使用SQL透明访问Hadoop上各类数据 – HDFS, HBase, Hive,Parquet格式等等  还支持SQL透明访问NFS,HTTP其他格式的数据(可自定义)  Performance and Scalability – Parallel Everything – Dynamic Pipelining – High Speed Interconnect(基于UDP) – HDFS access with C++ libhdfs3 – Co-Located Joins & Data Locality – Partition Elimination(支持静态动 态表分区) – Higher Cluster Utilization – Concurrency Control(资源作业优 先级调度)
6. HAWQ 及Hadoop软件栈 Resource Management & Workflow Yarn Zookeeper HAWQ– MPP SQL ANSI SQL + Analytics Xtension Framework Catalog Services Query Optimizer Dynamic Pipelining HDFS Data Loader Apache HAWQ Added Value Flume
7. HAWQ 与 Hadoop HDFS HAWQ Master Master Segment Segment Segment … Segment Data Data … Data Data Segment Data Name Name
8. HAWQ 与 Hadoop HDFS数据访问流 Master host HAWQ Interconnect Segment Segment Segment host Segment Segment host Segment Segment Segment Segment host Segment Segment Segment Segment Segment Segment host Segment Read/Write Datanode Datanode replication Datanode B Datanode Rack1 Meta Ops Namenode Rack2 HDFS
9. HAWQ 对比 Greenplum DB 基本架构 SQL 大规模并行处理 SQL MPP (Massively Parallel Processing) 无共享架构 Shared-Nothing Architecture SQL MapReduce Master 节点 生成查询计划并派发 汇总执行结果 ... ... Network Interconnect Segment 节点 ... ... 执行查询计划及数据 存储管理 外部数据源 并行装载或导出 数据库存储层 Sharding+Replica
10. 运行SQL,支持SQL2008及OLAP选项
11. HAWQ(SQL MPP)机制-1 Clients JDBC/ODBC SQL Console SELECT beer, price FROM Bars b, Sells s WHERE b.name = s.bar AND b.city = ‘San Francisco’ HAWQ Master Host Query Parser Query Optimizer HDFS Namenode HAWQ Segment Host Query Executor HDFS Datanode HAWQ Segment Host Query Executor HDFS Datanode HAWQ Segment Host Query Executor HDFS Datanode ...
12. HAWQ(SQL MPP)机制-2 Clients JDBC/ODBC SQL Console HAWQ Segment Host Query Executor HDFS Datanode HAWQ Master Host Query Parser Query Optimizer HDFS Namenode HAWQ Segment Host Query Executor HDFS Datanode Optimization Context Parse Tree Metadata Cost Model Resources HAWQ Segment Host Query Executor HDFS Datanode ...
13. HAWQ(SQL MPP)机制-3 Clients JDBC/ODBC SQL Console HAWQ Segment Host Query Executor HDFS Datanode HAWQ Master Host Query Parser Query Optimizer HDFS Namenode HAWQ Segment Host Query Executor HDFS Datanode Execution Plan MotionGather Shuffle Projects.beer, s.price HashJoinb.name = s.bar MotionRedist(b.name) ScanSs ells Filterb.city = 'San Francisco' ScanBbars HAWQ Segment Host Query Executor Reduce ... Map HDFS Datanode
14. HAWQ(SQL MPP)机制-4 Clients JDBC/ODBC SQL Console HAWQ Master Host Query Parser Query Optimizer HDFS Namenode HAWQ Segment Host MotionGather Projects.beer, s.price HashJoinb.name = s.bar MotionRedist(b.name) ScanSs ells Filterb.city = 'San Francisco' ScanBbars Query Executor HDFS Datanode HAWQ Segment Host MotionGather Projects.beer, s.price HashJoinb.name = s.bar MotionRedist(b.name) ScanSs ells Filterb.city = 'San Francisco' ScanBbars Query Executor HDFS Datanode HAWQ Segment Host MotionGather Projects.beer, s.price HashJoinb.name = s.bar MotionRedist(b.name) ScanSs ells Filterb.city = 'San Francisco' ScanBbars Query Executor HDFS Datanode ...
15. HAWQ(SQL MPP)机制-5 Clients JDBC/ODBC SQL Console HAWQ Master Host Query Parser Query Optimizer HDFS Namenode HAWQ Segment Host Query Executor HAWQ Segment Host Query Executor Dynamic Pipelining™ HDFS Datanode HDFS Datanode HAWQ Segment Host Query Executor HDFS Datanode ...
16. HAWQ(SQL MPP)机制-6 Clients JDBC/ODBC SQL Console HAWQ Segment Host Query Executor HDFS Datanode HAWQ Master Host Query Parser Query Optimizer HDFS Namenode HAWQ Segment Host Query Executor HDFS Datanode HAWQ Segment Host Query Executor HDFS Datanode ...
17. 数据分布方式(Data Distribution)  Data can be distributed based on a column or a composite of columns  Tables distributed similarly are co-located  Distribution scheme modifiable thru alter table Advantages:  Co-located joins Table A DN1 DN2 DN3  No data movement on joins or aggregates X=1 X=2 X=3 X=4 X=5  Improved performance on complex queries  Query engine optimization Y=1 Y=2 Y=3 Table B SELECT X FROM A,B WHERE A.X = B.Y SELECT SUM(X) FROM A GROUP BY A.X
18. 数据互联框架(Xtension Framework) Xtension Framework HDFS HBase Hive  An advanced version of Greenplum DB external tables  Enables combining HAWQ data and Hadoop data in single query  Supports connectors for HDFS, Hbase and Hive  Provides extensible framework API to enable custom connector development for other data sources
19. 数据导入导出(Loading/Unloading Data) gpload, gpfdist, External Tables Flat Files, CSV, Delimited, … Existing RDBMS Systems Web Tables, JSON, XML, HTML, … Executing Scripts, … DataLoader File Farms Streaming Batch Mode Flume, … integration Throttling, Compression, … features PXF {Native Hadoop Files} HDFS Flat Files, CSV, Delimited, … Hive HBase {w. predicate push-down} Avro, RCFile, SeqFile Open extendable API Available on Github: Accumulo, JSON,… Spring XD Java Development Framework Traditional Tools Postgres insert, copy, … ODBC + JDBC drivers Pivotal Data Dispatch {PDD} Integration with ETL tools… HAWQ 里数据导入导出仍是全并行
20. HAWQ External Tables gpload, gpfdist, External Tables Flat Files, CSV, Delimited, … Existing RDBMS Systems Web Tables, JSON, XML, HTML, … Executing Scripts, …
21. HAWQ and Hadoop Native File Formats PXF {Pivotal eXtention Framework} HDFS Flat Files, CSV, Delimited, … Hive HBase {predicate push-down} Avro, RCFile, SeqFile Open extendable API Available on Github: Accumulo, JSON,… Read/Write
22. HAWQ Queries MapReduce Pig Hive MAPRED (%) HAWQ (%) 更强大的资源管理器,兼容YARN HAWQ Resource Queue 1 Resource Queue 2 Resource Queue … YARN - MAPREDUCE M M M R R YARN NODEMANAGER HDFS DATANODE OPERATING SYSTEM Memory Consumption % HIGH MED CPU Utilization LOW divide system memory for resource queue # of Disk Operations Memory Consumption %
23. 运行时资源可控 (Dynamic Resources Allocation)
24. SQL for Hadoop功能对比 Feature Hive Work with HDFS native file formats ✓ Polymorphic Storage ✓ Advance SQL (ANSI SQL2008 & OLAP support) ✖ Impala ✓ ✓ ✖ Partitions and compression ✓ ✓ Data Locality ✓ ✓ Distributions, Join, Aggregate Locality ✖ ✖ Join Optimization ✖ ✖ Spill to disk (query must fit in memory) ✓ ✖ Fault tolerance during large query execution ✓ ✖ Granular Security and authentication ✖ ✓ Extendable (Serdes) ✓ ✖ Resource Management ✖ ✖ Open-source code ✓ ✓ HAWQ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✖ ✓ ✓ ✓ ✓
25. HAWQ 性能对比-1 User intelligence 4.2 Sales analysis 8.7 Click analysis 2.0 Data exploration 2.7 BI drill down 2.8 37 9X 596 69X 50 25X 55 20X 59 21X
26. HAW性能对比-2 User intelligence 4.2 Sales analysis 8.7 Click analysis 2.0 Data exploration 2.7 BI drill down 2.8 198 161 415 1,285 1,815 47X 19X 208X 476X 648X
27. 部分应用案例
28. 某企业解决方案 运维人员 管控人员 业务人员 管理人员 决策人员 数据 科学家 历史查 询应用 大数据分析应用 实时 分析应用 管理分析应用 沙盘 演练应用 流 数 程 调 据度 管层 控 层 历 史 归 大数据区 结构化数据区 大数据存储区 实时数据区 管理分析应用数据区 库存实时 变化数据 客户管 财务管 绩效管 供应链 理 理 理 管理 …… 沙盘演 练数据区 档 数 据 社交媒体 用户评价 访问日志 移动互联 用户订单 实时数据 经营体 产品大类 用户 …… 汇总区 主 题 数 区 音频视频 …… …… 用户主题 营销主题 产品主题 …… 明细区 据 区 非结构化数据交换 结构化数据交换 各数据区数据交换 用户访 问层 数据应 用层 数据存 储计算 层 数据交 换层 企业内外部半结构化、非结构化数据 xx商城 E-Hub系统 E-Store系统 SCRM系统 ……系统 数据 产生层
29. 大数据分析应用:用户雷达 产品分析 ODBC 经营实体分析 ODBC Hadoop HAWQ 用户雷达数据集市 竞争对手分析 ODBC 京东 淘宝 用户雷达分析器 苏宁 国美 一淘 商城 官网 产品主数据
30. 方案架构图 spring xd admin spring xd container spring xd container ZooKeeper Gemfire gemfire xd locator gemfire xd server gemfire xd server HDFS HAWQ master HAWQ segment HAWQ segment