新加坡国立大学 Michael Franklin - Big Data Software:Whats Next

綦元枫

2017/12/18 发布于 技术 分类

2017中国大数据技术大会的主题为“大数据与智能”,将于12月7日-9日在北京新云南皇冠假日酒店举行,大会为期三天。大会将就大数据时代社会各行业的智能化进程和行业实践展开深入讨论。除Keynote外,主办方精心策划了数十场专题技术和行业论坛,涵盖了大数据分析与生态系统、数据库、大数据云服务、机器学习与深度学习、知识图谱、区块链、推荐系统、金融大数据、交通与旅游大数据、工业与制造业大数据、精准医疗大数据、大数据安全与政策法规等主题。值得一提的是,本次大会还将请TOP 10大数据应用最佳案例实践的作者在大会现场与我们分享它们的精彩内容。

文字内容
1. Big Data Software: What’s Next? Michael Franklin BDTC Beijing December 2017
2. Big Data = Nearly every field of endeavor is transitioning from “data poor” to “data rich” Astronomy: LSST Physics: LHC Oceanography Neuroscience: EEG, fMRI Sociology: The Web Biology: Sequencing Economics: mobile, POS terminals Data-Driven Medicine Sports 2
3. Open Source Ecosystem & Context ŸŸŸ 3
4. Open Source Ecosystem & Context 2006-2010 Autonomic Computing & Cloud Usenix HotCloud Workshop 2010 ŸŸŸ 4
5. Open Source Ecosystem & Context 2006-2010 Autonomic Computing & Cloud Usenix HotCloud Workshop 2010 UC BERKELEY 2011-2016 Big Data Analytics ŸŸŸ 5
6. SparkSQL Streaming GraphX MLbase Spark’s Philosophy • Specializing MapReduce leads to incompatible, stovepiped systems • Instead, generalize MapReduce: … Spark 6
7. Spark’s Philosophy • Specializing MapReduce leads to incompatible, stovepiped systems • Instead, generalize MapReduce: SparkSQL Streaming GraphX MLbase 1. Richer Programming Model èMore operators than map and reduce … Spark 7
8. Spark’s Philosophy • Specializing MapReduce leads to incompatible, stovepiped systems • Instead, generalize MapReduce: SparkSQL Streaming GraphX MLbase 1. Richer Programming Model èMore operators than map and reduce 2. Memory Management èLess data movement leads to better performance for complex analytics … Spark 8
9. Berkeley Data Analytics Stack In House Applications – Genomics, IoT, Energy, Cosmology Access and Interfaces Processing Engines Storage Resource Virtualization
10. Apache Spark Meetups (Dec 2017) 635 groups with 452,749 members spark.meetup.com 10
11. Memory Mgmt in Hadoop MR Reduce Map Reduce Training Data Map (HDFS) Reduce Map 11
12. Memory Management in Spark Reduce Reduce Map Cached Load Training Data Map (HDFS) Map Reduce 12
13. Memory Management in Spark Reduce Training Data (HDFS) Map Efficiently move data between Map stages Reduce Map Reduce 13
14. Memory Management in Spark Reduce Training Data (HDFS) Reduce Map Efficiently move data between Map stages Reduce Map 10-100× speed up vs.Hadoop MapReduce with no HDFS data migration needed 14
15. Lineage for Fault Tolerance RDDs: Immutable collections of objects that can be stored in memory or disk across a cluster – Built via parallel transformations (map, filter, …) – Automatically rebuilt on (partial) failure M. Zaharia, et al, Resilient Distributed Datasets: A fault-tolerant abstraction for in-memory cluster computing, NSDI 2012. 15
16. Lineage for Fault Tolerance RDDs: Immutable collections of objects that can be stored in memory or disk across a cluster – Built via parallel transformations (map, filter, …) – Automatically rebuilt on (partial) failure messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2)) HadoopRDD path = hdfs://… FilteredRDD func = _.contains(...) MappedRDD func = _.split(…) M. Zaharia, et al, Resilient Distributed Datasets: A fault-tolerant abstraction for in-memory cluster computing, NSDI 2012. 16
17. SQL and DataFrame Support SQL increasingly supported by Big Data platforms: Apache Drill, Flink, Hive, Kafka, Spark, Cloudera Impala, HAWQ, IBM Big SQL, Presto, … Spark supports SQL and also “Dataframes”: people.filter("age > 30") .join(dept, people("deptId") === dept("id")) .groupBy(dept("name"), "gender") .agg(avg(people("salary")), max(people("age"))) M. Armbrust, et al, Spark SQL: Relational Data Processing in Spark, SIGMOD 2015.
18. SparkSQL/Catalyst Optimizer • Typical DB optimizations across SQL and Dataframes • Extensibility via Optimization Rules written in Scala • Open Source optimizer evolution! • Code Generation (inner loops and iterator removal) • Cost-based (as of V2.2) Simple Aggregation Query 18
19. SparkSQL/Catalyst Optimizer • Typical DB optimizations across SQL and Dataframes • Extensibility via Optimization Rules written in Scala • Open Source optimizer evolution! • Code Generation (inner loops and iterator removal) • Cost-based (as of V2.2) Simple Aggregation Query 19
20. Putting it all Together: Multimodal Analytics SQL Machine Learning Streaming 20
21. Putting it all Together: Multimodal Analytics SQL Machine Learning Streaming 21
22. Putting it all Together: Multimodal Analytics SQL Machine Learning Streaming 22
23. Multimodal Advanced Analytics From: Spark User Survey 2016, 1615 respondents from 900 organizations http://go.databricks.com/2016-spark-survey 23
24. Multimodal Advanced Analytics From: Spark User Survey 2016, 1615 respondents from 900 organizations http://go.databricks.com/2016-spark-survey 24
25. What Do Users Want? From: Spark User Survey 2016, 1615 respondents from 900 organizations http://go.databricks.com/2016-spark-survey 25
26. Maslow’s Hierarchy of Analytics? Safety Ease of Deployment Ease of Development Advanced Analytics Performance
27. What’s Next? Rapidly changing hardware means that there is still a lot of research to be done in performance, scalability and fault tolerance Likewise, new analytics approaches and AI techniques (e.g., Deep Learning) are becoming increasingly mainstream Lots of work to be done in these areas, but…
28. as we Move Up the Hierarchy… a new set of concerns moves to the fore: 1) Reducing Friction: Ease of Development and Deployment 2) Data Science/Analytics Full Lifecycle Concerns 3) “Safe” Data Science and Human Factors
29. Database Systems: One way in/out SQL Compiler Relational Dataflow Row/Col Store Adapted from Mike Carey, UCI
30. Database Systems: One way in/out SQL Compiler Relational Dataflow Row/Col Store Adapted from Mike Carey, UCI
31. Database Systems: One way in/out SQL Compiler Relational Dataflow Row/Col Store Adapted from Mike Carey, UCI
32. Database Systems: One way in/out SELECT FROM WHERE SQL Compiler Relational Dataflow Row/Col Store Adapted from Mike Carey, UCI
33. Database Systems: One way in/out SELECT FROM WHERE SQL Compiler Relational Dataflow Row/Col Store Adapted from Mike Carey, UCI
34. Reducing Data Friction SQL Dataframes R Graph MLlib Streams Big Data Platform (e.g., Spark) HDFS Adapted from Mike Carey, UCI S3 MongoDB
35. Reducing Friction – Schema Traditional approach – Schema First Alternative – “Schema on Read”, aka Data Lake or Dataspace Data Integration remains a “Wicked Problem”
36. Data Lakes (a.k.a. “Dataspaces”) Functionality Structure enables computers to improve performance and help users access and maintain data. Unstructured (schema-less) Time (and cost) Franklin, Halevy, Maier, “From Databases to Dataspaces: A New Paradigm for Information Management”, SIGMOD Record 2005.
37. Data Lakes (a.k.a. “Dataspaces”) Functionality Structure enables computers to improve performance and help users access and maintain data. Structured (schema-first) Unstructured (schema-less) Time (and cost) Franklin, Halevy, Maier, “From Databases to Dataspaces: A New Paradigm for Information Management”, SIGMOD Record 2005.
38. Data Lakes (a.k.a. “Dataspaces”) Functionality Structure enables computers to improve performance and help users access and maintain data. Data Lakes/Spaces (Flexible Schema) Structured (schema-first) Unstructured (schema-less) Time (and cost) Franklin, Halevy, Maier, “From Databases to Dataspaces: A New Paradigm for Information Management”, SIGMOD Record 2005.
39. Machine Learning Pipelines • Data Analytics is a complex process • Rare to simply run a single algorithm on an existing data set • Model training is only part of the process • Emerging systems support more complex workflows: • Spark MLPipelines • Google TensorFlow • KeystoneML and Clipper Model Serving (BDAS) 39
40. KeystoneML Declarative API è Automaton & Optimization E. Sparks et al., “MLI: An API for Distributed Machine Learning”, ICDM 2013 E. Sparks et al., “Automating Model Search for Large-Scale Machine Learning”, SOCC 2015 S. Venkataraman et al., ”Ernest: Efficient Performance Prediction for Large-Scale Analytics”, NSDI 2016 E. Sparks et al., “KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics”, ICDE 2017
41. KeystoneML Declarative API è Automation & Optimization ML operator selection/ hyperparameter tuning E. Sparks et al., “MLI: An API for Distributed Machine Learning”, ICDM 2013 E. Sparks et al., “Automating Model Search for Large-Scale Machine Learning”, SOCC 2015 S. Venkataraman et al., ”Ernest: Efficient Performance Prediction for Large-Scale Analytics”, NSDI 2016 E. Sparks et al., “KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics”, ICDE 2017
42. KeystoneML Declarative API è Automation & Optimization ML operator selection/ hyperparameter tuning Auto-provisioning cloud resources E. Sparks et al., “MLI: An API for Distributed Machine Learning”, ICDM 2013 E. Sparks et al., “Automating Model Search for Large-Scale Machine Learning”, SOCC 2015 S. Venkataraman et al., ”Ernest: Efficient Performance Prediction for Large-Scale Analytics”, NSDI 2016 E. Sparks et al., “KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics”, ICDE 2017
43. KeystoneML Declarative API è Automation & Optimization ML operator selection/ hyperparameter tuning Auto-provisioning cloud resources Pipeline optimization E. Sparks et al., “MLI: An API for Distributed Machine Learning”, ICDM 2013 E. Sparks et al., “Automating Model Search for Large-Scale Machine Learning”, SOCC 2015 S. Venkataraman et al., ”Ernest: Efficient Performance Prediction for Large-Scale Analytics”, NSDI 2016 E. Sparks et al., “KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics”, ICDE 2017
44. Deployment: Model Serving Clipper:'>Clipper: A prediction serving system that spans multiple ML frameworks – Simplifies model serving – Bounds latency and increases prediction throughput – Enables real-time learning and personalization across machine learning frameworks – Can be extended to support edge processing D. Crankshaw et al., “Clipper:'>Clipper: A Low-Latency Online Prediction Serving System”, NSDI Conf., March 2017 https://github.com/ucbrise/clipper
45. as we Move Up the Hierarchy… a new set of concerns moves to the fore: 1) Reducing Friction: Ease of Development and Deployment 2) Data Science/Analytics Full Lifecycle Concerns 3) “Safe” Data Science and Human Factors
46. The Data Science Lifecycle December 2016
47. Data Cleaning: SampleClean Key Systems Issues – how to deal with latency and cost of the crowd? J. Wang, S. Krishnan, et al., A Sample-and-Clean Framework for Fast and Accurate Query Processing on Dirty Data, SIGMOD 2014
48. Curation and Reproducibility Data outlives any particular application: “[database systems] let you use one set of data in multiple ways, including ways that are unforeseen at the time the database is built and the 1st applications are written.” (Curt Monash, analyst/blogger)
49. Curation and Reproducibility Data outlives any particular application: “[database systems] let you use one set of data in multiple ways, including ways that are unforeseen at the time the database is built and the 1st applications are written.” (Curt Monash, analyst/blogger) Z. Zhang et al., Hippo, HPDC 17: – Efficient fine-grained lineage for machine learning and advanced analytics pipelines – Supports code debugging, result analysis, data anomaly removal and computation replay – Provides interactive answers to queries over lineage
50. Bias, Privacy and Ethical Issues “With Big Data comes Big Responsibility”
51. Humans in the loop Data Consumers Data Generators Predictions Decisions Data Scientists Data Citizen Science Data Processors KeystoneML People Icons created by Clara Joy from Noun 51 Project
52. Crowd Platform The AMPCrowd System Retainer Pool Slots S1 T0’ S2 T0 S3 T1 ✗S4 Labels LifeGuard amplab.github.io/ampcrowd Leveraging systems and database techniques for hybrid human-in-the-loop analytics (e.g. Straggler Mitigation, Active Learning) User Labeling tasks Labels & predictions Pool Manager Scheduler Mitigator Maintainer Task batch Task Selector Uniform Hybrid Active Model Trainer D. Haas, et al., Clamshell: Scaling Up Crowds for Low Latency Data Labeling, PVLDB 9(4) Haas & Franklin, Cioppino: Multi-tenant Crowdsourcing, HCOMP 2017 52 Batcher
53. New Challenges Summary Performance, Scalability, and Functionality remain important, but we face new challenges, including: Ease of Development and Deployment • Leverage database-style abstractions (e.g., declarative query optimization) • Make ML and AI pipelines easier to build • New components for “model serving” and “model management” Data Science Lifecycle • Data Acquisition, Cleaning (i.e., wrangling) • Data Integration remains a “wicked problem” • Communicating results, Curation, • “Translational Data Science” “Safe” Data Science • end-to-end Bias Mitigation • Security, Ethics and Data Privacy • Explaining and influencing decisions • Human-in-the-loop
54. Acknowledgements The work described here is due to an amazing group of AMPLab students, staff, faculty and sponsors and to the open source community.
55. Thanks and for More Info Mike Franklin mjfranklin@uchicago.edu 55