6 WhyAndHowFlinkSQL v2

1. WHY AND HOW TO LEVERAGE THE POWER AND SIMPLICITY OF SQL ON APACHE FLINK® - FABIAN HUESKE, SOFTWARE ENGINEER
2. ABOUT ME • Apache Flink PMC member & ASF member ‒ Contributing since day 1 at TU Berlin ‒ Focusing on Flink’s relational APIs since ~2 years • Co-author of “Stream Processing with Apache Flink” ‒ Work in progress… • Co-founder & Software Engineer at data Artisans 2 © 2018 data Artisans
3. ABOUT DATA ARTISANS Original creators of Apache Flink® 3 © 2018 data Artisans Open Source Apache Flink + dA Application Manager
4. DA PLATFORM data-artisans.com/download 4 © 2018 data Artisans
5. WHAT IS APACHE FLINK? Data Stream Processing Batch Processing process static and historic data realtime results from data streams Event-driven Applications data-driven actions and services Stateful Computations Over Data Streams 5 © 2018 data Artisans
6. WHAT IS APACHE FLINK? Stateful computations over streams real-time and historic fast, scalable, fault tolerant, in-memory, event time, large state, exactly-once Queries Applications Application Streams Database Devices etc. Historic Data Stream File / Object Storage 6 © 2018 data Artisans
7. POWERED BY APACHE FLINK 8 © 2018 data Artisans
8. FLINK’S POWERFUL ABSTRACTIONS Layered abstractions to navigate simple to complex use cases High-level Analytics API SQL / Table API (dynamic tables) Stream- & Batch Data Processing DataStream API (streams, windows) Stateful EventDriven Applications Process Function (events, state, time) val stats = stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .sum((a, b) -> a.add(b)) def processElement(event: MyEvent, ctx: Context, out: Collector[Result]) = { // work with event and state (event, state.value) match { … } out.collect(…) // emit events state.update(…) // modify state 9 // schedule a timer callback ctx.timerService.registerEventTimeTimer(event.timestamp + 500) © 2018 data Artisans }
9. APACHE FLINK’S RELATIONAL APIS ANSI SQL LINQ-style Table API SELECT user, COUNT(url) AS cnt FROM clicks GROUP BY user tableEnvironment .scan("clicks") .groupBy('user) .select('user, 'url.count as 'cnt) Unified APIs for batch & streaming data A query specifies exactly the same result regardless whether its input is static batch data or streaming data. 10 © 2018 data Artisans
10. QUERY TRANSLATION tableEnvironment .scan("clicks") .groupBy('user) .select('user, 'url.count as 'cnt) Input data is bounded (batch) 11 © 2018 data Artisans SELECT user, COUNT(url) AS cnt FROM clicks GROUP BY user Input data is unbounded (streaming)
11. WHAT IF “CLICKS” IS A FILE? Input data is read at once Clicks user cTime url Mary 12:00:00 https://… Bob Mary Liz 12 12:00:00 12:00:02 12:00:03 © 2018 data Artisans https://… https://… https://… Result is produced at once SELECT user, COUNT(url) as cnt FROM clicks GROUP BY user user cnt Mary 2 Bob 1 Liz 1
12. WHAT IF “CLICKS” IS A STREAM? Input data is continuously read Clicks user cTime url Mary 12:00:00 https://… Bob Mary Liz 13 12:00:00 12:00:02 12:00:03 © 2018 data Artisans https://… https://… https://… Result is continuously updated SELECT user, COUNT(url) as cnt FROM clicks GROUP BY user The result is the same! user cnt Mary 2 1 Bob 1 Liz 1
13. WHY IS STREAM-BATCH UNIFICATION IMPORTANT? • Usability ‒ ANSI SQL syntax: No custom “StreamSQL” syntax. ‒ ANSI SQL semantics: No stream-specific results. • Portability ‒ Run the same query on bounded and unbounded data ‒ Run the same query on recorded and real-time data bounded query start of the stream unbounded query bounded query past now future unbounded query • How can we achieve SQL semantics on streams? 14 © 2018 data Artisans
14. DATABASE SYSTEMS RUN QUERIES ON STREAMS • Materialized views (MV) are similar to regular views, but persisted to disk or memory ‒ Used to speed-up analytical queries ‒ MVs need to be updated when the base tables change • MV maintenance is very similar to SQL on streams ‒ Base table updates are a stream of DML statements ‒ MV definition query is evaluated on that stream ‒ MV is query result and continuously updated 15 © 2018 data Artisans
15. CONTINUOUS QUERIES IN FLINK • Core concept is a “Dynamic Table” ‒ Dynamic tables are changing over time • Queries on dynamic tables ‒ produce new dynamic tables (which are updated based on input) ‒ do not terminate • Stream ↔ Dynamic table conversions 16 © 2018 data Artisans 16
16. STREAM ↔ DYNAMIC TABLE CONVERSIONS • Append Conversions ‒Records are only inserted/appended • Upsert Conversions ‒Records are inserted/updated/deleted ‒Records have a (composite) unique key • Changelog Conversions ‒Records are inserted/updated/deleted 17 © 2018 data Artisans
17. SQL FEATURE SET IN FLINK 1.5.0 • SELECT FROM WHERE • GROUP BY / HAVING ‒ Non-windowed, TUMBLE, HOP, SESSION windows • JOIN ‒ Windowed INNER, LEFT / RIGHT / FULL OUTER JOIN ‒ Non-windowed INNER JOIN • Scalar, aggregation, table-valued UDFs • SQL CLI Client (beta) • [streaming only] OVER / WINDOW ‒ UNBOUNDED / BOUNDED PRECEDING • [batch only] UNION / INTERSECT / EXCEPT / IN / ORDER BY 18 © 2018 data Artisans
18. WHAT CAN I BUILD WITH THIS? • Data Pipelines ‒ Transform, aggregate, and move events in real-time • Low-latency ETL ‒ Convert and write streams to file systems, DBMS, K-V stores, indexes, … ‒ Ingest appearing files to produce streams • Stream & Batch Analytics ‒ Run analytical queries over bounded and unbounded data ‒ Query and compare historic and real-time data • Power Live Dashboards ‒ Compute and update data to visualize in real-time 19 © 2018 data Artisans
19. THE NEW YORK TAXI RIDES DATA SET • The New York City Taxi & Limousine Commission provides a public data set about past taxi rides in New York City • We can derive a streaming table from the data • Table: TaxiRides rideId: isStart: lon: lat: rowtime: 20 © 2018 data Artisans BIGINT BOOLEAN DOUBLE DOUBLE TIMESTAMP // // // // // ID of the taxi ride flag for pick-up (true) or drop-off (false) event longitude of pick-up or drop-off location latitude of pick-up or drop-off location time of pick-up or drop-off event
20. IDENTIFY POPULAR PICK-UP / DROP-OFF LOCATIONS ▪ Compute every 5 minutes for each location the number of departing and arriving taxis of the last 15 minutes. SELECT cell, isStart, HOP_END(rowtime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE) AS hopEnd, COUNT(*) AS cnt FROM (SELECT rowtime, isStart, toCellId(lon, lat) AS cell FROM TaxiRides) GROUP BY cell, isStart, HOP(rowtime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE) 21 © 2018 data Artisans
21. AVERAGE RIDE DURATION PER PICK-UP LOCATION ▪ Join start ride and end ride events on rideId and compute average ride duration per pick-up location. SELECT pickUpCell, AVG(TIMESTAMPDIFF(MINUTE, e.rowtime, s.rowtime) AS avgDuration FROM (SELECT rideId, rowtime, toCellId(lon, lat) AS pickUpCell FROM TaxiRides WHERE isStart) s JOIN (SELECT rideId, rowtime FROM TaxiRides WHERE NOT isStart) e ON s.rideId = e.rideId AND e.rowtime BETWEEN s.rowtime AND s.rowtime + INTERVAL '1' HOUR GROUP BY pickUpCell 22 © 2018 data Artisans
22. BUILDING A DASHBOARD SELECT cell, isStart, HOP_END(rowtime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE) AS hopEnd, COUNT(*) AS cnt FROM (SELECT rowtime, isStart, toCellId(lon, lat) AS cell FROM TaxiRides) GROUP BY cell, isStart, HOP(rowtime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE) Elastic Search Kafka 23 © 2018 data Artisans
23. SOUNDS GREAT! HOW CAN I USE IT? • ATM, SQL queries must be embedded in Java/Scala code  ‒ Tight integration with DataStream and DataSet APIs • Community focused on internals (until Flink 1.4.0) ‒ Operators, types, built-in functions, extensibility (UDFs, extern. catalog) ‒ Proven at scale by Alibaba, Huawei, and Uber ‒ All built their own submission system & connectors library • Community neglected user interfaces ‒ No query submission client, no CLI ‒ No integration with common catalog services ‒ Limited set of TableSources and TableSinks 24 © 2018 data Artisans
24. COMING IN FLINK 1.5.0 - SQL CLI Demo Time! That’s a nice toy, but … ... can I use it for anything serious? 25 © 2018 data Artisans
25. FLIP-24 – A SQL QUERY SERVICE • REST service to submit & manage SQL queries ‒ SELECT … ‒ INSERT INTO SELECT … ‒ CREATE MATERIALIZE VIEW … • Serve results of “SELECT …” queries • Provide a table catalog (integrated with external catalogs) • Use cases ‒ Data exploration with notebooks like Apache Zeppelin ‒ Access to real-time data from applications ‒ Easy data routing / ETL from management consoles 26 © 2018 data Artisans
26. CHALLENGE: SERVE DYNAMIC TABLES Unbounded input yields unbounded results (Serving bounded results is easy) SELECT user, url FROM clicks WHERE url LIKE '%xyz.com' SELECT user, COUNT(url) AS cnt FROM clicks GROUP BY user Append-only Table Continuously updating Table • • • Result rows are never changed Consume, buffer, or drop rows • • 27 © 2018 data Artisans Result rows can be updated or deleted Consume changelog or periodically query result table Result table must be maintained somewhere
27. FLIP-24 – A SQL QUERY SERVICE External Catalog Application (Schema Registry, HCatalog, …) SELECT user, COUNT(url) AS cnt FROM clicks GROUP BY user Database / HDFS Catalog Submit Query REST REST Results Event Log Optimizer Submit Query Job Result Server State Query Service Results are served by Query Service via REST + + − 28 Application does not need a special client Works well in many network configurations Query service can become bottleneck © 2018 data Artisans Query Event Log Database / HDFS
28. FLIP-24 – A SQL QUERY SERVICE External Catalog Application Database / HDFS Catalog REST REST Submit Query Result Handle Serving Library (Schema Registry, HCatalog, …) SELECT user, COUNT(url) AS cnt FROM clicks GROUP BY user Event Log Optimizer Submit Query Job Result Server State Query Service Event Log Database / HDFS 29 © 2018 data Artisans Query
29. WE WANT YOUR FEEDBACK! • The design of SQL Query Service is not final yet. • Check out FLIP-24 and FLINK-7594 • Share your ideas and feedback and discuss on JIRA or dev@flink.apache.org. 30 © 2018 data Artisans
30. SUMMARY • Unification of stream and batch is important. • Flink’s SQL solves many streaming and batch use cases. • Runs in production at Alibaba, Uber, and others. • The community is working on improving user interfaces. • Get involved, discuss, and contribute! 31 © 2018 data Artisans
31. THANK YOU! Available on O’Reilly Early Release!
32. THANK YOU! @fhueske @dataArtisans @ApacheFlink WE ARE HIRING data-artisans.com/careers

相关幻灯片