蔡東邦 (DB Tsai) 如何弥合 Spark Datasets 和 DataFrames 之间的性能差距?

文字内容
1. Bridging the Gap Between Datasets and DataFrames DB Tsai ArchSummit 2019 © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
4. Agenda • Spark at Apple • Spark SQL • Datasets vs DataFrames • Optimizing Datasets with Bytecode Analysis © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
5. Spark at Apple
6. Spark at Apple • Scalable elastic on demand Spark • Disaggregated architecture • Over a million executors per day © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
7. Spark SQL
8. Catalyst © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
9. Datasets vs DataFrames
10. Review of APIs of Spark DataFrame: Relational untyped APIs introduced in Spark 1.3. From Spark 2.0, type DataFrame = Dataset[Row] Dataset: Support all the untyped APIs in DataFrame + typed functional APIs © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
11. Review of DataFrame Real estate information can be naturally modeled by root -- address: struct (nullable = true) -- houseNumber: integer (nullable = true) -- streetAddress: string (nullable = true) -- city: string (nullable = true) -- state: string (nullable = true) -- zipCode: string (nullable = true) -- facts: struct (nullable = true) -- price: integer (nullable = true) -- size: integer (nullable = true) -- yearBuilt: integer (nullable = true) -- schools: array (nullable = true) -- element: struct (containsNull = true) -- name: string (nullable = true) © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
12. Review of DataFrame val newAddress = struct( $"address.houseNumber" , new Column(Uuid()) as "streetAddress", $"address.city", $"address.state", $"address.zipCode" ) ds.withColumn("address", newAddress) .where($"facts.price" > 2000000).explain(true) © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
13. Execution Plan - Dataframe Untyped APIs == Physical Plan == *(1) Filter (isnotnull(facts#1) && (facts#1.price > 2000000)) +- *(1) Project [named_struct(houseNumber, address#0.houseNumber, streetAddress, uuid(Some(0)), city, address#0.city, state, address#0.state, zipCode, address#0.zipCode) AS address#7, facts#1, schools#2] +- *(1) FileScan parquet [address#0,facts#1,schools#2] DataFilters: [isnotnull(facts#56), (facts#56.price > 2000000)], Format: Parquet, PushedFilters: [IsNotNull(facts), GreaterThan(facts.price,2000000)], © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
14. Review of Dataset ds.map { home => home.copy(home.address.copy(streetAddress = UUID.randomUUID().toString)) }.filter(_.facts.price > 2000000).explain(true) © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
15. Execution Plan - Dataset Typed APIs == Physical Plan == *(1) SerializeFromObject [if (isnull(assertnotnull(input[0, Home, true]).address)) null else named_struct(houseNumber, assertnotnull(assertnotnull(input[0, Home, true]).address).houseNumber, streetAddress, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, Home, true]).address).streetAddress, true, false), city,………. +- *(1) Filter <function1>.apply +- *(1) MapElements <function1>, obj#27: Home +- *(1) DeserializeToObject newInstance(class Home), obj#26: Home +- *(1) FileScan parquet [address#0,facts#1,schools#2] DataFilters: [], Format: Parquet, PushedFilters: [], © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
16. Strongly Typed Pipeline • Typed Dataset is used to guarantee the schema consistency • Enables Java/Scala interoperability between systems • Compile time exceptions • Increases Data Scientist productivity © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
17. Drawbacks of Strongly Typed Pipeline • Dataset is slower than Dataframe https://tinyurl.com/dataset-vs-dataframe • Serialization and deserialization cost • In Dataset, many POJOs are created for each row resulting high GC pressure • Not able to apply Catalyst optimizations © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
18. Optimizing Datasets with Bytecode Analysis
19. SPARK-14083 • Use bytecode analysis to convert closures/lambdas into Catalyst expressions in order to speed up Datasets • Reported by Reynold Xin in March 2016 • No progress since March 2017 © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
20. JVM Bytecode • A platform-independent instruction set of the JVM • Java/Scala code is compiled into bytecode before it can be executed © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
21. JVM Architecture © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
22. JVM Architecture © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
23. JVM Architecture © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
24. JVM Architecture © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
25. Bytecode Example © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
26. Bytecode Example © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
27. Bytecode Example © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
28. Bytecode Example © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
29. Bytecode Example © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
30. Bytecode Example © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
31. Bytecode Example © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
32. Deriving Catalyst Expressions • Translate stack-based bytecode directly • Iterate over bytecode instruction-by-instruction simulating what happens to the operand stack • Rely on an intermediate representation of bytecode • Translate bytecode into a format that is easier to perceive before converting into Catalyst expressions (e.g., Jimple) © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
33. Logical Plans for Datasets © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
34. Logical Plans for Datasets © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
35. Logical Plans for Datasets © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
36. Algorithm • Get closure method & its bytecode • Build a correct local variable array with Catalyst expressions • Create an operand stack to hold partial results • Follow instructions and perform operations on the operand stack • Assign the final result to the expected attributes © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
37. Sample Use Case (w/o Bytecode Analysis) © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
38. Sample Use Case (w/o Bytecode Analysis) © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
39. Sample Use Case (w/o Bytecode Analysis) © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
40. Sample Use Case (with Bytecode Analysis) © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
41. Sample Use Case (with Bytecode Analysis) © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
42. Sample Use Case (with Bytecode Analysis) © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
43. Sample Use Case (Results) © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
44. Challenges • Preserving JVM semantics • Null handling • Exceptions (e.g., NullPointerException) • Edge cases (e.g., division by 0 for integers/doubles) • Result expressions can be too complicated • Method calls • Creation of new objects © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
45. Challenges - Example (Code) © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
46. Challenges - Example (Bytecode) © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
47. Challenges - Example (Implicit Operations) © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
48. Challenges - Example (Creation of Objects) © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
49. Challenges - Example (Method Calls) © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
50. Challenges - Example (Verbose Bytecode) © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
51. Challenges - Example (Verbose Bytecode) © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
52. Challenges - Example (Verbose Bytecode) © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
53. Challenges - Example (Verbose Bytecode) © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
54. Summary • Datasets provide type safety and the ability to apply user-defined closures at the cost of performance • Bytecode analysis can be used to bridge the gap in performance • The conversion of user closures into Catalyst expressions is challenging © 2019 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
57. TM and © 2018 Apple Inc. All rights reserved.