吴惠君 - 实时流系统Heron的异常检测和恢复

都星晴

2017/12/18 发布于 技术 分类

应近年来大规模实时分析的需求,很多流处理系统被开发出来。Twitter Heron开源系统就是其中的代表项目之一。这类系统要求在软件或者硬件失败的极端情况下能有较好的服务水平。为了达到这种要求,Twitter Heron系统添加了Dhalion异常检测和恢复框架来保障Heron系统的服务水平。 Dhalion异常检测和恢复框架使用polocy(策略)来整合detector(检测器)和resolver(执行器)模块。整个系统非常灵活。通过替换policy或者detector或者resolver能进行各种检测和恢复任务,包括检测back pressure(反压)指标并进行扩容,和检测负载指标并重新调度容器等等。Dhalion框架的应用给Heron系统带来了初步的自行规范调整机制。

文字内容
1. Self Regulating Stream Processing in Heron Huijun Wu 2017.12
2. Huijun Wu Twitter, Inc. Infrastructure, Data Platform, Real-Time Compute
3. Heron Overview Recent Improvements Self Regulating Challenges Dhalion Framework Case Study
4. Heron Overview Data model: user/developer perspective What is Heron? A real-time, distributed, fault-tolerant stream processing engine from Twitter Topology(DAG) • Vertex Spout Bolt • Edge: Stream Tuple Compatible with Apache Storm data model
5. Heron Overview Runtime architecture: data center with multiple topologies Topology shared services: • Scheduler • Uploader • State manager: zookeeper Tools: • Tracker • UI
6. Heron Overview Runtime architecture: one particular topology Shared between containers: • State manager: zookeeper Type 1: container 0 • Topology master • Metrics cache Type 2: container x (x>0) • Stream manager • Heron instance • Metrics manager
7. Heron Overview Runtime rate control: backpressure Health metrics: ● Metrics/counters ○ Backpressure ● Exceptions Backpressure example: B3 in the container A triggers backpressure, which is broadcasted to all Stream managers to stop local Spouts.
8. Heron Overview Recent Major Improvements (2016-2017) Self Regulating Challenges Dhalion Framework Case Study
10. Recent Improvements Resource managers Service Provider Interface (SPI) • Modular plugins • https://github.com/twitter/heron/tree/master/heron/spi • Scheduler implementation vs. delegation Supported resource pools • Mesos/Aurora/Marathon • Yarn • Kubernetes • Slurm • Local http://2015.qconshanghai.com/presentation/2792
11. Recent Improvements Elastic runtime scaling ● Update parallelism at runtime ● Adapt to stream traffic load ● `heron update` command ● Minimize impact to running topology ● Intelligent packing algorithm http://2015.qconshanghai.com/presentation/2792
12. Recent Improvements Stateful processing: effectively once Delivery semantics: ● At most once ● At least once ● Effectively once ○ Distributed snapshot/state checkpointing ○ At-least-once event delivery plus roll back ● Exactly once http://2015.qconshanghai.com/presentation/2792
13. Recent Improvements High level DSL: functional API Domain Original topology API Heron Functional API Programming style Procedural, processing component based Functional Abstraction level Low level. Developers must think in terms of "physical" spout and bolt implementation logic. High level. Developers can write processing logic in an idiomatic fashion in the language of their choice, without needing to write and connect spouts and bolts. Processing model Spout and bolt logic must be created explicitly, and connecting spouts and bolts is the responsibility of the developer Spouts and bolts are created for you automatically on the basis of the processing graph that you build http://2015.qconshanghai.com/presentation/2792
15. Recent Improvements Self regulating: health mgr/dhalion Motivation: ➢ the manual, time-consuming and error-prone tasks of tuning various configuration knobs to achieve service level objectives (SLO) as well as the maintenance of SLOs in the face of sudden, unpredictable load variation and hardware or software performance degradation What is Dhalion:'>Dhalion: ➢ a system that provides self-regulation capabilities to underlying streaming systems Floratou, Avrilia, et al. "Dhalion:'>Dhalion: self-regulating stream processing in heron." Proceedings of the VLDB Endowment 10.12 (2017): 1825-1836. http://2015.qconshanghai.com/presentation/2792
16. Heron Overview Recent Improvements Self Regulating Challenges Dhalion Framework Case Study
17. Self-Regulating Streaming Systems Manual, time-consuming and error-prone task of tuning various system knobs to achieve SLOs Maintenance of SLOs in the face of unpredictable load variation and hardware or software performance degradation Self-Regulating Streaming Systems Self-tuning Self-stabilizing Self-healing
18. Self Regulating Challenges Self-tuning ● Various tuning knobs ● Time consuming tuning phase ● The system should take as input as SLO and automatically configure the knobs.
19. Self Regulating Challenges Self-stabilizing ● Streaming applications are long running ● Load variations are observed ● The system should react to external shocks and automatically reconfigure itself.
20. Self Regulating Challenges Self-healing ● System performance can be affected by hardware or software delivering degraded quality of service ● The system should identify internal faults and attempt to recover from them.
21. Heron Overview Recent Improvements Self Regulating Challenges Dhalion Framework Case Study
22. Feedback Cycle ● Passive cycle ○ start backpressure(stream manager) -> cease spout(heron instance) -> stop backpressure(stream manager) ● Proactive cycle ○ [metrics -> metrics manager] -> metrics cache -> health manager -> [container/stream manager/spout/bolt -> metrics]
23. Dhalion terminology ● Policy: Dhalion periodically invokes a policy which evaluates the status of the topology, identifies potential problems and takes appropriate actions to resolve them. ● Detection: Dhalion observes the system state by collecting various metrics from the underlying streaming system. ○ Symptom: Based on the metrics collected, Dhalion attempts to identify symptoms that can potentially denote that the health of the streaming application has been compromised. ● Diagnosis: After collecting various symptoms, Dhalion attempts to find one or more diagnoses that explain them. ● Resolution: Once a set of diagnoses has been found, the system evaluates them and explores the possible actions that can be taken to resolve the problem.
24. Dhalion policy phases
25. HealMgr workflow ● Data flow ○ metrics -> component metrics ○ component metrics -> symptom ○ symptom -> diagnosis ○ diagnosis -> action ● Control flow ○ policy -> detector/diagnoser/resolver ○ detector -> sensor/metrics provider
26. Dhalion in HealthMgr Dhalion from Microsoft https://github.com/Microsoft/Dhalion
27. Action log and blacklist ● It is possible that a diagnosis produced by Dhalion is erroneous and thus, an incorrect action is performed that will not eventually resolve the problem. ● For this reason, after every action is performed, Dhalion evaluates whether the action was able to resolve the problem or brought the system to a healthier state. ● If an action does not produce the expected outcome then it is blacklisted and it is not repeated again.
28. Heron Overview Recent Improvements Self Regulating Challenges Dhalion Framework Case Study
29. Dynamic Resource Provisioning The major goal is to scale up and down topology resources as needed while still keeping the topology in a steady state where backpressure is not observed.
30. Dynamic Resource Provisioning Symptom detection phase ● The Pending Packets Detector focuses on the Stream Manager queue corresponding to each Heron Instance. Each Stream Manager queue temporarily stores packets that are pending for processing by the corresponding Heron Instance. This Symptom Detector examines the number of pending packets in the queues of the Heron Instances that belong to the same bolt, and denotes whether these Heron Instances have similar queue sizes or whether outliers are observed. ● The Backpressure Detector examines whether the topology experiences backpressure by evaluating the appropriate Stream Manager metrics. The existence of backpressure shows that the system is not able to achieve maximum throughput. ● The Processing Rate Skew Detector examines the number of tuples processed by each Heron Instance during the measurement period (processing rate). It then identifies whether skew in the processing rates is observed at each topology stage.
31. Dynamic Resource Provisioning Diagnosis generation phase h: heron instance r: processing rate p: pending packets B: subset of H
32. Dynamic Resource Provisioning Resolution phase ● Restart Instances Resolver:'>Resolver:'>Resolver:'>Resolver: moves the slow Heron Instances to new containers ● Data Skew Resolver:'>Resolver:'>Resolver:'>Resolver: adjusts the hash function used to distribute the data to the bolts ● Bolt Scale Up Resolver:'>Resolver:'>Resolver:'>Resolver: ○ To determine the scale up factor, the Resolver computes the percentage of the total amount of time that the Heron Instances spent suspending the input data over the amount of time where backpressure was not observed. This percentage essentially denotes the portion of the input load that the Heron Instances could not handle.
33. Dynamic Resource Provisioning Evaluation
34. Satisfying Throughput SLOs ● Emit Count Detector: computes the total rate at which spouts emit data ● Throughput SLO Violation Diagnoser ● Spout Scale Up Resolver: increases the number of Heron Instances of the spout ○ In case the policy increases the spout parallelism, the topology might experience backpressure due to the increase of the input load. In this case, the Throughput SLO Policy employs the components used by the Dynamic Resource Provisioning Policy to automatically adjust the resources assigned to the bolts so that the topology is brought back to a healthy state.
35. Satisfying Throughput SLOs Evaluation
36. Curious to know more ● Floratou, Avrilia, et al. "Dhalion: self-regulating stream processing in heron." Proceedings of the VLDB Endowment 10.12 (2017): 1825-1836. ● Kulkarni, Sanjeev, et al. "Twitter heron: Stream processing at scale." Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 2015. ● Fu, Maosong, et al. "Twitter Heron: Towards Extensible Streaming Engines." Data Engineering (ICDE), 2017 IEEE 33rd International Conference on. IEEE, 2017. ● Fu, Maosong, et al. "Streaming@ Twitter." IEEE Data Eng. Bull. 38.4 (2015): 15-27.