Scientific and Safe Chaos Engineering Brian Wilcox


2019/10/19 发布于 技术 分类

3. Steady State
4. Stressed Event Deviation
5. SEO “Reset” Requests Cache Purge Hardware Failure
6. (Amplitude) Resilience Reliability (Frequency)
7. Assertion Change Validation Operation
8. Resilience Engineering Attempts to improve how a system reacts to a stressed state. Chaos Engineering Attempts to prove how a system reacts to a stressed state. Photo by Michael Fenton on Unsplash
9. What are you trying to s/(im)?prove/ Steady state only matters if you can define what is good vs bad. 1. Operability 2. User Interactions 3. Durability Releases Dependency Tree • Standard deployments • Compatibility problems • Unrealized dependencies • Slow pipelines == bad app Regional Error Rates • Noisy Dependencies • Noisy Operation • Invalid Input Supportability/Deprecation Availability • Unrealized dependencies • Graceful Degradation Consistency • Consensus isn’t free • Natural Disasters • Governmental Instability • Hardware Failure Tooling • Metrics and Alerting Pipeline • AutoRemediation Tools • Fault Detection
10. The Process • • • • • • • Establish Steady State Observe bad results Observe good results Consider the control planes Plan for safety Execute the test Record, correct, repeat Photo by SpaceX on Unsplash
11. Steady State • • • • Availability Number of Units Shipped Rate of Failure Meters under Water Photo by Miguel A. Amutio on Unsplash
12. The Bad • What do failures look like? • What are the common categories? • Do you have a definition of bad? • Site Issues? Helpdesk tickets? Photo by Hayden Walker on Unsplash
13. The Good • Definition of Good (SLA) • IR&M Processes • How often does the system take care of itself? • Lineage-driven fault injection
14. Control Plane • When can your applications make routing decisions? • How long can you hold on to a request? • How do applications report stress? • Where can you affect change? • Development lifecycle • Operational burden • Control of resources Photo by Crew on Unsplash
15. Photo by Pop & Zebra on Unsplash Safety Third Minimize the blast radius Have a backup/rollback plan Assume missing information What’s the least you have to do to complete the experiment? • Build confidence • • • •
16. Assertion Change Validation Operation
18. Questions?