Chaos Engineering Open Science for Software Engineering KubeCon North America 2018

1. Chaos Engineering Open Science for Software Engineering Sylvain Hellegouarch Chaos CTO https://chaosiq.io @lawouach
2. A talk in three acts
3. Act I The one with History
4. A look at the past?
5. A worthwhile detour Watch Adrian Cockroft’s awesome talk at ChaosConf https://www.youtube.com/watch?v=cefJd2v037U
6. Let’s illustrate the challenge with a case-study
7. The Near-Loss and Recovery of America's First Space Station
8. The Near-Loss and Recovery of America's First Space Station •Skylab: first US space station launched in 1973 •Years of design •Relied on the previous Apollo program
9. The Near-Loss and Recovery of America's First Space Station •Suffered loss of sun-radiation shield during launch •Temperature went up high in the lab (up to 200°) •Engineers worked out ways to reduce the temperature (Recovery first!) ○ Changed angle of space station slightly ○ Brought up a new thermal insulation to the lab •Next launch was postponed by 10 days Copyright Nasa
10. The Near-Loss and Recovery of America's First Space Station The overarching — and was fully operational for Skylab. . What may have affected the oversight of the aerodynamic loads was involving several distinct technical disciplines. In our industry: It worked in the past and it’s a small change. Ring a bell?
11. The Near-Loss and Recovery of America's First Space Station , review and testing, the project team failed to recognize the shield’s design deficiency because they and structurally integrated as set forth in the design criteria. Smart and sharp engineers and scientists but previous project may have misled their confidence which wasn’t backed by enough experiments and data.
12. Concurrently, the investigation board emphasized that management must always be alert to the potential hazards of its systems and , documentation and visibility. According to the board, or analysts. Achieving a in analysis, design, test or operations
13. It's just one of these cases where Mars is going to give us a new deal, and we're going to have to play the cards we get, not the ones we want
14. Be ready not to be ready Copyright The Walt Disney Company
15. Fast forward to 2018
16. We learnt, adapted and improved... Hu ma am ns ar (so azin e me g tim es) Copyright NASA - Mission InSight
17. We have learnt indeed. But as systems reliability goes, we could still improve...
18. A regular certificate warning but in French French public service for driving license Certificate had been invalid for about 9 days
19. Twitter seems to be your best alerting platform sometimes Sent that message at 12:25pm (not just me but a few others too)
20. Updated at 1:41 pm that same day
21. Mild impacts but sometimes...
22. Certificate expiring can cause bigger troubles 02 mobile network outage on December 6th 2018 Earlier Ericsson president Börje Ekholm said "an initial root cause analysis" had indicated that the "main issue was an expired certificate in the software versions installed with these customers". Copyright Down Detector
23. Everyone needs more reliable systems
24. End of Act I
25. Act II The one with a community
26. You are not alone
27. CNCF Working Group
28. Strong signal that reliability matters to the Cloud Native ecosystem
29. Deliverables and challenges?
30. Deliverable 1: Whitepaper
31. CNCF WG Whitepaper •Not a specification/standard •Not dogmatic •Not a HOWTO
32. CNCF WG Whitepaper •Shared understanding •Product/Solution Agnostic •A starting line for users’ journey into Chaos Engineering •An industry effort to refine the practice •It’s not about giving solutions but expressing how Chaos Engineering is one tool to reliability problems!
33. CNCF WG Whitepaper •Shared understanding •Product/Solution Agnostic •A starting line for users’ journey into Chaos Engineering •An industry effort to refine the practice •It’s not about giving solutions but expressing how Chaos Engineering is one tool to reliability problems!
34. CNCF WG Whitepaper • Harness and Improve System Reliability • Direct Benefits for Cloud Native Systems • Software and Operational Practices In Production
35. CNCF WG Whitepaper •Service release impact on system •Third-party dependency out of reach •Network/CPU/Disk failure •Lack of team/org communication during degraded conditions •Multi-cloud migration
36. Deliverable 2: Landscape
37. CNCF WG Landscape
38. CNCF Landscape Some awesome tools But segmented and sparse
39. CNCF WG Landscape Many dimensions! Need community feedback to find the right approach for users to sense which tools to try and how they can complement each other
40. CNCF WG •A new practice so where to draw a line? •How to better engage with the community? •Everyone has failures and recovery stories to share! We should aggregate them!
41. Short-term Milestone
42. The community needs to make a stand about reliability!
43. End of Act II
44. Act III The ones with a plan
45. Chaos Engineering must not be reduced to its tooling or definition
46. is a deliberate practice to explore the unknown to
47. But why Chaos Engineering?
48. Because Reliability - in all its facets - is strategic to everyone
49. To Collaborate, on that Crucial Requirement for Reliability, we need a Platform to Share our Knowledge
50. A short detour...
51. Google Cloud Recommendations for your Black Friday •Awesome read •Full of tips (planning, playbooks, postmortems...) •Mention Disaster Recovery and Chaos Monkey BUT wouldn’t it be better if it offered runnable experiments? https://cloud.google.com/solutions/black-friday-production-readiness
52. Runnable experiments?
53. Yes, to share our engineering knowledge with our peers!
54. The have given us the to do just that
55. Hypothesis
56. Experiment
57. Observation
58. Finding
59. is and brings you a for exploring your system’s reliability
60. Chaos Engineering is For Software/System Engineering
61. We Must Strive to and to Building More Reliable Systems
62. To Unlock that Potential, the Industry must work towards
63. Kubernetes has paved the way
64. Serverless WG is a good example
65. Open Chaos Initiative as articles of interest across teams, across organisations and even between organisations. such that others can peer review and even suggest improvements and comparisons with their own findings based on similar experiments. and technical robustness of systems. on how to improve the resilience
66. Let’s recall what Nasa discovered…
67. End of Act III
68. Thank You Sylvain Hellegouarch Chaos CTO https://chaosiq.io @lawouach
69. Explore further... •Principles of Chaos Engineering http://principlesofchaos.org/ •Open Chaos Initiative https://openchaos.io/ •CNCF Chaos Engineering WG Whitepaper https://github.com/chaoseng/wg-chaoseng/blob/master/W HITEPAPER.md •Experiment/Journal Open API https://docs.chaostoolkit.org/reference/concepts/ •How complex systems fail https://www.researchgate.net/publication/228797158_How_c omplex_systems_fail •NASA Failures Case Studies https://nsc.nasa.gov/resources/case-studies