walmart labs Vilas Veeraraghavan Chaos Engineering past,present and fyture

CodeWarrior

2019/07/08 发布于 编程 分类

GIAC2019 

文字内容
1. CHAOS ENGINEERING – PAST, PRESENT AND FUTURE Vilas Veeraraghavan Director of Engineering, Walmart
3. THE JOURNEY • Past (2010 - 2015) • Origins • Lessons learned • Tools • Present (2016-2019) • Increasing adoption • Continued improvements and innovation • Mainstream tools • Future (2019 - ) • Automation • Organizational status
4. PART ONE - PAST
5. PIVOT TO CLOUD • Datacenter dependencies • Emergence of cloud services • Breaking up monoliths into microservices
6. PIVOT TO CLOUD • Unreliability – Issues with the cloud providers • Multiple incidents with AWS • How can we protect applications against outages?
7. REFERENCE https://www.infoq.com/presentations/chaos-architecture-mindset
8. TOOLS https://github.com/Netflix/chaosmonkey
9. HOWEVER http://www.quickmeme.com/meme/3oh4hu
10. DETERMINISTIC FAILURES
11. OBSERVABILITY • Logging alone is not enough • Ability to differentiate between healthy and unhealthy behavior • Alerts set up • Make sure your on-call knows what to do for specific problems
12. LESSONS LEARNED • Resiliency cannot be achieved by injecting chaos without control • Running a chaos exercise is NOT an individual effort. • Observability of metrics is key to get value out of exercises
13. Users/Frontend CLOUD ARCHITECTURE Internet FIREWALL Application server Application server Application server Application server Application server Load Balancer Application server Application server Application server Application server Application server Region B Region A Database Database
14. PART TWO - PRESENT
17. MOTIVATIONS • Reliability is no longer a function of redundancy and over-scaled hardware. • Specifically, to exist in a hybrid cloud environment, we have to acknowledge that cloud providers are an external dependency which are reliability risks. • Customers expect more and ‘scheduled downtime’ is no longer an acceptable term. • A user performing transactions (search, add to cart, payment) should not perceive a loss of functionality due to systemic failure.
18. MOTIVATIONS • Users can lose trust on brand due to a single bad experience • Loss could be temporary OR lifetime • Resiliency is not a local goal – it is a global goal
19. GOAL To maintain an application ecosystem where failures in infrastructure and dependencies cause minimal disruption to the end user experience
20. WHAT IS CHAOS ENGINEERING? • It is NOT to be confused with Integration or Performance testing • Integration testing is ensuring that functional requirements and contracts with teams have been met • Performance vs resiliency – what is the difference? • You can be performant but not resilient • And vice-versa https://principlesofchaos.org/
21. PRESENT • Chaos engineering has gone mainstream • Big companies use it now – Walmart, Nike, Target • You can use it too!
22. CREATING RESILIENT SYSTEMS USING CHAOS ENGINEERING
23. TOOLS https://github.com/dastergon/awesome-chaos-engineering
24. TECHNIQUES • Engineering organization needs to be motivated • Incentivize teams to conduct scientific experiments and FAIL! • SRE team charter needs to include chaos • Establish a core practitioners group inside the company • These are the key tech leads from all the technology groups
25. TEAM • Everyone has a hand in the success • Cant hire everyone – make the right choices – hire senior engineers • Little bit of everything – different people have different skills • Counter-intuitive - Incentivize preventing outages by causing them
26. CONVINCING MANAGEMENT • “Resiliency” is the end goal NOT chaos • Use outages as the way to motivate management • Do your homework!! – calculate support costs
27. KEY TESTS • Infrastructure issues – failures, glitches, faulty maintenance policies • Dependency failures – changing versions of APIs, changing SLAs • Deployment issues – is the app even deployed right?
28. Users/Frontend INFRASTRUCTURE FAILURES Internet FIREWALL Application server Application server Application server Application server Application server Load Balancer Application server Application server Application server Application server Application server Region B Region A Database Database
29. Users/Frontend INFRASTRUCTURE FAILURES Internet FIREWALL Application server Application server Application server Application server Application server Load Balancer Application server Application server Application server Application server Application server Region B Region A Database Database
30. Users/Frontend INFRASTRUCTURE FAILURES Internet FIREWALL Application server Application server Application server Application server Application server Load Balancer Application server Application server Application server Application server Application server Region B Region A Database Database
31. DEPENDENCY FAILURES X Application 2 Application 3 Front end Application 1 Users
32. DEPENDENCY FAILURES Application 2 DB Application 3 Front end Application 1 Users
33. DEPENDENCY FAILURES Application 2 DB X Application 3 Front end Application 1 Users
34. PREREQUISITES • Create Disaster Recovery (DR) failover playbook • Define critical dependencies • Compose playbook for critical dependency failures • Define non-critical dependencies • Define thresholds at which non-critical dependency failures will impact system
35. EXPERIMENT 1. Identify what you intend to fail 2. Before : Write down hypothesis of system behavior based on known assumptions 3. Inform key stakeholders 4. Run test 5. After : verify if hypothesis holds 6. If hypothesis does not hold, analyze and fix. If it holds, hypothesis is validated
36. EXPERIMENT 1. Identify what you intend to fail 2. Before : Write down hypothesis of system behavior based on known assumptions 3. Inform key stakeholders 4. Run test 5. After : verify if hypothesis holds 6. If hypothesis does not hold, analyze and fix. If it holds, hypothesis is validated Repeat!!
37. LEVELS
38. LEVEL 1 • All of the pre-requisites stored in a single well-defined place • Agreement on playbooks to be used by Devs, Testers, Operations, Stakeholders • Manual exercise that validates the DR failover playbook
39. LEVEL 2 • All of level 1 requirements, plus • Run a failure test for critical dependencies in a non-prod environment • Publish test results to team, stakeholders • Manual tests are acceptable
40. LEVEL 3 • All of level 2 requirements, plus • Run tests regularly on a cadence (at least once every 4–5 weeks) • Publish results to dashboards to track resiliency over time • Run at least one resiliency exercise (failure injection) in production environment
41. LEVEL 4 • All of level 3 requirements, plus • Automated resiliency testing in non-prod environment • Semi-automated DR failover scripts (minimal human supervision required)
42. LEVEL 5 • All of level 4 requirements, plus • Automated resiliency testing fully integrated into CI/CD environment • Resiliency failure results in build failure • Automated resiliency testing and DR failover testing enabled in production environment
43. SUPPORT COSTS
44. PART THREE - FUTURE
45. CI/CD Plan Code Build Test Deploy CI/CD Continuous integration/Continuous deployment Cloud
46. CI/CD • Make resiliency testing a part of the continuous integration cycle • Continuous delivery pipelines can have checks on resiliency metrics https://concord.walmartlabs.com/
47. EXAMPLE https://concord.walmartlabs.com/docs/plugins/gremlin.html
48. CI/CD – THE REAL PICTURE
49. TOOLS • Gremlin, chaos toolkit, chaosblade • Kubernetes based deployments also need chaos – use powerful seal
50. COMPLETE AUTOMATION
51. REFERENCES • https://medium.com/netflix-techblog/from-chaos-to-control-testing-theresiliency-of-netflixs-content-discovery-platform-ce5566aef0a4 • https://medium.com/walmartlabs/charting-a-path-to-software-resiliency38148d956f4a • https://www.youtube.com/watch?v=4Gy_5EQMrB4
52. THANK YOU
53. QUESTIONS??