Cameron-SRE at Airbnb: On-call and Incident Response

计一南

2017/11/14 发布于 技术 分类

As teams embrace a DevOps approach and become more distributed, how do you ensure a consistent, efficacious, efficient approach during your most severe incidents? Can you do that with a 100% volunteer on-call rotation? We will discuss how Airbnb's SRE team builds tools to empower on-call engineers to respond to and resolve incidents quickly, how we think about user impact during an incident, and how we ensure the engineering team learns from incidents.

文字内容
1. Cameron Tuckerman-Lee / DevOpsDays Shanghai / 2017-08-18 SRE at Airbnb
2. Cameron Tuckerman-Lee Airbnb Site Reliability Engineer
3. SRE at Airbnb How do you combine the culture and spirit of DevOps with an operations team? DevOps & SRE SRE Organization Future of Ops
4. SRE at Airbnb DevOps & SRE How is SRE at Airbnb organized? Cloud Infra and Reliability deep-dive. SRE Organization Future of Ops
5. SRE at Airbnb DevOps & SRE SRE Organization Operators should grow, learn, and be recognized for oncall work, while maintaining pagerlife balance. Future of Ops
6. DevOps & SRE
7. Centralized Ops Centralized Operations Organization Positives Reliability can be easily prioritized Specialization of roles Negatives Operators unfamiliar with code base Tension between operations and development
8. Centralized Ops Distributed Ops Distributed Operations Positives Agility can be easily prioritized Developers are incentivized to build systems that are easy to operate (since they are the operators!) Negatives Lack of specialization --- devs are forced to relearn difficult lessons over-andover Teams speak different uptime/reliability languages to each other
9. Centralized Ops Distributed Ops Hybrid Approach Hybrid Approach: Two Pizza Teams + SRE Team Able to 'tune' a balance between reliability and agility Developers are still expected to run normal operations for their services == build operable services Centralized operations organization can build reusable tools to make operations / incident response easier. Specialization of roles without tension between operations and development teams. Organization that understand and recognizes the value in automating away their job.
10. “ Fundamentally, it's what happens when you ask a software engineer to design an operations function... Ben Treynor VP Engineering, Google
11. SRE Organization
12. What makes up SRE at Airbnb? Site Reliability Engineering is made up of three components: Cloud Infrastructure Manages our touch points with AWS and other cloud partners Core Reliability Develops tools and processes to improve operations, reliability, and incident response for all teams Embedded Reliability Temporary embedding of SREs in product teams to work on specific reliability or availability focused projects
13. Cloud Infrastructure
21. Requirements for Each Integration Monitoring Alerting Security Approval Auditing Version Upgrades Access Control ...
23. Reliability
24. Three Pillars of Reliability Uptime Measurement Every team at any time should be able to confidently say whether their service is working properly or not. Alerting & Detection Defense-in-depth: our users are protected from bugs and regressions by multiple layers of opinionated alerts. Incident Response Engineers can coordinate across teams, investigate problems in systems they don't fully understand, and keep stakeholders up-to-date.
25. 1. Uptime Identify quantifiable metrics which are related to the health of their services, called (Service Level Indicators or SLI) Make public and easily discoverable promises about the behavior of your service using your SLIs (Service Level Objectives or SLO) Teams review their services current SLIs and compare them to their published SLOs to make tradeoffs between reliability improvements and new features --- SLOs encode the tradeoff between moving fast and breaking things (Error budgets)
26. 1. Uptime 2. Alerting Alerting philosophy should be opinionated --- engineers know what kind of alerts to write and when to write them Alerts (like configuration) should be code Practice defense in depth --- protect your users from bugs and regressions with layers of alerts like a security team protects employees from being compromised with layers of defenses
35. 1. Uptime 2. Alerting 3. Response Incident Reporter Tool Mid-Incident Engineers can effectively coordinate, even across teams Stakeholders (upstream clients, management, employees) are kept aware of updates Working on a Slack integration so responders can stay in chat but keep the company up-to-date Post-Incident Blameless postmortem process Consistent impact measurement (management sees that better incident response + corrective actions matters to the bottom line) Easily search past incidents/postmortems
37. Future of Ops
38. Future of Ops People-First On-call Pager-Life Balance: Ensure that more involved, tenured engineers aren’t always the ones waking up at 3 AM to put out fires Learning/Growth Focused: Continuing education and learning opportunities for on-call engineers Evaluation Metrics: Engineers should know where they can improve and should be recognized for excellent work Intelligent Scheduling: In DevOps when every team has at least two on-call rotations, how can we schedule around lives outside of work (and responsibilities inside of work)?
40. 会议 培训 咨询 • 8月18日 DevOpsDays 上海 • 全年 DevOps China 巡回沙龙 • 11月17日 DevOps金融上海 DevOpsDays 2017·上海站 • EXIN DevOps Master 认证培训 • DevOps 企业内训 • DevOps 公开课 • 互联网运维培训 • 企业DevOps 实践咨询 • 企业运维咨询 商务经理:刘静女士 电话 / 微信:13021082989 邮箱:liujing@greatops.com