kubecon 2018 observability thommccann amreth

1. Observability at Scale A key tenet of multi-tenancy Thom McCann & Amreth Chandrasehar T-Mobile Cloud Team
2. About T-Mobile The Un-carrier: Netflix on Us Real humans for Customer Service Taxes included in bill Unlimited Voice, Data and Text T-Mobile Tuesdays 77 million customers Operating two flagship brands: T-Mobile and Metro by T-Mobile 50,000+ employees
3. Running at Scale Customer Experience driving Un-carrier moves • 10,000 APIs across 100s of applications • Massive adoption of cloud native technologies in data center and public cloud over last 3 years Public Cloud • 6 years, 70+ Applications, AWS, Azure with data center network integration • 40 container-based applications using shared “multi-tenant” orchestration • Netflix integration, Retail agent application, T-Mobile for Business, Commerce, Biometric login, Team of Experts, On Device upgrade, Coverage Maps, Social messaging
4. Multi-tenancy drives quick adoption Shared clusters for multiple app team • Enable consistency across applications 7,200 Containers 31 Billion Requests in 2018 (Dec) 0 Minutes Downtime (Prod) 29+ Months 2K – 14K RPS 0 P1s Integrated Services Integrated CI/CD Logging Telemetry Load Balancing / Networking DNS Security / Scanning Service Discovery Secrets Management Certs
5. Observability promise land Build the culture, drive towards intelligent ops • Encourage application teams, proactive engagement • Begin at the beginning, don’t skip steps • Be where developers are: CLI, chat, pipelines, Data Collection Logging Metrics Request tracing Cost Information Telemetry Dashboards Reports Service Health Proactive Culture Default path Chat Ops / Slack Integration into tools Alerting: noise to signal Analytics and Intelligent Ops Trend Analysis Optimization (performance and cost) Machine Learning Embedded Analytics / decisions
6. The toolbox Prometheus infra supports • 28 Clusters, 1.5M metrics, 4K RPS • “Infinite” storage using Thanos and S3 Grafana • 150 Orgs, 1,000+ dashboards, 820 alerts • Drive a culture of DevOps success High availability, multi-region, isolated failure domains • Prometheus, Grafana and Service Health
7. Telemetry everywhere Internal “public” alert stream of clusters in slack, Application specific channels provides for proactive alerting and support
8. Ingress Dashboard
9. Operational Dashboard
10. API Dashboard
11. Service Dashboard
12. Call Flow Dashboard
13. Regional Service availability
14. Tabular example
15. Background FTW
16. Observe this: Cost
17. Observe this: Cost Showback • Proactive recommendations • Create a culture of cost awareness • Give execs and developers decision tools
18. What’s your σmemUtil? Data Lake • Prometheus data and cost data • Utilization analysis in Spark Short run containers • 600k unique containers run in a month >90% less than a day • CPU not a factor, but memory…. Avg. Utilization Std. Deviation 51.49% 31.35%
20. Reference of images
21. Health check map of K8s cluster