Postmortems Help Hardening Kubernetes KubeCon NA 2018 Seattle Puja

1. Hardening Kubernetes Setups: War Stories from the Trenches of Production Puja Abbassi - @puja108 11.12.2018
2. Puja @puja108 customer - Developer Advocate / Product Owner @ Giant Swarm - #CKA #Security #Operators - Data & Network Science “Almost-PhD” product @puja108 community 2
3. 1. On running 100+ clusters Agenda 2. Postmortems - Lots of them! 3. Hardening and Best-Practices @puja108 3
4. On running 100+ clusters @puja108 4
5. - Different Clouds 100+ clusters - Different Regions - On-Premise - China @puja108 5
6. - Companies Diversity - Industries - Users - Use Cases @puja108 6
7. Freedom vs. Control @puja108 - Opt for Freedom - Educate Users - Harden up 7
8. Postmortems - Lots of them! @puja108 8
9. Postmortem Philosophy “The primary goals of writing a postmortem are to ensure that the incident is documented, that all contributing root cause(s) are well understood, and, especially, that effective preventive actions are put in place to reduce the likelihood and/or impact of recurrence.” - @puja108 Google SRE book 9
10. 1. Gather Issues Single Product 2. Fix in Code 3. Roll out continuously 4. Profit 😉 @puja108 10
11. - Issue Template Postmortem Practice @puja108 - High Priority - Assigned to x-functional team 11
12. Team 1 Load Balancing Postmortems PM Team 2 Team 3 @puja108 12
13. 500+ Postmortems @puja108 13
14. War Stories @puja108 14
15. Kubernetes upstream issue: #57992 (fixed in 1.11.4 and 1.12.0) @puja108 15
16. Ingress Controller Misconfiguration - Faulty ingress objects can break controller - Lots of teams + lots of freedom = lots of issues @puja108 16
17. Ever built a full-mesh IPIP tunnels ICMP pinger? @puja108 17
18. Customer Load Test goes bad? You take the blame! @puja108 - “Must be Calico, kube-proxy, IC!” - Turns out EC2 network saturation was the bottleneck - Solution: More workers! 18
19. Hardening and Best-Practices @puja108 19
20. - Old versions Postmortem Hotspots - Ingress (~15%) - Networking & DNS - Resource Pressure - Multi-tenancy @puja108 20
21. - Issues might have been solved already - CVEs Old versions - Test Upgrades extensively - Automate Upgrades (or have a process) @puja108 21
22. - NGINX IC: Newer versions are less prone to misconfiguration Ingress - Separate controllers - Load- and failover-testing - Last resort: SVC of type LB @puja108 22
23. - Monitor network health Networking & DNS - Monitor DNS latency - Check for known issues - Apply best practices @puja108 23
24. - Resource Management! Resource Pressure @puja108 - Include Buffers (lots of them) - Protect K8s and critical addons (priority) 24
25. - Separate and isolate namespaces with RBAC - No cluster-admins! Multi-tenancy - Separate clusters if possible - Automate with CI/CD - Minimize manual ops @puja108 25
26. - Preemptive Monitoring & Alerting are key! - Logging (and Tracing) help debugging Best Practices - Fix issues fast - Educate users - Have a postmortem process - Train Recovery @puja108 26
27. Stand on the Shoulders of Giants! @puja108 Kubernetes the very hard way - Datadog Scaling Kubernetes to 2,500 Nodes - OpenAI 5 - 15s DNS lookups on Kubernetes? - BitMEX Scaling CoreDNS in Kubernetes Clusters Inside Kubernetes Resource Management (QoS) Michael Gasch List of Kubernetes Best Practice talks/blogs Kubernetes Office Hours 27
28. Thank you! Questions? Stay in touch - Twitter: @puja108 - Github: puja108 - Slack/Discuss: puja @puja108 28