PID Autoscaling Strava's Linkerd Service Mesh Using Prometheus Data J Evans, Strava

1. PID Auto-scaling a Linkerd Service Mesh J Evans Strava
2. What is a PID Controller?
3. Proportional Integral Derivative Process
4. Proportional Integral Derivative Setpoint: maintain 80% cpu_util Process
5. Proportional Integral Process Derivative Process variable: cpu_util Setpoint: maintain 80% cpu_util
6. Error: Difference between what we want and what we observe Proportional Integral Process Derivative Process variable: cpu_util Setpoint: maintain 80% cpu_util
7. Error: Difference between what we want and what we observe Control variable (# of instances) Proportional Integral Process Derivative Process variable: cpu_util Setpoint: maintain 80% cpu_util
8. P controller: oscillation, no convergence PV P Time
9. PI controller: oscillation, convergence PV P PI Time
10. PID controller: minimal oscillation, convergence PV P PI PID Time
11. How Strava uses PID for autoscaling
12. Application container Mesos task
13. Deployment
14. PID Controller service Setpoint: latency = 500 ms Deployment
15. traffic PID Controller service Setpoint: latency = 500 ms Deployment
16. ● ● ● Load balancing Automatic retry Circuit breaking traffic PID Controller service Setpoint: latency = 500 ms Deployment
17. ● ● ● ● Load balancing Automatic retry Circuit breaking Prometheus integration Process variable: Observed latency = 500 ms traffic PID Controller service Setpoint: latency = 500 ms Deployment
18. ● ● ● ● Load balancing Automatic retry Circuit breaking Prometheus integration Process variable: Observed latency = 500 ms traffic PID Controller service Setpoint: latency = 500 ms Deployment
19. ● ● ● ● Load balancing Automatic retry Circuit breaking Prometheus integration Process variable: Observed latency = 500 ms PID Controller service Setpoint: latency = 500 ms Deployment Increased traffic
20. ● ● ● ● Load balancing Automatic retry Circuit breaking Prometheus integration Process variable: Observed latency = 600 ms PID Controller service Setpoint: latency = 500 ms Deployment Increased traffic
21. ● ● ● ● Load balancing Automatic retry Circuit breaking Prometheus integration Increased traffic Error: Process variable > setpoint (600 > 500) PID Controller service Setpoint: latency = 500 ms Deployment
22. ● ● ● ● Load balancing Automatic retry Circuit breaking Prometheus integration Increased traffic PID Controller service Deployment Control variable: Increase instance count 3 => 5 Setpoint: latency = 500 ms
23. ● ● ● ● Load balancing Automatic retry Circuit breaking Prometheus integration Increased traffic PID Controller service Setpoint: latency = 500 ms Deployment
24. ● ● ● ● Load balancing Automatic retry Circuit breaking Prometheus integration Increased traffic PID Controller service Setpoint: latency = 500 ms Deployment
25. Mesos cluster Prometheus Request traffic PID controller
26. Mesos cluster Prometheus Request traffic PID controller
27. How you can autoscale in kubernetes
28. traffic PID Controller service Deployment
29. traffic PID Controller service Deployment
30. traffic PID Controller service Deployment
31. custom.metrics.k8s.io Custom Metrics Server HPA traffic Deployment Implementation of custom.metrics.k8s.io API using Prometheus https://github.com/DirectXMan12/k8s-prometheus-adapter.git
32. custom.metrics.k8s.io Custom Metrics Server HPA Increased traffic Deployment