1. Predicting Scale with Prometheus and ML Chris Dutra, Schireson
2. Chris Dutra @dutronlabs Director, Site Reliability Engineering chris@schireson.com @chrisdutra 2
3. SRE @SCHIRESON SRE @SCHIRESON
4. Moving Data Science Processes to Production - Code migration from “Jupyter to Container” - Achieve similar performance as laptop-driven simulations - Provide “data science as a service” to clients 4
7. Moving Data Science Processes to Production - Code migration from “Jupyter to Container” - Achieve similar performance as laptop-driven simulations - Provide “data science as a service” to clients - But... CPU/Memory utilization is non-deterministic! DS Processes are long running and resource hungry! Can we follow same SDLC as other engineering work? How does this work in a live environment? 7
8. Moving Data Science to Production Solution: daemonsets! - DS process gets node resources with light scheduling overhead Isolate resources via nodeSelector, node labels Scaling currency is now infrastructure based Retain orchestration capabilities! Message Queue ds/ds-process ds/ds-process ds/ds-process ec2.node ec2.node ec2.node 8
10. Schireson Today API Gateway vue.js/nginx Prometheus Grafana ELK Python Data Science Processes (daemonsets) Go Middleware (redis, rabbitmq, etc.) AWS Redshift AWS RDS S3 Airflow 10
11. How do we scale? … while minimizing SRE load? (important to me) … while keeping cloud spend in check? (important to boss) … AND while providing a great experience? (important to client)
12. We can probably scale down our daemonset nodes here...
13. Solution: Time Series Forecasting Anticipating changes to key metrics can help an SRE team: - Right size infrastructure - Provide more redundancy during peak times - Reduce cost during non-busy periods There are many ways to forecast, but we chose LSTM. 13
14. LSTM Model - Stands for Long Short-Term Memory - Recurrent Neural Network (RNN) designed to recognize patterns in sequences of data
15. Can you predict the next word in this paragraph? Hi, my name is Chris Dutra, and thank you for joining me at this conference for software engineers and Kubernetes experts. Today, I would like to talk to you about containers!
16. Additional LSTM Examples What would the next musical note be if this piece was written by Mozart? What would the next word be if this paragraph was written by Shakespeare? Time Series Forecasting 16
17. End to End Flow Collect Metrics metrics Build and Deploy Model Make predictions model Make Decisions Infrastructure Alerts predictions Work in Progress 17
18. Train and Deploy the LSTM Model ACQUIRE - Collect data range from Prometheus - Large enough, but not too big Minor formatting of data into dataset Example metrics: - Replica Counts - Requests per Second - Upstream Latency (Envoy) - Infrastructure - … 18
19. Collecting Metrics in Prometheus - Batch your queries when pulling data from Prometheus. Otherwise, you might run into errors like: Error executing query: exceeded maximum resolution of 11,000 points per timeseries. Try decreasing the query resolution (?step=XX) 19
20. Demo Collect 1 day of metrics, 60 minutes at a time. DEMO Format output and write to CSV.
21. Train and Deploy the LSTM Model TRAIN & TEST - Split data into test and training sets Transform data (supervised data, scaling) Fit model against training set Evaluate model against test set, benchmarks Deploy model 21
22. Generate Predictions - Choose a smaller dataset that represents recent history of your metric - for example, 1 hour of data to predict the next 10 minutes Split dataset into training/test datasets Forecast data on training set to gather state Generate predictions, using test set t-n ... t-4 t-3 t-2 Training Set - t-1 t0 t1 t2 Test Set t3 t4 t5 t6 t7 Predictions Important - the further out predictions are generated, the higher rate of error must be accounted for. Generally speaking, predictions 15 minutes into the future have a lower margin of error than 60 minutes. 22
23. Actionable Data! Predicted RPS time prediction actual t0 (present) 0.0 0.0 Anticipating little to no traffic t1 (+1min) 0.0 What’s actionable about this data? t2 (+2min) 0.0 t3 0.0 t4 0.0 t5 0.0 t6 0.0 What can we conclude about this data? - Scale down replicas Scale down infrastructure … … t15 (+15min) 0.0 23
24. Actionable Data! Predicted RPS What can we conclude about this data? - Lots more traffic incoming! What’s actionable about this data? - Scale up replicas Scale up infrastructure … time prediction actual t0 (present) 12.5 11.7 t1 (+1min) 10.0 t2 (+2min) 85 t3 150 t4 170 t5 215 t6 250 … t15 (+15min) 5000 24
25. Making Decisions Example: Stateless Resources (Deployment) - Define thresholds for your resources - RPS below 50, above 2500 Define minimum and maximum gates for your resources - No less than 3, no more than 15 replicas Determine how long to wait (sleep) before analyzing the next set of predictions. 25
26. Future Work - Can we do better? - Tuning - Multivariate LSTM - Explore Reinforcement Learning (q-learning) - Forecast other key areas - Saturation - Cluster Management How low-touch can we get our kubernetes clusters? 26
27. Future Work CRD - Predictive Auto-Scaler - Similar to HPA, but with the workflow outlined above Ability to interact with k8s resources, infrastructure kubectl create pas ... 27
28. Acknowledgements - Eben Esterhuizen and Schireson Data Science Team! - Kubernetes, Prometheus, & Envoy Communities
29. THANK YOU