Distributed Tracing in Serverless Systems Nitzan Shapira

1. KubeCon + CloudNativeCon Seattle Distributed Tracing in Serverless Systems Nitzan Shapira, Epsagon
2. > whoami Nitzan Shapira (@nitzanshapira) Software engineer > 12 years Co-Founder, CEO at Epsagon Tel Aviv 2
3. Things to discuss What is serverless? How is it different? What is observability for serverless? How can distributed tracing help? How will it help my job? 3
4. What is serverless? [Compute-as-a-Service] FaaS: Function-as-a-Service CaaS: Container-as-a-Service + Managed services (APIs) = Don’t manage infrastructure Focus on business logic 4
5. Why serverless? Pay-per-use: reduces cloud compute cost by 90% Out-of-the-box auto-scaling DevOps à LowOps ++Developer velocity Server Utilization Focus on business logic – iterate faster 5
6. The limitations of FaaS Limited memory Limited running time Stateless Cold starts + concurrency limit + some others… 6
7. The properties of serverless applications Serverless is micro-services Serverless applications are - Highly distributed - Highly event-driven Utilizing managed services via APIs is key 7
8. A real example – HSBC Source: re:Invent 2018 8
9. The challenge in serverless Yan Cui SIMPLE COMPLEX 9
10. What the community thinks 2017 results 2018 Serverless Community Survey, serverless.com, July 2018 10
11. Observability – why do we need it? Track system health Troubleshoot and fix Optimize performance and cost 11
12. Observability in serverless Let’s go one by one 12
13. Track system health System == Functions ? 13
14. Functions are important - Errors - Timeout - Out-of-memory - Cold start 14
15. Track system health System > Functions ! Serverless != Functions 15
16. Track system health System > Functions ! Functions APIs Transactions 16
17. Troubleshoot and fix e Functions are not enough Need: track asynchronous events 17
18. Transactions 18
19. Tracing asynchronous invocations 19
20. Tracing asynchronous invocations 20
21. Tracing asynchronous invocations 21
22. Distributed tracing …a trace tells the story of a transaction or workflow as it propagates through a (potentially distributed) system. Distributed tracing is a method used to profile and monitor applications. 22
23. Distributed tracing Jaeger 23
24. Implementing distributed tracing Manual tracing/instrumentation Before/after calls At the end of each micro-service High maintenance High potential of errors 24
25. Serverless apps are very distributed Complex systems have thousands of functions What about the developer velocity? 25
26. Can it be done differently in serverless? 26
27. Automation can help to keep up with the development speed of serverless 27
28. Example 28
29. Example 29
30. Monitoring serverless Limited memory Limited running time Stateless Cold starts 30
31. Time is $$$ 31
32. Where do we spend the most time? Our own code API calls 32
33. Serverless cost crisis A real-life example $$$$$$$$$$$$ 33
34. Scanning functions Scanning CloudWatch using AWS Lambda Every 5 minutes, save to RDS CloudWatch Spawn (async) Poll A new Lambda is spawned for every customer’s function 34
35. As time flies… !!!! CloudWatch became highly throttled Requests took too much time 5K concurrent Lambdas, for 5 minutes, timing out , every 5 minutes 35
36. Why you should care about external APIs e 702ms 36
37. Track service health 37
38. Business flows Subscribe Transfer Payment 38
39. What should I optimize first? 39
40. Remember… Serverless + Distributed Tracing = Perfect marriage (but only if you automate) 40
41. Thank you! nitzan@epsagon.com @nitzanshapira www.epsagon.com