LinkedIn Engineering Manager Siddharth Singh - Storage Infrastructure behind LinkedIn's recommendation

The data-derived recommendations that power LinkedIn's site consists of a continuous chain of three phases: data collection, processing and serving. Due to the dynamic nature of the professional graph both, data collection and processing, methods have evolved to be able to quickly incorporate new signals into the derived features. This evolution, in the data collection and processing methods, has created a need for specialized storage infrastructure that can support this dynamic cycle. This presentation will uncover the technology behind systems that specialize in serving recommendation dataset at LinkedIn and it's evolution from the early days of serving batch only derived data (Project Voldemort) to it's current novel way of serving derived data computed from any source, batch or nearline (Project Venice). The presentation will discuss architectural choices and tradeoffs in building these systems and share some thoughts on the future direction of this space.

1. Storage Infrastructure behind LinkedIn’s Recommendations Monday April 17th, 2017 By Siddharth Singh Engineering Manager, Storage Infrastructure ©2013 LinkedIn Corporation. All Rights Reserved. ESPRESSO/Voldemort
2. ©2013 LinkedIn Corporation. All Rights Reserved. ESPRESSO/Voldemort
3. Agenda ▪ What is LinkedIn ? – High Level Web Architecture ▪ What is Primary and Derived Data ? – Recommendation Data Lifecycle ▪ Derived Data Serving – Voldemort Read Only (RO): Architecture and Key Details – Lamda Architecture at LinkedIn – Beyond Lambda – Venice: Architecture and Key Details ▪ Challenges & how we solved them ▪ Early Wins and Future Prospects ▪ Q&A ©2013 LinkedIn Corporation. All Rights Reserved. ESPRESSO/Voldemort
4. LinkedIn - World’s Largest Professional Network 484M Members >2 New Members Per Second 107M Monthly Unique Visitors ©2013 LinkedIn Corporation. All Rights Reserved. ESPRESSO/Voldemort
5. High Level Web Architecture Web Derived Databases Front Ends Mid Tiers ©2013 LinkedIn Corporation. All Rights Reserved. Primary Databases ESPRESSO/Voldemort
6. ©2013 LinkedIn Corporation. All Rights Reserved. ESPRESSO/Voldemort
7. ©2013 LinkedIn Corporation. All Rights Reserved. ESPRESSO/Voldemort
8. Recommendation Data Lifecycle Web Front Ends Logs ©2013 LinkedIn Corporation. All Rights Reserved. Data Crunching Derived Databases Bulk Load Hadoop Database Snapshots Trigger Derived Data Refresh Job Scheduler Launch Jobs ETL Mid Tiers Logs Primary Databases Change Capture Queue ESPRESSO/Voldemort
9. Derived Data Serving ©2013 LinkedIn Corporation. All Rights Reserved. ESPRESSO/Voldemort
10. Voldemort ▪ Distributed Key Value Store – Consistent hashing – Partitions ▪ Shared Nothing ▪ Pluggable architecture – Storage Engine : Read-Only or Read-Write – Serialization (Avro, JSON etc.) – Local or Global ©2013 LinkedIn Corporation. All Rights Reserved. ESPRESSO/Voldemort
11. Voldemort Read Only (RO) – Build and Push ▪ Scalable offline index construction and data partitioning using MapReduce on Hadoop (Build Phase) ▪ Complete immutable data set fetched, bulk loaded and swapped for online serving from Voldemort RO. (Push Phase) ▪ Data set is versioned. Keeps one older version for quick rollback. ©2013 LinkedIn Corporation. All Rights Reserved. ESPRESSO/Voldemort
12. Voldemort RO – Key Details ▪ General Methodology – Index lookup in memory – Data lookup as a single SSD seek – RO client side latency: p99 < 1.5ms ▪ Read-Only custom storage engine – Pairs of index/data files – Index mmaped and mlocked – Checksum of checksums for data integrity – Index files fetched after data files to take advantage of OS page cache. ▪ 650 stores, 100TB+ of data moved between Hadoop and Voldemort daily ▪ Architectural Limitations: – Tightly coupled with Hadoop – No support for incremental pushes ©2013 LinkedIn Corporation. All Rights Reserved. ESPRESSO/Voldemort
13. Recommendation Data Lifecycle Web Front Ends Logs ©2013 LinkedIn Corporation. All Rights Reserved. Data Crunching Derived Databases Bulk Load Hadoop Database Snapshots Trigger Derived Data Refresh Job Scheduler Launch Jobs ETL Mid Tiers Logs Primary Databases Change Capture Queue ESPRESSO/Voldemort
14. Lambda Architecture @ LinkedIn Precompute Views (MapReduce) Voldemort BuildandPush ● Two different paths for (a) data ingestion (b) reads from apps. ● Difficult to operate/scale two data pipelines. ● Apps need to deal with two disparate systems. Read Only Store Read Write Store Batch layer Apps (Create a merged a view) Serving layer Incremental View (Samza) ©2013 LinkedIn Corporation. All Rights Reserved. Data Ingest Speed layer ESPRESSO/Voldemort
15. Beyond Lambda Precompute Views (MapReduce) Batch layer Hadoop to Venice Bridge ● One path: both for data ingestion and reads. ● One system to maintain/scale and interface with. Serving layer Kafka Ingestion layer VENICE VENICE VENICE Venice Apps (Access to pre-merged view) Incremental View (Samza) Speed layer ©2013 LinkedIn Corporation. All Rights Reserved. ESPRESSO/Voldemort
16. Venice ● Asynchronous derived data serving platform which provides : ○ High throughput ingestion from processing platforms like Hadoop, Samza etc. ○ Low latency key/value lookups ▪ Unified solution for serving of derived data o Handle batch and stream processing cases o Easy to operate o Support both Lambda and Kappa Architecture. ©2013 LinkedIn Corporation. All Rights Reserved. ESPRESSO/Voldemort
17. Venice – High Level Architecture ©2013 LinkedIn Corporation. All Rights Reserved. ESPRESSO/Voldemort
18. Store-version P1 P2 P3 P4 P1 P3 P3 P4 P1-R P2-R P3-R P4-R Partitions have replicas. Partitions and replicas get distributed across nodes on a cluster. P1 P2 P3 P4 Version provides Rollback/forward capability. Typically, only one previous version is kept. ©2013 LinkedIn Corporation. All Rights Reserved. A store can have many versions. Each store-version is partitioned and is populated Version by it’s own Kafka topic. 1 Version Store A Cluster hosts multiple stores 2 Version 7 Version 8 Store B VENICE VENICE VVEENNICICEE CLUSTER Version 24 Version 25 Version 26 Store C ESPRESSO/Voldemort
19. Venice – Three trick pony ©2013 LinkedIn Corporation. All Rights Reserved. 19 ESPRESSO/Voldemort
20. Batch support 9. Ingest complete. Mark the job success Hadoop to Venice Bridge 7. Poll for ingestion completion 1. Push to store X with schema Y ? Venice Controller 4. Push data to Kafka topic K Kafka 3. Yes, push to Kafka topic K with P partitions 6. Keep sending ingestion status 2. Provision store-version V+1 10. Delete store- version V-1 5. Consume Venice Storage Node Read requests hit the active store-version ©2013 LinkedIn Corporation. All Rights Reserved. Venice Router 8. Activate version V+1 ESPRESSO/Voldemort
21. Nearline Support 1. Initialize stream Venice Controller 2. Provision store Samza (Algo version V) 3. Push data to Kafka topic K with P partitions Kafka 4. Consume Venice Storage Node Venice Router Store-version V ▪ Workflow is similar to that of the batch support. ▪ Venice views both batch and streaming systems as the same. ▪ Samza stream will be consumed by Venice storage nodes and written to a versioned store. Quotas/Throttling to not affect live queries. ▪ New algorithm to be processed and stored in a new store. ©2013 LinkedIn Corporation. All Rights Reserved. ESPRESSO/Voldemort
22. Hybrid support (Batch+Nearline) ● Steady state – In between bulk loads Nearline (Samza) ✕Batch Version V7 (Hadoop) Real Time Kafka topic K Mirror Topic Venice Kafka topic K7 Venice Store V7 Merged View Venice Router ©2013 LinkedIn Corporation. All Rights Reserved. ESPRESSO/Voldemort
23. Hybrid support (Batch+Nearline) 1. Offline bulkload into a new store-version 2. Offline bulkload finished, start buffer replay 3. Replay caught up, router switches to new store-version Real-Time Job (Samza) ✕Batch Version V7 (Hadoop) ✕Batch Version V8 (Hadoop) Real Time Kafka topic K Mirror Topic Venice Kafka topic K7 Mirror Topic Venice Kafka topic K8 ©2013 LinkedIn Corporation. All Rights Reserved. Venice Store V7 Merged View Venice Store V8 Merged View Venice Router ESPRESSO/Voldemort
24. Challenges ▪ Challenges in building Venice: – High throughput Ingest Consumption – Data Guarantees – Dynamic Topic Lifecycle Management (creation/deletion) – Low read latency ©2013 LinkedIn Corporation. All Rights Reserved. ESPRESSO/Voldemort
25. Dataset Partitioning & Ingestion ▪ Each store-version is partitioned ▪ 1-to-1 mapping between Venice and Kafka logical partitions ▪ Each dataset version has its own Kafka topic ▪ Controller decides partition assignment and tells storage nodes ▪ On storage node, two separate thread pools (a) to pull data out of Kafka and (b) process and write to the storage engine ▪ Cleanup of old store dataset and corresponding topic after the new version is swapped and is being served ©2013 LinkedIn Corporation. All Rights Reserved. Scenario with four machines, a dataset with four partitions, and a replication factor of 3. ESPRESSO/Voldemort
26. Dataset Validation Handling missing and duplicate data 1. Before producing to any given partition, a producer sends a control message in order to uniquely identify itself before producing regular messages. 2. Then, on each produced messages, the producer includes some sequence number in the message's metadata. There is a distinct sequence number for each partition, and it is incremented by one for each new message. 3. The consumer keeps track of the last sequence number seen for each unique producer/partition combination. 4. Gaps in sequence signal missing data. 5. Duplicates can be safely ignored. 6. Checksum computation to signal corrupt data. ©2013 LinkedIn Corporation. All Rights Reserved. ESPRESSO/Voldemort
27. Dataset Validation 7. For hybrid case, use configurable log compaction point to ensure most recent data is never compacted. Storage node lenient when ingesting records for more than a certain threshold. ©2013 LinkedIn Corporation. All Rights Reserved. ESPRESSO/Voldemort
28. Early Wins and Future Prospects ▪ Early Wins ! – Venice data ingest pipeline ~25% faster than Voldemort (further speedup expected through the year) – Read latency comparable to that of Voldemort (p99 ~4-5 ms). – Ease of operability - cluster maintenance, expansion etc. are much easier. ▪ Some thoughts around what might be next: – Priority topic ingestion – Self-throttling mechanism – Auto-rewind capabilities based on offset lag – Limited server side transforms (may be ??) ©2013 LinkedIn Corporation. All Rights Reserved. ESPRESSO/Voldemort
29. Questions? (We’re hiring!) ©2013 LinkedIn Corporation. All Rights Reserved. ESPRESSO/Voldemort