全球架构师峰会 Arch Summit 2018

刘波 12,6 building+a+dynamic+and+responsive+Pinterest

1. Building a dynamic and responsive Pinterest Bo Liu (刘波) Pinterest Engineering Manager
3. Bo Liu Engineering Manager • An engineering manager with deep knowledge of large-scale online distributed systems • His team is responsible for building high throughput and low latency online backend systems for Pinterest • Authored and open sourced Pinterest C++ core library Rocksplicator • Worked on Facebook Graph Database TAO • Worked as a lecturer at HUST
4. • Pinterest Products (5 mins) • The Evolution of Pinterest Architecture (10 mins) • Real-time Data Replication for RocksDB (10 mins) • Automated Cluster Management and Recovery for Stateful Services (10 mins) • Unified ML Serving Platform (10 mins)
5. Pinterest Mission Our mission at Pinterest is to help people discover and do what they love
6. Boards
7. Topics
8. Following Feed
9. Home Feed
10. Related Pins
11. Some Pinterest Numbers • > 1500 employees • > 250M MAU • 50% of MAU is from USA • > 175 Billion pins
12. Some Pinterest Numbers • > 3 Billion boards • 81% Pinterest users are females • 50% of Pinterest users earn > $50K yearly • 87% of Pinterest users have purchased a product because of Pinterest
13. • Pinterest Products (4 mins) • The Evolution of Pinterest Architecture (10 mins) • Real-time Data Replication for RocksDB (10 mins) • Automated Cluster Management and Recovery for Stateful Services (10 mins) • Unified ML Serving Platform (10 mins)
14. Pinterest in 2015 The majority of content on Pinterest was pregenerated for users prior to login. It was stored statically in HBase and served directly upon entering the service.
15. Example architecture (Following Feed 2015)
16. Weak Points in 2015 Architecture • Hard to tweak or experiment with new ideas/models on different components in the system • Features used to rank contents could be weeks old (could not leverage latest pin/user/board data, not to mention real-time user actions) • Unnecessarily large HBase storage consumptions (users who never return, a large number of concurrent experiments)
17. We did want to go fully online!
18. Requirements to Go Fully Online 1. Relationship data (following graph, owner to board, board to pin and topic to pin mappings) need to be stored in a way suitable for real-time update and low latency query (multiple request trips, big fanout) 2. Filtering, light weight scoring need to happen close to storage at the retrieval stage 3. Low latency ML serving platform
19. Technical Decisions to solve the problems 1. Adopted C++, FBThrift and Folly • Lower long tail latency (big fanout request) • Finer control of system performance (some services are CPU extensive) • Shared libraries (a single repo across the company)
20. Technical Decisions to solve the problems 2. Built distributed stateful services from scratch • Customized data model and indexes • Complicated filtering/light weight scoring close to storage • Full control of every component in the systems (operate it at scale with confidence, easy to adapt to new feature requests)
21. Technical Decisions to solve the problems 3. Adopted RocksDB as storage engine • Widely used • Optimized for server load
22. Rocksplicator • Open sourced in 2016 and keep improving and adding new features https://github.com/pinterest/rocksplicator
23. Major Problems Solved by Rocksplicator • Real-time Data replication for RocksDB • Automated Cluster Management and Recovery for Stateful Service • Resilient request router • Many other small libraries and tools for productionizing C++ service
24. Common Architecture of Rocksplicator Powered Systems Application API Admin API Cluster management Admin tool/system Application Logic GetDB() Admin Logic Read/Write RoRcoRkcoRskcosDkcsDBkBsDBDB Create/Open DB Add/Remove DB for replication local updates RocksDB Replicator remote updates Data Replication
25. Example architecture (Following Feed 2018)
26. • Pinterest Products (5 mins) • The Evolution of Pinterest Architecture (10 mins) • Real-time Data Replication for RocksDB (10 mins) • Automated Cluster Management and Recovery for Stateful Services (10 mins) • Unified ML Serving Platform (10 mins)
27. RocksDB • An embedded storage engine library • Lack of replication support
28. Replication Design Decisions • Master-Slave replication • Replicating multiple RocksDB instances in one process • Master/Slave role is assigned at RocksDB instance level • Low replication latency
29. Replication Implementation • Use RocksDB WAL sequence # as global replication sequence # • FBThrift for RPC
30. Slave Side Workflow DB1 Master Thrift Server 
 Latest SEQ # DB2 Slave Upstream: ip_Port Worker threads
31. Slave Side Workflow DB1 Master Thrift Server 
 Latest SEQ # DB2 Slave Upstream: ip_Port Worker threads Get update since SEQ# for DB2
32. Slave Side Workflow DB1 Master Thrift Server 
 Latest SEQ # DB2 Slave Upstream: ip_Port Worker threads Get update since SEQ# for DB2 Updates since 
 SEQ# for DB2
33. Slave Side Workflow DB1 Master Thrift Server 
 DB2 Slave Upstream: ip_Port Latest SEQ # Apply updates Worker threads Get update since SEQ# for DB2 Updates since 
 SEQ# for DB2
34. Replication Implementation • A combination of pull & push based replication
35. Master Side Workflow DB1 Master 
 DB2 Slave Upstream: ip_Port Thrift Server Get updates 
 since SEQ# for DB1 Worker threads
36. Master Side Workflow DB1 Master 
 DB2 Slave Upstream: ip_Port Thrift Server Get updates 
 since SEQ# for DB1 Send request Worker threads
37. Master Side Workflow DB1 Master Has updates since SEQ#? 
 DB2 Slave Upstream: ip_Port Thrift Server Get updates 
 since SEQ# for DB1 Send request Worker threads
38. Master Side Workflow DB1 Master Yes, this is the data Has updates since SEQ#? 
 DB2 Slave Upstream: ip_Port Thrift Server Get updates 
 since SEQ# for DB1 Send request Worker threads
39. Master Side Workflow DB1 Master Yes, this is the data Has updates since SEQ#? 
 DB2 Slave Upstream: ip_Port Thrift Server Get updates 
 since SEQ# for DB1 Response Send request Worker threads
40. Master Side Workflow DB1 Master Yes, this is the data Has updates since SEQ#? 
 DB2 Slave Upstream: ip_Port Thrift Server Response Get updates 
 since SEQ# for DB1 Response Send request Worker threads
41. Master Side Workflow DB1 Master No, wait for my notification Has updates since SEQ#? 
 DB2 Slave Upstream: ip_Port Thrift Server Get updates 
 since SEQ# for DB1 Send request Worker threads
42. Master Side Workflow Writes DB1 Master No, wait for my notification Has updates since SEQ#? 
 DB2 Slave Upstream: ip_Port Thrift Server Get updates 
 since SEQ# for DB1 Send request Worker threads
43. Master Side Workflow Writes DB1 Master Thrift Server These are the new updates No, wait for my notification Has updates since SEQ#? 
 DB2 Slave Upstream: ip_Port Send request Worker threads Get updates 
 since SEQ# for DB1
44. Master Side Workflow Writes DB1 Master Thrift Server These are the new updates No, wait for my notification Has updates since SEQ#? Response Send request 
 DB2 Slave Upstream: ip_Port Worker threads Get updates 
 since SEQ# for DB1
45. Master Side Workflow Writes DB1 Master Thrift Server These are the new updates No, wait for my notification Has updates since SEQ#? Response Send request 
 DB2 Slave Upstream: ip_Port Worker threads Response Get updates 
 since SEQ# for DB1
46. Replication Performance in Production • > 50MB/S per host (AWS i3.2x instance) • < 20ms average max replication delay
47. • Pinterest Products (5 mins) • The Evolution of Pinterest Architecture (10 mins) • Real-time Data Replication for RocksDB (10 mins) • Automated Cluster Management and Recovery for Stateful Services (10 mins) • Unified ML Serving Platform (10 mins)
48. Manual Cluster Management (2017) Application API Admin API Cluster management Admin tool Generate cluster config Application Logic GetDB() Admin Logic Read/Write RoRcoRkcoRskcosDkcsBDksBDBDB Create/Open DB Add/Remove DB for replication local updates RocksDB Replicator Load config when start ZooKeeper Data Replication remote updates
49. Problems with Manual Cluster Management • Error prone • It became a big burden for the team • Mean Time To Repair (MTTR) was not ideal
50. Apache Helix “A cluster management framework for partitioned and replicated distributed resources”
51. Apache Helix Helix decides partition mapping and Master Slave roles Global constraints are enforced (no more than 1 Master, minimum replica #, etc)
52. Helix + Rocksplicator
53. Automated Cluster Management (2018) Application API Admin API Admin Command for Cluster management Helix Controller State Transition Command Application Logic Read/Write GetDB() Admin Logic Helix Participant Create/Open DB Add/Remove DB for replication State Transition Command RoRcoRkcoRskcosDkcsDBkBsDBDB local updates RocksDB Replicator Data Replication remote updates ZooKeeper
54. Implement State Transition Command • Offline => Slave How to differentiate between rebalance and service restart • How to reliably setup replication graph Multiple replicas could do state transition concurrently More details can be found on our blog post: http://t.cn/EyKMoTL
55. • Pinterest Products (5 mins) • The Evolution of Pinterest Architecture (10 mins) • Real-time Data Replication for RocksDB (10 mins) • Automated Cluster Management and Recovery for Stateful Services (10 mins) • Unified ML Serving Platform (10 mins)
56. ML at Pinterest
57. Unified ML Platform Scorpion is the Pinterest ML Platform, a fully integrated solution for scoring pins
58. Common Scoring and Logging Flow
59. Scorpion Serving
60. Raw Data • Provided by client teams • Structured data • Encoded in Thrift or FlatBuffers
61. Featurization
62. Linchpin DSL Feature extraction and model scoring language Input: Raw Data Transforms: • Matching • Field extraction • Arbitrary computation (UDFs)
63. Linchpin DSL Outputs: • Features for model • Model score
64. Pynchpin A python interface for model development Benefits: • Easy to write new models • Easy model debug and iterative development • Allow code sharing between similar models Training workflows call Pynchpin Api to generate Linchpin DSL
65. Model Management Deployed via general Pinterest deploy system (easy to track model versions)
66. Scorpion in Production • 4000 serving hosts • Hundreds of models • Continuous model deploy • Score tens of millions items per second • <20 ms P99 latency to score a batch of 1K items

相关幻灯片