keynote hic 2011 web

  • 94 浏览

Razor

2020/04/26 发布于 技术 分类

文字内容
1. Distributed Coordination via ZooKeeper Flavio Junqueira Yahoo! Research, Barcelona Hadoop in China 2011
2. A bit of history June 2007: Early adopters: Message Broker, Crawler Oct 2007: Sourceforge June 2008: Move to Apache, subproject of Hadoop Nov 2010: Top level project Hadoop in China 2011 2
3. Hadoop in China 2011 Se p Ju n b 1, 1, 1 20 1 11 20 01 1 0 20 1 10 20 1, 2 1, 1, 01 0 1, 2 01 0 10 20 09 20 Dev Fe De c ct Ju l O 1, 1, 00 9 1, 2 r1 ,2 Ap Ja n ct O Ju l Number of subscribers Apache community growth Community growth 600 User 450 300 150 0 3
4. Widely used Hadoop in China 2011 4
5. Widely used Jun Rao, LinkedIn Hadoop in China 2011 4
6. Widely used Jun Rao, LinkedIn Kannan Muthukkaruppan, Facebook Hadoop in China 2011 4
7. Widely used Jun Rao, LinkedIn Kannan Muthukkaruppan, Facebook Arun Murty, NextGen Hadoop Hadoop in China 2011 4
8. Widely used Jun Rao, LinkedIn Kannan Muthukkaruppan, Facebook Arun Murty, NextGen Hadoop Sebastian Stadil, Scalr Hadoop in China 2011 4
9. Widely used Jun Rao, LinkedIn Kannan Muthukkaruppan, Facebook Arun Murty, NextGen Hadoop Sebastian Stadil, Scalr Dave Beckett, Digg.com Hadoop in China 2011 4
10. Where we are coming from...
11. Yahoo! Portal Hadoop in China 2011 6
12. Yahoo! Portal Search E‐mail Finance Weather News Hadoop in China 2011 6
13. Yahoo!: Workload generated Hadoop in China 2011 7
14. Yahoo!: Workload generated Hadoop in China 2011 7
15. Yahoo!: Workload generated Yahoo! Homepage: 4.4 billion page views a month Hadoop in China 2011 7
16. Yahoo!: Workload generated Yahoo! Homepage: 4.4 billion page views a month Hadoop in China 2011 7
17. Yahoo!: Workload generated Yahoo! Homepage: 4.4 Yahoo! Search: 2.7 billion page views a monthbillion queries a month Hadoop in China 2011 7
18. Yahoo!: Workload generated Yahoo! Homepage: 4.4 Yahoo! Search: 2.7 billion page views a monthbillion queries a month Hadoop in China 2011 7
19. Yahoo!: Workload generated Yahoo! Homepage: 4.4 Yahoo! Search: 2.7 billion page views a monthbillion queries a month Yahoo! News: 88 million users in the US and 256 million users globally Hadoop in China 2011 7
20. Yahoo! Infrastructure • • • • • Lots of servers Lots of processes High volumes of data Highly complex systems ... and developers are mere mortals Hadoop in China 2011 Yahoo! Lockport Data Center
21. by amusingplanet via Flickr
22. Copyright by Peter E. Lee via Flickr
23. Copyright by Shamus O’Reilly via Flickr
24. Copyright by Shamus O’Reilly via Flickr
25. ... and in computer systems? • • • • • • Locks Queues Leader election Group membership Barriers Configuration Hadoop in China 2011 13
26. ... and in computer systems? • • • • • • Locks Queues Leader election Group membership Require knowledge of complex protocols Barriers Configuration Hadoop in China 2011 14
27. A simple model • Work assignment ✓ Master assigns work ✓ Worker executes task assigned by master Master Worker Worker Worker Hadoop in China 2011 Worker 15
28. Master crashes ✓ Single point of failure ✓ No work assigned ✓ Need to select a new master Master Worker Worker Worker Hadoop in China 2011 Worker 16
29. Worker crashes ✓ Not as bad... Overall system still works ✓ Some tasks won’t be executed ✓ Need the ability to reassign tasks Master Worker Worker Worker Hadoop in China 2011 Worker 17
30. Worker does receive assignment ✓ Same problem, tasks don’t get executed ✓ Need to guarantee that worker receives assignment Master Worker Worker Worker Hadoop in China 2011 Worker 18
31. Fault-tolerant distributed system What kinds of problems can we have here? Worker Worker Master Backup Master Primary Worker Hadoop in China 2011 Worker 19
32. Fault-tolerant distributed system Backup Master Coordina=on Service Primary Worker Worker Master Worker Worker Hadoop in China 2011 20
33. Fault-tolerant distributed system Backup Master Coordina=on Service Primary Master Worker Worker Hadoop in China 2011 Worker Worker 21
34. Fully distributed Coordina=on Service Worker Worker Worker Hadoop in China 2011 Worker 22
35. Fallacies of distributed computing 1. The network is reliable 2. Latency is zero 3. Bandwidth is infinite 4. Network is secure 5. Topology doesn’t change 6. There is one administrator 7. Transport cost is zero 8. Network is homogeneous Peter Deutsch, h?p://blogs.sun.com/jag/resource/Fallacies.html Hadoop in China 2011 23
36. One more fallacy 9. You know who is alive Hadoop in China 2011 24
37. Why is it difficult? • FLP impossibility result ✓ Asynchronous systems ✓ Impossible is a single process can crash Fischer, Lynch, Paterson, ACM PODS, 1983 • According to Herlihy we do need consensus ✓ Wait-free synchronization ✓ Wait-free: Operations complete in a finite number of steps ✓ Equivalent to solving consensus for n processes Herlihy, ACM TOPLAS, 1991 Hadoop in China 2011 25
38. Why is it difficult? • CAP Principle ✓ Can’t have availability, consistency, and partition tolerance Availability Consistency Partition tolerance Hadoop in China 2011 26
39. The case for a coordination service • • • Many fallacies to stumble upon Many impossibility results Several common requirements across applications ✓ Duplicating is bad ✓ Duplicating poorly is even worse • Coordination service ✓ Implement it once and well Hadoop in China 2011 27
40. Current systems • Google Chubby ✓ Lock service • Burrows, USENIX OSDI, 2006 Microsoft Centrifuge ✓ Lease service • Adya et al., USENIX NSDI, 2010 Apache ZooKeeper ✓ Coordination kernel ✓ Initially contributed by Yahoo! Hunt et al., USENIX ATC, 2010 Hadoop in China 2011 28
41. Example - Bigtable, Hbase • Sparse column-oriented data store ✓ Tablet: Range of rows ✓ Unit of distribution • Architecture ✓ Master ✓ Tablet servers { { Tablet Tablet Family f1 Family f2 f1:c1 f1:c2 f2:c1 f2:c2 f2:c3 Row 1 Row 2 Row 3 Row 4 Row k Row k+1 Row k+2 Row k+3 Hadoop in China 2011 Timestamps 29
42. Example - Bigtable, Hbase • Master election ✓ Master crashes • Metadata management ✓ ACLs, Tablet metadata • Rendezvous ✓ Find tablet server • Live tablet servers Hadoop in China 2011 30
43. Example - Web crawling • ✓ • • Fetch web pages for search engines Fetcher Master election ✓ • Master Fetching service Fetcher Assign work Metadata management ✓ Politeness constraints ✓ Shards Live workers Hadoop in China 2011 31
44. Example - Kafka, Pub/Sub • • • Based on topics Topics are partitioned across brokers Consumer groups ✓ Multiple consumers for a topic • Coordination requirements ✓ Metadata ➡ Each message is consumed by a single consumer ➡ Each partition is owned by a consumer ✓ Crash detection Hadoop in China 2011 32
45. And more examples... • Google File System ✓ Master election ✓ File system metadata • Hedwig - Pub/Sub ✓ Topic metadata ✓ Topic assignment ✓ Elasticity Hadoop in China 2011 33
46. At Yahoo!... • Has been used for: ✓ Fetching service ✓ Manage workflows in Hadoop (e.g., feed ingestion) ✓ Content optimization (distributed services) ✓ ... • Largest cluster I’m aware of ✓ Around 5,000 - 10,000 clients Hadoop in China 2011 34
47. Hadoop in China 2011 35
48. ZooKeeper introduction • Coordination kernel ✓ Does not export concrete primitives ✓ Recipes to implement primitives • File-system based API ✓ Manipulates small data nodes: znodes ✓ State is a hierarchy of znodes Hadoop in China 2011 36
49. More introduction • Stores database in memory ✓ Handles high load ✓ Handy when communicating with a large number of processes • Single data-center applications, originally ✓ Some cases of cross-colo deployments Hadoop in China 2011 37
50. ZooKeeper: Design Ensemble Client ZK Library Follower Leader Client ZK Library Follower Zab guarantees replica consistency Leader propagates state updates Follower Client ZK Library Follower Sessions Hadoop in China 2011 38
51. What’s difficult here? • Electing a leader ✓ All live processes are potential leaders ✓ Communication pattern is arbitrary • Replicating the state ✓ Zab: a high-performance broadcast protocol ✓ Enables multiple outstanding operations Hadoop in China 2011 39
52. What do clients see? • • • Semantics of sessions Prefix of operations are executed Upon a disconnection ✓ A server tries to contact another server ✓ Before session expires: connect to new server ✓ Server must have seen a higher transaction id Hadoop in China 2011 40
53. ZooKeeper API • Create znodes: create ✓ Persistent, ephemeral, sequential • • • • Read and modify data: getData, setData Read the children of znode:'>znode: getChildren Check if znode exists: exists Delete znode:'>znode: delete Hadoop in China 2011 41
54. ZooKeeper: Reads Ensemble Client ZK Library getData x x=9 Return 9 x=9 Client Client ZK Library ZK Library Hadoop in China 2011 Follower Leader x=9 Follower x=9 Follower x=9 Follower 42
55. ZooKeeper: Writes Ensemble Client Client Client ZK Library setData x, 10 x=10 Follower x=10 Leader x=10 Follower x=9 Follower x=9 Follower Ack ZK Library ZK Library Hadoop in China 2011 43
56. Example Client 1 Client 2 Client 3 1- create “/el/c-”, seq. + eph. 2- getChildren “/el” 3- Pick smallest Ensemble /el 1- create “/el/c-”, seq. + eph. 2- getChildren “/el” 3- Pick smallest 1- create “/el/c-”, seq. + eph. 2- getChildren “/el” 3- Pick smallest Hadoop in China 2011 C1 /el/c-1 C2 /el/c-3 C3 /el/c-2 44
57. Znode changes • Znode changes ✓ Data is set ✓ Node is created or deleted ✓ Etc… • Learn of znode changes ✓ Set a watch ✓ Upon change, client receives a notification before new updates Hadoop in China 2011 45
58. Watches getData “/foo”, true Client / return 10 10 Hadoop in China 2011 /foo 46
59. Watches Client / 11 /foo setData “/foo”, 11 Client Ack Hadoop in China 2011 47
60. Watches Client / Notification 11 /foo Client Hadoop in China 2011 48
61. Watches, Locks, and the herd effect • Herd effect ✓ Large number of clients wake up simultaneously • Undesirable effect ✓ Load spikes Hadoop in China 2011 49
62. Watches, Locks, and the herd effect Ensemble Client 1 /el C1 /el/c-1 C2 /el/c-2 Cn /el/c-m Client 2 Client n Hadoop in China 2011 50
63. Watches, Locks, and the herd effect Ensemble Client 1 /el Client 2 Client n Hadoop in China 2011 C2 /el/c-2 Cn /el/c-m 51
64. Watches, Locks, and the herd effect Ensemble Client 1 Client 2 Client n /el Notification C2 /el/c-2 Cn /el/c-m Notification Hadoop in China 2011 52
65. Watches, Locks, and the herd effect • A solution: Use order of clients ✓ Each client ➡ Pick znode z preceding its own znode in the sequential order ➡ Watch z ✓ A single notification is generated upon a crash • • Works for locks Maybe not for leader election ✓ One client is notified of a leader change Hadoop in China 2011 53
66. Evaluation
67. Evaluation • • • • Cluster of 50 servers Xeon dual-core 2.1 GHz 4 GB of RAM Two SATA disks Hadoop in China 2011 55
68. Throughput Throughput of saturated system Throughput of saturated system (all requests to leader) 90000 80000 70000 60000 50000 40000 30000 60000 50000 40000 30000 20000 20000 10000 10000 0 0 0 20 3 servers 5 servers 7 servers 9 servers 13 servers 80000 Operations per second 70000 Operations per second 90000 3 servers 5 servers 7 servers 9 servers 13 servers 40 60 Percentage of read requests 80 100 0 Hadoop in China 2011 20 40 60 Percentage of read requests 80 100 56
69. Latency Hadoop in China 2011 57
70. Load in production • Fetching service • Load of a ZKserver • Read spikes of over 2000 reads/s • Write spikes of less less than 500 ops/s Number of operations 2000 read write 1500 1000 500 0 0h 6h 12h 18h 24h 30h 36h 42h 48h 54h 60h 66h Time in seconds Hadoop in China 2011 58
71. Contributing
72. Steps to follow • • • • Watch the list for comments Watch issues that interest you Don’t be shy to chip in and ask questions Once you’re ready to contribute... https://cwiki.apache.org/confluence/display/ZOOKEEPER/HowToContribute Hadoop in China 2011 60
73. Some important detail • We like public communication ✓ User list for questions on how to use it ✓ Dev list for discussions about the code base • Issue tracker ✓ Jira https://issues.apache.org/jira/browse/ZOOKEEPER ✓ Create a new jira issue if you: ➡ Find a problem ➡ Intend to contribute a patch Hadoop in China 2011 61
74. On our roadmap • Dynamic configuration ✓ Currently static • Cross-colo deployments ✓ POLE: Performance-Oriented Leader Election ✓ Fault detection • ZooKeeper as a Service ✓ Multi-tenancy ✓ Write scalability Hadoop in China 2011 62
75. Final remarks • ZooKeeper ✓ Service for coordination ✓ Great experience internally and in Apache • One of a few building blocks ✓ BookKeeper ✓ Hedwig ✓ Omid Hadoop in China 2011 63