Theory of Locality Sensitive Hashing CS246 Stanford (Slides)
登录发表评论
文字内容
2. Task: Given a large number (N in the millions or billions) of documents, find “near duplicates” Applications: Mirror websites, or approximate mirrors Don’t want to show both in a single set of search results Problems: Many small pieces of one document can appear out of order in another Too many documents to compare all pairs Documents are so large or so many that they cannot fit in main memory 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 2
3. Shingling: Convert docs to sets of items 1. Document is a set of kshingles MinHashing: Convert large sets into short signatures, while preserving similarity 2. Want hash func. that Pr[hπ(C1) = hπ(C2)] = sim(C1, C2) For the Jaccard similarity MinHash has this property! Localitysensitive hashing: Focus on pairs of signatures likely to be from similar documents 3. Split signatures into bands and hash them Documents with similar signatures get hashed into same buckets: Candidate pairs 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 3
4. Localitysensitive Hashing Document The set of strings of length k that appear in the document 1/14/2015 Signatures: short integer vectors that represent the sets, and reflect their similarity Jure Leskovec, Stanford C246: Mining Massive Datasets Candidate pairs: those pairs of signatures that we need to test for similarity 4
5. A kshingle (or kgram) is a sequence of k tokens that appears in the document Example: k=2; D1 = abcab Set of 2shingles: C1 = S(D1) = {ab, bc, ca} Represent a doc by a set of hash values of its kshingles A natural document similarity measure is then the Jaccard similarity: sim(D1, D2) = C1∩C2 / C1∪C2 Similarity of two documents is the Jaccard similarity of their shingles 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 5
6. Prob. hπ(C1) = hπ(C2) is the same as sim(D1, D2): Pr[hπ(C1) = hπ(C2)] = sim(D1, D2) Permutation π Input matrix (Shingles x Documents) Signature matrix M 2 4 3 1 0 1 0 2 1 2 1 3 2 4 1 0 0 1 0 1 4 1 1 7 1 1 7 0 2 6 3 2 0 1 0 1 1 2 1 2 1 6 6 0 1 0 1 5 7 1 1 0 1 0 4 5 5 1 0 1 0 1/14/2015 Similarities of columns and signatures (approx.) match! 13 24 12 34 Col/Col 0.75 0.75 0 0 Sig/Sig 0.67 1.00 0 0 Jure Leskovec, Stanford C246: Mining Massive Datasets 6
7. Hash columns of the signature matrix M: Similar columns likely hash to same bucket Divide matrix M into b bands of r rows (M=b·r) Candidate column pairs are those that hash to the same bucket for ≥ 1 band b bands r rows Matrix M 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets Threshold s Prob. of sharing ≥ 1 bucket Buckets Similarity 7
8. The Scurve is where the “magic” happens Remember: Probability of equal hashvalues = similarity No chance if ts Similarity t of two sets Similarity t of two sets This is what 1 hashcode gives you Pr[hπ(C1) = hπ(C2)] = sim(D1, D2) Threshold s Probability of sharing ≥ 1 bucket This is what we want! How to get a stepfunction? By choosing r and b! Jure Leskovec, Stanford C246: Mining Massive Datasets 8
9. Remember: b bands, r rows/band Let sim(C1 , C2) = t Pick some band (r rows) P(C1, C2 is a candidate pair) Similarity t Prob. that elements in a single row of columns C1 and C2 are equal = t Prob. that all rows in a band are equal = tr Prob. that some row in a band is not equal = 1  tr Prob. that all bands are not equal = (1  tr)b Prob. that at least 1 band is equal = 1  (1  tr)b P(C1, C2 is a candidate pair) = 1  (1  tr)b 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 9
10. We want choose r and b such that the P(Candidate pair) has a “step” right around s. 1 r = 5, b = 1..50 r = 1..10, b = 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 Prob(Candidate pair) Given a fixed threshold s. Prob(Candidate pair) 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.9 0 1 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 r = 1, b = 1..10 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Similarity 1/14/2015 0.8 Jure Leskovec, Stanford C246: Mining Massive Datasets 0.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 r = 10, b = 1..50 0.1 0.9 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity prob = 1  (1  t r)b 10
11. Localitysensitive Hashing Document The set of strings of length k that appear in the document Signatures: short vectors that represent the sets, and reflect their similarity Candidate pairs: those pairs of signatures that we need to test for similarity
12. We have used LSH to find similar documents More generally, we found similar columns in large sparse matrices with high Jaccard similarity For example, customer/item purchase histories Can we use LSH for other distance measures? e.g., Euclidean distances, Cosine distance Let’s generalize what we’ve learned! 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 12
13. For MinHashing signatures, we got a MinHash function for each permutation of rows A “hash function” is any function that takes two elements and says whether they are “equal” Shorthand: h(x) = h(y) means “h says x and y are equal” A family of hash functions is any set of hash functions from which we can pick one at random efficiently Example: The set of MinHash functions generated from permutations of rows 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 13
14. Suppose we have a space S of points with a distance measure d(x,y) Critical assumption A family H of hash functions is said to be (d1, d2, p1, p2)sensitive if for any x and y in S: 1. If d(x, y) < d1, then the probability over all h∈ H, that h(x) = h(y) is at least p1 2. If d(x, y) > d2, then the probability over all h∈ H, that h(x) = h(y) is at most p2 With a LS Family we can do LSH! 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 14
15. Pr[h(x) = h(y)] Small distance, high probability p1 p2 d1 d2 Large distance, low probability of hashing to the same value d(x,y) 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 15
16. Let: S = space of all sets, d = Jaccard distance, H is family of MinHash functions for all permutations of rows Then for any hash function h∈ H: Pr[h(x) = h(y)] = 1  d(x, y) Simply restates theorem about MinHashing in terms of distances rather than similarities 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 16
17. Claim: Minhash H is a (1/3, 2/3, 2/3, 1/3)sensitive family for S and d. If distance < 1/3 (so similarity ≥ 2/3) Then probability that MinHash values agree is > 2/3 For Jaccard similarity, MinHashing gives a (d1,d2,(1d1),(1d2))sensitive family for any d1Consequence: No guarantees about fraction of false positives in that range 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 17
18. Can we reproduce the “Scurve” effect we saw before for any LS family? Prob. of sharing a bucket Similarity t The “bands” technique we learned for signature matrices carries over to this more general setting So we can do LSH with any (d1, d2, p1, p2)sensitive family Two constructions: AND construction like “rows in a band” OR construction like “many bands” 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 18
20. Given family H, construct family H’ consisting of r functions from H For h = [h1,…,hr] in H’, we say h(x) = h(y) if and only if hi(x) = hi(y) for all i 1≤i≤r Note this corresponds to creating a band of size r Theorem: If H is (d1, d2, p1, p2)sensitive, then H’ is (d1,d2, (p1)r, (p2)r)sensitive Proof: Use the fact that hi ’s are independent 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 20
21. Independence of hash functions (HFs) really means that the prob. of two HFs saying “yes” is the product of each saying “yes” But two hash functions could be highly correlated For example, in MinHash if their permutations agree in the first one million entries However, the probabilities in definition of a LSHfamily are over all possible members of H, H’ 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 21
22. Given family H, construct family H’ consisting of b functions from H For h = [h1,…,hb] in H’, h(x) = h(y) if and only if hi(x) = hi(y) for at least 1 i Theorem: If H is (d1, d2, p1, p2)sensitive, then H’ is (d1, d2, 1(1p1)b, 1(1p2)b)sensitive Proof: Use the fact that hi’s are independent 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 22
23. OR makes all probs. grow, but by choosing b correctly, we can make the upper prob. approach 1 while the lower does not 1 AND r=1..10, b=1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity of a pair of items 1/14/2015 Prob. sharing a bucket AND makes all probs. shrink, but by choosing r correctly, we can make the lower prob. approach 0 while the higher does not Prob. sharing a bucket 1 0.9 0.8 0.7 0.6 0.5 0.4 OR r=1, b=1..10 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity of a pair of items Jure Leskovec, Stanford C246: Mining Massive Datasets 23
24. rway AND followed by bway OR construction Exactly what we did with MinHashing If bands match in all r values hash to same bucket Cols that are hashed into ≥ 1 common bucket Candidate Take points x and y s.t. Pr[h(x) = h(y)] = p H will make (x,y) a candidate pair with prob. p Construction makes (x,y) a candidate pair with The SCurve! probability 1(1pr)b Example: Take H and construct H’ by the AND construction with r = 4. Then, from H’, construct H’’ by the OR construction with b = 4 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 24
25. p .2 .3 .4 .5 .6 .7 .8 .9 1/14/2015 1(1p4)4 .0064 .0320 .0985 .2275 .4260 .6666 .8785 .9860 r = 4, b = 4 transforms a (.2,.8,.8,.2)sensitive family into a (.2,.8,.8785,.0064)sensitive family. Jure Leskovec, Stanford C246: Mining Massive Datasets 25
27. Picking r and b to get desired performance 50 hashfunctions (r = 5, b = 10) Threshold s 1 Prob(Candidate pair) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Similarity 1/14/2015 0.8 0.9 1 Blue area X: False Negative rate These are pairs with sim > s but the X fraction won’t share a band and then will never become candidates. This means we will never consider these pairs for (slow/exact) similarity calculation! Green area Y: False Positive rate These are pairs with sim < s but we will consider them as candidates. This is not too bad, we will consider them for (slow/exact) similarity computation and discard them. Jure Leskovec, Stanford C246: Mining Massive Datasets 27
28. Picking r and b to get desired performance 50 hashfunctions (r * b = 50) Threshold s 1 0.9 0.8 0.7 0.6 r=2, b=25 r=5, b=10 r=10, b=5 0.5 0.4 0.3 0.2 0.1 0 1/14/2015 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Jure Leskovec, Stanford C246: Mining Massive Datasets 1 28
29. Apply a bway OR construction followed by an rway AND construction Transforms probability p into (1(1p)b)r The same Scurve, mirrored horizontally and vertically Example: Take H and construct H’ by the OR construction with b = 4. Then, from H’, construct H’’ by the AND construction with r = 4 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 29
30. p .1 .2 .3 .4 .5 .6 .7 .8 1/14/2015 (1(1p)4)4 .0140 .1215 .3334 .5740 .7725 .9015 .9680 .9936 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 The example transforms a (.2,.8,.8,.2)sensitive family into a (.2,.8,.9936,.1215)sensitive family Jure Leskovec, Stanford C246: Mining Massive Datasets 30
31. Example: Apply the (4,4) ORAND construction followed by the (4,4) ANDOR construction Transforms a (.2, .8, .8, .2)sensitive family into a (.2, .8, .9999996, .0008715)sensitive family Note this family uses 256 (=4*4*4*4) of the original hash functions 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 31
32. Pick any two distances d1 < d2 Start with a (d1, d2, (1 d1), (1 d2))sensitive family Apply constructions to amplify (d1, d2, p1, p2)sensitive family, where p1 is almost 1 and p2 is almost 0 The closer to 0 and 1 we get, the more hash functions must be used! 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 32
34. LSH methods for other distance metrics: Cosine distance:'>distance: Random hyperplanes Euclidean distance:'>distance: Project on lines Points Signatures: short integer signatures that reflect their similarity Design a (d1, d2, p1, p2)sensitive family of hash functions (for that particular distance metric) Localitysensitive Hashing Candidate pairs: those pairs of signatures that we need to test for similarity Amplify the family using AND and OR constructions Depends on the distance function used 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 35
35. Signatures: short integer signatures that reflect their similarity Data points Documents Data 0 1 0 0 0 1 1 1 0 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 0 1 00 1 1 1 0 000 1 0 1 0 1 01/14/2015 0 1 0 5 3 4 “Bands” technique Candidate pairs Random 1 +1 1 1 Hyperplanes +1 +1 +1 1 1 1 1 1 “Bands” technique Candidate pairs MinHash 1 2 6 5 3 4 1 1 6 Localitysensitive Hashing Candidate pairs: those pairs of signatures that we need to test for similarity Jure Leskovec, Stanford C246: Mining Massive Datasets 36
36. Cosine distance = angle between vectors from the origin to the points in question d(A, B) = θ = arccos(A⋅B / ǁAǁ·ǁBǁ) A B A⋅B ‖A‖ Has range 𝟎𝟎 … 𝝅𝝅 (equivalently 0...180°) Can divide θ by 𝝅𝝅 to have distance in range 0…1 Cosine similarity = 1d(A,B) But often defined as cosine sim: cos(𝜃𝜃) = 𝐴𝐴⋅𝐵𝐵 𝐴𝐴 𝐵𝐵  Has range 1…1 for general vectors  Range 0..1 for nonnegative vectors (angles up to 90°) 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 37
37. For cosine distance, there is a technique called Random Hyperplanes Technique similar to MinHashing Random Hyperplanes method is a (d1, d2, (1d1/𝝅𝝅), (1d2/𝝅𝝅))sensitive family for any d1 and d2 Reminder: (d1, d2, p1, p2)sensitive 1. 2. 1/14/2015 If d(x,y) < d1, then prob. that h(x) = h(y) is at least p1 If d(x,y) > d2, then prob. that h(x) = h(y) is at most p2 Jure Leskovec, Stanford C246: Mining Massive Datasets 38
38. Pick a random vector v, which determines a hash function hv with two buckets hv(x) = +1 if v⋅x ≥ 0; = 1 if v⋅x < 0 LSfamily H = set of all functions derived from any vector Claim: For points x and y, 1/14/2015 Pr[h(x) = h(y)] = 1 – d(x,y) / 𝝅𝝅 Jure Leskovec, Stanford C246: Mining Massive Datasets 39
39. Look in the plane of x and y. v’ x θ Hyperplane normal to v. Here h(x) = h(y) 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets v Hyperplane normal to v’. Here h(x) ≠ h(y) y 40
40. So:'>So: Prob[Red case] = θ / 𝝅𝝅 So:'>So: P[h(x)=h(y)] = 1 θ/𝜋𝜋 = 1d(x,y) 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 41
41. Pick some number of random vectors, and hash your data for each vector The result is a signature (sketch) of +1’s and –1’s for each data point Can be used for LSH like we used the MinHash signatures for Jaccard distance Amplify using AND/OR constructions 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 42
42. Expensive to pick a random vector in M dimensions for large M Would have to generate M random numbers A more efficient approach It suffices to consider only vectors v consisting of +1 and –1 components Why is this more efficient? 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 43
43. Simple idea: Hash functions correspond to lines Partition the line into buckets of size a Hash each point to the bucket containing its projection onto the line Nearby points are always close; distant points are rarely in same bucket 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 44
44. v v v v v v Line v v v Buckets of size a “Lucky” case: Points that are close hash in the same bucket Distant points end up in different buckets 1/14/2015 v v v Two “unlucky” cases: Top: unlucky quantization Bottom: unlucky projection Jure Leskovec, Stanford C246: Mining Massive Datasets 45
46. Points at distance d If d << a, then the chance the points are in the same bucket is at least 1 – d/a. Randomly chosen line Bucket width a 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 47
47. If d >> a, θ must be close to 90o for there to be any chance points go to the same bucket. Points at distance d θ d cos θ Randomly chosen line Bucket width a 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 48
48. If points are distance d < a/2, prob. they are in same bucket ≥ 1 d/a = ½ If points are distance d > 2a apart, then they can be in the same bucket only if d cos θ ≤ a cos θ ≤ ½ 60 < θ < 90, i.e., at most 1/3 probability Yields a (a/2, 2a, 1/2, 1/3)sensitive family of hash functions for any a Amplify using ANDOR cascades 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 49
49. Projection method yields a (a/2, 2a, 1/2, 1/3)sensitive family of hash functions For previous distance measures, we could start with an (d1, d2, p1, p2)sensitive family for any d1 < d2, and drive p1 and p2 to 1 and 0 by AND/OR constructions Note: Here, we seem to need d1 ≤ 4 d2 In the calculation on the previous slide we only considered cases d < a/2 and d > 2a 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 50
50. But as long as d1 < d2, the probability of points at distance d1 falling in the same bucket is greater than the probability of points at distance d2 doing so Thus, the hash family formed by projecting onto lines is an (d1, d2, p1, p2)sensitive family for some p1 > p2 Then, amplify by AND/OR constructions 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 51
51. Data points Documents Data 0 1 0 0 0 1 Signatures: short integer signatures that reflect their similarity Design a (d1, d2, p1, p2)sensitive family of hash functions (for that particular distance metric) 1 00 1 1 0 1 5 1 5 0 0 1 MinHash 2 3 1 3 1 0 1 6 4 6 4 0 1 0 00 1 0 1 00 1 1 1 0 000 1 0 1 0 1 01/14/2015 0 1 0 Random 1 +1 1 1 Hyperplanes +1 +1 +1 1 1 1 1 1 Localitysensitive Hashing Candidate pairs: those pairs of signatures that we need to test for similarity Amplify the family using AND and OR constructions “Bands” technique Candidate pairs “Bands” technique Candidate pairs Jure Leskovec, Stanford C246: Mining Massive Datasets 52
52. Property P(h(C1)=h(C2))=sim(C1,C2) of hash function h is the essential part of LSH, without it we can’t do anything LShash functions transform data to signatures so that the bands technique (AND, OR constructions) can then be applied 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 53
推荐

201208 网易DBA 王洪权：MyS...
9527

201206 网易DBA 王洪权：Mys...
9527

201707 何登成：AliSQL 引领...
9527

201607 何登成：AliSQL性能优...
9527

智能金融在客服机器人中台的落地实践&mda...
CodeWarrior

深度学习编译优化技术的探索和实践&mdas...
CodeWarrior

滴滴搜索系统的深度学习演进之路&mdash
CodeWarrior

走向深度学习的美图社区推荐—汤斌
CodeWarrior

AI在游戏内容自动化创作上的落地&mdas...
CodeWarrior

360金融的AI实践之旅—苏绥
CodeWarrior
分享