Theory of Locality Sensitive Hashing CS246 Stanford (Slides)

Taomcat

2019/03/11 发布于 技术 分类

文字内容
1. CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu
2.   Task: Given a large number (N in the millions or billions) of documents, find “near duplicates” Applications:  Mirror websites, or approximate mirrors  Don’t want to show both in a single set of search results  Problems:  Many small pieces of one document can appear out of order in another  Too many documents to compare all pairs  Documents are so large or so many that they cannot fit in main memory 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 2
3. Shingling: Convert docs to sets of items 1.  Document is a set of k-shingles Min-Hashing: Convert large sets into short signatures, while preserving similarity 2.  Want hash func. that Pr[hπ(C1) = hπ(C2)] = sim(C1, C2)  For the Jaccard similarity Min-Hash has this property! Locality-sensitive hashing: Focus on pairs of signatures likely to be from similar documents 3.  Split signatures into bands and hash them  Documents with similar signatures get hashed into same buckets: Candidate pairs 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 3
4. Localitysensitive Hashing Document The set of strings of length k that appear in the document 1/14/2015 Signatures: short integer vectors that represent the sets, and reflect their similarity Jure Leskovec, Stanford C246: Mining Massive Datasets Candidate pairs: those pairs of signatures that we need to test for similarity 4
5.  A k-shingle (or k-gram) is a sequence of k tokens that appears in the document  Example: k=2; D1 = abcab Set of 2-shingles: C1 = S(D1) = {ab, bc, ca}   Represent a doc by a set of hash values of its k-shingles A natural document similarity measure is then the Jaccard similarity: sim(D1, D2) = C1∩C2 / C1∪C2  Similarity of two documents is the Jaccard similarity of their shingles 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 5
6.  Prob. hπ(C1) = hπ(C2) is the same as sim(D1, D2): Pr[hπ(C1) = hπ(C2)] = sim(D1, D2) Permutation π Input matrix (Shingles x Documents) Signature matrix M 2 4 3 1 0 1 0 2 1 2 1 3 2 4 1 0 0 1 0 1 4 1 1 7 1 1 7 0 2 6 3 2 0 1 0 1 1 2 1 2 1 6 6 0 1 0 1 5 7 1 1 0 1 0 4 5 5 1 0 1 0 1/14/2015 Similarities of columns and signatures (approx.) match! 1-3 2-4 1-2 3-4 Col/Col 0.75 0.75 0 0 Sig/Sig 0.67 1.00 0 0 Jure Leskovec, Stanford C246: Mining Massive Datasets 6
7.  Hash columns of the signature matrix M: Similar columns likely hash to same bucket  Divide matrix M into b bands of r rows (M=b·r)  Candidate column pairs are those that hash to the same bucket for ≥ 1 band b bands r rows Matrix M 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets Threshold s Prob. of sharing ≥ 1 bucket Buckets Similarity 7
8. The S-curve is where the “magic” happens Remember: Probability of equal hash-values = similarity No chance if t<s 1/14/2015 Probability=1 if t>s Similarity t of two sets Similarity t of two sets This is what 1 hash-code gives you Pr[hπ(C1) = hπ(C2)] = sim(D1, D2) Threshold s Probability of sharing ≥ 1 bucket  This is what we want! How to get a step-function? By choosing r and b! Jure Leskovec, Stanford C246: Mining Massive Datasets 8
9.   Remember: b bands, r rows/band Let sim(C1 , C2) = t Pick some band (r rows) P(C1, C2 is a candidate pair)  Similarity t  Prob. that elements in a single row of columns C1 and C2 are equal = t  Prob. that all rows in a band are equal = tr  Prob. that some row in a band is not equal = 1 - tr   Prob. that all bands are not equal = (1 - tr)b Prob. that at least 1 band is equal = 1 - (1 - tr)b P(C1, C2 is a candidate pair) = 1 - (1 - tr)b 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 9
10. We want choose r and b such that the P(Candidate pair) has a “step” right around s. 1 r = 5, b = 1..50 r = 1..10, b = 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 Prob(Candidate pair) Given a fixed threshold s. Prob(Candidate pair) 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.9 0 1 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 r = 1, b = 1..10 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Similarity 1/14/2015 0.8 Jure Leskovec, Stanford C246: Mining Massive Datasets 0.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 r = 10, b = 1..50 0.1 0.9 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity prob = 1 - (1 - t r)b 10
11. Localitysensitive Hashing Document The set of strings of length k that appear in the document Signatures: short vectors that represent the sets, and reflect their similarity Candidate pairs: those pairs of signatures that we need to test for similarity
12.  We have used LSH to find similar documents  More generally, we found similar columns in large sparse matrices with high Jaccard similarity  For example, customer/item purchase histories  Can we use LSH for other distance measures?  e.g., Euclidean distances, Cosine distance  Let’s generalize what we’ve learned! 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 12
13.   For Min-Hashing signatures, we got a Min-Hash function for each permutation of rows A “hash function” is any function that takes two elements and says whether they are “equal”  Shorthand: h(x) = h(y) means “h says x and y are equal”  A family of hash functions is any set of hash functions from which we can pick one at random efficiently  Example: The set of Min-Hash functions generated from permutations of rows 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 13
14. Suppose we have a space S of points with a distance measure d(x,y)  Critical assumption A family H of hash functions is said to be (d1, d2, p1, p2)-sensitive if for any x and y in S:  1. If d(x, y) < d1, then the probability over all h∈ H, that h(x) = h(y) is at least p1 2. If d(x, y) > d2, then the probability over all h∈ H, that h(x) = h(y) is at most p2 With a LS Family we can do LSH! 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 14
15. Pr[h(x) = h(y)] Small distance, high probability p1 p2 d1 d2 Large distance, low probability of hashing to the same value d(x,y) 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 15
16.  Let:  S = space of all sets,  d = Jaccard distance,  H is family of Min-Hash functions for all permutations of rows  Then for any hash function h∈ H: Pr[h(x) = h(y)] = 1 - d(x, y)  Simply restates theorem about Min-Hashing in terms of distances rather than similarities 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 16
17.  Claim: Min-hash H is a (1/3, 2/3, 2/3, 1/3)sensitive family for S and d. If distance < 1/3 (so similarity ≥ 2/3)   Then probability that Min-Hash values agree is > 2/3 For Jaccard similarity, Min-Hashing gives a (d1,d2,(1-d1),(1-d2))-sensitive family for any d1<d2 Theory leaves unknown what happens to pairs that are at distance between d1 and d2  Consequence: No guarantees about fraction of false positives in that range 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 17
18. Can we reproduce the “S-curve” effect we saw before for any LS family? Prob. of sharing a bucket  Similarity t  The “bands” technique we learned for signature matrices carries over to this more general setting  So we can do LSH with any (d1, d2, p1, p2)-sensitive family  Two constructions:  AND construction like “rows in a band”  OR construction like “many bands” 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 18
20.  Given family H, construct family H’ consisting of r functions from H  For h = [h1,…,hr] in H’, we say h(x) = h(y) if and only if hi(x) = hi(y) for all i 1≤i≤r  Note this corresponds to creating a band of size r  Theorem: If H is (d1, d2, p1, p2)-sensitive, then H’ is (d1,d2, (p1)r, (p2)r)-sensitive  Proof: Use the fact that hi ’s are independent 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 20
21.  Independence of hash functions (HFs) really means that the prob. of two HFs saying “yes” is the product of each saying “yes”  But two hash functions could be highly correlated  For example, in Min-Hash if their permutations agree in the first one million entries  However, the probabilities in definition of a LSH-family are over all possible members of H, H’ 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 21
22.  Given family H, construct family H’ consisting of b functions from H  For h = [h1,…,hb] in H’, h(x) = h(y) if and only if hi(x) = hi(y) for at least 1 i  Theorem: If H is (d1, d2, p1, p2)-sensitive, then H’ is (d1, d2, 1-(1-p1)b, 1-(1-p2)b)-sensitive  Proof: Use the fact that hi’s are independent 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 22
23.  OR makes all probs. grow, but by choosing b correctly, we can make the upper prob. approach 1 while the lower does not 1 AND r=1..10, b=1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity of a pair of items 1/14/2015 Prob. sharing a bucket AND makes all probs. shrink, but by choosing r correctly, we can make the lower prob. approach 0 while the higher does not Prob. sharing a bucket  1 0.9 0.8 0.7 0.6 0.5 0.4 OR r=1, b=1..10 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity of a pair of items Jure Leskovec, Stanford C246: Mining Massive Datasets 23
24.  r-way AND followed by b-way OR construction  Exactly what we did with Min-Hashing  If bands match in all r values hash to same bucket  Cols that are hashed into ≥ 1 common bucket  Candidate  Take points x and y s.t. Pr[h(x) = h(y)] = p  H will make (x,y) a candidate pair with prob. p  Construction makes (x,y) a candidate pair with The S-Curve! probability 1-(1-pr)b  Example: Take H and construct H’ by the AND construction with r = 4. Then, from H’, construct H’’ by the OR construction with b = 4 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 24
25. p .2 .3 .4 .5 .6 .7 .8 .9 1/14/2015 1-(1-p4)4 .0064 .0320 .0985 .2275 .4260 .6666 .8785 .9860 r = 4, b = 4 transforms a (.2,.8,.8,.2)-sensitive family into a (.2,.8,.8785,.0064)-sensitive family. Jure Leskovec, Stanford C246: Mining Massive Datasets 25
27.  Picking r and b to get desired performance  50 hash-functions (r = 5, b = 10) Threshold s 1 Prob(Candidate pair) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Similarity 1/14/2015 0.8 0.9 1 Blue area X: False Negative rate These are pairs with sim > s but the X fraction won’t share a band and then will never become candidates. This means we will never consider these pairs for (slow/exact) similarity calculation! Green area Y: False Positive rate These are pairs with sim < s but we will consider them as candidates. This is not too bad, we will consider them for (slow/exact) similarity computation and discard them. Jure Leskovec, Stanford C246: Mining Massive Datasets 27
28.  Picking r and b to get desired performance  50 hash-functions (r * b = 50) Threshold s 1 0.9 0.8 0.7 0.6 r=2, b=25 r=5, b=10 r=10, b=5 0.5 0.4 0.3 0.2 0.1 0 1/14/2015 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Jure Leskovec, Stanford C246: Mining Massive Datasets 1 28
29.   Apply a b-way OR construction followed by an r-way AND construction Transforms probability p into (1-(1-p)b)r  The same S-curve, mirrored horizontally and vertically  Example: Take H and construct H’ by the OR construction with b = 4. Then, from H’, construct H’’ by the AND construction with r = 4 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 29
30. p .1 .2 .3 .4 .5 .6 .7 .8 1/14/2015 (1-(1-p)4)4 .0140 .1215 .3334 .5740 .7725 .9015 .9680 .9936 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 The example transforms a (.2,.8,.8,.2)-sensitive family into a (.2,.8,.9936,.1215)-sensitive family Jure Leskovec, Stanford C246: Mining Massive Datasets 30
31.  Example: Apply the (4,4) OR-AND construction followed by the (4,4) AND-OR construction  Transforms a (.2, .8, .8, .2)-sensitive family into a (.2, .8, .9999996, .0008715)-sensitive family  Note this family uses 256 (=4*4*4*4) of the original hash functions 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 31
32.   Pick any two distances d1 < d2 Start with a (d1, d2, (1- d1), (1- d2))-sensitive family  Apply constructions to amplify (d1, d2, p1, p2)-sensitive family, where p1 is almost 1 and p2 is almost 0  The closer to 0 and 1 we get, the more hash functions must be used! 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 32
34.  LSH methods for other distance metrics:  Cosine distance: Random hyperplanes  Euclidean distance: Project on lines Points Signatures: short integer signatures that reflect their similarity Design a (d1, d2, p1, p2)-sensitive family of hash functions (for that particular distance metric) Localitysensitive Hashing Candidate pairs: those pairs of signatures that we need to test for similarity Amplify the family using AND and OR constructions Depends on the distance function used 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 35
35. Signatures: short integer signatures that reflect their similarity Data points Documents Data 0 1 0 0 0 1 1 1 0 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 0 1 00 1 1 1 0 000 1 0 1 0 1 01/14/2015 0 1 0 5 3 4 “Bands” technique Candidate pairs Random -1 +1 -1 -1 Hyperplanes +1 +1 +1 -1 -1 -1 -1 -1 “Bands” technique Candidate pairs MinHash 1 2 6 5 3 4 1 1 6 Localitysensitive Hashing Candidate pairs: those pairs of signatures that we need to test for similarity Jure Leskovec, Stanford C246: Mining Massive Datasets 36
36.  Cosine distance = angle between vectors from the origin to the points in question d(A, B) = θ = arccos(A⋅B / ǁAǁ·ǁBǁ) A B A⋅B ‖A‖   Has range 𝟎𝟎 … 𝝅𝝅 (equivalently 0...180°)  Can divide θ by 𝝅𝝅 to have distance in range 0…1 Cosine similarity = 1-d(A,B)  But often defined as cosine sim: cos(𝜃𝜃) = 𝐴𝐴⋅𝐵𝐵 𝐴𝐴 𝐵𝐵 - Has range -1…1 for general vectors - Range 0..1 for non-negative vectors (angles up to 90°) 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 37
37.  For cosine distance, there is a technique called Random Hyperplanes  Technique similar to Min-Hashing  Random Hyperplanes method is a (d1, d2, (1-d1/𝝅𝝅), (1-d2/𝝅𝝅))-sensitive family for any d1 and d2  Reminder: (d1, d2, p1, p2)-sensitive 1. 2. 1/14/2015 If d(x,y) < d1, then prob. that h(x) = h(y) is at least p1 If d(x,y) > d2, then prob. that h(x) = h(y) is at most p2 Jure Leskovec, Stanford C246: Mining Massive Datasets 38
38.  Pick a random vector v, which determines a hash function hv with two buckets  hv(x) = +1 if v⋅x ≥ 0; = -1 if v⋅x < 0  LS-family H = set of all functions derived from any vector  Claim: For points x and y, 1/14/2015 Pr[h(x) = h(y)] = 1 – d(x,y) / 𝝅𝝅 Jure Leskovec, Stanford C246: Mining Massive Datasets 39
39. Look in the plane of x and y. v’ x θ Hyperplane normal to v. Here h(x) = h(y) 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets v Hyperplane normal to v’. Here h(x) ≠ h(y) y 40
40. So: Prob[Red case] = θ / 𝝅𝝅 So: P[h(x)=h(y)] = 1- θ/𝜋𝜋 = 1-d(x,y) 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 41
41.  Pick some number of random vectors, and hash your data for each vector  The result is a signature (sketch) of +1’s and –1’s for each data point  Can be used for LSH like we used the Min-Hash signatures for Jaccard distance  Amplify using AND/OR constructions 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 42
42.  Expensive to pick a random vector in M dimensions for large M  Would have to generate M random numbers  A more efficient approach  It suffices to consider only vectors v consisting of +1 and –1 components  Why is this more efficient? 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 43
43.  Simple idea: Hash functions correspond to lines  Partition the line into buckets of size a  Hash each point to the bucket containing its projection onto the line  Nearby points are always close; distant points are rarely in same bucket 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 44
44. v v v v v v Line v v v Buckets of size a  “Lucky” case:  Points that are close hash in the same bucket  Distant points end up in different buckets 1/14/2015  v v v Two “unlucky” cases:  Top: unlucky quantization  Bottom: unlucky projection Jure Leskovec, Stanford C246: Mining Massive Datasets 45
45. v v vv v v v v 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 46
46. Points at distance d If d << a, then the chance the points are in the same bucket is at least 1 – d/a. Randomly chosen line Bucket width a 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 47
47. If d >> a, θ must be close to 90o for there to be any chance points go to the same bucket. Points at distance d θ d cos θ Randomly chosen line Bucket width a 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 48
48.   If points are distance d < a/2, prob. they are in same bucket ≥ 1- d/a = ½ If points are distance d > 2a apart, then they can be in the same bucket only if d cos θ ≤ a  cos θ ≤ ½  60 < θ < 90, i.e., at most 1/3 probability   Yields a (a/2, 2a, 1/2, 1/3)-sensitive family of hash functions for any a Amplify using AND-OR cascades 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 49
49.  Projection method yields a (a/2, 2a, 1/2, 1/3)-sensitive family of hash functions  For previous distance measures, we could start with an (d1, d2, p1, p2)-sensitive family for any d1 < d2, and drive p1 and p2 to 1 and 0 by AND/OR constructions  Note: Here, we seem to need d1 ≤ 4 d2  In the calculation on the previous slide we only considered cases d < a/2 and d > 2a 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 50
50.  But as long as d1 < d2, the probability of points at distance d1 falling in the same bucket is greater than the probability of points at distance d2 doing so  Thus, the hash family formed by projecting onto lines is an (d1, d2, p1, p2)-sensitive family for some p1 > p2  Then, amplify by AND/OR constructions 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 51
51. Data points Documents Data 0 1 0 0 0 1 Signatures: short integer signatures that reflect their similarity Design a (d1, d2, p1, p2)-sensitive family of hash functions (for that particular distance metric) 1 00 1 1 0 1 5 1 5 0 0 1 MinHash 2 3 1 3 1 0 1 6 4 6 4 0 1 0 00 1 0 1 00 1 1 1 0 000 1 0 1 0 1 01/14/2015 0 1 0 Random -1 +1 -1 -1 Hyperplanes +1 +1 +1 -1 -1 -1 -1 -1 Localitysensitive Hashing Candidate pairs: those pairs of signatures that we need to test for similarity Amplify the family using AND and OR constructions “Bands” technique Candidate pairs “Bands” technique Candidate pairs Jure Leskovec, Stanford C246: Mining Massive Datasets 52
52.  Property P(h(C1)=h(C2))=sim(C1,C2) of hash function h is the essential part of LSH, without it we can’t do anything  LS-hash functions transform data to signatures so that the bands technique (AND, OR constructions) can then be applied 1/14/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 53