# Three Cool Algorithms You’ve Never Heard Of! Carey Nachenberg

## Presentation on theme: "Three Cool Algorithms You’ve Never Heard Of! Carey Nachenberg"— Presentation transcript:

Three Cool Algorithms You’ve Never Heard Of! Carey Nachenberg cnachenberg@symantec.com

Cool Data Structure: The Metric Tree City: LA Threshold: 1500km City: Las Vegas Threshold: 1000km City: SF Threshold: 100km City: Austin Threshold: 250km City: NYC Threshold: 1100km City: Boston Threshold: 400km City: Atlanta Threshold: 600km City: New Orleans Threshold: 300km City: San Jose Threshold: 200km City: Merced Threshold: 70km <=1500km away <=1000km away <=100km away <=1110km away City: Providence Threshold: 200km <=400km away <=600km away >1500km away >1000km away >100km away <=200km away >200km away <=70km away >70km away ………… … … … >600km away >1100km away … <=200km away >200km away …… <=300km away >300km away ……

Challenge: Building a Fuzzy Spell Checker Imagine you’re building a word processor and you want to implement a spell checker that gives suggestions… lobeky Of course it’s easy to tell the user that their word is misspelled… Question: What data structure could we use to determine if a word is in a dictionary or not? Suggestions lonely lovely locale … Right – a hash table or binary search tree could tell you if a word is spelled correctly. But what if we want to efficiently provide the user with possible alternatives?

Providing Alternatives? Before we can provide alternatives, we need a way to find close matches… One useful tool for this is the “edit distance” metric. Edit Distance: How many letters must be added, deleted or replaced to get from word A to B. lobeky -> lovely has an edit distance of 2. v l -> lowly has an edit distance of 3. olbkey w l So given the user’s misspelled word, and this edit distance function… How can we use this to provide the user with spelling suggestions?

Providing Alternatives? Well, we could take our misspelled word and compute its edit distance to every word in the dictionary! lobeky aardvark ark acorn … bone bonfire … lonely lonesome … 8 5 6 And then give the user all words with an edit distance of <=3… There’s a better way! But before we talk about it, let’s talk about edit distance a bit more… But that’s really, really slow!

Edit Distance As it turns out, the edit distance function, e(x,y), is what we call a “metric distance function.” What does that mean? 1. e(x,y) = e(y,x) The edit distance of “foo” from “food” is the same as from “food” to “foo” 2. e(x,y) >= 0 You can never have a negative edit distance… Well that makes sense… 3. e(x,z) <= e(x,y) + e(y,z) It’s never cheaper to do two conversions than a direct conversion. e(“foo”,”feed”) = 3 e(“feed”,”goon”) = 4 Total cost: 7 e(“foo”,”goon”) = 2 aka “the triangle inequality” >

Metric Distance Functions Given some word w (e.g., pier), let’s say I happen to know all words with an edit distance of 1 from that word… Now, if my misspelled word m (e.g., zifs) has an edit distance of 3 from w, what does that guarantee about m to these other words? Right: If e(“zifs”,”pier”) is 3, and all these other words are exactly 1 edit away from pier… pier peer tier piper pie pies zifs Then by definition, “zifs” must be at most 4 edits away from any word in this cloud! And directly: e(“zifs”,”piper”) = 4 +3 +1 But by the same reasoning, none of these words can be less than 2 edits away from “zifs”… Why? Because we know that all of these words have at most one character difference from “pier”… So if “pier” is 3 away from “zifs”, then in the best case these other words would be one letter closer to “zifs” (e.g., if one of pier’s letters was replaced by one of zifs’ letters)... Imagine if we had thousands of different clouds like this. We could compare your misspelled word to the center word of each cloud. If e(m,w) is less than some threshold edit distance, then the cloud’s other words are good suggestions… Let’s see: e(“zifs”,”pies”) = 2 e(“zifs”,”pier”) = 3 e(“pier”,”piper”) = 1 Total cost: 4

Metric Distance Functions pier peer tier piper pie pies zifs We could compare your misspelled word to the center word of each cloud. If e(m,w) is less than some threshold edit distance, then the cloud’s other words are good suggestions… gate hate rate date ate gale pencil computer table 3 5 8 5 4

A Better Way? That works well, but then again, we’d still have to do thousands of comparisons (one to each cloud)… Hmmm. Can we figure out a more efficient way to do this? Say with log 2 (D) comparisons, where D is the number of words in your dictionary? Duh… Well of course, we’ll need a tree!

The Metric Tree The Metric Tree was invented in 1991 by Jeffrey Uhlmann of the Naval Research Labs. Each node in a Metric Tree holds a word, an edit distance threshold value and left and right next pointers. struct MetricTreeNode { string word; unsigned int editThreshold; MetricTreeNode *left, *right; }; Let’s see how to build a Metric Tree! Building one is really slow, but once we build it, searching it is really fast!

The Metric Tree 1. Pick a random word W from set S. 2. Compute the edit distance for all other words in set S to your random word W. main() { Let S = {every word in the dictionary}; Node *root = buildMTree(S); 3. Sort all these words based on their edit distance d i to your random word W. 5. Now, create a root node N for our tree and put our word W in this node. Set its editThreshold value to d med. 6. N->left = buildMTree(subset of S that is <= d med ) 4. Select the median value of d i, let d med be this median edit distance. Node *buildMTree(SetOfWords &S) 7. N->right = buildMTree(subset of S that is > d med ) 8. return N SetOfWords goat oyster roster hippo toad hamster mouse chicken rooster

The Metric Tree main() { Let S = {every word in the dictionary}; Node *root = buildMTree(S); 1. Pick a random word W from set S. 2. Compute the edit distance for all other words in set S to your random word W. 3. Sort all these words based on their edit distance d i to your random word W. 5. Now, create a root node N for our tree and put our word W in this node. Set its editThreshold value to d med. 6. N->left = buildMTree(subset of S that is <= d med ) 4. Select the median value of d i, let d med be this median edit distance. Node *buildMTree(SetOfWords &S) 7. N->right = buildMTree(subset of S that is > d med ) 8. return N SetOfWords goat oyster roster hippo toad hamster mouse chicken 6 2 1 rooster 7 6 3 4 7 SetOfWords roster 1 oyster 2 hamster 3 mouse 4 goat 6 toad 6 hippo 7 chicken 7 d med = 4 “rooster” 4 1. Pick a random word W from set S. 2. Compute the edit distance for all other words in set S to your random word W. 3. Sort all these words based on their edit distance d i to your random word W. 5. Now, create a root node N for our tree and put our word W in this node. Set its editThreshold value to d med. 6. N->left = buildMTree(subset of S that is <= d med ) 4. Select the median value of d i, let d med be this median edit distance. Node *buildMTree(SetOfWords &S) 7. N->right = buildMTree(subset of S that is > d med ) 8. return N

1. Pick a random word W from set S. 2. Compute the edit distance for all other words in set S to your random word W. 3. Sort all these words based on their edit distance d i to your random word W. 5. Now, create a root node N for our tree and put our word W in this node. Set its editThreshold value to d med. 6. N->left = buildMTree(subset of S that is <= d med ) 4. Select the median value of d i, let d med be this median edit distance. Node *buildMTree(SetOfWords &S) 7. N->right = buildMTree(subset of S that is > d med ) 8. return N The Metric Tree main() { Let S = {every word in the dictionary}; Node *root = buildMTree(S); Dictionary goat oyster roster hippo toad hamster mouse chicken 6 3 1 rooster 7 7 SetOfWords roster oyster hamster goat toad hippo chicken “rooster” 4 4 4 6 1. Pick a random word W from set S. 2. Compute the edit distance for all other words in set S to your random word W. 3. Sort all these words based on their edit distance d i to your random word W. 5. Now, create a root node N for our tree and put our word W in this node. Set its editThreshold value to d med. 6. N->left = buildMTree(subset of S that is <= d med ) 4. Select the median value of d i, let d med be this median edit distance. Node *buildMTree(SetOfWords &S) 7. N->right = buildMTree(subset of S that is > d med ) 8. return N mouse d med = 4 “mouse” 4 “oyster” 4 “roster” 0 “hamster” 0

1. Pick a random word W from set S. 2. Compute the edit distance for all other words in set S to your random word W. 3. Sort all these words based on their edit distance d i to your random word W. 5. Now, create a root node N for our tree and put our word W in this node. Set its editThreshold value to d med. 6. N->left = buildMTree(subset of S that is <= d med ) 4. Select the median value of d i, let d med be this median edit distance. Node *buildMTree(SetOfWords &S) 7. N->right = buildMTree(subset of S that is > d med ) 8. return N The Metric Tree main() { Let S = {every word in the dictionary}; Node *root = buildMTree(S); Dictionary goat oyster roster hippo toad hamster mouse chicken SetOfWords roster oyster hamster goat hippo chicken mouse “oyster” 4 “roster” 0 “hamster” 0 1. Pick a random word W from set S. 2. Compute the edit distance for all other words in set S to your random word W. 3. Sort all these words based on their edit distance d i to your random word W. 5. Now, create a root node N for our tree and put our word W in this node. Set its editThreshold value to d med. 6. N->left = buildMTree(subset of S that is <= d med ) 4. Select the median value of d i, let d med be this median edit distance. Node *buildMTree(SetOfWords &S) 7. N->right = buildMTree(subset of S that is > d med ) 8. return N toad 2 5 7 d med = 5 “toad” 5 “mouse” 4 “rooster” 4 “goat” 5 “hippo” 0 “chicken” 0

1. Pick a random word W from set S. 2. Compute the edit distance for all other words in set S to your random word W. 3. Sort all these words based on their edit distance d i to your random word W. 5. Now, create a root node N for our tree and put our word W in this node. Set its editThreshold value to d med. 6. N->left = buildMTree(subset of S that is <= d med ) 4. Select the median value of d i, let d med be this median edit distance. Node *buildMTree(SetOfWords &S) 7. N->right = buildMTree(subset of S that is > d med ) 8. return N The Metric Tree main() { Let S = {every word in the dictionary}; Node *root = buildMTree(S); Dictionary goat oyster roster hippo toad hamster mouse chicken SetOfWords roster oyster hamster goat hippo chicken mouse “oyster” 4 “roster” 0 “hamster” 0 toad “toad” 5 “mouse” 4 “rooster” 4 “goat” 5 “hippo” 0 “chicken” 0

A Metric Tree “oyster” 4 “roster” 0 “hamster” 0 “toad” 5 “mouse” 4 “rooster” 4 “goat” 5 “hippo” 0 “chicken” 0 So now we have a metric tree! How do we interpret it? rooster 4 Every word to the left of rooster is guaranteed to be within 4 edits of it… And every word to the right of rooster is guaranteed to be more than 4 edits away… chicken And this same structure is repeated recursively! mouse oyster hamster roster 4 5 toad goat hippo 5

Searching When you search a metric tree, you specify the word you’re looking for and an edit-distance radius, e.g. e.g., I want to find words within 2 edits of “roaster”. rooster oyster hamster roster mouse “oyster” 4 “roster” 0 “hamster” 0 “toad” 5 “mouse” 4 “rooster” 4 “goat” 5 “hippo” 0 “chicken” 0 toad goat chicken hippo Starting at the root, there are three cases to consider: 1. Your word and its search radius are totally inside the edit threshold. roaster 2 In this case, all of your matches are guaranteed to be in our left subtree…

Searching rooster oyster hamster roster mouse “oyster” 4 “roster” 0 “hamster” 0 “toad” 5 “mouse” 4 “rooster” 4 “goat” 5 “hippo” 0 “chicken” 0 toad goat chicken hippo 2. Your word and its search radius are partially inside and partially outside the edit threshold. goute 2 In this case, some matches will be in our left subtree and some in our right subtree…

Searching rooster oyster hamster roster mouse “oyster” 4 “roster” 0 “hamster” 0 “toad” 5 “mouse” 4 “rooster” 4 “goat” 5 “hippo” 0 “chicken” 0 toad goat chicken hippo 3. Your word and its search radius are completely outside the edit threshold. vhivken 2 In this case, all matches will be in our right subtree.

PrintMatches(Node *cur, string misspell, int rad) { if e(misspell,cur->word) <= rad then print the current word if e(misspell,cur->word) editThreshold then PrintMatches(cur->left) if e(misspell,cur->word) > cur->editThreshold then PrintMatches(cur->right); } Metric Tree: Search Algorithm PrintMatches(root,”chomster”,2); cur-> rooster oyster hamster roster mouse 4 cur-> toad goat chicken hippo *This is a slight simplification… e(“chomster”,”rooster”) = 3 So rooster is outside of chomster’s radius of 2. It’s not a close enough match to print… chomster 2 3 e(“chomster”,”rooster”) = 3 Since 3 is less than our editThreshold of 4, let’s go left… PrintMatches(Node *cur, string misspell, int rad) { if e(misspell,cur->word) <= rad then print the current word if e(misspell,cur->word) editThreshold then PrintMatches(cur->left) if e(misspell,cur->word) > cur->editThreshold then PrintMatches(cur->right); } e(“chomster”,”mouse”) = 5 So mouse is outside of chomster’s radius of 2. It’s not a close enough match to print… oyster hamster roster mouse chomster 2 5 e(“chomster”,”mouse”) = 5 Since 5 is greater than our editThreshold of 4, we won’t go left. e(“chomster”,”mouse”) = 5 Since 5 is greater than our editThreshold of 4, we will go right. PrintMatches(Node *cur, string misspell, int rad) { if e(misspell,cur->word) <= rad then print the current word if e(misspell,cur->word) editThreshold then PrintMatches(cur->left) if e(misspell,cur->word) > cur->editThreshold then PrintMatches(cur->right); } cur-> hamster e(“chomster”,”hamster”) = 2 So hamster is inside of chomster’s radius of 2. We’ve got a match! Print hamster! chomster 2 2

Other Metric Tree Applications In addition to spell checking, the Metric Tree can be used with virtually any application where the items obey metric rules! Pretty cool, huh? Here’s the full search algorithm from the original paper (without my earlier simplications): PrintMatches(Node *cur, string misspell, int rad) { if ( e(cur->word, misspell) <= rad) cout word; if ( e(cur->word,misspell) – rad editThresh ) PrintMatches(cur->left,misspell,maxDist) if ( e(cur->word, misspell) + rad >= cur->editThresh ) PrintMatches (cur->right,misspell,maxDist); }

Challenge: Space-efficient Set Membership There are many problems where we want to maintain a set S of items and then check if a new item X is in the set, e.g.: So, what data structures could you use for this? Right! Both hash tables and binary search trees allow you to: 1.Hold a bunch of items. 2.Quickly search through them to see if they hold an item X. “Is ‘carey nachenberg’ a student at UCLA?” “Is the phone number ‘424-750-7519’ known to be used by a terrorist cell?

So what’s the problem! Well, binary search trees and hash tables are memory hogs! But if I JUST want to do two things: In other words, if I never need to: 1.Print the items of the set (after they’ve been added). 2.Enumerate each value in the set. 3.Erase items from the set. Then we can do much better than our classic data structures! I can actually create a much more memory efficient data structure! 1. Add new items to the set 2. Check if an item was previously added to a set

But first… A hash primer * * Not that kind of hash. A hash function is a function, y=f(x), that takes an input x (like a string) and returns an output number y for that input. The ideal hash function returns entirely different values for each different input, even if two inputs are almost identical: int y,z; y = idealHashFunction(“carey”); cout << y; z = idealHashFunction(“cArey”); cout << z; So even though these two strings are almost identical, a good hash function might return y=92629 and z=152.

Hash Functions int hashFunc(const string &name) { int i, total=0; for (i=0;i { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/13/3953989/slides/slide_26.jpg", "name": "Hash Functions int hashFunc(const string &name) { int i, total=0; for (i=0;i

A Better Hash Function The CRC or Cyclical Redundancy Check algorithm is an excellent hash function. This function was designed to check network packets for corruption. We won’t go into CRC’s details, but it’s a perfectly fine hashing algorithm… Ok, so we have a good hash function, now what?

A Simple Set Membership Algorithm Imagine that I know I want to store up to 1 million items in my set… I could create an array of say… 100 million bits And then do the following… class SimpleSet { public: … private: BitArray m_arr[100000000]; void insertItem(string &name) { int slot = CRC(SEED, name); slot = slot % 100000000; m_arr[slot] = 1; } bool isItemInSet(string &name) { int slot = CRC(SEED, name); slot = slot % 100000000; if (m_arr[slot] == 1) return(true); else return(false); } main() { SimpleSet s; s.insertItem(“Carey”); s.insertItem(“Flint”); if (s.isItemInSet(“Flint”) == true) cout << “Flint’s in my set!”; } 000000000000000000000000000000000000000000000000000000 s “Carey” slot 3000012131 12131 1 “Flint” 9721 1 “Flint” slot 9721 Most hash functions require a seed (initialization) value to be passed in. Here’s how it might be used: unsigned CRC(unsigned seed, string &s) { unsigned crc = seed; for (int i=0;i > 8) & CONST1) ^ crcTable[(crc^ s[i]) & CONST2]; return(crc); } Typically you’d use a seed value of 0xFFFFFFFF with CRC. But you can change the seed if you like – this results in a (much) different hash value, even for the same input!

A Simple Set Membership Algorithm class SimpleSet { public: … private: BitArray m_arr[100000000]; void insertItem(string &name) { int slot = CRC(SEED,name); slot = slot % 100000000; m_arr[slot] = 1; } bool isItemInSet(string &name) { int slot = CRC(SEED,name); slot = slot % 100000000; if (m_arr[slot] == 1) return(true); else return(false); } Ok, so what’s the problem with our SimpleSet? Right! There’s a chance of collisions! What if two names happen to hash right to the same slot? main() { SimpleSet coolPeople; coolPeople.insertItem(“Carey”); if (coolPeople.isItemInSet(“Paul”)) cout << “Paul Agbabian is cool!”; } 000000000000000000000000000000000000000000000000000000 cool People slot 3000012131 12131 1 slot 11000012131 12131

A Simple Set Membership Algorithm class SimpleSet { public: … private: BitArray m_arr[100000000]; void insertItem(string &name) { int slot = CRC(SEED,name); slot = slot % 100000000; m_arr[slot] = 1; } bool isItemInSet(string &name) { int slot = CRC(SEED,name); slot = slot % 100000000; if (m_arr[slot] == 1) return(true); else return(false); } Ok, so what’s the problem with our SimpleSet? Right! There’s a chance of collisions! What if two names happen to hash right to the same slot? Ack! If we put 1 million items in our 100 million entry array… we’ll have a collision rate of about 1%! Actually, depending on your requirements, that might not be so bad…

A Simple Set Membership Algorithm Our simple set can hold about 1M items in just 12.5MB of memory! While it does have some false- positives, it’s much smaller than a hash table or binary search tree… But we’ve got to be able to do better… Right? Right! That’s where the Bloom Filter comes in! The Bloom Filter was invented by Burton Bloom in 1970. Let’s take a look!

The Bloom Filter In a Bloom Filter, we use an array of bits just like our original algorithm! class BloomFilter { public: … private: BitArray m_arr[100000000]; But instead of just using 1 hash function and setting just one bit for each insertion… We use K hash functions, compute K hash values and set K bits! void insertItem(string &name) { for (int i=0;i< K ;i++) { int slot = CRC( i, name); slot = slot % 100000000; m_arr[slot] = 1; } main() { BloomFilter coolPeople; coolPeople.insertItem(“Preston”); } 000000000000000000000000000000000000000000000000000000 cool People We’ll see how K is chosen in a bit. It’s a constant and its value is computed from: 1.The max # of items you want to add. 2.The size of the array. 3.Your desired false positive rate. const int K = 4; slot 9000022531 Notice that each time we call the CRC function, it starts with a different seed value: unsigned CRC(unsigned seed, string &s) { unsigned crc = seed; for (int i=0;i > 8) & CONST1) ^ crcTable[(crc^ s[i]) & CONST2]; return(crc); } (Passing K different seed values is the same as using K different hash functions …) 22531 1 9197 1 79929 1 300000001313 1

The Bloom Filter Now to search, we do the same thing! class BloomFilter { public: … private: BitArray m_arr[100000000]; void insertItem(string &name) { for (int i=0;i< K ;i++) { int slot = CRC( i, name); slot = slot % 100000000; m_arr[slot] = 1; } main() { BloomFilter coolPeople; coolPeople.insertItem(“Preston”); } 000000000000000000000000000000000000000000000000000000 cool People 11 1 1 bool isItemInSet(string &name) { for (int i=0;i< K ;i++) { int slot = CRC( i, name); slot = slot % 100000000; if (m_arr[slot] == 0) return(false); } return(true); } Note: We only say an item is a member of the set if all K bits are set to 1. Note: If any bit that we check is 0, then we have a miss… if (coolPeople.isItemInSet(“Carey”)) cout << “I figured…”;

The Bloom Filter Ok, so what’s the big deal? All we’re doing is checking K bits instead of 1?!!? class BloomFilter { public: private: BitArray m_arr[100000000]; void insertItem(string &name) { for (int i=0;i< K ;i++) { int slot = CRC( i, name); slot = slot % 100000000; m_arr[slot] = 1; } bool isItemInSet(string &name) { for (int i=0;i< K ;i++) { int slot = CRC( i, name); slot = slot % 100000000; if (m_arr[slot] == 0) return(false); } return(true); } Well, it turns out that this dramatically reduces the false positive rate! Ok… So the only questions are, how do we chose: 1.The size of our bit-array? 2.The value of K? Let’s see!

The Bloom Filter If you want to store N items in your Bloom Filter… And you want a false positive rate of F%... You’ll want to have M bits in your bit array: M = log(F) * N log(.6185) And you’ll want to use K different hash functions: K=.7* M N Let’s see some stats! To store: N items with this FP rate, use M bits (bytes) and K hash fns 1M.1% 14.4M bits (1.79MB) 10 100M.001% 2.4B bits (299MB) 17 100M.00001% 3.4B bits (419MB) 23 Now you’ve got to admit, that’s pretty efficient! Of course, unlike a hash table, there is some chance of having a false positive… But for many projects, this is not an issue, especially if you can guarantee a certain minimum level of FPs! Now that’s COOL! And you’ve (hopefully) never heard about it!

Challenge: Constant-time searching for similar items (in a high-dimensional space) Problem: I’ve got a large collection C of existing web-pages, and I want to determine if a new web-page P is a close match to any pages in my existing collection. Obvious approach: I could iterate through all C of my existing pages and do a pair-wise comparison of page P to each page. But that’s inefficient! So how can we do it faster?

Answer: Use Locality Sensitive Hashing! LSH has two operations: Inserting items into the hash table: We add a bunch of items (e.g., web pages) into a locality-sensitive hash table Given an item, find closely-related items in the hash table: Once we have a filled locality-sensitive hash table, we want to search it for a new item and see if it contains anything similar.

LSH, Operation #1: Insertion Here’s the Insertion algorithm: Step #1: Take each input item (e.g., a web-page) and convert it to a feature vector of size V. What’s a feature vector? It’s a fixed-length array of floating point numbers that measure various attributes about each input item. const int V = 6; float fv[V]; fv[0] = # of times the word “free” was used in the email fv[1] = # of times the word “viagra” was used in the email fv[2] = # of exclamation marks used in the email fv[3] = The length of the email in words fv[4] = The average length of each word found in the email fv[5] = The ratio of punctuation marks to letters in the email The items in the feature vector should be chosen to provide maximum differentiation between different categories of items (e.g., spam vs clean email)! fv[5] = # of times the word “the” was used in the email

LSH, Operation #1: Insertion Why compute a feature vector for each input item? The feature vector is a way of plotting each item into N-space. Input #1: “Click here now for free viagra!!!!!” fv1 = {1, 1, 5, 6, 4.17, 0.2} Input #2: “Please come to the meeting at 5pm.” fv2 = {0, 0, 1, 7, 3.71, 0.038} 1.0 5.0 } } fv1 fv2 In principle, items (e.g. emails) with similar content (i.e., similar feature vectors) should occupy similar regions of N-space.

LSH, Operation #1: Insertion Step #2: Once you have a feature vector for each of your items, you determine the size of your hash table. “I’m going to need to hold 100 million email feature vectors, so I’ll want an open hash table of size N = 1 million” Step #3: Next compute the number of bits B required to represent N in binary. If N is 1 million, B will be log 2 (1 million), or 20. Wait! Why is our hash table smaller than the # of items we want to store? Because we want to put related items in the same bucket/slot of the table! Note: N must be a power of 2, e.g., 65536, or 1,048,576

LSH, Operation #1: Insertion Step #4: Now, create B (e.g., 20) RANDOM feature vectors that are the same dimension as your input feature vectors. R 1 = {.277,.891,3,.32,5.89,.136} R 2 = {2.143,.073,0.3,4.9,.58,.252} … R 19 = {.8,.425,6.43,5.6,.197,1.43} R 20 = {1.47,.256,4.15,5.6,.437,.075}

LSH, Operation #1: Insertion What are these B random vectors for? R 1 = {1,0,1} Each of the B random vectors defines a hyper-plane in N-space! R 2 = {0,0,3} R 3 = {0,2.5,0} (each hyper-plane is perpendicular to its random vector) If we have B such random vectors, we essentially chop up N-space with B possibly overlapping slices! So in our example, we’d have B=20 hyper-planes chopping up our V=6 dimensional space. (Chopping it up into 2 20 different regions!)

LSH, Operation #1: Insertion Ok, let’s consider a single random vector, R 1, and it’s hyper-plane for now. If the tips of those two vectors are on the same side of R’s hyper-plane, then the dot-product of the two vectors will be positive. R 1 · v 1 > 0 v2v2 R1R1 v1v1 Now let’s consider a second vector, v 1. On the other hand, if the tips of those two vectors are on opposite sides of R’s hyper-plane, then the dot- product of the two vectors will be negative. R 1 · v 2 < 0 So this is useful – if we compute the dot product of two vectors R and v, we can determine if they’re close to each other or far from each other in N-space.

· {1, 1, 5, 6, 4.17, 0.2} And if we concatenate the 1s and 0s, this gives us a B-digit (e.g., 20 digit) binary number. {1, 1, 5, 6, 4.17, 0.2} LSH, Operation #1: Insertion Step #5: Create an empty open hash table with 2 B buckets (e.g. 2 20 = 1M). … 000…0000 000…0001 000…0010 000…0011 … 1111…11110 1111…11111 Let’s label each bucket’s # using binary rather than decimal numbers. (You’ll see why soon ) Step #6: For each item we want to add to our hash table… Take the feature vector for the item... “Click here now for free viagra!!!!!” And dot-product multiply it by every one of our B random-valued vectors… R 1 = {.277,.891,3,.32,5.89,.136} R 2 = {2.13,.07,0.3,4.9,.58,.252} … R 19 = {.8,.45,6.3,5.6,.197,1.43} R 20 = {1.7,.26,4.15,5.6,.47,.07} -3.25 -1.73.18 5.24 Now convert every positive dot-product to a 1 And convert every negative dot-product into a 0 00…1100…11 Which we can use to compute a bucket number in our hash table and store our item! This basically tells us whether our feature vector is on the same side or the opposite side of the hyper-plane of every one of our random vectors. Opp. side of R 1 Opp. side of R 2 … Same side as R 19 Same side as R 20 is on the…

{1, 1, 5, 6, 4.17, 0.2} LSH, Operation #1: Insertion … 000…0000 000…0001 000…0010 000…0011 … 1111…11110 1111…11111 “Click here now for free viagra!!!!!” Basically, every item in bucket 0000000000000 will be on the opposite sides of hyper- planes of all the random vectors. And every item in bucket 111111111111111 will be on the same side of the hyper- planes of all the random vectors. And items in bucket 000000000001 will be on the same side as R 20, but the opposite side of R 1, R 2 … R 19. So each bucket essentially represents one of the 2 20 different regions of N- space, as divided by the 20 random hyper-plane slices.

{1, 1, 5, 6, 4.17, 0.2} LSH, Operation #2: Searching … 000…0000 000…0001 000…0010 000…0011 … 1111…11110 1111…11111 “Click here now for free viagra!!!!!” Searching for closely-related items is the same as inserting! Step #1: Compute the feature vector for your item Step #2: Dot-product multiply this vector by your B random vectors Step #3: Convert all positive dot-products to 1, and all negative dot-products to 0 Step #4: Use the concatenated binary number to pick a bucket in your hash table And viola – you’ve located similar feature vectors/items!

LSH, One Last Point… Typically, we don’t just use one LSH hash table… But we use two or more, each with a different set of random vectors! Why? Then, when searching for a new vector V, we take the union of all buckets that V hashes to, from all hash tables to obtain a list of matches.

Questions?