Presentation is loading. Please wait.

Presentation is loading. Please wait.

Approximate string matching Evlogi Hristov Telerik Corporation Student at Telerik Academy.

Similar presentations


Presentation on theme: "Approximate string matching Evlogi Hristov Telerik Corporation Student at Telerik Academy."— Presentation transcript:

1 Approximate string matching Evlogi Hristov Telerik Corporation Student at Telerik Academy

2 1. Levenshtein distance. 2. Bitap overview. 3. Bitap Exact search. 4. Bitap Fuzzy search. 5. Additional information. 2

3 Edit distance 3

4  Edit distance: Primitive operations necessary to convert the string into an exact match.  insertion: cot → coat  deletion: coat → cot  substitution: coat → cost 4 Example: 1. Set n to be the length of s = "GUMBO" Set m to be the length of t = "GAMBOL" If n = 0, return m and exit If m = 0, return n and exit

5 0 1 2 3 4 5 1 1 2 3 4 5 2 2 1 2 3 4 3 3 2 1 2 3 4 4 3 2 1 2 GUMBO 012345 G1 A2 M3 B4 O5 L6 2. Initialize matrix M [m + 1, n + 1] 3. Examine each character of s ( i from 1 to n ) 4. Examine each character of t ( j from 1 to m ) 5. If s[i] equals t[j], the cost is 0 If s[i] is not equal to t[j], the cost is 1 6. Set cell M[j, i] equal to the minimum of: a. The cell immediately above plus 1: M [j-1, i] + 1 b. The cell immediately to the left plus 1: M [j, i-1] + 1 c. The cell diagonally above and to the left plus the cost: M [j-1, i-1] + cost 7. After the iteration steps (3, 4, 5, 6) are complete, the distance is found in the cell M [m - 1, n - 1] 5

6 private int Levenshtein(string source, string target) { if (string.IsNullOrEmpty(source)) { if (string.IsNullOrEmpty(source)) { if (!string.IsNullOrEmpty(target)) { if (!string.IsNullOrEmpty(target)) { return target.Length; return target.Length; } return 0; return 0; } if (string.IsNullOrEmpty(target)) { if (string.IsNullOrEmpty(target)) { if (!string.IsNullOrEmpty(source)) { if (!string.IsNullOrEmpty(source)) { return source.Length; return source.Length; } return 0; return 0; } int[,] dist = new int[source.Length + 1, target.Length + 1]; int[,] dist = new int[source.Length + 1, target.Length + 1]; int min1, min2, min3, cost; int min1, min2, min3, cost; //..continues on text page 6

7 for (int i = 0; i < dist.GetLength(0); i += 1) { for (int i = 0; i < dist.GetLength(0); i += 1) { dist[i, 0] = i; dist[i, 0] = i; } for (int i = 0; i < dist.GetLength(1); i += 1) { for (int i = 0; i < dist.GetLength(1); i += 1) { dist[0, i] = i; dist[0, i] = i; } for (int i = 1; i < dist.GetLength(0); i++) { for (int i = 1; i < dist.GetLength(0); i++) { for (int j = 1; j < dist.GetLength(1); j++) { for (int j = 1; j < dist.GetLength(1); j++) { cost = Convert.ToInt32(!(source[i-1] == target[j - 1])); cost = Convert.ToInt32(!(source[i-1] == target[j - 1])); min1 = dist[i - 1, j] + 1; min1 = dist[i - 1, j] + 1; min2 = dist[i, j - 1] + 1; min2 = dist[i, j - 1] + 1; min3 = dist[i - 1, j - 1] + cost; min3 = dist[i - 1, j - 1] + cost; dist[i, j] = Math.Min(Math.Min(min1, min2), min3); dist[i, j] = Math.Min(Math.Min(min1, min2), min3); } } return dist[dist.GetLength(0)-1,dist.GetLength(1)-1]; return dist[dist.GetLength(0)-1,dist.GetLength(1)-1];} 7

8 shift-or/shift-and 8

9  Also known as the shift-or, shift-and or Baeza–Yates–Gonnet algorithm.  Aproximate string matching algorithm.  Approximate equality is defined in terms of Levenshtein distance.  Often used for fuzzy search without indexing.  Does most of the work with bitwise operations.  Runs in O(mn) operations, no matter the structure of the text or the pattern. 9

10 public static List ExactMatch(string text, string pattern) { long[] alphabet = new long[128]; //ASCII range (0 – 127) long[] alphabet = new long[128]; //ASCII range (0 – 127) for (int i = 0; i < pattern.Length; ++i) for (int i = 0; i < pattern.Length; ++i) { int letter = (int)pattern[i]; int letter = (int)pattern[i]; alphabet[letter] = alphabet[letter] | (1 << i); alphabet[letter] = alphabet[letter] | (1 << i); } long result = 1; //0000 0001 long result = 1; //0000 0001 List indexes = new List (); List indexes = new List (); for (int index = 0; index < text.Length; index++) for (int index = 0; index < text.Length; index++) { result &= alphabet[text[index]]; //if result != pattern => result = 0 result &= alphabet[text[index]]; //if result != pattern => result = 0 result = (result << 1) + 1; result = (result << 1) + 1; if ((result & (1 0) if ((result & (1 0) { indexes.Add(index - pattern.Length + 1); indexes.Add(index - pattern.Length + 1); } } return indexes; return indexes;} 10

11 cbaba 00101 11 alphabet[a] = 01234 ababc cbaba 01010 alphabet[b] = cbaba 10000 alphabet[c] = = 5 = 10 = 16 Example: text = cbdabababcpattern = ababc cbaba 00000 alphabet[d] = = 0 43210 bits: 00001 start res: c 00000 cb 00000 cbd 00000 cbda 00001 cbdab 00010 bdaba 00101 dabab 01010 ababa 00101 babab 01010 ababc 10000 res: text[i] = 1

12 12... long[] result = new long[k + 1]; for (int i = 0; i <= k; i++) for (int i = 0; i <= k; i++) { result[i] = 1; result[i] = 1; }... for (int j = 1; j <= k; ++j) for (int j = 1; j <= k; ++j) { // Three operations of the Levenshtein distance // Three operations of the Levenshtein distance long insertion = current | ((result[j] & patternMask[text[i]]) << 1); long insertion = current | ((result[j] & patternMask[text[i]]) << 1); long deletion = (previous | (result[j] & patternMask[text[i]])) << 1; long deletion = (previous | (result[j] & patternMask[text[i]])) << 1; long substitution = (previous | (result[j] & patternMask[text[i]])) << 1; long substitution = (previous | (result[j] & patternMask[text[i]])) << 1; current = result[j]; current = result[j]; result[j] = substitution | insertion | deletion | 1; result[j] = substitution | insertion | deletion | 1; previous = result[j]; previous = result[j]; }...  Instead of having a single array result that changes over the length of the text, we now have k distinct arrays result 1..k

13  Shift-and :  Uses bitwise & and 1’s for matches  More intuitive and easyer to understand  Needs to add result |= 1  Shift-or :  Uses bitwise | and zeroes’s for matches  A bit faster 13

14 форум програмиране, форум уеб дизайн курсове и уроци по програмиране, уеб дизайн – безплатно програмиране за деца – безплатни курсове и уроци безплатен SEO курс - оптимизация за търсачки уроци по уеб дизайн, HTML, CSS, JavaScript, Photoshop уроци по програмиране и уеб дизайн за ученици ASP.NET MVC курс – HTML, SQL, C#,.NET, ASP.NET MVC безплатен курс "Разработка на софтуер в cloud среда" BG Coder - онлайн състезателна система - online judge курсове и уроци по програмиране, книги – безплатно от Наков безплатен курс "Качествен програмен код" алго академия – състезателно програмиране, състезания ASP.NET курс - уеб програмиране, бази данни, C#,.NET, ASP.NET курсове и уроци по програмиране – Телерик академия курс мобилни приложения с iPhone, Android, WP7, PhoneGap free C# book, безплатна книга C#, книга Java, книга C# Николай Костов - блог за програмиране http://algoacademy.telerik.com

15  Original paper of Baeza-Yates and Gonnet:  http://www.akira.ruc.dk/~keld/teaching/algoritmedesign _f08/Artikler/09/Baeza92.pdf http://www.akira.ruc.dk/~keld/teaching/algoritmedesign _f08/Artikler/09/Baeza92.pdf http://www.akira.ruc.dk/~keld/teaching/algoritmedesign _f08/Artikler/09/Baeza92.pdf  Google implementation using bitap:  https://code.google.com/p/google-diff-match-patch https://code.google.com/p/google-diff-match-patch  Levenshtein algorithm:  http://www.codeproject.com/Articles/13525/Fast- memory-efficient-Levenshtein-algorithm http://www.codeproject.com/Articles/13525/Fast- memory-efficient-Levenshtein-algorithm http://www.codeproject.com/Articles/13525/Fast- memory-efficient-Levenshtein-algorithm  http://en.wikibooks.org/wiki/Algorithm_Implementation /Strings/Levenshtein_distance http://en.wikibooks.org/wiki/Algorithm_Implementation /Strings/Levenshtein_distance http://en.wikibooks.org/wiki/Algorithm_Implementation /Strings/Levenshtein_distance

16  “C# Programming @ Telerik Academy  csharpfundamentals.telerik.com csharpfundamentals.telerik.com  Telerik Software Academy  academy.telerik.com academy.telerik.com  Telerik Academy @ Facebook  facebook.com/TelerikAcademy facebook.com/TelerikAcademy  Telerik Software Academy Forums  forums.academy.telerik.com forums.academy.telerik.com


Download ppt "Approximate string matching Evlogi Hristov Telerik Corporation Student at Telerik Academy."

Similar presentations


Ads by Google