# Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines A Presentation by Ian Graham Carnegie Mellon University August 2, 2002.

## Presentation on theme: "Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines A Presentation by Ian Graham Carnegie Mellon University August 2, 2002."— Presentation transcript:

Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines A Presentation by Ian Graham Carnegie Mellon University August 2, 2002

The March of Progress  1. Literal string search (exact substring)  2. Extended string search (character classes)  3. Regular expression matching  4. Approximate matching  5. “Extended” regular expression matching

Begin at the Beginning  The simplest case of a regular expression is a literal string search  Literal—any symbol in the alphabet  Literal string—a concatenation of literals  Literal string search—the problem of finding all occurrences of one literal string within another literal string (find “cad” in “abracadabra”)

Quick Review  Knuth-Morris-Pratt (KMP) and Boyer-Moore (BM) Two classical literal string search algorithms Two classical literal string search algorithms About 25 years old About 25 years old Used to achieve O(m+n) search performance, where m is the length of the search pattern and n is the length of the text to be searched Used to achieve O(m+n) search performance, where m is the length of the search pattern and n is the length of the text to be searched

Quick Review  KMP scans from left to right, shifting by aligning the longest prefix of the search pattern which matches a suffix of the text scanned  BM scans from right to left along a window that shifts from left to right by choosing the largest shift amount from multiple shift rules

Practical Developments  For any alphabet size, there is always an algorithm which achieves better experimental results than KMP or BM.  The Horspool algorithm (1980) simplifies BM, using only the bad character shift rule instead of calculating multiple shift amounts and using the best

Practical Developments  Horspool is O(m+n) in the average case (assuming equal probability of all alphabet characters), O(mn) in the worst case  BM is O(m+n) average, O(m+n) worst  Evaluating multiple shift rules for BM greatly increases its runtime constant  Horspool is much faster in practice, and is extremely hard for any algorithm to beat over large alphabets

Bit-Parallelism  Recent algorithms (1992~) create nondeterministic automata to keep track of each possible match along the length of the pattern  States of these NFAs are mapped to bits in a word, and transitions are simulated utilizing the parallelism of bitwise operations

Bit-Parallelism  Possible matches may be represented by “1”s, and proceed in parallel along the pattern until they reach the end, indicating a match

Bit-Parallelism  Savings due to parallelism depends on the word size  Bit-parallel algorithms often only perform well for patterns of size near to or less than the word size

Bit-Parallelism  Most analysis assumes constant word size, either 32 or 64 bits  Savings under this assumption are constant, but result in extremely good performance for practical applications

A Wrench  Let a “character class” be an item which matches a single character from a range or explicit list.  Examples [0-9] matches any digit [0-9] matches any digit [Aa] matches A or a [Aa] matches A or a [A-Za-z] matches any English letter [A-Za-z] matches any English letter

A Wrench  Let an “extended string” be a literal string with the additional property that it may contain character classes in place of literals.  Examples: “abc[de]f” matches “abcdf or “abcef” “abc[de]f” matches “abcdf or “abcef” “[Aa][Nn][Ee][Uu][Rr][Ii][Ss][Mm]” matches “aneurism”, “ANEURISM”, “aNeUrIsM”… “[Aa][Nn][Ee][Uu][Rr][Ii][Ss][Mm]” matches “aneurism”, “ANEURISM”, “aNeUrIsM”…

A Wrench  Moving from literal string searches to extended string searches confounds many algorithms  Horspool may be extended, but its performance suffers greatly  Boyer-Moore may also be extended, and performs better than other well-known extensions

Bit-Parallelism on Top?  A recent (Navarro and Raffinot, 1998) bit- parallel algorithm claims to be 10-40% faster than any known variant of BM  Appears to be the fastest algorithm given: moderate-sized alphabet (e.g. English) moderate-sized alphabet (e.g. English) moderate pattern sizes (5-110 characters) moderate pattern sizes (5-110 characters)

What is a Regular Expression?  Says Stephen Kleene: “A notation to describe regular languages.” “A notation to describe regular languages.” “A description of the behavior of a finite state machine.” “A description of the behavior of a finite state machine.” “Regular.” “Regular.”

A Familiar Definition  1. a for some a in the alphabet Σ  2. ε  3. the null language  4. R1 U R2 (R1, R2 regular languages)  5. R1 ◦ R2 (R1, R2 regular languages)  6. R1* (R1 regular language)

Efficiently Matching Regular Expressions  Attempts to extend classical literal search algorithms to process regular expressions have largely been fruitless  Efficient algorithms involve clever ways of simulating an NFA equivalent of the regular expression

Efficiently Matching Regular Expressions  For small to moderate pattern sizes, optimizations using bit-parallelism appear to result in the fastest algorithms (Navarro, Raffinot)  For large pattern sizes (greater than about 4 times the word size), partial conversion from NFA to DFA results in good performance

Where can we go from here?  Approximate matching—match a literal string to within some “difference” Edit distance is commonly used Edit distance is commonly used Rules much more complex for computational biology applications Rules much more complex for computational biology applications  Extensions to regular expressions Used by most languages and applications Used by most languages and applications

Where can we go from here?  Efficiently handling regular expressions and approximate matching are problems in much of today’s research  Flexible Pattern Matching in Strings, by Navarro and Raffinot, referenced here, was published June 15, 2002

What is a Regular Expression?  Say modern developers: A pattern that can be matched against a string A pattern that can be matched against a string Not necessarily a model of any particular machine Not necessarily a model of any particular machine Not necessarily (and not usually) regular Not necessarily (and not usually) regular A very powerful tool for solving text-based problems A very powerful tool for solving text-based problems

Who uses regular expressions?  Where to find built-in “regular expression” support today? awk, grep, sed, vi, emacs, find, more, less, lex, Perl, Ruby, Tcl, MySQL, Javascript, PHP, Python, Java, Microsoft.NET, and many, many more awk, grep, sed, vi, emacs, find, more, less, lex, Perl, Ruby, Tcl, MySQL, Javascript, PHP, Python, Java, Microsoft.NET, and many, many more  Built-in support has become more frequent and more advanced in the past few years

Irregular Regular Expressions?  The languages described by most popular “regular expression” engines are NP-Hard  Construction of a “regular expression” in Perl which matches representations of 3- colorable graphs is fairly straightforward

Irregular Regular Expressions? Perl “regular expression” which matches any 3-colorable graph, given a number of vertices V and an edge-list E: \$string = (join "\n", (("rgb") x \$V)). "\n:\n". "\n:\n". join "\n", (("rgbrbgr") x @E) ;. join "\n", (("rgbrbgr") x @E) ; \$regex = '^‘. (join "\\n", (".*(.).*") x \$V). (join "\\n", (".*(.).*") x \$V). "\\n:\\n". "\\n:\\n". (join "\\n", map {".*\\\$_->[0]\\\$_->[1].*"} @E). (join "\\n", map {".*\\\$_->[0]\\\$_->[1].*"} @E). '\$' ;. '\$' ; 3-colorable iff \$regex matches \$string ( http://perl.plover.com/NPC/NPC-3COL.html)

Irregular Regular Expressions?  Usage of the term “regular expression” in modern development conflicts with its theoretical definition  Many are unaware of or ignore this conflict, while others choose different terminology: “Extended regular expression” “Extended regular expression” “Regex” “Regex”

Clear Definitions  Regular expression—a description of a regular language, as defined by Kleene  Regex—any pattern matched against a string, not necessarily regular

The Main Culprit  Backreferences Ability to refer to text that has been matched in a previous part of the regex Ability to refer to text that has been matched in a previous part of the regex Typically expressed as \n, where n is a number—refers to the text matched by the regex inside the nth set of parenthesis Typically expressed as \n, where n is a number—refers to the text matched by the regex inside the nth set of parenthesis “(.*)\1\1” matches “abcabcabc”, “abaabaaba”... “(.*)\1\1” matches “abcabcabc”, “abaabaaba”... “\b(\w+)\b\s+\b\1\b” matches “the the”, “a a”…(double words) “\b(\w+)\b\s+\b\1\b” matches “the the”, “a a”…(double words)

Backreferences  Supported in limited number by vi, sed, grep, emacs, Ruby, Python, PHP  POSIX standard for Basic Regular Expressions includes capability to process nine backreferences  Bounding the number available places a bound on performance

Backreferences  Supported without quantity bounds by Perl 5 and later, Tcl, Java 1.4,.NET  Number of backreferences limited only by physical memory restrictions LanguagevisedgrepemacsRubyPHPPerlTclJava.NET Backreferences supported 9999109∞∞∞∞

Backreferences  Slow—processing a regex becomes NP- Hard (for unbound amounts of backreferences)  Extremely useful—add a great deal of expressive power to a regex  Largely untouched by theoretical analysis No real bounds on efficiency No real bounds on efficiency

Lookahead  Also known as “zero-width matching”  Ability to check text ahead without consuming it in a match  Typically expressed as (?text)  Example “abc(?def)” will match “abc”, but only if followed by “def” “abc(?def)” will match “abc”, but only if followed by “def” LanguagesedgrepemacslexRubyPHPPerlTclJava.NET Lookaheadsupport?NoNoNoYesYesYesYesYesYesYes

Thank Larry Wall  Perl 5 regexes offer the ability to embed code within a regex  Perl 6 will support recursive regexes

Why the divide?  Very little theory has touched on extended regular expressions.  Backreferences are indispensable for many programmers, and often even in non-development use of *NIX systems

Why the divide?  Developers implemented regular expression processors shortly after Kleene created regular expressions in the 50’s

Why the divide?  New and more powerful features were quickly added to practical “regular expressions” so that users and programmers could express more languages  Regexes soon left theory in the dust

Moral of the Story  It’s much easier to hack than to make a good proof

The Future  Unbound backreferences are becoming a standard feature in regex libraries and languages  The idea of implementing regexes in a common module and sharing it among different languages and platforms is growing in popularity PCRE(Perl-Compatible Regex Engine) is used by Python, PHP, Apache, KDE… PCRE(Perl-Compatible Regex Engine) is used by Python, PHP, Apache, KDE…

The Future  Regex implementations seem to be moving towards more standardization  Meanwhile, a solid theoretical foundation has been laid down for regular expressions and modest extensions  Practice may not come to theory, but theory may soon come to practice

Download ppt "Kleene Would Be Shocked Redrawing the Link Between Theory and Modern Regex Engines A Presentation by Ian Graham Carnegie Mellon University August 2, 2002."

Similar presentations