LING 388: Language and Computers Sandiway Fong Lecture 6: 9/15
Administrivia reminder –optional homework exercises (from lecture 5) –due tomorrow (usual rules apply) –for those of you who missed one or more questions on homework 1
Administrivia homework 2 –out next week –requires access to Microsoft Word –or an alternative Open Office (free download, see openoffice.org)
Today’s Topic Regular Expressions (RE)
Regular Expressions (formally) equivalent to –finite state automata (FSA), and –regular grammars used in –string pattern matching typically for a single word form search text: unix (e)grep, perl, microsoft word caution: –differences in notation and implementation Regular Grammars FSA Regular Expressions
Regular Expressions shorthand for describing sets of strings String –sequence of zero or more characters –(typically, unbroken by spaces) Examples –aaa –john –mary45 –NT$ – (empty string)
Regular Expressions –shorthand string n –exactly n occurrences of string –n = 0,1,2,3,... examples –a 4 b 3 = aaaabbb –(uv) 2 = uvuv –((ab) 2 (ba) 2 ) 2 = ababbabaababbaba Note: –parentheses are used to group sequences of characters (strings)
Regular Expressions shorthand for describing sets of strings string + –set of one or more occurrences of string –i.e. the set {string 1, string 2, string 3,... } –Note: set is infinite examples –a + = {a, aa, aaa, aaaa, aaaaa, …} –(abc) + = {abc, abcabc, abcabcabc, …}
Regular Expressions shorthand for describing sets of strings string * –set of zero or more occurrences of string –i.e. the set {string 0, string 1, string 2, string 3,... } –string 0 = (the empty string) examples –a * = {, a, aa, aaa, aaaa, …} –(abc) * = {, abc, abcabc, …} Note: –a a * = a + –a {, a, aa, aaa, aaaa, …} = {a, aa, aaa, aaaa, aaaaa, …} Language = a set of strings
Regular Expressions Wildcard Characters matches a range of characters. (period) matches any single character examples –. + ed = set of all strings of length 3 or greater containing ed and having at least one character preceding it worked bed pre-education ed education –. * fix = set of all strings of length 3 or greater containing fix prefix infix infixed suffix fix
Regular Expressions Wildcard Characters matches a range of characters [characters] (list of matching characters) matches any single character in the list examples –[s,z]ation organization organisation –[a-z] any character in the range lowercase a to z Note: not uppercase –[0-9] any digit
Regular Expressions: grep excerpts from the manpage –The caret ^ and the dollar sign $ are metacharacters that respectively match the empty string at the beginning and end of a line. –The symbol \b matches the empty string at the edge of a word –The symbols \ respectively match the empty string at the beginning and end of a word. terminology –word unbroken sequence of digits, underscores and letters
Regular Expressions: grep Excerpts from the manpage –A regular expression may be followed by one of several repetition operators: ? The preceding item is optional and matched at most once. * The preceding item will be matched zero or more times. + The preceding item will be matched one or more times. {n} The preceding item is matched exactly n times {n,} The preceding item is matched n or more times. {n,m} The preceding item is matched at least n times, but not more than m times.
Regular Expressions: GNU grep Excerpts from the manpage concatenation –Two regular expressions may be concatenated; the resulting regular expression matches any string formed by concatenating two substrings that respectively match the concatenated subexpressions. disjunction – Two regular expressions may be joined by the infix operator |; the resulting regular expression matches any string matching either subexpression.
Regular Expressions: Examples Regular Expression –gupp(y|ies) examples –guppy –guppies Regular Expression –beds? examples –bed –beds
Regular Expressions: Examples Example –\b99 matches 99 in “there are 99 bottles …” –but not in 99 in “there are 299 bottles …” –Note: $99 contains two words, so \b99 will match 99 here –word unbroken sequence of digits, underscores and letters
Regular Expressions: Examples Example (sheeptalk) –ba! –baa! –baaa! … regular expression –baa*! –ba+!
Regular Expressions: Microsoft Word terminology: –wildcard search
Regular Expressions: Microsoft Word