Presentation is loading. Please wait.

Presentation is loading. Please wait.

LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23.

Similar presentations

Presentation on theme: "LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23."— Presentation transcript:

1 LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23

2 Administrivia Class:

3 Administrivia Class:

4 Administrivia Did you bring your laptop?

5 Administrivia Did you install Perl yet? –Active State Perl –http://www.activestate.com –install free version

6 Regular Expressions regular expressions are used in string pattern-matching –important tool in automated searching –formally equivalent to finite-state automata (FSA) and regular grammars popular implementations –Unix grep command line program returns lines matching a regular expression standard part of all Unix-based systems –including MacOS X (command-line interface in Terminal) many shareware/freeware implementations available for Windows XP –just Google and see... –grep functionality is built into many programming languages e.g. Perl –wildcard search in Microsoft Word limited version of regular expressions (not full power) with differences in notation

7 Regular Expressions Historical note –grep : name comes from Unix ed command –g/re/p –“search globally for lines matching the regular expression, and print them” –[Source:] –ed is an obscure and difficult-to-use text edit program on Unix systems –doesn’t need a screen display –would work on an ancient teletype

8 Regular Expressions Formally –a regular expression (regexp) is formed from: an alphabet (= set of characters) operators –a regexp is shorthand for a set of strings (possibly infinite set) (strings are of finite length) Formally –a set of strings is called a language –a language that can be defined by a regular expression is called a regular language (not all languages are regular)

9 Regular Expressions alphabet –e.g. {a,b,c,...,z} set of lower case English letters Note: case is important operators –asingle symbol a –a n exactly n occurrences of a, n a positive integer –a n a 3  aaa –a * zero or more occurrences of a –a + one or more occurrences of a –concatenation two regexps may be concatenated, the resulting string is also a regexp e.g. abc –disjunction infix operator: | (vertical bar) e.g. a|b –parentheses may be used for disambiguation e.g. gupp(y|ies)

10 Regular Expressions Technically, a + is not necessary –aa* = a + “a concatenated with a* (zero or more occurrences of a)” = “one or more occurrences of a” Disjunction –[set of characters] set of characters enclosed in square brackets means match one of the characters –e.g. [aeiou] matches any of the vowels a, e, i, o or u but not d –dash (-) shorthand for a range –e.g. [a-e] matches a, b, c, d or e

11 Regular Expressions Range defined over a computer character set –typically ASCII –originally a 7 bit character set –2^7 = 128 (0-127) different characters –ASCII = American Standard Code for Information Interchange –e.g. [A-z] [0-9A-Za-z] rogramming/ascii_table/PROGRAM MING_ascii_table.shtml

12 Regular Expressions: Microsoft Word terminology: –wildcard search

13 Regular Expressions: Microsoft Word Note: zero or more times is missing in Microsoft Word

14 Regular Expressions Perl uses the same notation as grep (textbook also uses grep notation) More shorthand –question mark (?) means the previous regexp is optional –e.g. colou?r –matches color or colour –metacharacters or operators like ? have a function –to match a question mark, escape it using a backslash (\) –e.g. why\? –? in Microsoft Word means match any character More shorthand –period (.) stands for any character (except newline) –e.g. e.t matches eat as well as eet –caret sign (^) as the first character of a range of characters [^set of characters] means don’t match any of the characters mentioned (after the caret) –e.g. [^aeiou] –any character except for one of the vowels listed

15 Regular Expressions Text files in Unix consists of sequences of lines separated by a newline character (LF = line feed) Typically, text files are read a line at a time by programs Matching in Perl and grep is line-oriented (can be changed in Perl) Differences in platforms for line breaking: –Unix: LF –Windows (DOS): CR LF –MacOS (X): CR

16 Regular Expressions Line-oriented metacharacters: –caret (^) at the beginning of a regexp string matches the “beginning of a line” –e.g. ^The matches lines beginning with the sequence The –Note: the caret is very overloaded... [^ab] a^b –dollar sign ($) at the end of a regexp string matches the “end of the line” –e.g. end\.$ –matches lines ending in the sequence end. –e.g. ^$ matches blank lines only –e.g. ^ $ matches lines contains exactly one space

17 Regular Expressions Word-oriented metacharacters: –a word is any sequence of digits [0-9], underscores (_) and letters [a-zA-Z] –(historical reasons for this) –\b matches a word boundary, e.g. a space or beginning or end of a line or a non-word character –e.g. the –matches the, they, breathe and other –but \bthe will only match the and they –the\b will match the and breathe –\bthe\b will only match the –(\ can also be used to match the beginning and end of a word) –e.g. \b99 –matches 99 but not 299 –also matches $99

18 Regular Expressions Range abbreviations: –\d (digit) = [0-9] –\s (whitespace character) = space (SP), tab (HT), carriage return (CR), newline (LF) or form feed (FF) –\w (word character) = [0=9a- zA-Z_] uppercase versions denote negation –e.g. \W means a non-word character \D means a non-digit Repetition abbreviations: –a? a optional –a* zero or more a’s –a+ one or more a’s –a{n,m} between n and m a’s –a{n,} at least n a’s –a{n} exactly n a’s –e.g. \d{7,} matches numbers with at least 7 digits –e.g. \d{3}-\d{4} –matches 7 digit telephone numbers with a separating dash

19 Reading Perl Quick Intro – Perl Regular Expressions (RE) –perlrequick - Perl regular expressions quick startperlrequick –perlretut - Perl regular expressions tutorialperlretut

Download ppt "LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23."

Similar presentations

Ads by Google