Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Regular Expressions: grep LING 5200 Computational Corpus Linguistics Martha Palmer.

Similar presentations


Presentation on theme: "1 Regular Expressions: grep LING 5200 Computational Corpus Linguistics Martha Palmer."— Presentation transcript:

1 1 Regular Expressions: grep LING 5200 Computational Corpus Linguistics Martha Palmer

2 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 2 Homework 2 Bytes Read path names ~ not necessary in home directory Display results of commands if they’re just a few lines.

3 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 3 Switches -c list a count of matching lines only  (like adding | wc) -i ignore the case of the letters in the pattern -n include the line numbers -v show lines that do NOT match the pattern grep -i lemma README.english grep -ic lemma README.english grep -in lemma README.english

4 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 4 The Chomsky Grammar Hierarchy Regular grammars, aabbbb S → aS | nil | bS Context free grammars, aaabbb S → aSb | nil Context sensitive grammars, aaabbbccc xSy → xby Transformational grammars - Turing Machines

5 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 5 Movement What did John give to Mary? *Where did John give to Mary? John gave cookies to Mary. John gave to Mary.

6 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 6 Nested Dependencies and Crossing Dependencies John, Mary and Bill ate peaches, pears and apples, respectively The dog chased the cat that bit the mouse that ran. The mouse the cat the dog chased bit ran. CF CS

7 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 7 Most parsers are Turing Machines To give a more natural and comprehensible treatment of movement For a more efficient treatment of features Not because of respectively – most parsers can’t handle it.

8 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 8 b*c matches the first character in the string cabbbcde, b*cd matches the third to seventh characters in the string cabbbcdebbbbbbcdbc.

9 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 9 Character classes: ranges All upper-case, all lower-case, all letters, any digit from zero to 9… [A-Z] [a-z] [A-Za-z] [0-9] Practice!

10 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 10 Character classes: complements Any character that's not a vowel [^aeiouAEIOU] In this context, means "not"

11 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 11 Anchors Any line that begins with… Any line that ends with… ^T line that begins with T VBZ$ line that ends with VBZ

12 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 12 Quantifiers One or more… Zero or more… One or zero… a+ one or more “a's” a* zero or more “a's” a? one “a”, or nothing And more…

13 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 13 grep/egrep X+ instead of xx* (xxx|yyy) ? Matches a single character

14 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 14 Searching the treebank cat ??/* | egrep -i '(push|pull)[a-z]*’

15 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 15 grep/egrep grep '^[^a-z]*epl' README.english grep ‘ epl' README.english egrep '^[^a-z]*(epl|epw)' README.english egrep ‘ (epl|epw)' README.english Nice when you have tokenized strings…

16 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 16 More grepping But when you don’t…. /corpora/celex/english/epw/epw.cd Find all capitalized words grep ^'[0-9][0-9]*.[A-Z]' epw.cd | wc -l

17 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 17 Exercises – pick a directory How many 5 letter words? head -10 wsj_0564 | grep -i ' [a-z][a-z][a-z][a-z][a-z] ' | wc grep -i ' [a-z][a-z][a-z][a-z][a-z] ' * | wc

18 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 18 Lab (cont.) Are there any words with no vowels? grep -i ' [^aeiou][^aeiou]* ' wsj_0564 | wc grep -i ' [^aeiouy][^aeiouy.]* ' wsj_0564 | wc grep -i ' [^aeiouy"][^aeiouy."]* ' wsj_0564 80%?

19 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 19 Lab (cont.) Find “1-syllable” words. (words with exactly one vowel) grep -i ' [^aeiouy]*[aeiouy][^aeiouy]* ‘ Find “2- syllable” words. (words with exactly two vowels) Delete words ending with a silent “e” from the “2-syllable” list

20 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 20 Emacs emacs –nw Control x, control c – exit Control x, control s – save Control x, control v – visit Appropos


Download ppt "1 Regular Expressions: grep LING 5200 Computational Corpus Linguistics Martha Palmer."

Similar presentations


Ads by Google