Presentation is loading. Please wait.

Presentation is loading. Please wait.

Corpus Linguistics I ENG 617

Similar presentations


Presentation on theme: "Corpus Linguistics I ENG 617"— Presentation transcript:

1 Corpus Linguistics I ENG 617
Dr. Rania Al-Sabbagh Department of English Faculty of Al-Alsun (Languages) Ain Shams University

2 What are Regular Expressions?
Regular Expressions – RegEx or RegExp, for short – are a powerful way to do complex searches. With RegEx you can, for instance, find: Acronyms Rhyming words Postal codes Phone numbers s Spelling variations … and much more Week 7

3 RegEx at Work Many corpus processors support RegEx search. AntConc is one of them. For illustration purposes, we will use the Brown Corpus. The Brown University Standard Corpus of Present-Day American English (or just Brown Corpus) was compiled in the 1960s. The corpus originally contains 1,023,374 tokens and 41,506 types sampled from 15 text categories, including: press, religion, non-fiction books, etc. To get started open your AntConc and your Brown Corpus text file. We are using a raw version of the corpus. Week 7

4 RegEx at Work: Finding Acronyms
For a computer, what are Acronyms? They are all CAP words that include two or more characters. How can we tell the computer to look for ‘all CAP words that include two or more characters’? \b[A-Z]\b \b[A-Z]+\b \b[A-Z]*\b \b[A-Z]{2,}\b Week 7

5 Quiz Write a RegEx to find: two-letter acronyms only
three-letter acronyms only long acronyms of at least 4 letters Week 7

6 RegEx at Work: Finding Verb Conjugations
How can we search for all the verb conjugations of begin (i.e., begin, begins, began, begun) in one step? \bbeg?n\b \bbeg.n\b \bbeg+n\b \bbeg*n\b \bbeg*ns?\b \bbeg.ns\b \bbeg.ns?\b Week 7

7 Quiz Write a RegEx to find the verb conjugations of:
speak: speak, speaks, spoke, spoken fly: fly, flies, flew, flown Week 7

8 RegEx at Work: Finding Spelling Variation
Which RegEx matches: ‘colour’, ‘color’, ‘colours’, ‘colors’, ‘colouring’, ‘coloring’? colou?rs?(ing)? colours?(ing)? colou?rs?ing? Can colorful, colorless, and colored be matched by the same RegEx? If not, how can we modify the RegEx to match them? Is there are more concise way to write your RegEx? colou?rs?(ing)?(less)?(ful)?(ed)? colou?r(s|ing|less|ful|ed)? OR colou?r\w* Week 7

9 Quiz Write a RegEx to find: puppy – puppies behavior – behaviour
Week 7

10 RegEx at Work: Finding Affixes
Which RegEx matches all words ending with ‘ness’? \b[a-zA-Z]+ness\b \b[a-zA-Z]*ness\b Another way to do the same thing is \b\w+ness\b Week 7

11 Quiz Write a RegEx to find words starting with: anti un
Write a RegEx to find words ending with: ation ment Week 7

12 RegEx at Work: Finding Rhyming Words
Which RegEx find words rhyming with ‘duck’? Check all that apply: \b[a-zA-Z]uck\b \b[a-zA-Z]+uck\b \b\w+uck\b Quiz Write a RegEx to find words rhyming with: soon clip Week 7

13 RegEx at Work: Finding Specific Words
Which RegEx matches all words starting with ‘a’ in both upper and lower cases? \b[aA]\w+\b \ba\w+\b \b[aA]\b Which RegEx matches ‘This’ at the beginning of sentences? .This \. \bThis\b Which RegEx matches ‘this’ at the end of the sentence? this. this\. Week 7

14 RegEx at Work: Finding Numbers
Which RegEx matches all digits? \d \D Which RegEx returns years (e.g. 1999, 2010)? Check all that apply: [0-9]{4} [0-9][0-9][0-9][0-9] [0-9] \b[0-9]\b Week 7

15 RegEx at Work: Finding Punctuation Markers
Which RegEx matches all punctuation markers? [!?,.*()&”;’] \W+ Week 7

16 Quiz Use RegEx to answer each of the following questions:
How many words have double vowels? How many times is ‘moreover’ used in a sentence-initial position? How many words start with un and end with ing? How many full proper nouns (first and last names) are in the corpus? Week 7

17 Quiz Which RegEx matches the US price tags? \$[0-9]+
\$[0-9]+\.[0-9][0-9] \$[0-9]+(\.[0-9][0-9])? Which RegEx matches ‘bucket’, ‘a bucket’, and ‘some buckets’? (a|some)?\sbuckets? (a|some)\sbucket? (a|some)\sbuckets (a|some)buckets? Week 7

18 Quiz Which RegEx matches IP addresses such as , , and ? 192\.168\.1\.\d{1,3} 192\.168\.1\. 192\.168\.1\.\d Week 7

19 RegEx Cheat Sheet 1 Quantifiers + one or more * zero or more
? zero or one {n,m} at least n times and at most m times {n,} at least n times {,m} at most m times {n} exactly n times Ranges [0-9] the range of all possible digits [A-Z] the range of all upper case alphabet letters [a-z] the range of all lower case alphabet letters Week 7

20 RegEx Cheat Sheet 2 Grouping () all what is in-between is one unit
Characters \w all word characters (e.g. alphanumeric characters) \d all digits \D everything except digits \W all non-word characters (e.g. punctuation markers) \s white spaces Week 7

21 RegEx Cheat Sheet 3 Boundaries \b word boundary Symbols
\ escape symbol to treat special character literally | the either or symbol Regular expressions make raw corpora more useful. What are other ways in which raw corpora can be useful? Week 7


Download ppt "Corpus Linguistics I ENG 617"

Similar presentations


Ads by Google