Corpus Linguistics I ENG 617

Slides:



Advertisements
Similar presentations
Lexical Analysis. what is the main Task of the Lexical analyzer Read the input characters of the source program, group them into lexemes and produce the.
Advertisements

Regular Expressions BKF03 Brian Ciccolo. Agenda Definition Uses – within Aspen and beyond Matching Replacing.
Regular expressions Day 2
Regular expressions and the Corpus Query Language
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23.
Using regular expressions Search for a single occurrence of a specific string. Search for all occurrences of a string. Approximate string matching.
ASCII and Unicode. ASCII Inside a computer, EVERYTHING is a number – that includes music, sound, and text. In the early days of computers, every manufacturer.
Regular Expressions A regular expression defines a pattern of characters to be found in a string Regular expressions are made up of – Literal characters.
Last Updated March 2006 Slide 1 Regular Expressions.
Regular Expressions Week 07 TCNJ Web 2 Jean Chu. Regular Expressions Regular Expressions are a powerful way to validate and format text strings that may.
Regular Expressions Dr. Ralph D. Westfall May, 2011.
1 Regular Expressions CIS*2450 Advanced Programming Techniques Material for this lectures has been taken from the excellent book, Mastering Regular Expressions,
Literacy Workshop 2013 Ms Javed. Three Areas of English Speaking and Listening Reading Writing- includes spelling and handwriting.
CIS 451: Regular Expressions Dr. Ralph D. Westfall January, 2009.
Regular Expressions Regular expressions are a language for string patterns. RegEx is integral to many programming languages:  Perl  Python  Javascript.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 4. Document Search and Regular Expressions.
Regular Expression Dr. Tran, Van Hoai Faculty of Computer Science and Engineering HCMC Uni. of Technology
Corpus Linguistics- Practical utilities (Lecture 7) Albert Gatt.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Regular Expressions for PHP Adding magic to your programming. Geoffrey Dunn
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
Ajmer Singh PGT(IP) Programming Fundamentals. Ajmer Singh PGT(IP) Java Character Set Character set is a set of valid characters that a language can recognize.
Unit 11 –Reglar Expressions Instructor: Brent Presley.
Regular expressions and the Corpus Query Language Albert Gatt.
English Workshop Three Areas of English Speaking and Listening Reading Writing- includes spelling and handwriting.
Foundations of Statistical NLP Chapter 4. Corpus-Based Work 박 태 원박 태 원.
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
May 2006CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
Lesson 4 String Manipulation. Lesson 4 In many applications you will need to do some kind of manipulation or parsing of strings, whether you are Attempting.
1. 2 Regular Expressions Regular Expressions are found in Formal Language Theory and can be used to describe a class of languages called regular languages.
Regular Expressions In Javascript cosc What Do They Do? Does pattern matching on text We use the term “string” to indicate the text that the regular.
Regular Expressions Copyright Doug Maxwell (
End of Key Stage 1 Working Towards the expected Standard
Regular Expressions Upsorn Praphamontripong CS 1110
Regular Expressions 'RegEx'.
Introduction to Corpus Linguistics
Looking for Patterns - Finding them with Regular Expressions
CompSci 101 Introduction to Computer Science
Concepts of Programming Languages
CSC 594 Topics in AI – Natural Language Processing
Corpus Linguistics I ENG 617
C-Character Set Dept. of Computer Applications Prof. Harpreet Kaur
Topics in Linguistics ENG 331
Corpus Linguistics I ENG 617
Corpus Linguistics I ENG 617
CSC 594 Topics in AI – Natural Language Processing
Representing Characters
Topics in Linguistics ENG 331
Corpus Linguistics I ENG 617
LING 388: Computers and Language
Topics in Linguistics ENG 331
Advanced Find and Replace with Regular Expressions
Introduction to Corpus Linguistics ENG 331
Topics in Linguistics ENG 331
Regular Expressions
Basic Text Processing: Sentence Segmentation
Selenium WebDriver Web Test Tool Training
Matcher functions boolean find() Attempts to find the next subsequence of the input sequence that matches the pattern. boolean lookingAt() Attempts to.
CSE 303 Concepts and Tools for Software Development
CIT 383: Administrative Scripting
Lecture 25: Regular Expressions
AntConc Search Wildcards (not Regex)
Nate Brunelle Today: Regular Expressions
C Programming Language
Nate Brunelle Today: Regular Expressions
Regular Expression: Pattern Matching
ADVANCE FIND & REPLACE WITH REGULAR EXPRESSIONS
ASCII and Unicode.
LING 388: Computers and Language
Presentation transcript:

Corpus Linguistics I ENG 617 Dr. Rania Al-Sabbagh Department of English Faculty of Al-Alsun (Languages) Ain Shams University rsabbagh@alsun.asu.edu.eg

What are Regular Expressions? Regular Expressions – RegEx or RegExp, for short – are a powerful way to do complex searches. With RegEx you can, for instance, find: Acronyms Rhyming words Postal codes Phone numbers Emails Spelling variations … and much more Week 7

RegEx at Work Many corpus processors support RegEx search. AntConc is one of them. For illustration purposes, we will use the Brown Corpus. The Brown University Standard Corpus of Present-Day American English (or just Brown Corpus) was compiled in the 1960s. The corpus originally contains 1,023,374 tokens and 41,506 types sampled from 15 text categories, including: press, religion, non-fiction books, etc. To get started open your AntConc and your Brown Corpus text file. We are using a raw version of the corpus. Week 7

RegEx at Work: Finding Acronyms For a computer, what are Acronyms? They are all CAP words that include two or more characters. How can we tell the computer to look for ‘all CAP words that include two or more characters’? \b[A-Z]\b \b[A-Z]+\b \b[A-Z]*\b \b[A-Z]{2,}\b  Week 7

Quiz Write a RegEx to find: two-letter acronyms only three-letter acronyms only long acronyms of at least 4 letters Week 7

RegEx at Work: Finding Verb Conjugations How can we search for all the verb conjugations of begin (i.e., begin, begins, began, begun) in one step? \bbeg?n\b \bbeg.n\b \bbeg+n\b \bbeg*n\b \bbeg*ns?\b \bbeg.ns\b \bbeg.ns?\b  Week 7

Quiz Write a RegEx to find the verb conjugations of: speak: speak, speaks, spoke, spoken fly: fly, flies, flew, flown Week 7

RegEx at Work: Finding Spelling Variation Which RegEx matches: ‘colour’, ‘color’, ‘colours’, ‘colors’, ‘colouring’, ‘coloring’? colou?rs?(ing)? colours?(ing)? colou?rs?ing? Can colorful, colorless, and colored be matched by the same RegEx? If not, how can we modify the RegEx to match them? Is there are more concise way to write your RegEx?  colou?rs?(ing)?(less)?(ful)?(ed)? colou?r(s|ing|less|ful|ed)? OR colou?r\w* Week 7

Quiz Write a RegEx to find: puppy – puppies behavior – behaviour Week 7

RegEx at Work: Finding Affixes Which RegEx matches all words ending with ‘ness’? \b[a-zA-Z]+ness\b \b[a-zA-Z]*ness\b Another way to do the same thing is \b\w+ness\b  Week 7

Quiz Write a RegEx to find words starting with: anti un Write a RegEx to find words ending with: ation ment Week 7

RegEx at Work: Finding Rhyming Words Which RegEx find words rhyming with ‘duck’? Check all that apply: \b[a-zA-Z]uck\b \b[a-zA-Z]+uck\b \b\w+uck\b Quiz Write a RegEx to find words rhyming with: soon clip    Week 7

RegEx at Work: Finding Specific Words Which RegEx matches all words starting with ‘a’ in both upper and lower cases? \b[aA]\w+\b \ba\w+\b \b[aA]\b Which RegEx matches ‘This’ at the beginning of sentences? .This \. \bThis\b Which RegEx matches ‘this’ at the end of the sentence? this. this\.    Week 7

RegEx at Work: Finding Numbers Which RegEx matches all digits? \d \D Which RegEx returns years (e.g. 1999, 2010)? Check all that apply: [0-9]{4} [0-9][0-9][0-9][0-9] [0-9] \b[0-9]\b    Week 7

RegEx at Work: Finding Punctuation Markers Which RegEx matches all punctuation markers? [!?,.*()&”;’] \W+  Week 7

Quiz Use RegEx to answer each of the following questions: How many words have double vowels? How many times is ‘moreover’ used in a sentence-initial position? How many words start with un and end with ing? How many full proper nouns (first and last names) are in the corpus? Week 7

Quiz Which RegEx matches the US price tags? \$[0-9]+ \$[0-9]+\.[0-9][0-9] \$[0-9]+(\.[0-9][0-9])? Which RegEx matches ‘bucket’, ‘a bucket’, and ‘some buckets’? (a|some)?\sbuckets? (a|some)\sbucket? (a|some)\sbuckets (a|some)buckets?   Week 7

Quiz Which RegEx matches IP addresses such as 192.168.1.0, 192.168.1.33, and 192.168.1.422? 192\.168\.1\.\d{1,3} 192\.168\.1\. 192\.168\.1\.\d  Week 7

RegEx Cheat Sheet 1 Quantifiers + one or more * zero or more ? zero or one {n,m} at least n times and at most m times {n,} at least n times {,m} at most m times {n} exactly n times Ranges [0-9] the range of all possible digits [A-Z] the range of all upper case alphabet letters [a-z] the range of all lower case alphabet letters Week 7

RegEx Cheat Sheet 2 Grouping () all what is in-between is one unit Characters \w all word characters (e.g. alphanumeric characters) \d all digits \D everything except digits \W all non-word characters (e.g. punctuation markers) \s white spaces Week 7

RegEx Cheat Sheet 3 Boundaries \b word boundary Symbols \ escape symbol to treat special character literally | the either or symbol Regular expressions make raw corpora more useful. What are other ways in which raw corpora can be useful? Week 7