Topics in Linguistics ENG 331

Slides:



Advertisements
Similar presentations
Regular Expressions BKF03 Brian Ciccolo. Agenda Definition Uses – within Aspen and beyond Matching Replacing.
Advertisements

Regular expressions Day 2
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23.
Using regular expressions Search for a single occurrence of a specific string. Search for all occurrences of a string. Approximate string matching.
ASCII and Unicode. ASCII Inside a computer, EVERYTHING is a number – that includes music, sound, and text. In the early days of computers, every manufacturer.
Regular Expressions A regular expression defines a pattern of characters to be found in a string Regular expressions are made up of – Literal characters.
Last Updated March 2006 Slide 1 Regular Expressions.
Regular Expressions Dr. Ralph D. Westfall May, 2011.
Regular Expression Darby Tien-Hao Chang (a.k.a. dirty) Department of Electrical Engineering, National Cheng Kung University.
1 Regular Expressions CIS*2450 Advanced Programming Techniques Material for this lectures has been taken from the excellent book, Mastering Regular Expressions,
Literacy Workshop 2013 Ms Javed. Three Areas of English Speaking and Listening Reading Writing- includes spelling and handwriting.
CIS 451: Regular Expressions Dr. Ralph D. Westfall January, 2009.
RegExp. Regular Expression A regular expression is a certain way to describe a pattern of characters. Pattern-matching or keyword search. Regular expressions.
Regular Expressions Regular expressions are a language for string patterns. RegEx is integral to many programming languages:  Perl  Python  Javascript.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 4. Document Search and Regular Expressions.
Corpus Linguistics- Practical utilities (Lecture 7) Albert Gatt.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
CSCI 330 UNIX and Network Programming Unit IV Shell, Part 2.
Ajmer Singh PGT(IP) Programming Fundamentals. Ajmer Singh PGT(IP) Java Character Set Character set is a set of valid characters that a language can recognize.
Unit 11 –Reglar Expressions Instructor: Brent Presley.
Regular expressions and the Corpus Query Language Albert Gatt.
An Introduction to Regular Expressions Specifying a Pattern that a String must meet.
English Workshop Three Areas of English Speaking and Listening Reading Writing- includes spelling and handwriting.
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
May 2006CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
Regular Expressions In Javascript cosc What Do They Do? Does pattern matching on text We use the term “string” to indicate the text that the regular.
Regular Expressions Copyright Doug Maxwell (
Regular Expressions Upsorn Praphamontripong CS 1110
Regular Expressions 'RegEx'.
Introduction to Corpus Linguistics
Looking for Patterns - Finding them with Regular Expressions
Concepts of Programming Languages
CSC 594 Topics in AI – Natural Language Processing
Corpus Linguistics I ENG 617
C-Character Set Dept. of Computer Applications Prof. Harpreet Kaur
Topics in Linguistics ENG 331
Corpus Linguistics I ENG 617
Corpus Linguistics I ENG 617
Concepts of Programming Languages
Corpus Linguistics I ENG 617
CSC 594 Topics in AI – Natural Language Processing
Representing Characters
LING 388: Computers and Language
Corpus Linguistics I ENG 617
LING 388: Computers and Language
Topics in Linguistics ENG 331
Regular Expressions: Searching strings for patterns April 24, 2008 Copyright , Andy Packard and Trent Russi. This work is licensed under the Creative.
Introduction to Corpus Linguistics ENG 331
Topics in Linguistics ENG 331
Regular Expressions
Basic Text Processing: Sentence Segmentation
Selenium WebDriver Web Test Tool Training
CS 1111 Introduction to Programming Fall 2018
An Overview of Grep and Regular Expression
LING 408/508: Computational Techniques for Linguists
Matcher functions boolean find() Attempts to find the next subsequence of the input sequence that matches the pattern. boolean lookingAt() Attempts to.
CSE 303 Concepts and Tools for Software Development
Regular Expressions and Grep
CIT 383: Administrative Scripting
AntConc Search Wildcards (not Regex)
Nate Brunelle Today: Regular Expressions
C Programming Language
Nate Brunelle Today: Regular Expressions
Regular Expression: Pattern Matching
REGEX.
ADVANCE FIND & REPLACE WITH REGULAR EXPRESSIONS
ASCII and Unicode.
LING 388: Computers and Language
Presentation transcript:

Topics in Linguistics ENG 331 Dr. Rania Al-Sabbagh Department of English Faculty of Al-Alsun (Languages) Ain Shams University rsabbagh@alsun.asu.edu.eg

What are Regular Expressions? Regular Expressions – RegEx or RegExp, for short – are a powerful way to do complex searches. With RegEx you can, for instance, find: Acronyms Rhyming words Postal codes Phone numbers Emails Spelling variations … and much more Week 8

RegEx at Work Many corpus processors support RegEx search. AntConc is one of them. For illustration purposes, we will use the Brown Corpus. The Brown University Standard Corpus of Present-Day American English (or just Brown Corpus) was compiled in the 1960s. The corpus originally contains 1,023,374 tokens and 41,506 types sampled from 15 text categories, including: press, religion, non-fiction books, etc. To get started open your AntConc and your Brown Corpus text file. We are using a raw version of the corpus. Week 8

RegEx at Work: Finding Acronyms For a computer, what are Acronyms? They are all CAP words that include two or more characters. How can we tell the computer to look for ‘all CAP words that include two or more characters’? \b[A-Z]\b \b[A-Z]+\b \b[A-Z]*\b \b[A-Z]{2,}\b  Week 8

Quiz Write a RegEx to find: two-letter acronyms only three-letter acronyms only long acronyms of at least 4 letters Week 8

RegEx at Work: Finding Verb Conjugations How can we search for all the verb conjugations of begin (i.e., begin, begins, began, begun) in one step? \bbeg?n\b \bbeg.n\b \bbeg+n\b \bbeg*n\b \bbeg*ns?\b \bbeg.ns\b \bbeg.ns?\b  Week 8

Quiz Write a RegEx to find the verb conjugations of: speak: speak, speaks, spoke, spoken fly: fly, flies, flew, flown Week 8

RegEx at Work: Finding Spelling Variation Which RegEx matches: ‘colour’, ‘color’, ‘colours’, ‘colors’, ‘colouring’, ‘coloring’? colou?rs?(ing)? colours?(ing)? colou?rs?ing? Can colorful, colorless, and colored be matched by the same RegEx? If not, how can we modify the RegEx to match them? Is there are more concise way to write your RegEx?  colou?rs?(ing)?(less)?(ful)?(ed)? colou?r(s|ing|less|ful|ed)? OR colou?r\w* Week 8

Quiz Write a RegEx to find: puppy – puppies behavior – behaviour Week 8

RegEx at Work: Finding Affixes Which RegEx matches all words ending with ‘ness’? \b[a-zA-Z]+ness\b \b[a-zA-Z]*ness\b Another way to do the same thing is \b\w+ness\b  Week 8

Quiz Write a RegEx to find words starting with: anti un Write a RegEx to find words ending with: ation ment Week 8

RegEx at Work: Finding Rhyming Words Which RegEx find words rhyming with ‘duck’? Check all that apply: \b[a-zA-Z]uck\b \b[a-zA-Z]+uck\b \b\w+uck\b Quiz Write a RegEx to find words rhyming with: soon clip    Week 8

RegEx at Work: Finding Specific Words Which RegEx matches all words starting with ‘a’ in both upper and lower cases? \b[aA]\w+\b \ba\w+\b \b[aA]\b Which RegEx matches ‘This’ at the beginning of sentences? .This \. \bThis\b Which RegEx matches ‘this’ at the end of the sentence? this. this\.    Week 8

RegEx at Work: Finding Numbers Which RegEx matches all digits? \d \D Which RegEx returns years (e.g. 1999, 2010)? Check all that apply: [0-9]{4} [0-9][0-9][0-9][0-9] [0-9] \b[0-9]\b    Week 8

RegEx at Work: Finding Punctuation Markers Which RegEx matches all punctuation markers? [!?,.*()&”;’] \W+  Week 8

RegEx Cheat Sheet 1 Quantifiers + one or more * zero or more ? zero or one {n,m} at least n times and at most m times {n,} at least n times {,m} at most m times {n} exactly n times Ranges [0-9] the range of all possible digits [A-Z] the range of all upper case alphabet letters [a-z] the range of all lower case alphabet letters Week 8

RegEx Cheat Sheet 2 Grouping () all what is in-between is one unit Characters \w all word characters (e.g. alphanumeric characters) \d all digits \D everything except digits \W all non-word characters (e.g. punctuation markers) \s white spaces Week 8

RegEx Cheat Sheet 3 Boundaries \b word boundary Symbols \ escape symbol to treat special character literally | the either or symbol Regular expressions make raw corpora more useful. What are other ways in which raw corpora can be useful? Week 8