1 Regular Expressions CIS*2450 Advanced Programming Techniques Material for this lectures has been taken from the excellent book, Mastering Regular Expressions,

Slides:



Advertisements
Similar presentations
Session 3BBK P1 ModuleApril 2010 : [#] Regular Expressions.
Advertisements

BBK P1 Module2010/11 : [‹#›] Regular Expressions.
Bioinformatics Programming 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Regular Expressions in Perl By Josue Vazquez. What are Regular Expressions? A template that either matches or doesn’t match a given string. Often called.
CSCI 330 T HE UNIX S YSTEM Regular Expressions. R EGULAR E XPRESSION A pattern of special characters used to match strings in a search Typically made.
7 Searching and Regular Expressions (Regex) Mauro Jaskelioff.
Regular Expression Original Notes by Song Guo. What Regular Expressions Are Exactly - Terminology a regular expression is a pattern describing a certain.
LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 2: 8/23.
LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23.
Regular Expressions. u A regular expression is a pattern which matches some regular (predictable) text. u Regular expressions are used in many Unix utilities.
Regular Expressions In ColdFusion and Studio. Definitions String - Any collection of 0 or more characters. Example: “This is a String” SubString - A segment.
Scripting Languages Chapter 8 More About Regular Expressions.
Regular Expressions in ColdFusion Applications Dave Fauth DOMAIN technologies Knowledge Engineering : Systems Integration : Web.
Regular Expressions A regular expression defines a pattern of characters to be found in a string Regular expressions are made up of – Literal characters.
Last Updated March 2006 Slide 1 Regular Expressions.
1 Lab Session-III CSIT-120 Fall 2000 Revising Previous session Data input and output While loop Exercise Limits and Bounds Session III-B (starts on slide.
Regular Expression Darby Tien-Hao Chang (a.k.a. dirty) Department of Electrical Engineering, National Cheng Kung University.
System Programming Regular Expressions Regular Expressions
Pattern matching with regular expressions A common file processing requirement is to match strings within the file to a standard form, e.g. address.
 Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’
Globalisation & Computer systems Week 7 Text processes and globalisation part 1: Sorting strings: collation Searching strings and regular expressions Practical:
PHP Workshop ‹#› Data Manipulation & Regex. PHP Workshop ‹#› What..? Often in PHP we have to get data from files, or maybe through forms from a user.
INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7.
1 Python & Pattern Matching with Regular Expressions (REs) OPIM 101 File:PythonREs.ppt.
WDV 331 Dreamweaver Applications Find and Replace Dreamweaver CS6 Chapter 20.
 Regular expressions are : › A language or syntax that lets you specify patterns for matching e.g. filenames or strings › Used to identify the files.
Web Design-Lecture1-QN-2003 Web Design Web Design Using HTML.
Regular Expression JavaScript Web Technology Derived from:
REGULAR EXPRESSIONS. Lexical Analysis Lexical analysers can be constructed by programs such as LEX These programs employ as input a description of the.
ASP.NET Programming with C# and SQL Server First Edition Chapter 5 Manipulating Strings with C#
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
LING 388: Language and Computers Sandiway Fong Lecture 6: 9/15.
I/O Redirection and Regular Expressions February 9 th, 2004 Class Meeting 4.
VBScript Session 13.
Overview A regular expression defines a search pattern for strings. Regular expressions can be used to search, edit and manipulate text. The pattern defined.
Prof. Alfred J Bird, Ph.D., NBCT Door Code for IT441 Students.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Module 6 – Generics Module 7 – Regular Expressions.
Telecooperation Technische Universität Darmstadt Copyrighted material; for TUD student use only Q&A Telecooperation Group TU Darmstadt.
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
I/O Redirection & Regular Expressions CS 2204 Class meeting 4 *Notes by Doug Bowman and other members of the CS faculty at Virginia Tech. Copyright
R EGULAR E XPRESSION IN P ERL (P ART 1) Thach Nguyen.
Copyright © Curt Hill Regular Expressions Providing a Search Pattern.
Regular Expressions CS 2204 Class meeting 6 Created by Doug Bowman, 2001 Modified by Mir Farooq Ali, 2002.
CSCI 330 UNIX and Network Programming Unit IV Shell, Part 2.
Prof. Alfred J Bird, Ph.D., NBCT Office – McCormick 3rd floor 607.
Prof. Alfred J Bird, Ph.D., NBCT Door Code for IT441 Students.
Introduction to Programming the WWW I CMSC Winter 2004 Lecture 13.
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
CS 101 – Sept. 11 Review linear vs. non-linear representations. Text representation Compression techniques Image representation –grayscale –File size issues.
May 2006CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
ICS611 Lex Set 3. Lex and Yacc Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the.
Hands-on Regular Expressions Simple rules for powerful changes.
Regular Expressions.
PROGRAMMING THE BASH SHELL PART III by İlker Korkmaz and Kaya Oğuz
Regular Expressions Copyright Doug Maxwell (
RE Tutorial.
INTERNATIONALIZATION
Looking for Patterns - Finding them with Regular Expressions
CSC 594 Topics in AI – Natural Language Processing
BASIC AND EXTENDED REGULAR EXPRESSIONS
Corpus Linguistics I ENG 617
Regular Expression Beihang Open Source Club.
CSC 594 Topics in AI – Natural Language Processing
Advanced Find and Replace with Regular Expressions
Selenium WebDriver Web Test Tool Training
Data Manipulation & Regex
CSCI The UNIX System Regular Expressions
1.5 Regular Expressions (REs)
ADVANCE FIND & REPLACE WITH REGULAR EXPRESSIONS
Presentation transcript:

1 Regular Expressions CIS*2450 Advanced Programming Techniques Material for this lectures has been taken from the excellent book, Mastering Regular Expressions, 2nd Edition by Jeffrey Friedl.

2 Regular Expressions Regular expressions are a powerful tool for text processing. They allow for the description and parsing of text. With additional support, tools employing regular expressions can modify text and data.

3 Example Problems Check many files to confirm that in each file the word SetSize appears exactly as often as ResetSize. Pay attention to lower and upper case. –similar to the problem of checking that your code has as many closing braces (}) as it has opening braces ({) Check any number of files and report any occurrence of double words, e.g. the the –should be case insensitive, i.e. the The –if you were checking HTML code then you would have to disregard HTML tags, e.g. the the

4 Required Reading A particularly interesting view of regular expressions can be found in the article, Marshall McLuhan vs. Marshalling Regular Expressions. A link is available on the website.

5 Unix Tools: egrep Given a regular expression and files to search, egrep attempts to match the regex to each line of each file. egrep '^(From|Search): ' -file egrep breaks the input file into separate text lines. There is no understanding of high-level concepts, such as words.

6 Examples Parsing a mail header egrep '^(From|Subject): ' testmail From: Deb Stacey Subject: test message From: Deb Stacey Subject: html doc From: Deb Stacey Subject: html file From: "Colleen O'Brien" Subject: Me From: Deborah Stacey Subject: PS file (fwd) From: Deb Stacey Subject: PS file

7 Examples The following instance of egrep will match the three letters cat wherever they are. egrep 'cat' file If a file contains the line While on vacation we saw a fat dog on the beach. then egrep 'cat' file will produce the result While on vacation we saw

8 What do the following mean in egrep? ^cat$ ^$ ^

9 The Language of Regular Expressions Character –most text-processing system treat ASCII bytes as characters –recently, Unicode has been introduced and this is a consideration for any tool using regular expressions because this new encoding must be accounted for

10 What is Unicode? Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. Unlike ASCII, which uses 8 bits (1 byte) for each character, Unicode uses 16 bits. –Thus, it can represent more than 65,000 unique characters. –This is necessary for languages, such as Arabic, Chinese and Japanese, which have needs different for those of most Western European languages which were serviced well by the ASCII representation.

11 Metacharacter a metacharacter has a special meaning in some circumstances within regular expressions a metacharacter is not a metacharacter when it has been escaped the escape character is usually the backslash \ –\* means the character * –\\* means the character \ followed by the metacharacter *

12 Subexpression a subexpression is a part of a larger expression and is usually within parentheses or are the alternatives of a | clause they are particularly important when parsing the meaning of quantifiers

13 Metacharacters ^ (carat) –matches the position at the start of the line $ (dollar) –matches the position at the end of the line. (dot) –matches any one character

14 Metacharacters [ ] (character class) –matches any character listed [^ ] (negated character class) –matches any character not listed | (or, bar) –matches either expression it separates ( ) (parentheses) –used to designate scope

15 Negation A file contains the following: While on vacation we saw a fat dog on the beach. There was also quite a fat duck which went 'Quack, quack'! The beach was in Iraq and so the dog must have been Iraqian. The regex is egrep '[Qq][^u]' file The result is The beach was in Iraq and so the dog must have been Iraqian.

16 Tricky Regex Let's create a file with the following contents: Quack Iraq Qantas Quack What will the following produce? egrep '[Qq][^u]' regfile It will produce Qantas What will the following produce? egrep '[Qq]([^u]|$)' regfile It will produce Iraq Qantas WHY?

17 Match Any Character Match the date: 07/04/76 or or Regex - version 1 07[-./]04[-./]76 Regex - version Version 1 [-./] –the dot is not a metacharacter - it is just the character dot. And the dash is not special if it is first in a class Version 2. –the dots are metacharacters that match any single character

18 Match Any Character if we use egrep ' ' on the file This strange happening was on the following date: 07/04/76. This date could also be written as or but who's counting! My id number is it gives date: 07/04/76. This date could also be written as or but who's counting! My id number is WHY?

19 Alternation Search a file for the word gray or grey. Possible regex's are gr[ea]y grey|gray gr(a|e)y The difference between class ([]) and alternation is that class concerns only single characters while alternatives can be regex's on their own. What would the following do? egrep '(From|Subject|Date): ' -file

20 Optional Items The metacharacter ? appears after an optional character. colou?r matches color and colour. Problem: Match any variation on the date July 1, such as Jul 1st, July first, Jul first, etc. July? (first|1(st)?)

21 Repetition Quantifiers Meta-Char Minimum Required Maximum to Appear ? none one * none no limit + one no limit + means “find one or more of the preceding item” * means “find zero or more of the preceding item”

22 Problem Match all legal variations of the HTML tags This works but allows no optional spaces inside the angle brackets which HTML does. Matches

23 Problem What does match? How about and

24 Case Sensitivity in egrep To make an expression case insensitive egrep -i ' ' file and in Perl it is /i

25 Problem Match all dollar amounts in text including optional cents. \$[0-9]+(\.[0-9][0-9])? But will this match $.49 ? NO -- but will the following work? \$[0-9]*(\.[0-9][0-9])? NO -- what is the alternative? \$[0-9]+(\.[0-9][0-9])?|\$\.[0-9][0-9]

26 Try it yourself To discover why use the following file (name it money): $10.49 $0.49 $.49 $10 $. $ $abc and test the following: egrep '\$[0-9]+(\.[0-9][0-9])?' money egrep '\$[0-9]*(\.[0-9][0-9])?' money egrep '\$[0-9]+(\.[0-9][0-9])?|\$\.[0-9][0- 9]' money