Regular Expressions Comp 2400: Fall 2008 Prof. Chris GauthierDickey.

Slides:



Advertisements
Similar presentations
LIS651 lecture 4 regular expressions Thomas Krichel
Advertisements

An Introduction to Sed & Awk Presented Tues, Jan 14 th, 2003 Send any suggestions to Siobhan Quinn
Regular Expressions (in Python). Python or Egrep We will use Python. In some scripting languages you can call the command “grep” or “egrep” egrep pattern.
Python: Regular Expressions
Regular Expression Original Notes by Song Guo. What Regular Expressions Are Exactly - Terminology a regular expression is a pattern describing a certain.
ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 2: 8/23.
LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23.
Quotes: single vs. double vs. grave accent % set day = date % echo day day % echo $day date % echo '$day' $day % echo "$day" date % echo `$day` Mon Jul.
Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl Linux editors and commands (e.g.
Regular Expressions In ColdFusion and Studio. Definitions String - Any collection of 0 or more characters. Example: “This is a String” SubString - A segment.
Regular Expressions. String Matching The problem of finding a string that “looks kind of like …” is common  e.g. finding useful delimiters in a file,
Regular Expressions in ColdFusion Applications Dave Fauth DOMAIN technologies Knowledge Engineering : Systems Integration : Web.
Regular Language & Expressions. Regular Language A regular language is one that a finite state machine (fsm) will accept. ‘Alphabet’: {a, b} ‘Rules’:
Regular Expressions A regular expression defines a pattern of characters to be found in a string Regular expressions are made up of – Literal characters.
Last Updated March 2006 Slide 1 Regular Expressions.
Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © – Curt Hill.
Regular Expression Darby Tien-Hao Chang (a.k.a. dirty) Department of Electrical Engineering, National Cheng Kung University.
Pattern matching with regular expressions A common file processing requirement is to match strings within the file to a standard form, e.g. address.
 Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’
Computer Programming for Biologists Class 5 Nov 20 st, 2014 Karsten Hokamp
INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7.
CSC312 Automata Theory Lecture # 2 Languages.
REGULAR EXPRESSIONS. Lexical Analysis Lexical analysers can be constructed by programs such as LEX These programs employ as input a description of the.
1 Regular Expressions. 2 Regular expressions describe regular languages Example: describes the language.
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
LING 388: Language and Computers Sandiway Fong Lecture 6: 9/15.
January 23, 2007Spring Unix Lecture 2 Special Characters for Searches & Substitutions Shell Scripts Hana Filip.
Agenda Regular Expressions (Appendix A in Text) –Definition / Purpose –Commands that Use Regular Expressions –Using Regular Expressions –Using the Replacement.
Python Regular Expressions Easy text processing. Regular Expression  A way of identifying certain String patterns  Formally, a RE is:  a letter or.
What is a language? An alphabet is a well defined set of characters. The character ∑ is typically used to represent an alphabet. A string : a finite.
Regular Expressions in PHP. Supported RE’s The most important set of regex functions start with preg. These functions are a PHP wrapper around the PCRE.
REGEX. Problems Have big text file, want to extract data – Phone numbers (503)
Working with Forms and Regular Expressions Validating a Web Form with JavaScript.
Prof. Alfred J Bird, Ph.D., NBCT Door Code for IT441 Students.
Searching and Regular Expressions. Proteins 20 amino acids Interesting structures beta barrel, greek key motif, EF hand... Bind, move, catalyze, recognize,
Regular Expressions What is this line all about? while (!($search =~ /^\s*$/)) { It’s a string search just like before, but with a huge twist – regular.
Pattern Matching CSCI N321 – System and Network Administration.
©Brooks/Cole, 2001 Chapter 9 Regular Expressions ( 정규수식 )
12. Regular Expressions. 2 Motto: I don't play accurately-any one can play accurately- but I play with wonderful expression. As far as the piano is concerned,
©Brooks/Cole, 2001 Chapter 9 Regular Expressions.
CSC 4630 Meeting 21 April 4, Return to Perl Where are we? What is confusing? What practice do you need?
GREP. Whats Grep? Grep is a popular unix program that supports a special programming language for doing regular expressions The grammar in use for software.
Sys Prog & Scrip - Heriot Watt Univ 1 Systems Programming & Scripting Lecture 12: Introduction to Scripting & Regular Expressions.
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters.
CSC 2720 Building Web Applications PHP PERL-Compatible Regular Expressions.
Copyright © Curt Hill Regular Expressions Providing a Search Pattern.
1 Lecture 9 Shell Programming – Command substitution Regular expressions and grep Use of exit, for loop and expr commands COP 3353 Introduction to UNIX.
CIT 383: Administrative ScriptingSlide #1 CIT 383: Administrative Scripting Regular Expressions.
CSCI 330 UNIX and Network Programming Unit IV Shell, Part 2.
What are Regular Expressions?What are Regular Expressions?  Pattern to match text  Consists of two parts, atoms and operators  Atoms specifies what.
An Introduction to Regular Expressions Specifying a Pattern that a String must meet.
Regular expressions Day 11 LING Computational Linguistics Harry Howard Tulane University.
ORAFACT Text Processing. ORAFACT Searching Inside Files grep - searches for patterns within files grep [options] [[-e] pattern] filename [...] -n shows.
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
OOP Tirgul 11. What We’ll Be Seeing Today  Regular Expressions Basics  Doing it in Java  Advanced Regular Expressions  Summary 2.
MA/CSSE 474 Theory of Computation How many regular/non-regular languages are there? Closure properties of Regular Languages (if there is time) Pumping.
Lesson 4 String Manipulation. Lesson 4 In many applications you will need to do some kind of manipulation or parsing of strings, whether you are Attempting.
Theory of Computation Lecture #
Looking for Patterns - Finding them with Regular Expressions
Lecture 2 Lexical Analysis
Regular Expressions in Perl
Regular Expressions and perl
Regular Expressions and Grep
CIT 383: Administrative Scripting
REGEX.
LIS651 lecture 4 regular expressions
Presentation transcript:

Regular Expressions Comp 2400: Fall 2008 Prof. Chris GauthierDickey

What are regular expressions? A compact way to specify patterns in text A compact way to specify a finite state machine without pictures Typically a set of characters and symbols which are expanded to match finite and infinite strings of text

What do they look like? Regular characters are part of regular expressions A-Z, a-z, 0-9, plus other symbols Some characters are ‘meta-characters’. ^ $ * + ? { } [ ] \ | ( )

Simple matches A sequence of characters are matched in a regex: abc matches ‘abc’ if you use a regex for searching or matching in a string, it could match any sequence of abc as a substring: abc would find abc in aaaabcccc at the 4th character position

[]: your first meta- characters [] denote a character ‘class’, or set of characters that can be matched [abc] matches with a, b, or c [a-z] matches any character a to z [0-9] matches any character 0 to 9 [] are case senstive and can be combined [A-Za-z] matches a to z regardless of case

What if you need to match meta- characters inside the character class? By default, they will match on their own [a-z$] will match a-z and $ Special characters like newline are matched by a backslash \n matches newline, \t matches tab, \r\n matches the end of line on Windows or Mac

Now that we’re starting with regular expressions, we’d like an easy way to test them out Introducing sed: stream-editor uses regular expressions, among other things, to edit text on the fly using the typical unix I/O model sed -E s/[a-zA-Z]/1/g Will replace anything in the character class with 1, try it! Introducing sed

regexs and sed Originally, sed only supported basic regular expressions, and +, ? were not supported They could be represented using {1,} and {0,1} respectively POSIX.2 defined regular expressions use the -E flag with sed to get full regular expressions

Back to regexs The ( and ) group characters together Typically we use grouping with modifiers Modified with +, *, ^, ?, and $ + means the regex repeated 1 or more times * means the regex is repeated 0 or more times ^ means the regex begins at the start of the line $ matches the end of line character ? means 0 or 1 of a single character or group

Regexs and the longest sequence Matches always occur on the longest sequence: a+ will always match aaaaaa instead of just the first a in aaaaaa (ie, it won’t match 6 times) Try sed -E s/a{1,2}/YES/ try caaat, and it will return what? cYESaat or cYESat

Examples [a-z]+ matches any group of characters with only the letters a-z sed -E s/[a-z]+/1/g (car)* matches 0 or more cars unix(es)? matches unix or unixes ^re will match recount, but not Andre re$ on the other hand will match Andre

Using { and } {n, m} are used for repeating n and m are integers n is the minimum number, m is the maximum number leaving out m means it can repeat any number of times {5} means repeat exactly 5 times {0,1} means repeat 0 or 1 times {1,} means repeat 1 or more times {1,5} means repeat 1 to 5 times

Warnings with bounds a{3} matches exactly 3 a’s: aaa a{1,3} matches between 1 and 3 a’s: a, aa, aaa But, if you match against aaaa, it will match twice, aaa, and a

More complex regexs The bar, ‘|’ lets the regex choose between two patterns a|b means match a or b cat|car means match cat or car How else could you match the above example? The. matches any character, but by default doesn’t match the end-of-line character c.t matches c followed by anything followed by t

The anti-class We can match against all characters not in a class by starting with ^ [^a-z] matches anything that’s NOT a-z sed -E s/[^abc]+/NOABC/g Given abcdef will return: abcNOABC

Standard Character Classes Any of the following surrounded by [: :] alnum alpha blank cntrl digit graph lower print punct space upper xdigit [:alnum:] in our locale is [0-9A-Za-z] [:alpha:] is [A-Za-z] [:blank:] is [ \t]

[:cntrl:] is any control character [:digit:] is [0-9] [:graph:] is any printable character, but not space or space-like things [:lower:] is [a-z] [:print:] is any printable character, including space [:punct:] is anything not a space or an [:alnum:] [:space:] is [ \t\n\v\f\r] [:upper:] is [A-Z] [:xdigit:] is [0-9A-Fa-f]

A regular expression is one way to express Finite State Automata (or machine) An FSA can be represented using a regex or a graph Regexs as FSAs Regex: a | bd*c

Building blocks of FSAs All FSAs can be constructed by two basic building blocks alternation ‘|’ Kleene star ‘*’ Q: How can we represent the others? Regex: a | b Regex: a*

Exercises Using sed, write a one-line script to translate a Windows/Mac text file into a Unix text file Note, Windows/Mac terminate lines with \r\n, Unix terminates with \n Write a regular expression to parse hexadecimal numbers Hex numbers typically begin with 0x

Questions Imagine that you didn’t have +, how could you represent it using the other regex constructs? Imagine that you didn’t have ?, how could you represent it using other regex constructs?