Regular Expressions. String Matching The problem of finding a string that “looks kind of like …” is common  e.g. finding useful delimiters in a file,

Slides:



Advertisements
Similar presentations
Session 3BBK P1 ModuleApril 2010 : [#] Regular Expressions.
Advertisements

BBK P1 Module2010/11 : [‹#›] Regular Expressions.
CSCI 6962: Server-side Design and Programming Input Validation and Error Handling.
Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt) –Given tokens specified as regular expressions, Lex automatically generates a routine.
Regular Expression Original Notes by Song Guo. What Regular Expressions Are Exactly - Terminology a regular expression is a pattern describing a certain.
ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
CS 898N – Advanced World Wide Web Technologies Lecture 8: PERL Chin-Chih Chang
ISBN Chapter 6 Data Types Character Strings Pattern Matching.
16-Jun-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the.
Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl Linux editors and commands (e.g.
28-Jun-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the.
Regular Expressions Comp 2400: Fall 2008 Prof. Chris GauthierDickey.
Regular expression. Validation need a hard and very complex programming. Sometimes it looks easy but actually it is not. So there is a lot of time and.
1 Overview Regular expressions Notation Patterns Java support.
1 Scanning Tokens. 2 Tokens When a Scanner reads input, it separates it into “tokens”  … at least when using methods like nextInt()  nextInt() takes.
Lesson 3 – Regular Expressions Sandeepa Harshanganie Kannangara MBCS | B.Sc. (special) in MIT.
Last Updated March 2006 Slide 1 Regular Expressions.
Regular Expressions Dr. Ralph D. Westfall May, 2011.
Regular Expression Darby Tien-Hao Chang (a.k.a. dirty) Department of Electrical Engineering, National Cheng Kung University.
Pattern matching with regular expressions A common file processing requirement is to match strings within the file to a standard form, e.g. address.
 Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’
Regular Expressions in.NET Ashraya R. Mathur CS NET Security.
PHP Workshop ‹#› Data Manipulation & Regex. PHP Workshop ‹#› What..? Often in PHP we have to get data from files, or maybe through forms from a user.
COMP Parsing 2 of 4 Lecture 22. How do we write programs to do this? The process of getting from the input string to the parse tree consists of.
CIS 451: Regular Expressions Dr. Ralph D. Westfall January, 2009.
Regular Expressions Regular expressions are a language for string patterns. RegEx is integral to many programming languages:  Perl  Python  Javascript.
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
Review: Regular expression: –How do we define it? Given an alphabet, Base case: – is a regular expression that denote { }, the set that contains the empty.
CPSC 388 – Compiler Design and Construction Scanners – JLex Scanner Generator.
Python Regular Expressions Easy text processing. Regular Expression  A way of identifying certain String patterns  Formally, a RE is:  a letter or.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 4. Document Search and Regular Expressions.
Regular Expressions CSC207 – Software Design. Motivation Handling white space –A program ought to be able to treat any number of white space characters.
Regular Expressions.
BY Sandeep Kumar Gampa.. What is Regular Expression? Regex in.NET Regex Language Elements Examples Regular Expression API How to Test regex in.NET Conclusion.
Regular Expressions – An Overview Regular expressions are a way to describe a set of strings based on common characteristics shared by each string in.
CS 536 Fall Scanner Construction  Given a single string, automata and regular expressions retuned a Boolean answer: a given string is/is not in.
 2003 Jeremy D. Frens. All Rights Reserved. Calvin CollegeDept of Computer Science(1/8) Regular Expressions in Java Joel Adams and Jeremy Frens Calvin.
REGEX. Problems Have big text file, want to extract data – Phone numbers (503)
Overview A regular expression defines a search pattern for strings. Regular expressions can be used to search, edit and manipulate text. The pattern defined.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Regular Expressions. Overview Regular expressions allow you to do complex searches within text documents. Examples: Search 8-K filings for restatements.
Module 6 – Generics Module 7 – Regular Expressions.
Regular Expressions What is this line all about? while (!($search =~ /^\s*$/)) { It’s a string search just like before, but with a huge twist – regular.
Telecooperation Technische Universität Darmstadt Copyrighted material; for TUD student use only Q&A Telecooperation Group TU Darmstadt.
Introduction to Java Java Translation Program Structure
GREP. Whats Grep? Grep is a popular unix program that supports a special programming language for doing regular expressions The grammar in use for software.
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many.
Perl Day 4. Fuzzy Matches We know about eq and ne, but they only match things exactly We know about eq and ne, but they only match things exactly –Sometimes.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
Unit 11 –Reglar Expressions Instructor: Brent Presley.
An Introduction to Regular Expressions Specifying a Pattern that a String must meet.
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
Pattern Matching: Simple Patterns. Introduction Programmers often need to scan a file, directory, etc. for a specific substring. –Find all files that.
OOP Tirgul 11. What We’ll Be Seeing Today  Regular Expressions Basics  Doing it in Java  Advanced Regular Expressions  Summary 2.
Definition of the Programming Language CPRL
Parsing 2 of 4: Scanner and Parsing
Regular Expressions Upsorn Praphamontripong CS 1110
CS 330 Class 7 Comments on Exam Programming plan for today:
Looking for Patterns - Finding them with Regular Expressions
Lecture 19 Strings and Regular Expressions
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Week 14 - Friday CS221.
JAVA RegEx Manish Shrivastava 11/11/2018.
Data Manipulation & Regex
Appendix B.1 Lex Appendix B.1 -- Lex.
REGEX.
Lex Appendix B.1 -- Lex.
Presentation transcript:

Regular Expressions

String Matching The problem of finding a string that “looks kind of like …” is common  e.g. finding useful delimiters in a file, checking for valid user input, filtering , … “Regular expressions” are a common tool for this  most languages support regular expressions  in Java, they can be used to describe valid delimiters for Scanner (and other places)

Matching When you give a regular expression (a regex for short) you can check a string to see if it “matches” that pattern e.g. Suppose that we have a regular expression to describe “a comma then maybe some whitespace” delimiters  The string “,” would match that expression. So would “, ” and “, \n”  But these wouldn’t: “,” “,, ” “word”

Note The “finite state machines” and “regular languages” from MACM 101 are closely related  they describe the same sets of characters that can be matched with regular expressions  (Regular expression implementations are sometimes extended to do more than the “regular language” definition)

Basics When we specified a delimiter new Scanner(…).useDelimiter(“,”);  … the “,” is actually interpreted as a regular expression Most characters in a regex are used to indicate “that character must be right here”  e.g. the regex “abc” matches only one string: “abc”  literal translation: “an ‘a’ followed by a ‘b’ followed by a ‘c’”

Repetition You can specify “this character repeated some number of times” in a regular expression  e.g. match “wot” or “woot” or “wooot” … A * says “match zero or more of those” A + says “match one or more of those”  e.g. the regex wo+t will match the strings above  literal translation: “a ‘w’ followed by one or more ‘o’ s followed by a ‘t’ ”

Example Read a text file, using “comma and any number of spaces” as the delimiter Scanner filein = new Scanner( new File(“file.txt”) ).useDelimiter(“, *”); while(filein.hasNext()) { System.out.printf(“(%s)”, filein.next()); } a comma followed by zero or more spaces

Character Classes In our example, we need to be able to match “any one of the whitespace characters” In a regular expression, several characters can be enclosed in […]  that will match any one of those characters  e.g. regex a[123][45] will match these: “a14” “a15” “a24” “a25” “a34” “a35”  “An ‘a’ ; followed by a 1,2, or 3 ; followed by 4 or 5 ”

Example Read values, separated by comma, and one whitespace character: Scanner filein = new Scanner(…).useDelimiter(“,[ \n\t]”); “Whitespace” technically refers to some other characters, but these are the most common: space, newline, tab  java.lang.Character contains the “real” definition of whitespace

Example We can combine this with repetition to get the “right” version  a comma, followed by some (optional) whitespace Scanner filein = new Scanner(…).useDelimiter(“,[ \n\t]*”); The regex matches “a comma followed by zero or more spaces, newlines, or tabs.”  exactly what we are looking for

More Character Classes A character range can be specified  e.g. [0-9] will match any digit A character class can also be “negated,” to indicate “any character except”  done by inserting a ^ at the start  e.g. [^0-9] will match anything except a digit  e.g. [^ \n\t] will match any non-whitespace

Built-in Classes Several character classes are predefined, for common sets of characters . (period): any character  \d : any digit  \s : any space  \p{Lower} : any lower case letter These often vary from language to language.  period is universal, \s is common, \p{Lower} is Java-specific (usually it’s [:lower:])

Examples [A-Z] [a-z]*  title case words (“Title”, “I” :not “word” or “AB”) \p{Upper}\p{Lower}*  same as previous [0-9].*  a digit, followed by anything (“5q”, “2345”, “2”) gr[ea]y  “grey” or “gray”

Other Regex Tricks Grouping: parens can group chunks together  e.g. (ab)+ matches “ab” or “abab” or “ababab”  e.g. ([abc] *)+ matches “a” or “a b c”, “abc “ Optional parts: the question mark  e.g. ab?c matches only “abc” and “ac”  e.g. a(bc+)?d matches “ad”, “abcd”, “abcccd”, but not “abd” or “accccd” … and many more options as well

Other Uses Regular expressions can be used for much more than describing delimiters The Pattern class (in java.util.regex ) contains Java’s regular expression implementation  it contains static functions that let you do simple regular expression manipulation  … and you can create Pattern objects that do more

In a Scanner Besides separating tokens, a regex can be used to validate a token when its read  by using the.next(regex) method  if the next token matches regex, it is returned  InputMismatchException is thrown if not This allows you to quickly make sure the input is in the right form.  … and ensures you don’t continue with invalid (possibly dangerous) input

Example Scanner userin = new Scanner(System.in); String word; System.out.println(“Enter a word:”); try{ word = userin.next(“[A-Za-z]+”); System.out.printf( “That word has %d letters.\n”, word.length() ); } catch(Exception e){ System.out.println(“That wasn’t a word”); }

Simple String Checking The matches function in Pattern takes a regex and a string to try to match  returns a boolean: true if string matches e.g. in previous example could be done without an exception: word = userin.next(); if(matches(“[A-Za-z]+”, word)) { … // a word } else{ … // give error message }

Compiling a Regex When you match against a regex, the pattern must first be analyzed  the library does some processing to turn it into some more-efficient internal format  it “compiles” the regular expression It would be inefficient to do this many times with the same expression

Compiling a Regex If a regex is going to be used many times, it can be compiled, creating a Pattern object  it is only compiled when the object is created, but can be used to match many times The function Pattern.compile(regex) returns a new Pattern object

Example Scanner userin = new Scanner(System.in); Pattern isWord = Pattern.compile(“[A-Za-z]+”); Matcher m; String word; System.out.println(“Enter some words:”); do{ word = userin.next(); m = isWord.matcher(word); if(m.matches() ) {… // a word } else {… // not a word } } while(!word.equals(“done”) );

Matchers The Matcher object that is created by patternObj.matcher(str) can do a lot more than just match the whole string  give the part of the string that actually matched the expression  find substrings that matched parts of the regex  replace all matches with a new string Very useful in programs that do heavy string manipulation