Text manipulation  Suppose you want to build a web-page which will always contain the latest sports headlines collected from several newspaper websites.

Slides:



Advertisements
Similar presentations
Regular Expressions (in Python). Python or Egrep We will use Python. In some scripting languages you can call the command “grep” or “egrep” egrep pattern.
Advertisements

ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
Regular Expression (1) Learning Objectives: 1. To understand the concept of regular expression 2. To learn commonly used operations involving regular expression.
Regular Expressions in Java. Namespace in XML Transparency No. 2 Regular Expressions Regular expressions are an extremely useful tool for manipulating.
Regular Expressions in Java. Regular Expressions A regular expression is a kind of pattern that can be applied to text ( String s, in Java) A regular.
CS 497C – Introduction to UNIX Lecture 31: - Filters Using Regular Expressions – grep and sed Chin-Chih Chang
LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23.
Regular expressions (contd.) -- remembering subpattern matches When a is being matched with a target string, substrings that match sub-patterns can be.
CS 330 Programming Languages 10 / 10 / 2006 Instructor: Michael Eckmann.
8.1 Last time on: Pattern Matching. 8.2 Finding a sub string (match) somewhere: if ($line =~ m/he/)... remember to use slash( / ) and not back-slash Will.
Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl Linux editors and commands (e.g.
Scripting Languages Chapter 8 More About Regular Expressions.
CSE467/567 Computational Linguistics Carl Alphonce Computer Science & Engineering University at Buffalo.
REGULAR EXPRESSIONS CHAPTER 14. REGULAR EXPRESSIONS A coded pattern used to search for matching patterns in text strings Commonly used for data validation.
Regular Expressions A regular expression defines a pattern of characters to be found in a string Regular expressions are made up of – Literal characters.
Last Updated March 2006 Slide 1 Regular Expressions.
Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © – Curt Hill.
Programming Perl in UNIX Course Number : CIT 370 Week 4 Prof. Daniel Chen.
 Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’
Regular Expressions in.NET Ashraya R. Mathur CS NET Security.
PHP Workshop ‹#› Data Manipulation & Regex. PHP Workshop ‹#› What..? Often in PHP we have to get data from files, or maybe through forms from a user.
PHP Using Strings 1. Replacing substrings (replace certain parts of a document template; ex with client’s name etc) mixed str_replace (mixed $needle,
REGULAR EXPRESSIONS. Lexical Analysis Lexical analysers can be constructed by programs such as LEX These programs employ as input a description of the.
ASP.NET Programming with C# and SQL Server First Edition Chapter 5 Manipulating Strings with C#
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
Python Regular Expressions Easy text processing. Regular Expression  A way of identifying certain String patterns  Formally, a RE is:  a letter or.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 4. Document Search and Regular Expressions.
Regular Expression Dr. Tran, Van Hoai Faculty of Computer Science and Engineering HCMC Uni. of Technology
Kirkwood Center for Continuing Education Introduction to PHP and MySQL By Fred McClurg, Copyright © 2015, Fred McClurg, All Rights.
BY Sandeep Kumar Gampa.. What is Regular Expression? Regex in.NET Regex Language Elements Examples Regular Expression API How to Test regex in.NET Conclusion.
Regular Expressions in PHP. Supported RE’s The most important set of regex functions start with preg. These functions are a PHP wrapper around the PCRE.
REGEX. Problems Have big text file, want to extract data – Phone numbers (503)
Overview A regular expression defines a search pattern for strings. Regular expressions can be used to search, edit and manipulate text. The pattern defined.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
JLex Lecture 4 Mon, Jan 24, JLex JLex is a lexical analyzer generator in Java. It is based on the well-known lex, which is a lexical analyzer generator.
Module 6 – Generics Module 7 – Regular Expressions.
Regular Expressions for PHP Adding magic to your programming. Geoffrey Dunn
ECA 225 Applied Interactive Programming1 ECA 225 Applied Online Programming regular expressions.
Regular Expressions in Perl CS/BIO 271 – Introduction to Bioinformatics.
JavaScript, Part 2 Instructor: Charles Moen CSCI/CINF 4230.
Regular Expressions What is this line all about? while (!($search =~ /^\s*$/)) { It’s a string search just like before, but with a huge twist – regular.
©Brooks/Cole, 2001 Chapter 9 Regular Expressions ( 정규수식 )
12. Regular Expressions. 2 Motto: I don't play accurately-any one can play accurately- but I play with wonderful expression. As far as the piano is concerned,
Introduction to sed. Sed : a “S tream ED itor ” What is Sed ?  A “non-interactive” text editor that is called from the unix command line.  Input text.
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
CSC 2720 Building Web Applications PHP PERL-Compatible Regular Expressions.
Copyright © Curt Hill Regular Expressions Providing a Search Pattern.
1 Lecture 9 Shell Programming – Command substitution Regular expressions and grep Use of exit, for loop and expr commands COP 3353 Introduction to UNIX.
CIT 383: Administrative ScriptingSlide #1 CIT 383: Administrative Scripting Regular Expressions.
Standard Types and Regular Expressions CS 480/680 – Comparative Languages.
7 Copyright © 2009, Oracle. All rights reserved. Regular Expression Support.
What is grep ?  % man grep  DESCRIPTION  The grep utility searches text files for a pattern and prints all lines that contain that pattern. It uses.
Regular Expressions /^Hel{2}o\s*World\n$/ SoftUni Team Technical Trainers Software University
Introduction to Programming the WWW I CMSC Winter 2004 Lecture 13.
Variable Variables A variable variable has as its value the name of another variable without $ prefix E.g., if we have $addr, might have a statement $tmp.
Pattern Matching: Simple Patterns. Introduction Programmers often need to scan a file, directory, etc. for a specific substring. –Find all files that.
CS 330 Programming Languages 09 / 30 / 2008 Instructor: Michael Eckmann.
OOP Tirgul 11. What We’ll Be Seeing Today  Regular Expressions Basics  Doing it in Java  Advanced Regular Expressions  Summary 2.
Chapter 3: Formatted Input/Output 1 Chapter 3 Formatted Input/Output.
May 2006CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
Regular Expressions Copyright Doug Maxwell (
Strings and Serialization
Looking for Patterns - Finding them with Regular Expressions
CSC 594 Topics in AI – Natural Language Processing
CSC 594 Topics in AI – Natural Language Processing
Pattern Matching in Strings
Regular Expression: Pattern Matching
REGEX.
ADVANCE FIND & REPLACE WITH REGULAR EXPRESSIONS
Presentation transcript:

Text manipulation  Suppose you want to build a web-page which will always contain the latest sports headlines collected from several newspaper websites  You might, for example, wish to include the Guardian’s sports headlines on your page

Adding these headlines manually You would have to access the source of the Guardian page

You would then have to find the text which defines the headlines Analyse it And copy the relevant bits into the HTML for your own web-page

Examining it, we find that the source contains one HTML table for each sport in the list of top stories Here is the table for the tennis headlines on the page seen earlier: Murray magic books semi spot Tennis: The biggest win of his career to-date saw Andy Murray stun Robby Ginepri and reach the last four at the Thailand Open. Tough home Davis Cup tie for GB More tennis

Here is the text which defines the main tennis headline on the page shown earlier: Murray magic books semi spot Tennis: The biggest win of his career to-date saw Andy Murray stun Robby Ginepri and reach the last four at the Thailand Open.

To get this story onto your own web-page you could then copy the relevant HTML segment into the source code for your web-page But … … doing this manually is very labour- intensive We ought to automate the complete task

Adding headlines automatically To add headlines automatically, you would have to write a program which would –Download the source code for the Guardian page –Analyse this source code to extract the appropriate text –Add the relevant text to source code for your own web-page

Adding headlines automatically Later, we will see how to download page sources from other websites Now, we will focus on the issue of text analysis

Regular Expressions Regular expression technology provides a convenient way of searching string for patterns of interest

Regular expressions (contd.) Example regular expression: /ab*c/ this searches the target string for substring(s) that comprise “an a followed by zero or more instances of b followed by by a c” It will match any of the following substrings: ac abc abbc abbbc ….

Using regular expressions in PHP Regular expressions are supported in several languages, including PHP PHP provides a group of pre-defined functions for using them For now, we will focus on just one of these, the preg_replace function

The preg_replace function Format of call: preg_replace (regexp, replacement, subject [, int limit]) This function returns the result of replacing substrings in subject which match regexp with replacement The number of matching substrings which are replaced is controlled by the optional parameter limit An example application is on the next slide

Regular expressions (contd.) PHP code <?php $myString = "xyzacklmabbcpqrabbbbbcstu"; echo "myString is $myString "; $myString = preg_replace("/ab*c/","_",$myString); echo "myString is now $myString"; ?> Resultant output is myString is xyzacklmabbcpqrabbbbbcstu myString is now xyz_klm_pqr_stu

Using the limit parameter in preg_replace PHP code <?php $myString = "xyzacklmabbcpqrabbbbbcstu"; echo "myString is $myString "; $myString = preg_replace("/ab*c/","_",$myString,1); echo "myString is now $myString"; ?> Resultant output is myString is xyzacklmabbcpqrabbbbbcstu myString is now xyz_klmabbcpqrabbbbbcstu

Meta-characters We have seen that certain characters have a special meaning in regular expressions: –the example on the last few slides used the * character which means “0 or more instances of the preceding character or pattern” These are called meta-characters Other meta-characters are listed on the next slide

The meta-characters include: the * character which means “0 or more instances of preceding” the + character, which means “1 or more instances of preceding” the ? character, which means “0 or 1 instances of preceding” the { and } character delimit an expression specifying a range of acceptable occurrences of the preceding character Examples: {m} means exactly m occurences of preceding character/pattern {m,} means at least m occurrences of preceding char/pattern {m,n} means at least m, but not more than n, occurrences of preceding char/pattern Thus, {0,} is equivalent to * {1,} is equivalent to + {0,1} is equivalent to ?

Regular expressions (contd.) Further meta-characters are: the ^ character, which matches the start of a string the $ character, which matches the end of a string the. character which matching anything except a newline character the [ and ] character starts an equivalence class of characters, any of which can match one character in the target string the ( and ) characters delimit a group of sub-patterns the | character separates alternative patterns

Regular expressions (contd.) Example expression: /^a.*d$/ this matches the entire target string provided the target string starts with an a, followed by zero or more non-newline characters, and ends with a d An example application is on the next slide

Example application PHP code <?php $myString1 = ”abcdefghijklmnopqrstuvd"; echo "myString1 is $myString1 "; $myString1 = preg_replace(”/^a.*d$/","_",$myString1); echo "myString1 is now $myString1 "; $myString2 = ”xabcdefghijklmnopqrstuvd"; echo "myString2 is $myString2 "; $myString2 = preg_replace(”/^a.*d$/","_",$myString2); echo "myString2 is now $myString2"; ?> Resultant output is myString1 is abcdefghijklmnopqrstuvd myString1 is now _ myString2 is xabcdefghijklmnopqrstuvd myString2 is now xabcdefghijklmnopqrstuvd

Regular expressions (contd.) Example expression: /^a.{2,5}d$/ this replaces the entire target string with “x”, provided the target string starts with an a, followed by between two and five non-newline characters, and ends with a d An example application is on the next slide

Regular expressions (contd.) PHP code <?php $myString1 = "adabbbbccccaaaabbbbccccd"; echo "myString1 is $myString1 "; $myString1 = preg_replace(”/^a.{2,5}d$/","_",$myString1); echo "myString1 is now $myString1 "; $myString2 = "afghd"; echo "myString2 is $myString2 "; $myString2 = preg_replace(”/^a.{2,5}d$/","_",$myString2); echo "myString2 is now $myString2"; ?> Resultant output is myString1 is adabbbbccccaaaabbbbccccd myString1 is now adabbbbccccaaaabbbbccccd myString2 is afghd myString2 is now _

Regular expressions (contd.) Example regular expression: /(abc){2,5}d/ this matches sub-string(s) in the target that comprise “between 2 and 5 repeats of the pattern abc followed by a d” An example application is on the next slide

Regular expressions (contd.) PHP code <?php $myString = "klmabcabcabcdpqrabcdklmabcabcabcabcdxyz"; echo "myString is $myString "; $myString = preg_replace("/(abc){2,5}d/","_",$myString); echo "myString is now $myString"; ?> Resultant output is myString is klmabcabcabcdpqrabcdklmabcabcabcabcdxyz myString is now klm_pqrabcdklm_xyz

Regular expressions (contd.) Example regular expression: /(foo|bar)/ this matches sub-strings foo or bar An example application is on the next slide

Regular expressions (contd.) PHP code <?php $myString = ”abcfoodefbarghi"; echo "myString is $myString "; $myString = preg_replace("/(foo|bar)/","_",$myString); echo "myString is now $myString"; ?> Resultant output is myString is abcfoodefbarghi myString is now abc_def_ghi

Regular expressions (contd.) Although some characters have special meanings in regular expressions, we may, sometimes, just want to use them to match themselves in the target string We do this by escaping them in the regular expression, by preceding them with a backslash \ Example regular expression: /^a\^+.*d$/ this matches the entire target string, provided the target string starts with an a, followed by one or more carat characters, followed by zero or more non- newline characters, and ends with a d An example application is on the next slide

Example application PHP code <?php $myString1 = ”adabbbbcabbcabced"; echo "myString1 is $myString1 "; $myString1 = preg_replace(”/^a\^+.*d$/","_",$myString1); echo "myString1 is now $myString1 "; $myString2 = ”a^^^abbbbcabbcabceed"; echo "myString2 is $myString2 "; $myString2 = preg_replace(”/^a\^+.*d$/","_",$myString2); echo "myString2 is now $myString2"; ?> Resultant output is myString1 is adabbbbcabbcabced myString1 is now adabbbbcabbcabced myString2 is a^^^abbbbcabbcabceed myString2 is now _

Regular expressions (contd.) As mentioned earlier, the [ and ] characters have a special meaning in regular expressions –they delimit an equivalence class of characters, any one of which may be used to match one character in the target string Example regular expression: /a[KLM]b/ replaces any substring comprising “the letter a followed by one of the three letters KLM, followed by the letter b”

Regular expressions (contd.) The ^ character has a special meaning when used as the first character between [ and ] characters; this meaning is different from its special meaning when used outside the [ and ] characters –when used as the first character between the [ and ] characters, the ^ character specifies the complement of the equivalence class that would have been specified if its were absent Example regular expression: /a[^KLM]b/ replaces any substring comprising “the letter a followed by any single letter that is not one of KLM, followed by the letter b”

Regular expressions (contd.) The - character also has a special meaning when used between [ and ] characters: –it is used to join the start and end of a sequence of characters, any one of which may be used to match one character in the target string Example regular expression: /a[0-9]b/ matches any substring comprising “the letter a followed by one digit, followed by the letter b”

Regular expressions (contd.) Example regular expression: / %[a-fA-F0-9]/ matches any substring comprising “an % followed by a hexadecimal digit”

Regular expressions (contd.) Certain escape sequences also have a special meaning in regular expressions. They define certain commonly used equivalence classes of characters: \w is equivalent to [a-zA-Z0-9_] \W is equivalent to [^a-zA-Z0-9_] \d is equivalent to [0-9] \D is equivalent to [^0-9] \s is equivalent to [ \n\t\f\r] \S is equivalent to [^ \n\t\f\r] \b denotes a word boundary \B denotes a non-word boundary Note the SP characters in the meaning of \s and \S, that is the white-space equivalence includes SP Byt the way, \f is formFeed and \r is carriageReturn

Regular expressions (contd.) Example regular expression: / %\d\d\d\D/ matches any substring comprising “an % followed by three decimal digits, followed by a non-digit” Example regular expression: / \s\w\w\s/ matches any substring comprising “a white-space character, followed by two word characters, followed by another white-space character”

Regular expressions (contd.) PHP code <?php $myString = ”This is not an apple"; echo "myString is $myString "; $myString = preg_replace("/\s\w\w\s/","_",$myString); echo "myString is now $myString"; ?> Resultant output is myString is This is not an apple myString is now This_not_apple

Regular expressions (contd.) The standard quantifiers are all "greedy” – they match as many occurrences as possible without causing the pattern to fail. It is possible to make them “frugal” –that is, make them match the minimum number of times necessary We do this by following the quantifier with a "?" *? Match 0 or more times, preferably only 0 +? Match 1 or more times, preferably only 1 time ?? Match 0 or 1 time, preferably only 0 {n}? Match exactly n times {n,}? Match at least n times, preferably only n times {n,m}? Match at least n but not more than m times, preferably only n times

Regular expressions (contd.) PHP code <?php $myString1 = ”abcabcabcabc"; echo "myString1 is $myString1 "; $myString1 = preg_replace(”/(abc){2,5}/",”x",$myString1); echo "myString1 is now $myString1 "; $myString2 = "abcabcabcabc"; echo "myString2 is $myString2 "; $myString2 = preg_replace(”/(abc){2,5} ? /",”x",$myString2); echo "myString2 is now $myString2"; ?> Resultant output is myString1 is abcabcabcabc myString1 is now x myString2 is abcabcabcabc myString2 is now xx What is going on here? See next slide for contrast

Regular expressions (contd.) PHP code <?php $myString1 = ”abcabcabcabc"; echo "myString1 is $myString1 "; $myString1 = preg_replace(”/(abc){2,5}/",”x",$myString1,1); echo "myString1 is now $myString1 "; $myString2 = "abcabcabcabc"; echo "myString2 is $myString2 "; $myString2 = preg_replace(”/(abc){2,5}?/",”x",$myString2,1); echo "myString2 is now $myString2"; ?> Resultant output is myString1 is abcabcabcabc myString1 is now x myString2 is abcabcabcabc myString2 is now xabcabc Discussion of contrast with previous slide...

A digression Before proceeding to further regexp concepts, let’s look at applying to HTML manipulation what we have already seen

Example task Suppose we have the following HTML wine f12 cheese Suppose we want to eliminate from the list any list item whose content comprises only non-digits That is, we want the HTML to become f12

Regular expressions (contd.) PHP code <?php $myString = ” wine f12 cheese "; echo "myString is $myString "; $myString = preg_replace(”/ \D+ /",”",$myString); echo "myString is now $myString "; ?> Resultant output is myString is  wine  f12  cheese myString is now  f12

Seeing the raw-HTML Suppose we want to see the raw HTML in our output That is, suppose we wanted to see myString is wine f12 cheese myString is now f12 We would have to replace all occurrences of < with < We could use regular expressions for this but, –the string to be replaced is a constant –so we can use a simpler technology

Regular expressions (contd.) PHP code <?php $myString = ” wine f12 cheese "; echo "myString is ".str_replace(“ "; $myString = preg_replace("/ \D+ /",”x",$myString); echo "myString is now ".str_replace(“<“,”<”,$myString); ?> Now the resultant output is myString is wine f12 cheese myString is now f12

Suppose we want to replace every list item with the fixed phrase listItem That is, we wanted to see this output myString is wine f12 cheese myString is now listItem listItem listItem

Regular expressions (contd.) Suppose we try this <?php $myString = ” wine f12 cheese "; echo "myString is ".str_replace(“ "; $myString = preg_replace("/. + /",” listItem ",$myString); echo "myString is now ".str_replace(“<“,”<”,$myString); ?> Resultant output is myString is wine f12 cheese myString is now listItem What is wrong? We need to make the + quantifier ungreedy

Regular expressions (contd.) We must do this <?php $myString = ” wine f12 cheese "; echo "myString is ".str_replace(“ "; $myString = preg_replace("/. + ? /",” listItem ",$myString); echo "myString is now ".str_replace(“<“,”<”,$myString); ?> Resultant output is myString is wine f12 cheese myString is now listItem listItem listItem

End of digression Back to regular expressions...

Regular expressions (contd.) -- remembering subpattern matches When a is being matched with a target string, substrings that match sub-patterns can be remembered and re-used later in the same pattern Sub-patterns whose matching substrings are to be remembered are enclosed in parentheses The sub-patterns are implicitly numbered, starting from 1 and their matching substrings can then be re-used later in the pattern by using back-references like \1 or \2 or \3 However, to get the backslash, we need to escape it, so we must type \\1 or \\2 or \\3 in our regular expressions

Using back-references (contd.) PHP code <?php $myString1 = ” klmAklmAAklmABklmBklmBBklm "; echo "myString is $myString "; $myString1 = preg_replace(”/([A-Z])\\1/",”_",$myString1); echo "myString1 is now $myString1 "; ?> Resultant output is myString1 is klmAklmAAklmABklmBklmBBklm myString1 is now klmAklm_klmABklmBklm_klm