 Regular expressions are : › A language or syntax that lets you specify patterns for matching e.g. filenames or strings › Used to identify the files.

Slides:



Advertisements
Similar presentations
2-1. Today’s Lecture Review Chapter 4 Go over exercises.
Advertisements

Computer Science & Engineering 2111 Text Functions 1CSE 2111 Lecture-Text Functions.
ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
Regular Expressions Regular Expression (or pattern) in Perl – is a template that either matches or doesn’t match a given string. if( $str =~ /hello/){
Quotes: single vs. double vs. grave accent % set day = date % echo day day % echo $day date % echo '$day' $day % echo "$day" date % echo `$day` Mon Jul.
CS 330 Programming Languages 10 / 10 / 2006 Instructor: Michael Eckmann.
CS 330 Programming Languages 09 / 30 / 2008 Instructor: Michael Eckmann.
UNIX Filters.
Shell Script Examples.
Regular Expressions. String Matching The problem of finding a string that “looks kind of like …” is common  e.g. finding useful delimiters in a file,
Last Updated March 2006 Slide 1 Regular Expressions.
Regular Expressions Week 07 TCNJ Web 2 Jean Chu. Regular Expressions Regular Expressions are a powerful way to validate and format text strings that may.
Tutorial 14 Working with Forms and Regular Expressions.
Pattern matching with regular expressions A common file processing requirement is to match strings within the file to a standard form, e.g. address.
 Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’
INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7.
1 Regular Expressions CIS*2450 Advanced Programming Techniques Material for this lectures has been taken from the excellent book, Mastering Regular Expressions,
Chapter 2 Overview of C Part I J. H. Wang ( 王正豪 ), Ph. D. Assistant Professor Dept. Computer Science and Information Engineering National Taipei University.
CS 403: Programming Languages Fall 2004 Department of Computer Science University of Alabama Joel Jones.
Regular Expression (continue) and Cookies. Quick Review What letter values would be included for the following variable, which will be used for validation.
UNIX Shell Script (1) Dr. Tran, Van Hoai Faculty of Computer Science and Engineering HCMC Uni. of Technology
Agenda Regular Expressions (Appendix A in Text) –Definition / Purpose –Commands that Use Regular Expressions –Using Regular Expressions –Using the Replacement.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 4. Document Search and Regular Expressions.
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
Regular Expressions CSC207 – Software Design. Motivation Handling white space –A program ought to be able to treat any number of white space characters.
Regular Expressions.
CS 330 Programming Languages 10 / 07 / 2008 Instructor: Michael Eckmann.
Introduction to Unix – CS 21 Lecture 12. Lecture Overview A few more bash programming tricks The here document Trapping signals in bash cut and tr sed.
Regular Expression - Intro Patterns that define a set of strings (or, pieces of a string) Not wildcards (similar notion, but different thing) Used by utilities.
REGEX. Problems Have big text file, want to extract data – Phone numbers (503)
Corpus Linguistics- Practical utilities (Lecture 7) Albert Gatt.
Working with Forms and Regular Expressions Validating a Web Form with JavaScript.
Prof. Alfred J Bird, Ph.D., NBCT Door Code for IT441 Students.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Regular Expressions for PHP Adding magic to your programming. Geoffrey Dunn
GREP. Whats Grep? Grep is a popular unix program that supports a special programming language for doing regular expressions The grammar in use for software.
CS 330 Programming Languages 10 / 02 / 2007 Instructor: Michael Eckmann.
Regular Expressions Pattern and String Matching in Text.
CSCI 330 UNIX and Network Programming Unit IV Shell, Part 2.
Awk- An Advanced Filter by Prof. Shylaja S S Head of the Dept. Dept. of Information Science & Engineering, P.E.S Institute of Technology, Bangalore
Validation final steps Stopping gaps being entered in an input.
Unit 11 –Reglar Expressions Instructor: Brent Presley.
Standard Types and Regular Expressions CS 480/680 – Comparative Languages.
Introduction to Programming the WWW I CMSC Winter 2004 Lecture 13.
An Introduction to Regular Expressions Specifying a Pattern that a String must meet.
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
Introduction to Programming the WWW I CMSC Winter 2003 Lecture 17.
CS 330 Programming Languages 09 / 30 / 2008 Instructor: Michael Eckmann.
CS 403: Programming Languages Lecture 20 Fall 2003 Department of Computer Science University of Alabama Joel Jones.
Filters and Utilities. Notes: This is a simple overview of the filtering capability Some of these commands are very powerful ▫Only showing some of the.
Chapter 18 The HTML Tag
Prof. Alfred J Bird, Ph.D., NBCT Office – McCormick 3rd floor 607.
Regular Expressions.
Regular Expressions Copyright Doug Maxwell (
Lesson 5-Exploring Utilities
Lexical Analyzer in Perspective
Looking for Patterns - Finding them with Regular Expressions
Lecture 19 Strings and Regular Expressions
CSC 594 Topics in AI – Natural Language Processing
Regular Expression - Intro
PROGRAMMING THE BASH SHELL PART IV by İlker Korkmaz and Kaya Oğuz
Vi Editor.
CS 403: Programming Languages
CSC 594 Topics in AI – Natural Language Processing
Intro to PHP & Variables
Folks Carelli, Instructor Kutztown University
Selenium WebDriver Web Test Tool Training
CSE 303 Concepts and Tools for Software Development
Introduction to Computer Science
REGEX.
Presentation transcript:

 Regular expressions are : › A language or syntax that lets you specify patterns for matching e.g. filenames or strings › Used to identify the files or lines you want to work with › Used inside of substitution functions to change the contents of a string

 ls 14* › * is a wildcard here, not regex › 14 followed by zero or more of any character  ls 14[0-1][0-9]* › [0-1] and [0-9] are regex character classes, specifying a single character within the the list of characters from 0 to 1, and 0 to 9, respectively  ls 14[0-1][0-9][0-3][0-9]* › 6 digits that look like a date YYMMDD, mostly

 mv [b-z]* $data_scratch › An alphabetical class, which depending on your system might match the lower case letters from b through z, OR a mix of upper and lower case: b C c D d... Z z  grep 'MIT01$' sysnos.txt › Find lines that end ($) with MIT01 › ^ can be used to match at the beginning of a line

 In vi, you can use regular expressions with the s/// substitution operator  With emacs, use M-x query-replace- regexp › Replace $ with MIT01 › Take a list of system numbers and make it valid input to an Aleph service by adding the library code to the end of each line

 Look through a MARC file in Aleph sequential format for lines with tag 260 › L $$aCambridge$$bMIT Press  if ($matched =~ m/^\d{9}\s260.+/) {... } › $matched is the while loop variable representing the line we're working on › =~ is a pattern operator used with the matching (m), substitution (s), and translation (tr) functions › m// is the pattern matching function

 ^ start at the beginning of the line  \d Perl-speak for the digits character class  {9} a quantifier. Find exactly 9 of \d  \s Perl-speak for the whitespace char class  260 the MARC tag I'm looking for . any character  + a quantifier. Find 1 or more of.

^start at the beginning of the line \dPerl-speak for the digits character class {9}a quantifier. Find exactly 9 of \d \sPerl-speak for the whitespace char class 260the MARC tag I'm looking for.any character +a quantifier. Find 1 or more of.

 Look for deleted records › LDR position 05 is d › $my_LDR =~ /LDR L.....d/  Look for e-resource records › $my_245 =~ /\$\$h\[electronic resource\]/  Look for OCLC numbers › $my_035 =~ /(\(OCoLC\)\d{8,10})/ › Note the double use of () here

if ($hash{$tmp} =~ m/SKIP/ || $hash{$tmp} =~ m/NEW/) { $new_count++ if (m/ FMT L /); $skip_count++ if (m/ FMT L / && $hash{$tmp} =~ m/SKIP/); $bre_count++ if (m/ FMT L / && $hash{$tmp} =~ m/SKIP Brief/); $bks_count++ if (m/ FMT L / && $hash{$tmp} =~ m/SKIP Books24x7/); $eebo_count++ if (m/ FMT L / && $hash{$tmp} =~ m/SKIP EEBO/); $epda_count++ if (m/ FMT L / && $hash{$tmp} =~ m/SKIP EPDA/); $sta_count++ if (m/ FMT L / && $hash{$tmp} =~ m/SKIP STA/); }

 We have a browse index of URLs  An Aleph browse index only sorts the first 69 characters of the field  When we have many URLs from the same site, we need to get the unique part closer to the beginning  Following is an SFX OpenURL from the MARCit! service

 url_ver=Z &ctx_ver=Z &ctx_enc=info:ofi/enc:UTF- 8&rfr_id=info:sid/sfxit.com:opac_856&url_ ctx_fmt=info:ofi/fmt:kev:mtx:ctx&sfx.ignor e_date_threshold=1&rft.object_id= &svc_val_fmt=info:ofi/fmt:ke v:mtx:sch_svc&

 _id= &url_ver=Z &ctx_ver=Z &ctx_enc=info:ofi/enc:UTF- 8&rfr_id=info:sid/sfxit.com:opac_856&url_ ctx_fmt=info:ofi/fmt:kev:mtx:ctx&sfx.ignor e_date_threshold=1&svc_val_fmt=info:ofi /fmt:kev:mtx:sch_svc&

 $my_856 =~ s/(^.*sfx_local\?)(.*)(rft\.object_id\=\d{1,}\&)(.*$)/$1$3$2$4/;  s is the substitution operator › substitute/this/for this/  Parentheses used here to group different sections of the pattern, and then re- arrange them

$1The first matched parenthetical section ^.*sfx_local\?From the beginning, anything up to and including sfx_local? ? is a special character and is escaped here to get a literal question mark $2The 2nd matched parenthetical section.*Any number of any character, until it reaches the next match string

 Now change the order from $1$2$3$4 to $1$3$2$4 $3The 3rd parenthetical section rft\.object_id\=\d{1,} \& rft.object_id= followed by one or more digits and an ampersand. = and & are escaped with \ because they are special characters {1,} is like + a quantifier meaning one or more $4The 4th and final parenthetical section.*$Any number of any character to the end

 Thesis degree, year, and department are stored in a single free text MARC field 502  We have applied some structure to this, but it has varied over time  In DSpace, we want to get these 3 bits into separate fields, so the note is parsed on the way from MARC to Dublin Core

 $MIT = 'Massachusetts Institute of Technology\.?|M\.\s?I\.\s?T\.'; › ? is the zero or one quantifier. › | match the pattern alternative before or after this  $Dept = '[Dd]epartment\s[Oo]f|[dD]ept\.\s+[Oo ]f'; › A few small character classes, to allow for case variation, and Department vs Dept.

 $Month = 'January|February|March|April|May|J une|July|August|September|October| November|December'; › match any one month name when $Month is used inside a pattern

 /^Thesis\.\s+(\d+)\.?\s+([\w\.\s]+)-- ($MIT)\.?\s+($Dept)?\s*(.+)$/o /^Thesis\.Begin with Thesis. \s+1 or more spaces (\d+)1 or more digits = $1 \.?0 or 1 period \s+1 or more spaces ([\w\.\s]+)1 or more word chars, periods, spaces = $2 -- ($MIT)something matching $MIT = $3

 /^Thesis\.\s+(\d+)\.?\s+([\w\.\s]+)-- ($MIT)\.?\s+($Dept)?\s*(.+)$/o \.?0 or 1 period \s+1 or more spaces ($Dept)?0 or 1 strings matching $Dept = $4 \s*0 or more spaces (.+)$anything left to the end = $5 /oAn option. Compile the expression only once. The variables, $MIT and $Dept are not going to change

 Massachusetts Institute of Technology. Dept. of Economics. Thesis Ph.D.  Massachusetts Institute of Technology, Dept. of Civil Engineering, Thesis Sc. D.  /^($MIT)(\.|,)?\s+($Dept)?\s*([\w\s\.,]+ )\s+Thesis.\s*(\d{4})\.?\s*(.*)$/o

 Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Aeronautics and Astronautics,  Thesis (Sc. D.)--Massachusetts Institute of Technology, Dept. of Aeronautics an Astronautics.  Thesis. (M.S.)--Sloan School of Management,  Thesis (Sc. D.)--Massachusetts Institute of Technology, Dept. of Mechanical Engineering,  Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Linguistics and Philosophy, February  /^Thesis\.?\s*\(([^\)]*)\)(\s*-- ?\s*|\s+)?(($MIT)[\.,]?)?\s*($Dept)?\s*(.*)(,\s+(\d{4}) )?\.?$/o

 Thesis (Ph. D.)--Joint Program in Oceanography/Applied Ocean Science and Engineering (Massachusetts Institute of Technology, Dept. of Earth, Atmospheric, and Planetary Sciences; and the Woods Hole Oceanographic Institution),  /^Thesis\.?\s*\(([^\)]*)\)(\s*--(Joint Program in ([\w\.\s]+)\((($MIT)[\.,]?)?\s*($Dept)?\s*([ \w,;\s]+)\)))(,\s+(\d{4}))?\.?$/o