Presentation is loading. Please wait.

Presentation is loading. Please wait.

 Regular expressions are : › A language or syntax that lets you specify patterns for matching e.g. filenames or strings › Used to identify the files.

Similar presentations


Presentation on theme: " Regular expressions are : › A language or syntax that lets you specify patterns for matching e.g. filenames or strings › Used to identify the files."— Presentation transcript:

1

2  Regular expressions are : › A language or syntax that lets you specify patterns for matching e.g. filenames or strings › Used to identify the files or lines you want to work with › Used inside of substitution functions to change the contents of a string

3  ls 14* › * is a wildcard here, not regex › 14 followed by zero or more of any character  ls 14[0-1][0-9]* › [0-1] and [0-9] are regex character classes, specifying a single character within the the list of characters from 0 to 1, and 0 to 9, respectively  ls 14[0-1][0-9][0-3][0-9]* › 6 digits that look like a date YYMMDD, mostly

4  mv [b-z]* $data_scratch › An alphabetical class, which depending on your system might match the lower case letters from b through z, OR a mix of upper and lower case: b C c D d... Z z  grep 'MIT01$' sysnos.txt › Find lines that end ($) with MIT01 › ^ can be used to match at the beginning of a line

5  In vi, you can use regular expressions with the s/// substitution operator  With emacs, use M-x query-replace- regexp › Replace $ with MIT01 › Take a list of system numbers and make it valid input to an Aleph service by adding the library code to the end of each line

6  Look through a MARC file in Aleph sequential format for lines with tag 260 › 001234567 260 L $$aCambridge$$bMIT Press  if ($matched =~ m/^\d{9}\s260.+/) {... } › $matched is the while loop variable representing the line we're working on › =~ is a pattern operator used with the matching (m), substitution (s), and translation (tr) functions › m// is the pattern matching function

7  ^ start at the beginning of the line  \d Perl-speak for the digits character class  {9} a quantifier. Find exactly 9 of \d  \s Perl-speak for the whitespace char class  260 the MARC tag I'm looking for . any character  + a quantifier. Find 1 or more of.

8 ^start at the beginning of the line \dPerl-speak for the digits character class {9}a quantifier. Find exactly 9 of \d \sPerl-speak for the whitespace char class 260the MARC tag I'm looking for.any character +a quantifier. Find 1 or more of.

9  Look for deleted records › LDR position 05 is d › $my_LDR =~ /LDR L.....d/  Look for e-resource records › $my_245 =~ /\$\$h\[electronic resource\]/  Look for OCLC numbers › $my_035 =~ /(\(OCoLC\)\d{8,10})/ › Note the double use of () here

10 if ($hash{$tmp} =~ m/SKIP/ || $hash{$tmp} =~ m/NEW/) { $new_count++ if (m/ FMT L /); $skip_count++ if (m/ FMT L / && $hash{$tmp} =~ m/SKIP/); $bre_count++ if (m/ FMT L / && $hash{$tmp} =~ m/SKIP Brief/); $bks_count++ if (m/ FMT L / && $hash{$tmp} =~ m/SKIP Books24x7/); $eebo_count++ if (m/ FMT L / && $hash{$tmp} =~ m/SKIP EEBO/); $epda_count++ if (m/ FMT L / && $hash{$tmp} =~ m/SKIP EPDA/); $sta_count++ if (m/ FMT L / && $hash{$tmp} =~ m/SKIP STA/); }

11  We have a browse index of URLs  An Aleph browse index only sorts the first 69 characters of the field  When we have many URLs from the same site, we need to get the unique part closer to the beginning  Following is an SFX OpenURL from the MARCit! service

12  http://owens.mit.edu/sfx_local? url_ver=Z39.88-2004&ctx_ver=Z39.88- 2004&ctx_enc=info:ofi/enc:UTF- 8&rfr_id=info:sid/sfxit.com:opac_856&url_ ctx_fmt=info:ofi/fmt:kev:mtx:ctx&sfx.ignor e_date_threshold=1&rft.object_id=37100 00000092335&svc_val_fmt=info:ofi/fmt:ke v:mtx:sch_svc&

13  http://owens.mit.edu/sfx_local?rft.object _id=3710000000092335&url_ver=Z39.88- 2004&ctx_ver=Z39.88- 2004&ctx_enc=info:ofi/enc:UTF- 8&rfr_id=info:sid/sfxit.com:opac_856&url_ ctx_fmt=info:ofi/fmt:kev:mtx:ctx&sfx.ignor e_date_threshold=1&svc_val_fmt=info:ofi /fmt:kev:mtx:sch_svc&

14  $my_856 =~ s/(^.*sfx_local\?)(.*)(rft\.object_id\=\d{1,}\&)(.*$)/$1$3$2$4/;  s is the substitution operator › substitute/this/for this/  Parentheses used here to group different sections of the pattern, and then re- arrange them

15 $1The first matched parenthetical section ^.*sfx_local\?From the beginning, anything up to and including sfx_local? ? is a special character and is escaped here to get a literal question mark $2The 2nd matched parenthetical section.*Any number of any character, until it reaches the next match string

16  Now change the order from $1$2$3$4 to $1$3$2$4 $3The 3rd parenthetical section rft\.object_id\=\d{1,} \& rft.object_id= followed by one or more digits and an ampersand. = and & are escaped with \ because they are special characters {1,} is like + a quantifier meaning one or more $4The 4th and final parenthetical section.*$Any number of any character to the end

17  Thesis degree, year, and department are stored in a single free text MARC field 502  We have applied some structure to this, but it has varied over time  In DSpace, we want to get these 3 bits into separate fields, so the note is parsed on the way from MARC to Dublin Core

18  $MIT = 'Massachusetts Institute of Technology\.?|M\.\s?I\.\s?T\.'; › ? is the zero or one quantifier. › | match the pattern alternative before or after this  $Dept = '[Dd]epartment\s[Oo]f|[dD]ept\.\s+[Oo ]f'; › A few small character classes, to allow for case variation, and Department vs Dept.

19  $Month = 'January|February|March|April|May|J une|July|August|September|October| November|December'; › match any one month name when $Month is used inside a pattern

20  /^Thesis\.\s+(\d+)\.?\s+([\w\.\s]+)-- ($MIT)\.?\s+($Dept)?\s*(.+)$/o /^Thesis\.Begin with Thesis. \s+1 or more spaces (\d+)1 or more digits = $1 \.?0 or 1 period \s+1 or more spaces ([\w\.\s]+)1 or more word chars, periods, spaces = $2 -- ($MIT)something matching $MIT = $3

21  /^Thesis\.\s+(\d+)\.?\s+([\w\.\s]+)-- ($MIT)\.?\s+($Dept)?\s*(.+)$/o \.?0 or 1 period \s+1 or more spaces ($Dept)?0 or 1 strings matching $Dept = $4 \s*0 or more spaces (.+)$anything left to the end = $5 /oAn option. Compile the expression only once. The variables, $MIT and $Dept are not going to change

22  Massachusetts Institute of Technology. Dept. of Economics. Thesis. 1968. Ph.D.  Massachusetts Institute of Technology, Dept. of Civil Engineering, Thesis. 1965. Sc. D.  /^($MIT)(\.|,)?\s+($Dept)?\s*([\w\s\.,]+ )\s+Thesis.\s*(\d{4})\.?\s*(.*)$/o

23  Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Aeronautics and Astronautics, 1973.  Thesis (Sc. D.)--Massachusetts Institute of Technology, Dept. of Aeronautics an Astronautics.  Thesis. (M.S.)--Sloan School of Management, 1983.  Thesis (Sc. D.)--Massachusetts Institute of Technology, Dept. of Mechanical Engineering, 1951.  Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Linguistics and Philosophy, February 2004.  /^Thesis\.?\s*\(([^\)]*)\)(\s*-- ?\s*|\s+)?(($MIT)[\.,]?)?\s*($Dept)?\s*(.*)(,\s+(\d{4}) )?\.?$/o

24  Thesis (Ph. D.)--Joint Program in Oceanography/Applied Ocean Science and Engineering (Massachusetts Institute of Technology, Dept. of Earth, Atmospheric, and Planetary Sciences; and the Woods Hole Oceanographic Institution), 2013.  /^Thesis\.?\s*\(([^\)]*)\)(\s*--(Joint Program in ([\w\.\s]+)\((($MIT)[\.,]?)?\s*($Dept)?\s*([ \w,;\s]+)\)))(,\s+(\d{4}))?\.?$/o

25 orbitee@mit.edu


Download ppt " Regular expressions are : › A language or syntax that lets you specify patterns for matching e.g. filenames or strings › Used to identify the files."

Similar presentations


Ads by Google