Matching in list context (Chapter 11 = ($str =~ /pattern/); This stores the list of the special ($1, $2,…) capturing variables into the.

Slides:



Advertisements
Similar presentations
Scripting Languages Chapter 6 I/O Basics. Input from STDIN We’ve been doing so with $line = chomp($line); Same as chomp($line= ); line input op gives.
Advertisements

1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Computer & Network Forensics
CS 330 Programming Languages 10 / 11 / 2007 Instructor: Michael Eckmann.
Regular Expressions Regular Expression (or pattern) in Perl – is a template that either matches or doesn’t match a given string. if( $str =~ /hello/){
More Regular Expressions. List/Scalar Context for m// Last week, we said that m// returns ‘true’ or ‘false’ in scalar context. (really, 1 or 0). In list.
COS 381 Day 22. Agenda Questions?? Resources Source Code Available for examples in Text Book in Blackboard
Guide To UNIX Using Linux Third Edition
Guide To UNIX Using Linux Third Edition
Physical Mapping II + Perl CIS 667 March 2, 2004.
Regular Expressions Regular Expression (or pattern) in Perl – is a template that either matches or doesn’t match a given string. if( $str =~ /hello/){
Unix Filters Text processing utilities. Filters Filter commands – Unix commands that serve dual purposes: –standalone –used with other commands and pipes.
UNIX Filters.
 2004 Prentice Hall, Inc. All rights reserved. Chapter 25 – Perl and CGI (Common Gateway Interface) Outline 25.1 Introduction 25.2 Perl 25.3 String Processing.
Regular Expressions A regular expression defines a pattern of characters to be found in a string Regular expressions are made up of – Literal characters.
Advanced File Processing
Advanced Shell Programming. 2 Objectives Use techniques to ensure a script is employing the correct shell Set the default shell Configure Bash login and.
Chapter 12: Searching in Web applications The first examples use a search form embedded in a Web page to query the deptstore database, which contains the.
Tutorial 14 Working with Forms and Regular Expressions.
Programming Perl in UNIX Course Number : CIT 370 Week 4 Prof. Daniel Chen.
XHTML Introductory1 Linking and Publishing Basic Web Pages Chapter 3.
JavaScript, Fourth Edition
Computer Programming for Biologists Class 5 Nov 20 st, 2014 Karsten Hokamp
USING PERL FOR CGI PROGRAMMING
CIS 218 Advanced UNIX1 CIS 218 – Advanced UNIX (g)awk.
Chapter 8 Cookies And Security JavaScript, Third Edition.
Strings The Basics. Strings can refer to a string variable as one variable or as many different components (characters) string values are delimited by.
Chapter 11: Regular Expressions and Matching The match operator has the following form. m/pattern/ A pattern can be an ordinary string or a generalized.
Linux+ Guide to Linux Certification, Third Edition
Copyright © 2010 Certification Partners, LLC -- All Rights Reserved Perl Specialist.
CS 330 Programming Languages 10 / 07 / 2008 Instructor: Michael Eckmann.
Java server pages. A JSP file basically contains HTML, but with embedded JSP tags with snippets of Java code inside them. A JSP file basically contains.
Working with Forms and Regular Expressions Validating a Web Form with JavaScript.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Regular Expressions for PHP Adding magic to your programming. Geoffrey Dunn
Introduction to Unix – CS 21
CS346 Regular Expressions1 Pattern Matching Regular Expression.
5 1 Data Files CGI/Perl Programming By Diane Zak.
Chapter Five Advanced File Processing. 2 Lesson A Selecting, Manipulating, and Formatting Information.
XP Tutorial 8 Adding Interactivity with ActionScript.
CS 330 Programming Languages 10 / 02 / 2007 Instructor: Michael Eckmann.
CS 330 Class 9 Programming plan for today: More of how data gets into a script Via environment variables Via the url From a form By editing the url directly.
Copyright © 2003 ProsoftTraining. All rights reserved. Perl Fundamentals.
Copyright © Curt Hill Regular Expressions Providing a Search Pattern.
1 Lecture 9 Shell Programming – Command substitution Regular expressions and grep Use of exit, for loop and expr commands COP 3353 Introduction to UNIX.
A Few More Functions. One more quoting operator qw// Takes a space separated sequence of words, and returns a list of single-quoted words. –no interpolation.
 2001 Prentice Hall, Inc. All rights reserved. Chapter 7 - Introduction to Common Gateway Interface (CGI) Outline 7.1Introduction 7.2A Simple HTTP Transaction.
8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.
Standard Types and Regular Expressions CS 480/680 – Comparative Languages.
An Introduction to Programming with C++ Sixth Edition Chapter 13 Strings.
Introduction to Programming the WWW I CMSC Winter 2004 Lecture 13.
ORAFACT Text Processing. ORAFACT Searching Inside Files grep - searches for patterns within files grep [options] [[-e] pattern] filename [...] -n shows.
Programming Fundamentals. Today’s Lecture Array Fundamentals Arrays as Class Member Data Arrays of Objects C-Strings The Standard C++ string Class.
Dr. Abdullah Almutairi Spring PHP is a server scripting language, and a powerful tool for making dynamic and interactive Web pages. PHP is a widely-used,
Introduction to Programming the WWW I CMSC Winter 2003 Lecture 17.
CSC 4630 Perl 3 adapted from R. E. Beck. Problem But we worked on it first: Input: Read from a text file named in a command line argument Output: List.
Session 11: Cookies, Sessions ans Security iNET Academy Open Source Web Development.
CS 330 Programming Languages 09 / 30 / 2008 Instructor: Michael Eckmann.
L071 Introduction to C Topics Compilation Using the gcc Compiler The Anatomy of a C Program Reading Sections
Winter 2016CISC101 - Prof. McLeod1 CISC101 Reminders Quiz 3 this week – last section on Friday. Assignment 4 is posted. Data mining: –Designing functions.
Editing Tons of Text? RegEx to the Rescue! Eric Cressey Senior UX Content Writer Symantec Corporation.
Filters and Utilities. Notes: This is a simple overview of the filtering capability Some of these commands are very powerful ▫Only showing some of the.
1 CGI (Common Gateway Interface) CmpE 587 Emir Bayraktar Onur Bük.
Regular Expressions Copyright Doug Maxwell (
CS 330 Class 7 Comments on Exam Programming plan for today:
Intro to PHP & Variables
Advanced Find and Replace with Regular Expressions
Functions, Regular expressions and Events
Linux Shell Script Programming
Lecture 5: Functions and Parameters
Presentation transcript:

Matching in list context (Chapter 11 = ($str =~ /pattern/); This stores the list of the special ($1, $2,…) capturing variables into only if there are grouped expressions in the pattern to capture matches. Otherwise, if there are no grouped expressions, either (1) or () is returned into depending upon whether there are successful matches or not. The following results in ("cat chow", "cat", "chow") being assigned = ("Purina cat chow" =~ /((cat|dog|ferret) (food|chow))/);

The g command modifier causes matching to be done globally -- it doesn't quit after finding the first = ($str =~ /pattern/g); Use global matching only when there are no grouped expressions in the pattern. The following results in the list ("an ", "amp") being assigned = ("an example" =~ /a../g); In contrast the following would result in the one-element list ("an ") being assigned = ("an example" =~ /(a..)/);

The following statement parses out all of the HTML tags and stores this list (" ", " ") in = (" Title " =~ / /g); Suppose $document is a (perhaps long) string that contains some text document, and suppose we want to pull out all the social security numbers from the document. If we assume social security numbers look like , then a solution =~ /\d{3}-\d{2}-\d{4}/g); But what if the social security numbers are inconsistent in that some are missing the dashes? Then a solution

Two very useful functions that take patterns and return lists. split(pattern, string) Returns a list consisting of the fields (the substrings not used in any matches) between successful matches of the pattern against the string. Trailing empty fields are omitted. split(pattern,string,limit) Returns a list with at most limit number of fields. grep(pattern, list) Returns a list consisting of those elements in the given list which successfully matched the pattern. ( grep -- get regular expression pattern)

We have used split often, even in the decoding routine where we split about a one-character A string with more complicated delimiting patterns can also be split. In the following case, a delimiter is one or more colons. $str = = split( /:+/, $str);

grep (get regular expression pattern) is different from split in that you send it an array rather than a string. It "filters" the array based upon the regular expression. That is only those array elements which match the pattern are returned. contains some large number of named Web addresses. One simple call to grep can filter out only those addresses in the ".edu" domain, for grep Note: The period had to be escaped since it is a metacharacter.

Example: Analyzing log files. A typical HTTP access log. See accesslog.txt.

The 10 different fields are actually standard. Results when we split out the first line (around delimiting = split (/\s+/, $line); FIELDFirst LineMeaning $field[0] Address (either IP or name) of client $field[1]- Not used anymore $field[2]- Not used anymore $field[3][09/Nov/2001:10:34: 01 Date and time $field[4]-0600] Time zone $field[5]"GET Request method $field[6]/ Relative part of URL (here the site root) $field[7]HTTP/1.1" HTTP version $field[8]200 Status code (success code or error code) $field[9]16058 Bytes transferred

Log file analysis can get very elaborate and there are many commercial and free software packages available for that. For a simple example, we count the total number of hits (lines in the access log) and the total number of unique hits (different IP addresses). Notice that requesting one page can result in numerous lines in the access log since all of the image transfers are separate HTTP transactions. (Some hit counters you find actually report the number of lines in the file!) Counting lines is easy. To count the number of unique IP addresses, we add IP addresses to a hash as the keys. Thus a new hash entry only can originate from a new IP address. We then count the number of keys in the hash. See source file hitcount.pl

The substitution operator $scalar_variable =~ s/pattern/replacement_string/command_modifiers; The binding operator "binds" the substitution onto the string. The substitution operator s/// takes two arguments (in contrast to the match operator m// ). It attempts to find a match for the pattern in the $scalar_variable, and if successful, replaces the match with the replacement_string. Thus, the scalar variable is altered if a successful match is found. In contrast, match operator does not alter the string onto which it is bound.

The following attempts to replace the with my. $str = "the cat in the hat"; $str =~ s/the/my/; This causes $str to contain "my cat in the hat". By default, only the left-most occurrence is replaced. The g (global) command modifier causes substitutions to be made globally. $str = "the cat in the hat"; $str =~ s/the/my/g; This causes $str to contain "my cat in my hat".

The following results in $str having the value "puppy ferret category". ( non-global substitution) $str = "puppy dog category"; $str =~ s/(cat|dog)/ferret/; A similar global substitution results in $str containing "puppy ferret ferretegory". $str = "puppy dog category"; $str =~ s/(cat|dog)/ferret/g; The following replaces all whitespace characters with the empty string, resulting in $str containing "hello". $str = "h e l l o"; $str =~ s/\s//g;

Captured matches can actually be included into the replacement string. $str = "puppy dog category"; $str =~ s/(\w+)/$1s/g; This results in $str having the value " puppys dogs categorys". There is only one set of grouping parentheses used in this example, so we only need to use $1. As each match is found, $1 is assigned that new match. Thus, $1 may be reused several times during a global substitution.

The transliteration operator $scalar =~ tr/search_characters/replacement_characters/; This replaces the search characters with the corresponding replacement characters. It's usually used with single characters. $str = "the cat in the hat"; $str =~ tr/a/u/; The result is "the cut in the hut"; Transliteration can be done using substitutions, but tr automatically does global substitutions and only uses characters which means you don't have to escape metacharacters.

Example: Inspired by news sites which which display parts of stories and provide links pointing to the full stories. See partialcontent.cgi

Each story is a text file (.news ) Paragraphs must separated by at least a blank line /n/n The program reads the directory and prints the first two paragraphs of only the.news files.

Acquiring only the.news stories from the directory is straight forward, especially with the power of grep. opendir(D, = readdir(D); = grep We then loop over the.news files and process each one. foreach $file { if(open(STORY, "$storyDataDir$file")) { = ; close(STORY); # join whole story into one string my $story =

We can then extract all of the paragraph with one global = ($story =~ /((.|\n)+?\n\s*\n)/g); It's then trivial to print the first two paragraphs. But the pattern certainly needs clarification. First we need to identify the space between paragraphs. \n\s*\n ## matches one or more consecutive blank lines ## That is, two newline characters with zero or more whitespace characters in between. Since quantifiers are greedy, the pattern will not stop after finding the first in a sequence of blank lines.

Now we match paragraph content. (.|\n)+ ## one or more of any character ## ( wildcard doesn't match /n characters) Now the whole pattern which matches a paragraph. /(.|\n)+?\n\s*\n/ ## one or more of anything, then a ## then a blank line(s) Notes: One would have been tempted to identify paragraphs as one or more wildcard characters (.+ ). But that would miss parts of paragraphs containing an inadvertent hard return ( \n ) between sentences. The extra metacharacter ( ? ) specifies non-greedy matching. Otherwise, the pattern would not stop after the first paragraph.

There are still two subtle pitfalls regarding the structure of the news files. A sequence of two blank lines ( \n\n\n ) or more at the beginning of the file will cause the first \n to be matched as the first paragraph. (That is not a problem for multiple blank lines between paragraphs since \n\s*\n is greedy.) If there are no blank lines after the last paragraph in the file, the last paragraph will not be matched (hence not captured). That doesn't affect this application as long as there are three or more paragraphs in a file. How would you fix those problems?