Unix Programming Environment Part 3-4 Regular Expression and Pattern Matching Prepared by Xu Zhenya( Draft – Xu Zhenya(

Slides:



Advertisements
Similar presentations
CSCI 330 T HE UNIX S YSTEM Regular Expressions. R EGULAR E XPRESSION A pattern of special characters used to match strings in a search Typically made.
Advertisements

1 Unix Talk #2 AWK overview Patterns and actions Records and fields Print vs. printf.
7 Searching and Regular Expressions (Regex) Mauro Jaskelioff.
ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
CS 497C – Introduction to UNIX Lecture 29: - Filters Using Regular Expressions – grep and sed Chin-Chih Chang
Chin-Chih Chang CS 497C – Introduction to UNIX Lecture 28: - Filters Using Regular Expressions – grep and sed Chin-Chih Chang
CS 497C – Introduction to UNIX Lecture 31: - Filters Using Regular Expressions – grep and sed Chin-Chih Chang
LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23.
Quotes: single vs. double vs. grave accent % set day = date % echo day day % echo $day date % echo '$day' $day % echo "$day" date % echo `$day` Mon Jul.
Regular Expressions. u A regular expression is a pattern which matches some regular (predictable) text. u Regular expressions are used in many Unix utilities.
Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl Linux editors and commands (e.g.
CSE467/567 Computational Linguistics Carl Alphonce Computer Science & Engineering University at Buffalo.
UNIX Filters.
Filters using Regular Expressions grep: Searching a Pattern.
CST8177 Regular Expressions. What is a "Regular Expression"? The term “Regular Expression” is used to describe a pattern-matching technique that is used.
Overview of the grep Command Alex Dukhovny CS 265 Spring 2011.
System Programming Regular Expressions Regular Expressions
 Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’
INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7.
Unix Talk #2 (sed). 2 You have learned…  Regular expressions, grep, & egrep  grep & egrep are tools used to search for text in a file  AWK -- powerful.
Introduction to Unix (CA263) File Processing. Guide to UNIX Using Linux, Third Edition 2 Objectives Explain UNIX and Linux file processing Use basic file.
Unix programming Term: III B.Tech II semester Unit-II PPT Slides Text Books: (1)unix the ultimate guide by Sumitabha Das (2)Advanced programming.
Linux+ Guide to Linux Certification Chapter Four Exploring Linux Filesystems.
REGULAR EXPRESSIONS. Lexical Analysis Lexical analysers can be constructed by programs such as LEX These programs employ as input a description of the.
Regular expressions Used by several different UNIX commands, including ed, sed, awk, grep A period ‘.’ matches any single characters.X. matches any X.
CS 403: Programming Languages Fall 2004 Department of Computer Science University of Alabama Joel Jones.
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
(Stream Editor) By: Ross Mills.  Sed is an acronym for stream editor  Instead of altering the original file, sed is used to scan the input file line.
Agenda Regular Expressions (Appendix A in Text) –Definition / Purpose –Commands that Use Regular Expressions –Using Regular Expressions –Using the Replacement.
Chapter 13: sed Say what?. In this chapter … Basics Programs Addresses Instructions Control Spaces Examples.
CSC 352– Unix Programming, Spring 2015 April 28 A few final commands.
I/O Redirection and Regular Expressions February 9 th, 2004 Class Meeting 4.
Introduction to Unix – CS 21 Lecture 12. Lecture Overview A few more bash programming tricks The here document Trapping signals in bash cut and tr sed.
Regular Expression - Intro Patterns that define a set of strings (or, pieces of a string) Not wildcards (similar notion, but different thing) Used by utilities.
Regular Expressions What is this line all about? while (!($search =~ /^\s*$/)) { It’s a string search just like before, but with a huge twist – regular.
Pattern Matching CSCI N321 – System and Network Administration.
Appendix A: Regular Expressions It’s All Greek to Me.
Chapter Five Advanced File Processing. 2 Lesson A Selecting, Manipulating, and Formatting Information.
Introduction to sed. Sed : a “S tream ED itor ” What is Sed ?  A “non-interactive” text editor that is called from the unix command line.  Input text.
Sys Prog & Scrip - Heriot Watt Univ 1 Systems Programming & Scripting Lecture 12: Introduction to Scripting & Regular Expressions.
Lesson 4-Mastering the Visual Editor. Overview Introducing the visual editor. Working in an existing file with vi. Understanding the visual editor. Navigating.
I/O Redirection & Regular Expressions CS 2204 Class meeting 4 *Notes by Doug Bowman and other members of the CS faculty at Virginia Tech. Copyright
2004/12/051/27 SPARCS 04 Seminar Regular Expression By 박강현 (lightspd)
Regular Expressions CS 2204 Class meeting 6 Created by Doug Bowman, 2001 Modified by Mir Farooq Ali, 2002.
CIT 383: Administrative ScriptingSlide #1 CIT 383: Administrative Scripting Regular Expressions.
CSCI 330 UNIX and Network Programming Unit IV Shell, Part 2.
Awk- An Advanced Filter by Prof. Shylaja S S Head of the Dept. Dept. of Information Science & Engineering, P.E.S Institute of Technology, Bangalore
CSE 374 Programming Concepts & Tools Hal Perkins Fall 2015 Lecture 5 – Regular Expressions, grep, Other Utilities.
What is grep ?  % man grep  DESCRIPTION  The grep utility searches text files for a pattern and prints all lines that contain that pattern. It uses.
ORAFACT Text Processing. ORAFACT Searching Inside Files grep - searches for patterns within files grep [options] [[-e] pattern] filename [...] -n shows.
FILTERS USING REGULAR EXPRESSIONS – grep and sed.
Pattern Matching: Simple Patterns. Introduction Programmers often need to scan a file, directory, etc. for a specific substring. –Find all files that.
CSC 352– Unix Programming, Fall 2011 November 8, 2011, Week 11, a useful subset of regular expressions, grep and sed, parts of Chapter 11.
Filters and Utilities. Notes: This is a simple overview of the filtering capability Some of these commands are very powerful ▫Only showing some of the.
PROGRAMMING THE BASH SHELL PART III by İlker Korkmaz and Kaya Oğuz
Regular Expressions Copyright Doug Maxwell (
Lesson 5-Exploring Utilities
CSE 374 Programming Concepts & Tools
CSC 352– Unix Programming, Spring 2016
Looking for Patterns - Finding them with Regular Expressions
CST8177 sed The Stream Editor.
Regular Expression - Intro
BASIC AND EXTENDED REGULAR EXPRESSIONS
PROGRAMMING THE BASH SHELL PART IV by İlker Korkmaz and Kaya Oğuz
In the last class, sed to edit an input stream and understand its addressing mechanism Line addressing Using multiple instructions Context addressing Writing.
Unix Talk #2 grep/egrep/fgrep (maybe add more to this one….)
Unix Talk #2 (sed).
Chin-Chih Chang CS 497C – Introduction to UNIX Lecture 28: - Filters Using Regular Expressions – grep and sed Chin-Chih Chang
CSCI The UNIX System Regular Expressions
Presentation transcript:

Unix Programming Environment Part 3-4 Regular Expression and Pattern Matching Prepared by Xu Zhenya( Draft – Xu Zhenya( 2002/10/01 ) Rev1.0 – Xu Zhenya( 2002/10/09 )

Unix Programming Environment Dept. of CSE, BUAA Agenda  1. An Introduction to Regular Expression in UNIX  BRE & ERE, GNU-RE  2. grep, egrep, fgrep  3. sed  Chapter 4( 4.1 & 4.2 )

Unix Programming Environment Dept. of CSE, BUAA An Introduction to Regular Expression(1)  UNIX commands combined with REs allow us perform three tasks:  Pattern matching  Search a particular pattern. RE specifies the pattern.  UNIX command searches: ed ex sed vi grep awk  Modify  Search for a particular pattern and change it. RE specifies the pattern and sometimes how to change.  UNIX command searches and modifies: ed ex sed vi  Programming  awk provides a programming language which can use REs.  Perl, python, tcl, etc  lex ( flex in GNU tools )  POSIX defines a standard regular expression library include the libc:  regcomp, regexec…  Three different regular expression definitions:  the shells : for filename/pathname expansion  Simple/Basic ( BRE )  grep, sed, vi  Extended ( ERE )  egrep, awk, perl, etc

Unix Programming Environment Dept. of CSE, BUAA Meta-characters  Meta-characters can be divided into three categories:  matching characters: the primary building blocks of REs  grouping and repeat characters:  Tagging and back-referencing :  Some of the RE meta characters are also shell meta characters. For example:  $ egrep r* /etc/passwd # “filename expansion  => so REs meta characters should be quoted.

Unix Programming Environment Dept. of CSE, BUAA Matching Characters ( 1 ) CharacterWhat is matches c any single character (which isn't one of the RE special characters) will match itself \ removes the special meaning from a RE special character. matches any single character ^ matches the start of the line $ matches the end of the line [chars] matches any ONE character within the square brackets [^chars] matches any ONE character NOT within the square brackets

Unix Programming Environment Dept. of CSE, BUAA Matching Characters ( 2 )  Example REs  hello Match the string hello ...\.... Match any three letters, followed by a., followed by three more letters  ^...\.... Match the same as the previous one but it must appear at the start of the line.  ^hello$ Match any line which contains hello ONLY  [a-z][^a-z] Match any two characters where the first one is between a-z and the second isn't.  \[\]\\ The characters '[', ']', '\' in order and contiguous.

Unix Programming Environment Dept. of CSE, BUAA Grouping and Repetition Characters( 1 ) Character(s)Purpose * match 0 or more of the previous RE + match one or more of the previous RE ? match zero or one occurences of the previous RE \{n\} match exactly N occurrences of the previous RE \{n,\} match at least N occurences of the previous RE \{n,m\} match between N and M occurences of the preious RE | match one of two different REs (alternation) ( ) used to group a collection of REs

Unix Programming Environment Dept. of CSE, BUAA Grouping and Repetition Characters( 2 )  Examples  OO+ Match two or more Os  /bin/(tcsh|bash)$ Match /bin/tcsh or /bin/bash which occur at the end of the line.  ^[^:]*:[^:]*:[0-9]  the start of a line (^), followed by  0 or more characters which aren't :s ([^:]*), followed by  a : (:), followed by  0 or more characters which aren't :s ([^:]*), followed by  a : (:), followed by  a single number ([0-9]), followed by  a : (:)  (\+|-)?[0-9]+  [+-]?[0-9]+  an optionally signed integer (a plus or minus or nothing followed by an integer).

Unix Programming Environment Dept. of CSE, BUAA Tagging and back-referencing (1)  \( \) is used to tag/remember the RE you wish to back reference.  Contents are placed into a numeric register 1, 2,...  access the contents of a register using \N where N is the number of the register  Examples  \(hello\) \1 Matches hello hello  \([0-9]*\),\([0-9]*\),\([0-9]*\) \3,\2,\1 Matches patterns like 12,55,34 34,55, ,5321, ,5321,1023

Unix Programming Environment Dept. of CSE, BUAA Tagging and back-referencing (2) $ sed -e 's/\([a-zA-Z]\+\) \([a-zA-Z]\+\)/\2, \1/' | | | | | | | | | | | extract memory 1 | | | extract memory 1 | | extract memory 2 | | extract memory 2 | Put “UNIX" to memory 2 | Put “UNIX" to memory 2 put “Programming" to memory 1 put “Programming" to memory 1 Programming UNIX UNIX, Programming

Unix Programming Environment Dept. of CSE, BUAA BRE BRE Precedence (from high to low) escaped characters \ bracket expression [ ] subexpressions/back-references \( \) \n Repetition single-character-BRE Repetition * \{m,n\} concatenation anchoring ^ $

Unix Programming Environment Dept. of CSE, BUAA ERE ERE Precedence (from high to low) escaped characters \ bracket expression [ ] grouping ( ) single-character-ERE duplication * + ? {m,n} concatenation anchoring ^ $ alternation |

Unix Programming Environment Dept. of CSE, BUAA Conclusion  1. There are a few metacharacters common, both in representation and meaning, to all three definitions.  The back slash (\)  The square brackets ([])  There are also two metacharacters within the brackets which are common among all three forms of RE.  If the caret (^) is the first character within the brackets, the complement of the set of characters given is meant.  If the hyphen (-) occurs within the brackets, it indicates a range of characters.  2. In addition to BRE, extended REs add  the parentheses (()) for grouping:  the vertical bar (|) (also called pipe) for alternation.  the plus sign (+) meaning repeat the preceding item at least once.  the question mark (?) meaning match the preceding item either zero or one times.

Unix Programming Environment Dept. of CSE, BUAA Conclusion  In basic regular expressions the metacharacters ?, +, {, |, (, and ) lose their special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and \).  Large repetition counts in the {m,n} construct may cause grep to use lots of memory. In addition, certain other obscure regular expressions require exponential time and space, and may cause grep to run out of memory.  Backreferences are very slow, and may require exponential time.

Unix Programming Environment Dept. of CSE, BUAA Overview of Text Manipulation Utils  ed was a very early UNIX line editor. It included a number of commands to manipulate the file being edited.  Other UNIX commands like sed, ex, vi were built on ed and use the same commands.  Ex/vi command syntax: [ from_address [, to_address ] command [ parameters ]  Read the textbook – Appendix A  Supplementary Reading:  An Introduction to Display Editing with Vi, William Joy, Mark Horton An Introduction to Display Editing with Vi, An Introduction to Display Editing with Vi,  VI (and Clones) Editor Reference Manual, Miles O'Neal VI (and Clones) Editor Reference Manual VI (and Clones) Editor Reference Manual

Unix Programming Environment Dept. of CSE, BUAA 2. Grep, egrep, fgrep  grep: BRE  egrep: ERE  fgrep: f = fixed string  Option: -r recursively  Reading the textbook: p72

Unix Programming Environment Dept. of CSE, BUAA 3. sed  sed(1) is a (s)tream (ed)itor, which manipulates the data according given rules.  The sed command line syntax is:  $ sed [OPTIONS] -e 'INSTRUCTION' [-e 'INSTRUCTION'..] FILE  $ sed [OPTIONS] -f SCRIPT.sed FILE  $ cat FILE | sed -f SCRIPT.sed "COMMANDS"  Some most used options:  -f SCRIPT.sed : Read commands from file SCEIPT.sed  -e "SED-EXPRESSION" : Expression follows immediately. You can give this option multiple times.  -n : Do not print lines unless p command used.

Unix Programming Environment Dept. of CSE, BUAA sed (2) sed (2)  The INSTRUCTION choices can be in format [OPTION]/RE/COMMAND [OPTION]/RE/COMMAND | | | | | | | | what to do | | what to do | | p = print | | p = print | | d = delete | | d = delete | | | | | | regular expression to search | | regular expression to search | Optional, can be left out. Optional, can be left out. g = global option. Do the COMMAND for all lines g = global option. Do the COMMAND for all lines

Unix Programming Environment Dept. of CSE, BUAA sed (3)  commands and options  1. The delete line command  $ sed -e '/this/d' text1.txt  2. The print lines options  $ sed -n -e '/this/p' text1.txt

Unix Programming Environment Dept. of CSE, BUAA sed (4)  3. The substitution of text in the line  The command that sed uses most is the (s)ubstitute and the syntax is bit different: [address]s/RE/replacement/[flag] [address]s/RE/replacement/[flag] | substitute command here substitute command here  Examples:  s/this-word//g  s/UPE/UNIX Programming Environment/g  The (g)lobal flag causes regular expression search to continue to the end of line, so all words on the line will be replaced.

Unix Programming Environment Dept. of CSE, BUAA sed (5)  Address in substitute command  The [address] says, where the command does its work. It can be 1.numeric address 2.special marker; like $ which denotes the end of file 3.regular expression to delimit the the commands to certain lines. # 1. We refer to explicit lines by a number here: 1s/BJ/Beijing/g Do substitution only at line 1. 10,20s/BJ/Beijing/g RANGE: from 10 to 20. #2. We can mix the number with the other address markers: 50,$s/BJ/Beijing/g from line 50 to the end of file. 1,/^$/s/BJ/Beijing/g from line 1 to next empty line. #3. Or use only regular expression to delimit the line area $ sed -e '/BEGIN/,/END/s/variable1/variable2/g' some.code BEGIN line1; line2 variable1 = variable1 + variableX; END

Unix Programming Environment Dept. of CSE, BUAA sed (6)  Examples:  $ sed –e ‘s/sweeping \(.*\) of \(.*\) steel/sweeping \2 \1 of/g’  sweeping blade of flashing steel  sweeping flashing blade of steel  $ sed –e ‘s/sweeping \(.*\) of \(.*\) steel/evil &/g’  sweeping blade of flashing steel  evil sweeping blade of flashing steel  & : specifys the entire expression

Unix Programming Environment Dept. of CSE, BUAA Supplementary Readings  Regular Expression in Unix  Regular Expression in Unix Regular Expression in Unix Regular Expression in Unix  Including two parts. The first part is an simple and incomplete introduction to regular expressions in UNIX. The other is the RE specification intercepted from SUSv2, including the formal grammar for BER and ERE.  External Filters, Programms and Commands in Unix, Mendel Cooper External Filters, Programms and Commands in Unix External Filters, Programms and Commands in Unix