Regular expressions, egrep, and sed

Slides:



Advertisements
Similar presentations
Form Validation CS What is form validation?  validation: ensuring that form's values are correct  some types of validation:  preventing blank.
Advertisements

7 Searching and Regular Expressions (Regex) Mauro Jaskelioff.
ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
David Notkin Autumn 2009 CSE303 Lecture 6 HW 2A in focus anagram raga Man computer science (&) engineering epic reticence ensuring gnome notkin beard drank.
1 CSE 390a Lecture 7 Regular expressions, egrep, and sed slides created by Marty Stepp, modified by Jessica Miller and Ruth Anderson
1 CSE 303 Lecture 7 Regular expressions, egrep, and sed read Linux Pocket Guide pp , 73-74, 81 slides created by Marty Stepp
1 CSE 390a Lecture 7 Regular expressions, egrep, and sed slides created by Marty Stepp, modified by Jessica Miller
CS 497C – Introduction to UNIX Lecture 31: - Filters Using Regular Expressions – grep and sed Chin-Chih Chang
Quotes: single vs. double vs. grave accent % set day = date % echo day day % echo $day date % echo '$day' $day % echo "$day" date % echo `$day` Mon Jul.
Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl Linux editors and commands (e.g.
Introduction to regular expression. Wéber André Objective of the training Scope of the course  We will present what are “regular expressions”
Form Validation CS What is form validation?  validation: ensuring that form's values are correct  some types of validation:  preventing blank.
Filters using Regular Expressions grep: Searching a Pattern.
CSE 154 LECTURE 11: REGULAR EXPRESSIONS. What is form validation? validation: ensuring that form's values are correct some types of validation: preventing.
slides created by Marty Stepp
Regular Expressions A regular expression defines a pattern of characters to be found in a string Regular expressions are made up of – Literal characters.
Last Updated March 2006 Slide 1 Regular Expressions.
Regular Expression Darby Tien-Hao Chang (a.k.a. dirty) Department of Electrical Engineering, National Cheng Kung University.
System Programming Regular Expressions Regular Expressions
INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7.
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
Agenda Regular Expressions (Appendix A in Text) –Definition / Purpose –Commands that Use Regular Expressions –Using Regular Expressions –Using the Replacement.
I/O Redirection and Regular Expressions February 9 th, 2004 Class Meeting 4.
Working with Forms and Regular Expressions Validating a Web Form with JavaScript.
Sys Prog & Scrip - Heriot Watt Univ 1 Systems Programming & Scripting Lecture 12: Introduction to Scripting & Regular Expressions.
I/O Redirection & Regular Expressions CS 2204 Class meeting 4 *Notes by Doug Bowman and other members of the CS faculty at Virginia Tech. Copyright
2004/12/051/27 SPARCS 04 Seminar Regular Expression By 박강현 (lightspd)
Regular Expressions CS 2204 Class meeting 6 Created by Doug Bowman, 2001 Modified by Mir Farooq Ali, 2002.
1 Lecture 9 Shell Programming – Command substitution Regular expressions and grep Use of exit, for loop and expr commands COP 3353 Introduction to UNIX.
CSCI 330 UNIX and Network Programming Unit IV Shell, Part 2.
ORAFACT Text Processing. ORAFACT Searching Inside Files grep - searches for patterns within files grep [options] [[-e] pattern] filename [...] -n shows.
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
CIRC Summer School 2016 Baowei Liu
PROGRAMMING THE BASH SHELL PART III by İlker Korkmaz and Kaya Oğuz
Regular Expressions Copyright Doug Maxwell (
CSE 374 Programming Concepts & Tools
Regular expressions, egrep, and sed
Regular expressions, egrep, and sed
Looking for Patterns - Finding them with Regular Expressions
CIRC Summer School 2017 Baowei Liu
Lecture 19 Strings and Regular Expressions
CSC 594 Topics in AI – Natural Language Processing
CST8177 sed The Stream Editor.
CIRC Winter Boot Camp 2017 Baowei Liu
HTML Forms and Server-Side Data
BASIC AND EXTENDED REGULAR EXPRESSIONS
Regular Expressions and perl
Lecture 9 Shell Programming – Command substitution
CSE 390a Lecture 7 Regular expressions, egrep, and sed
Regular Expression Beihang Open Source Club.
CSC 594 Topics in AI – Natural Language Processing
Pattern Matching in Strings
Folks Carelli, Instructor Kutztown University
CSC 352– Unix Programming, Spring 2016
Unix Talk #2 (sed).
CSE 390a Lecture 7 Regular expressions, egrep, and sed
Chin-Chih Chang CS 497C – Introduction to UNIX Lecture 28: - Filters Using Regular Expressions – grep and sed Chin-Chih Chang
Regular expressions, egrep, and sed
Regular Expressions
Regular expressions, egrep, and sed
Regular expressions, egrep, and sed
Regular Expressions and Grep
Regular expressions, egrep, and sed
Lecture 25: Regular Expressions
Regular expressions, egrep, and sed
Regular expressions, egrep, and sed
CSE 390a Lecture 7 Regular expressions, egrep, and sed
Lab 8: Regular Expressions
Lecture 23: Regular Expressions
Presentation transcript:

Regular expressions, egrep, and sed CSE 391 Lecture 7 Regular expressions, egrep, and sed slides created by Marty Stepp, modified by Jessica Miller and Ruth Anderson http://www.cs.washington.edu/391/

Lecture summary regular expression syntax commands that use regular expressions egrep (extended grep) - search sed (stream editor) - replace links http://www.panix.com/~elflord/unix/grep.html http://www.robelle.com/smugbook/regexpr.html http://www.grymoire.com/Unix/Sed.html http://www.gnu.org/software/sed/manual/sed.html

What is a regular expression? "[a-zA-Z_\-]+@(([a-zA-Z_\-])+\.)+[a-zA-Z]{2,4}" regular expression ("regex"): a description of a pattern of text can test whether a string matches the expression's pattern can use a regex to search/replace characters in a string regular expressions are extremely powerful but tough to read (the above regular expression matches basic email addresses) regular expressions occur in many places: shell commands (grep) many text editors (TextPad) allow regexes in search/replace Java Scanner, String split (CSE 143 grammar solver)

egrep and regexes egrep "[0-9]{3}-[0-9]{3}-[0-9]{4}" contact.html grep uses “basic” regular expressions instead of “extended” extended has some minor differences and additional metacharacters we’ll just use extended syntax. See online if you’re interested in the details. -i option before regex signifies a case-insensitive match egrep -i "mart" matches "Marty S", "smartie", "WALMART", ... command description egrep extended grep; uses regexes in its search patterns; equivalent to grep -E

Basic regexes "abc" the simplest regexes simply match a particular substring this is really a pattern, not a string! the above regular expression matches any line containing "abc" YES : "abc", "abcdef", "defabc", ".=.abc.=.", ... NO : "fedcba", "ab c", "AbC", "Bash", ...

Wildcards and anchors . (a dot) matches any character except \n ".oo.y" matches "Doocy", "goofy", "LooPy", ... use \. to literally match a dot . character ^ matches the beginning of a line; $ the end "^fi$" matches lines that consist entirely of fi \< demands that pattern is the beginning of a word; \> demands that pattern is the end of a word "\<for\>" matches lines that contain the word "for“ Words are made up of letters, digits and _ (underscore) Exercise : Find lines in ideas.txt that refer to the C language. Exercise : Find act/scene numbers in hamlet.txt . Answer: egrep "\<C\>" ideas.txt egrep "^ACT|^Scene" hamlet.txt

Special characters | means OR () are for grouping "abc|def|g" matches lines with "abc", "def", or "g" precedence of ^(Subject|Date) vs. ^Subject|Date: There's no AND symbol. () are for grouping "(Homer|Marge) Simpson" matches lines containing "Homer Simpson" or "Marge Simpson" \ starts an escape sequence many characters must be escaped to match them: / \ $ . [ ] ( ) ^ * + ? "\.\\n" matches lines containing ".\n"

Quantifiers: * + ? * means 0 or more occurrences "abc*" matches "ab", "abc", "abcc", "abccc", ... "a(bc)*" matches "a", "abc", "abcbc", "abcbcbc", ... "a.*a" matches "aa", "aba", "a8qa", "a!?_a", ... + means 1 or more occurrences "a(bc)+" matches "abc", "abcbc", "abcbcbc", ... "Goo+gle" matches "Google", "Gooogle", "Goooogle", ... ? means 0 or 1 occurrences "Martina?" matches lines with "Martin" or "Martina" "Dan(iel)?" matches lines with "Dan" or "Daniel" Exercise : Find all ^^ or ^_^ type smileys in chat.txt . Answer: egrep "\^_*\^" chat.txt

More quantifiers {min,max} means between min and max occurrences "a(bc){2,4}" matches "abcbc", "abcbcbc", or "abcbcbcbc" min or max may be omitted to specify any number "{2,}" means 2 or more "{,6}" means up to 6 "{3}" means exactly 3

Character sets [ ] group characters into a character set; will match any single character from the set "[bcd]art" matches strings containing "bart", "cart", and "dart" equivalent to "(b|c|d)art" but shorter inside [ ], most modifier keys act as normal characters "what[.!*?]*" matches "what", "what.", "what!", "what?**!", ... Exercise : Match letter grades in 143.txt such as A, B+, or D- . Answer: egrep "[ABCDF][+\-]?" 143.txt

Character ranges inside a character set, specify a range of characters with - "[a-z]" matches any lowercase letter "[a-zA-Z0-9]" matches any lower- or uppercase letter or digit an initial ^ inside a character set negates it "[^abcd]" matches any character other than a, b, c, or d inside a character set, - can sometimes be tricky to match Try escaping it (use \) or place it last in the brackets "[+\-]?[0-9]+" matches optional + or -, followed by  one digit Exercise : Match phone #s in contact.html, e.g. (206) 685-2181 . Answer: egrep "[0-9]{3}-[0-9]{3}-[0-9]{4}" contact.html

sed Usage: Example (replaces all occurrences of 143 with 391): sed -r "s/REGEX/TEXT/g" filename substitutes (replaces) occurrence(s) of regex with the given text if filename is omitted, reads from standard input (console) sed has other uses, but most can be emulated with substitutions Example (replaces all occurrences of 143 with 391): sed -r "s/143/391/g" lecturenotes.txt command description sed stream editor; performs regex-based replacements and alterations on input

more about sed sed is line-oriented; processes input a line at a time -r option makes regexes work better recognizes ( ) , [ ] , * , + the right way, etc. s for substitute g flag after last / asks for a global match (replace all) special characters must be escaped to match them literally sed -r "s/http:\/\//https:\/\//g" urls.txt sed can use delimiters besides / to make more readable (e.g. #) : sed -r "s#http://#https://#g" urls.txt

sed exercises In movies.txt: Replace “The” with “The Super Awesome” Now do it only when The occurs at the beginning of the line. (Need the next slide for this) Move the year from the end of the line to the beginning of the line. Do this and also sort the movies by year Now do the two items above and then put the year back at the end of the line.

Back-references every span of text captured by () is given an internal number you can use \number to use the captured text in the replacement \0 is the overall pattern \1 is the first parenthetical capture ... Back-references can also be used in egrep pattern matching Match “A” surrounded by the same character: “(.)A\1” Example: swap last names with first names sed -r "s/([A-Za-z]+), ([A-Za-z]+)/\2 \1/g" names.txt Exercise : Reformat phone numbers with 685-2181 format to (206) 685.2181 format. Answer: sed -r "s/([0-9]{3})-([0-9]{3})-([0-9]{4})/(\1) \2.\3/g" facnames.txt

Other tools find supports regexes through its -regex argument find . -regex ".*CSE 14[23].*" Many editors understand regexes in their Find/Replace feature

Exercise Write a shell script that reads a list of file names from files.txt and finds any occurrences of MM/DD dates and converts them into MM/DD/YYYY dates. Example: 04/17 would be changed to: 04/17/2016

Yay Regular Expressions! Courtesy XKCD