Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advanced Text Processing. 222 Lecture Overview  Character manipulation commands cut, paste, tr  Line manipulation commands sort, uniq, diff  Regular.

Similar presentations


Presentation on theme: "Advanced Text Processing. 222 Lecture Overview  Character manipulation commands cut, paste, tr  Line manipulation commands sort, uniq, diff  Regular."— Presentation transcript:

1 Advanced Text Processing

2 222 Lecture Overview  Character manipulation commands cut, paste, tr  Line manipulation commands sort, uniq, diff  Regular expressions and grep  Text replacement using sed

3 333 Cutting Lines – cut  The cut command extracts sections from each line of the input file  Command line options for cut : -c – output only these characters -f – output only these fields -d – use this character as the field delimiter cut options [files]

4 444 Cutting Lines – cut  With cut, at least one of the selection options ( -c or -f ) must be specified  The value given with -c or -f can be: A number – specifies a single character position A range – specifies a sequence of positions A comma separated list – specifies multiple positions or ranges

5 555 cut – Examples  Given a file called ' my_phones.txt ': ADAMS, Andrew 7583 BARRETT, Bruce 6466 BAYES, Ryan 6585 BECK, Bill 6346 BENNETT, Peter 7456 GRAHAM, Linda 6141 HARMER, Peter 7484 MAKORTOFF, Peter 7328 MEASDAY, David 6494 NAKAMURA, Satoshi 6453 REEVE, Shirley 7391 ROSNER, David 6830

6 666 cut – Examples head -3 my_phones.txt | cut -c3-16 AMS, Andrew 75 RRETT, Bruce 6 YES, Ryan 6585 head -3 my_phones.txt | cut -d" " -f2 Andrew Bruce Ryan head -3 my_phones.txt | cut -c1-3,10,12,15-18 ADAde7583 BARBu 646 BAYa 85

7 777 Merging Files – paste  The paste command merges multiple files by concatenating corresponding lines  Command line options for paste : -d – provide a list of separator characters -s – paste one file at a time instead of in parallel (each file becomes a single line) paste [options] [files]

8 888 paste – Examples  Assume that we are given 3 input files: Andrew Bruce Ryan Bill Peter Linda Peter David Satoshi first.txt ADAMS BARRETT BAYES BECK BENNETT GRAHAM HARMER MAKORTOFF MEASDAY NAKAMURA last.txt 7583 6466 6585 6346 7456 6141 7484 7328 6494 6453 num.txt

9 999 paste – Examples paste first.txt last.txt num.txt | head -3 Andrew ADAMS 7583 Bruce BARRETT 6466 Ryan BAYES 6585 paste -d" :" first.txt last.txt num.txt | head -3 Andrew ADAMS:7583 Bruce BARRETT:6466 Ryan BAYES:6585 paste -s last.txt first.txt num.txt | cut -f1-5,10 ADAMS BARRETT BAYES BECK BENNETT NAKAMURA Andrew Bruce Ryan Bill Peter Satoshi 7583 6466 6585 6346 7456 6453

10 10 Translating Characters – tr  The tr command is used to translate between one character set and another  Input is read from standard input and written to standard output (no files)  With no options, tr accepts two character sets with equal lengths, and replaces each character with the corresponding one tr [options] set1 [set2]

11 11 Deleting or Squeezing Characters – tr  Sets contain literal characters, or character ranges, such as: ' a-z ' or ' DEFa-z '  With command line options, tr can also be used to delete or squeeze characters  Command line options for tr : -d – delete characters in set1 -s – replace sequence of characters with one

12 12 Defining Sets for tr  tr has some interpreted sequences to simplify the definition of sets: [:alpha:] – all letters [:digit:] – all digits [:alnum:] – all letters and digits [:space:] – all whitespace [:punct:] – all punctuation characters [CHAR*REPEAT] – REPEAT copies of CHAR [CHAR*] – copies of CHAR until set1 length

13 13 tr – Examples  Change lower case to capital, and replace the digits 6, 7, 8 with the letters x, y, z head -3 padded_phones.txt ADAMS Andrew 7583 BARRETT Bruce 6466 BAYES Ryan 6585 head -3 padded_phones.txt | tr 'a-z678' 'A-Zxyz' ADAMS ANDREW y5z3 BARRETT BRUCE x4xx BAYES RYAN x5z5

14 14 tr – Examples  Squeeze sequences of spaces into one:  Delete spaces, and digits 7 and 8: head -3 padded_phones.txt | tr -d " 78" ADAMSAndrew53 BARRETTBruce6466 BAYESRyan655 head -3 padded_phones.txt | tr -s " " ADAMS Andrew 7583 BARRETT Bruce 6466 BAYES Ryan 6585

15 15 Reading from Standard Input  Many UNIX commands accept one or more input files listed in the command line ( tr is one of the few that don't)  If no input file is given, these commands will read from the standard input  Alternately, if the file list contains a ' - ', the standard input will be inserted in its place

16 16 Standard Input – Example cat last.txt | tr "A-Z" "a-z" | \ paste –d"_" first.txt - number.txt | head -10 Andrew_adams_7583 Imelda_aguilar_6518 Daniel_albers_7540 Pierre_amaudruz_7567 Friedhelm_ames_7581 Willy_andersson_6238 Andrei_andreyev_6491 Jonathan_aoki_6820 Donald_arseneau_6295 Danny_ashery_6188

17 17 Lecture Overview  Character manipulation commands cut, paste, tr  Line manipulation commands sort, uniq, diff  Regular expressions and grep  Text replacement using sed

18 18 Sorting Files – sort  The sort command reorders the lines in a file (or files), and sends the result to the standard output  Command line options for sort : -f – ignore case (fold lowercase to uppercase) -r – sort in reverse order -n – sort in numeric order sort [options] [files]

19 19 Sorting Files – sort  With no options given, the input is sorted based on the ASCII code order  The sort command has many more options for selecting which fields to sort by, and for changing the way input is treated  As always, you should read the man pages for the full details

20 20 sort – Example: Using Ignore-Case Andrew bill Bruce peter Ryan Andrew Bruce Ryan bill peter Bruce Ryan peter Andrew bill sort -f sort

21 21 sort – Example: Sorting Numbers 18 38 66 575 1256875 18 38 575 66 38 18 1256875 66 575 sort -n sort

22 22 Removing Duplicate Lines – uniq  The uniq command removes adjacent duplicate lines from its input file If input is sorted, removes all duplicate lines  Command line options for uniq : -i – ignore case -c – prefix lines by the number of occurrences -d – only print duplicate lines -u – only print unique lines

23 23 uniq – Example 1 Andrew 1 Bill 2 David 3 Peter 1 Ryan Andrew Bill David Peter Ryan Andrew Bill David Peter Ryan uniq -c uniq

24 24 uniq – Example Andrew Bill Ryan David Peter Andrew Bill David Peter Ryan uniq -u uniq -d

25 25 Example – File Processing Using Pipes  Task – go over the book "War and Peace" and count the appearances of each word Step 1: remove all punctuation marks Step 2: put each word in a separate line Step 3: sort words cat war_and_peace.txt | tr -d '[:punct:]' cat war_and_peace.txt | tr -d '[:punct:]' | tr " " "\n" cat war_and_peace.txt | tr -d '[:punct:]' | tr " " "\n" | sort

26 26 Example – File Processing Using Pipes Step 4: count appearances of each word Step 5: sort result by number of appearances Step 6: write output to file cat war_and_peace.txt | tr -d '[:punct:]' | tr " " "\n" | sort | uniq -c | sort -nr cat war_and_peace.txt | tr -d '[:punct:]' | tr " " "\n" | sort | uniq -c cat war_and_peace.txt | tr -d '[:punct:]' | tr " " "\n" | sort | uniq -c | sort -nr > words.txt

27 27 Comparing Text Files – diff  The diff command takes two input files, and compares them  The output contains only the different lines, with their line numbers  Command line options for diff : -i – ignore case -b – ignore changes in amount of white space -B – ignore insertion or deletion of blank lines

28 28 diff – Examples 2,3c2,3 < BARRETT Bruce 6466 < BAYES Ryan 6585 --- > BARRETT Bruce 3333 > BAYES Ryan 6585 5c5 < BENNETT Peter 7456 --- > Bennett peter 7456 diff ADAMS Andrew 7583 BARRETT Bruce 3333 BAYES Ryan 6585 BECK Bill 6346 Bennett peter 7456 ADAMS Andrew 7583 BARRETT Bruce 6466 BAYES Ryan 6585 BECK Bill 6346 BENNETT Peter 7456

29 29 diff – Examples 2c2 < BARRETT Bruce 6466 --- > BARRETT Bruce 3333 5c5 < BENNETT Peter 7456 --- > Bennett peter 7456 diff -b ADAMS Andrew 7583 BARRETT Bruce 3333 BAYES Ryan 6585 BECK Bill 6346 Bennett peter 7456 ADAMS Andrew 7583 BARRETT Bruce 6466 BAYES Ryan 6585 BECK Bill 6346 BENNETT Peter 7456 2c2 < BARRETT Bruce 6466 --- > BARRETT Bruce 3333 diff -bi

30 30 Maintaining Output Consistency  During program development, assume that we have reached the correct output  We want to verify that it does not change Create reference output file: After changing the program, compare output: prog > prog.out prog | diff – prog.out

31 31 Lecture Overview  Character manipulation commands cut, paste, tr  Line manipulation commands sort, uniq, diff  Regular expressions and grep  Text replacement using sed

32 32 Searching For Matching Patterns – grep  The grep command searches files for patterns, and prints matching lines  The mandatory regexp argument defines a regular expression  A regular expression is a formula for matching strings that follow some pattern grep [options] regexp [files]

33 33 Searching For Matching Patterns – grep  The simplest regular expression is just a sequence of characters  This regular expression matches only a single string – itself  The following command prints all lines from any of files that contain word : grep word files

34 34 Searching For Matching Patterns – grep  The power of grep lies in using more sophisticated regular expressions  Command line options for grep : -v – print all lines that don't match -c – print only a count of matched lines -n – print line numbers -h – don't print file names (for multiple files) -l – print file name but not matching line

35 35 Regular Expressions  Regular expressions are a powerful tool for searching and selecting text  Their origin is in the UNIX grep command (and further back in automata theory)  They have since been copied into many other tools and languages such as awk, sed, perl and Java

36 36 Regular Expressions vs. Filename Expansion  Note that regular expressions are different from filename expansion  Filename expansion uses some regular expression concepts and symbols, but: Filename expansion is done by the shell Regular expressions are passed as arguments to specific commands or utilities

37 37 Matching a Single Character  A period (. ) matches any single character  For example: Regular ExpressionMatchesDoesn't Match b.gbag debug bigger brag bg bad U..XUNIXunix.a, b, cAn empty line

38 38 Matching a Character Class  Square brackets ( [] ) match any single character within the brackets  If the first character following the left bracket is a ' ^ ', the expression matches any character not in the brackets  A ' - ' can be used to indicate a range, such as: [a-z]

39 39 Matching a Character Class Regular ExpressionMatchesDoesn't Match [Bb]illBill bill got billed Dill ill kill t[aeiou].ktalk stack stink track take number [^0-5]number xxx number 8: number 59

40 40 Matching a Character Class  The same predefined character classes used for tr can also be used here  For portability reasons, [:alpha:] is always preferable to [A-Za-z]  Note: the brackets are part of the symbolic names, and must be included in addition to the enclosing brackets, i. e. [[:alpha:]]

41 41 Matching Repetitions  An asterisk ( * ) represents zero or more matches of the regular expression it follows Regular ExpressionMatchesDoesn't Match ab*cac abc aaabbbc abac acb t.*ingthing string thinking king

42 42 Matching Special Characters  Sometimes we want to literally match a character that has a special meaning, such as ' * ' or ' [ '  There are two ways to do that: Precede the character with a ' \ ' Use square brackets – any character inside is taken literally

43 43 Matching Special Characters Regular ExpressionMatchesDoesn't Match a\.ca.cabc \.\.\.*the end... more..... abc stop. [*.]* start * Sys.print Hello world abc C:\\binC:\binC:\\bin

44 44 Matching the Beginning or the End of a Line  A regular expression that begins with a caret ( ^ ) can match a string only at the beginning of a line  Similarly, a regular expression that ends with a dollar sign ( $ ) can match a string only at the end of a line

45 45 Matching the Beginning or the End of a Line Regular ExpressionMatchesDoesn't Match ^TThis line That bug START My Tag ^num.*[0-9]$num5 num99 number 1 my num1 the number 6 num 6a ^t.*k$talk track tk stack take

46 46 Using Regular Expressions with grep – Examples cat bugs.txt big boy bad bug bag bigger bag better boogie nights grep 'b.g' bugs.txt big boy bad bug bag bigger bag grep 'b.g.' bugs.txt big boy bigger bag grep 'b.*g.' bugs.txt big boy bigger bag boogie nights

47 47 Using Regular Expressions with grep – Examples cat f.txt ADAMS, Andrew 7583 BARRETT, Bruce 6466 BAYES, Ryan 6585 grep '[[:alpha:]],' f.txt grep '^[C-Z][[:lower:]]*$' f.txt Ryan ADAMS, BARRETT, BAYES, 6466 6585 grep '^[^[:alpha:]0-3]*$' f.txt

48 48 Pipes and Regular Expressions – Example  Task: create a file containing the names of all source files in the current directory, sorted by the number of lines in each file Step 1: count lines in each file Step 2: leave only '.c ' and '.h ' files Step 3: sort in reverse order (largest first) wc -l * wc -l * | grep '\.[ch]$' wc -l * | grep '\.[ch]$' | sort -nr

49 49 Pipes and Regular Expressions – Example Step 4: squeeze leading spaces (into one) Step 5: remove number field Step 6: write output to file wc -l * | grep '\.[ch]$' | sort -nr | tr -s " " | cut -d" " –f3 > sorted_source_files.txt wc -l * | grep '\.[ch]$' | sort -nr | tr -s " " wc -l * | grep '\.[ch]$' | sort -nr | tr -s " " | cut -d" " –f3

50 50 Which grep to Use?  In addition to grep itself, there are two more variants of it: egrep and fgrep Use grep for most standard text finding tasks Use egrep for complex tasks, where basic regular expressions are just not enough, and you need to use extended regular expressions Use fgrep when only fixed strings are searched, and speed is of the essence

51 51 Extended Regular Expressions – egrep  Extended regular expressions support all basic regular expression syntax, plus some additional special characters: + – similar to ' * ', but at least one appearance ? – similar to ' * ', but zero or one appearances () – grouping a|b – the OR operator – matches either regular expression a or regular expression b

52 52 Extended Regular Expressions – egrep Regular ExpressionMatchesDoesn't Match num6+num666 num654 num566 number num6?5num65 num555 num6 num665 Barret|BennetBarret Bennet B(arr|enn)etBarret Bennet

53 53 Lecture Overview  Character manipulation commands cut, paste, tr  Line manipulation commands sort, uniq, diff  Regular expressions and grep  Text replacement using sed

54 54 Stream Editor – sed  sed is a script editor for text streams, which supports basic regular expressions  It performs transformations on an input stream, based on simple instructions  sed has many commands, but the most commonly used is the substitute command: sed 's/pattern/replacement/[g]' [file]

55 55 Stream Editor – sed  pattern is any basic regular expression  replacement is a string that will replace one or more matches of pattern  The optional g flag defines whether the operation is global – without it only the first match in every line is replaced  The special character ' & ' can be used inside replacement to refer to the matched text

56 56 Using Regular Expressions with grep – Examples cat bugs.txt big boy bad bug bag bigger bag better sed 's/b.g/XXX/' bugs.txt XXX boy bad XXX XXX XXXger bag better sed 's/b.g/XXX/g' bugs.txt XXX boy bad XXX XXX XXXger XXX better

57 57 sed – Examples head -2 my_phones.txt head -2 my_phones.txt | sed 's/ [[:upper:]]/ /g' ADAMS, ndrew 7583 BARRETT, ruce 6466 ADAMS, Andrew 7583 BARRETT, Bruce 6466 ADAMS, Andrew ### BARRETT, Bruce ### head -2 my_phones.txt | sed 's/[[:digit:]]*$/###/g'

58 58 Matching and Reusing Portions of a Pattern in sed  It is also possible to use portions of the matching pattern  Within the pattern, portions should be enclosed between ' \( ' and ' \) '  In replacement, the special sequences: ' \1 ', ' \2 ', etc. can be used to refer to the matched portions

59 59 Matching and Reusing Portions of a Pattern in sed – Examples  Remove the first name from each line:  Replace first name with initial: head -2 my_phones.txt | sed 's/ \([[:upper:]]\)[[:lower:]]* / \1. /' ADAMS, A. 7583 BARRETT, B. 6466 ADAMS, 7583 BARRETT, 6466 head -2 my_phones.txt | sed 's/ [[:upper:]][[:lower:]]* / /'

60 60 Matching and Reusing Portions of a Pattern in sed – Examples  Switch between first and last names:  Switch names and parenthesize number: head -2 my_phones.txt | sed 's/\(.*\), \(.*\) \(.*\)/\2 \1: (03-555\3)/' Andrew ADAMS: (03-5557583) Bruce BARRETT: (03-5556466) Andrew ADAMS 7583 Bruce BARRETT 6466 head -2 my_phones.txt | sed 's/\(.*\), \(.*\) /\2 \1 /'


Download ppt "Advanced Text Processing. 222 Lecture Overview  Character manipulation commands cut, paste, tr  Line manipulation commands sort, uniq, diff  Regular."

Similar presentations


Ads by Google