CIT 500: IT Fundamentals Text Processing 1. Topics 1.Displaying files: cat, less, od, head, tail 2.Creating and appending 3.Concatenating files 4.Comparing.

Slides:



Advertisements
Similar presentations
CST8177 sed The Stream Editor. The original editor for Unix was called ed, short for editor. By today's standards, ed was very primitive. Soon, sed was.
Advertisements

UNIX Chapter 10 Advanced File Processing Mr. Mohammad Smirat.
A Guide to Unix Using Linux Fourth Edition
CS 497C – Introduction to UNIX Lecture 29: - Filters Using Regular Expressions – grep and sed Chin-Chih Chang
Linux+ Guide to Linux Certification, Second Edition
CS 497C – Introduction to UNIX Lecture 25: - Simple Filters Chin-Chih Chang
CS 497C – Introduction to UNIX Lecture 23: - Simple Filters Chin-Chih Chang
Guide To UNIX Using Linux Third Edition
T UTORIAL OF U NIX C OMMAND & SHELL SCRIPT S 5027 Professor: Dr. Shu-Ching Chen TA: Samira Pouyanfar Spring 2015.
Lecture 02CS311 – Operating Systems 1 1 CS311 – Lecture 02 Outline UNIX/Linux features – Redirection – pipes – Terminating a command – Running program.
Grep, comm, and uniq. The grep Command The grep command allows a user to search for specific text inside a file. The grep command will find all occurrences.
CSCI 330 T HE UNIX S YSTEM File operations. OPERATIONS ON REGULAR FILES 2 CSCI The UNIX System Create Edit Display Contents Display Contents Print.
Unix Files, IO Plumbing and Filters The file system and pathnames Files with more than one link Shell wildcards Characters special to the shell Pipes and.
Unix Filters Text processing utilities. Filters Filter commands – Unix commands that serve dual purposes: –standalone –used with other commands and pipes.
UNIX Filters.
Filters using Regular Expressions grep: Searching a Pattern.
Shell Script Examples.
Advanced File Processing
System Programming Regular Expressions Regular Expressions
BIF703 Miscellaneous Commands. File related commands  grep - print lines matching a pattern  head - output the first part of files  tail - output the.
Agenda User Profile File (.profile) –Keyword Shell Variables Linux (Unix) filters –Purpose –Commands: grep, sort, awk cut, tr, wc, spell.
Guide To UNIX Using Linux Fourth Edition
LIN 6932 Unix Lecture 6 Hana Filip. LIN 6932 HW6 - Part II solutions posted on my website see syllabus.
Unix Talk #2 (sed). 2 You have learned…  Regular expressions, grep, & egrep  grep & egrep are tools used to search for text in a file  AWK -- powerful.
Introduction to Unix (CA263) File Processing. Guide to UNIX Using Linux, Third Edition 2 Objectives Explain UNIX and Linux file processing Use basic file.
Unix programming Term: III B.Tech II semester Unit-II PPT Slides Text Books: (1)unix the ultimate guide by Sumitabha Das (2)Advanced programming.
Jozef Goetz, expanded by Jozef Goetz, 2009 Credits: Parts of the slides are based on slides created by UNIX textbook authors, Syed M. Sarwar, Robert.
Regular expressions Used by several different UNIX commands, including ed, sed, awk, grep A period ‘.’ matches any single characters.X. matches any X.
CS 403: Programming Languages Fall 2004 Department of Computer Science University of Alabama Joel Jones.
Advanced File Processing. 2 Objectives Use the pipe operator to redirect the output of one command to another command Use the grep command to search for.
UNIX Shell Script (1) Dr. Tran, Van Hoai Faculty of Computer Science and Engineering HCMC Uni. of Technology
Chapter Five Advanced File Processing Guide To UNIX Using Linux Fourth Edition Chapter 5 Unix (34 slides)1 CTEC 110.
Chapter Five Advanced File Processing. 2 Objectives Use the pipe operator to redirect the output of one command to another command Use the grep command.
Module 6 – Redirections, Pipes and Power Tools.. STDin 0 STDout 1 STDerr 2 Redirections.
(Stream Editor) By: Ross Mills.  Sed is an acronym for stream editor  Instead of altering the original file, sed is used to scan the input file line.
Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 5.1 © Copyright IBM Corporation 2008 Unit 10 Linux.
Agenda Regular Expressions (Appendix A in Text) –Definition / Purpose –Commands that Use Regular Expressions –Using Regular Expressions –Using the Replacement.
BIF713 Additional Utilities. Linux Utilities  You have learned many Linux commands. Here are some more that you can use:  Data Manipulation (Reg Exps)
I/O Redirection and Regular Expressions February 9 th, 2004 Class Meeting 4.
Introduction to Unix – CS 21 Lecture 12. Lecture Overview A few more bash programming tricks The here document Trapping signals in bash cut and tr sed.
Introduction to Unix (CA263) File Processing (continued) By Tariq Ibn Aziz.
Chapter Five Advanced File Processing. 2 Lesson A Selecting, Manipulating, and Formatting Information.
UNIX An Introduction. Brief History UNIX UNIX Created at Bell Labs, 1969 Created at Bell Labs, 1969 BSD during mid 70s BSD during mid 70s AT&T began offering.
I/O Redirection & Regular Expressions CS 2204 Class meeting 4 *Notes by Doug Bowman and other members of the CS faculty at Virginia Tech. Copyright
ULI101 More Linux Commands Introduction to UNIX/Linux and the Internet
Lesson 3-Touring Utilities and System Features. Overview Employing fundamental utilities. Linux terminal sessions. Managing input and output. Using special.
Awk- An Advanced Filter by Prof. Shylaja S S Head of the Dept. Dept. of Information Science & Engineering, P.E.S Institute of Technology, Bangalore
CIT 140: Introduction to ITSlide #1 CSC 140: Introduction to IT Advanced File Processing.
– Introduction to the Shell 1/21/2016 Introduction to the Shell – Session Introduction to the Shell – Session 3 · Job control · Start,
CIT 140: Introduction to ITSlide #1 CSC 140: Introduction to IT File Processing.
ORAFACT Text Processing. ORAFACT Searching Inside Files grep - searches for patterns within files grep [options] [[-e] pattern] filename [...] -n shows.
CIT 140: Introduction to ITSlide #1 CSC 140: Introduction to IT Advanced File Processing Part 2.
CS 403: Programming Languages Lecture 20 Fall 2003 Department of Computer Science University of Alabama Joel Jones.
Filters and Utilities. Notes: This is a simple overview of the filtering capability Some of these commands are very powerful ▫Only showing some of the.
Introduction to Textutils sort, uniq, wc, cut, grep, sed, awk ● Steve Walsh ● Linux Users of Victoria ● November, 2007.
CSE 303 Concepts and Tools for Software Development Richard C. Davis UW CSE – 10/9/2006 Lecture 6 – String Processing.
Regular Expressions Copyright Doug Maxwell (
Tutorial of Unix Command & shell scriptS 5027
Lesson 5-Exploring Utilities
CST8177 sed The Stream Editor.
Chapter 6 Filters.
Linux command line basics III: piping commands for text processing
CS 403: Programming Languages
Tutorial of Unix Command & shell scriptS 5027
Tutorial of Unix Command & shell scriptS 5027
Guide To UNIX Using Linux Third Edition
Tutorial of Unix Command & shell scriptS 5027
Unix Talk #2 (sed).
Software I: Utilities and Internals
Presentation transcript:

CIT 500: IT Fundamentals Text Processing 1

Topics 1.Displaying files: cat, less, od, head, tail 2.Creating and appending 3.Concatenating files 4.Comparing files 5.Printing files 6.Sorting files 7.Searching files and regular expressions 8.Sed and awk

Displaying Files 1.cat 2.less 3.od 4.head 5.tail

Displaying files: cat cat [options] [file1 [file2 … ]] -e Displays $ at the end of each line. -n Print line numbers before each line. -t Displays tabs as ^I and formfeeds as ^L -v Display nonprintable characters, except for tab, newline, and formfeed. -vet Combines –v, -e, -t to display all nonprintable characters.

Displaying files: less less [file1 [file2 … ]] h Displays help. q Quit. space Forward one page. return Forward one line. b Back one page. y Back one line. :n Next file. :p Previous file. / Search file.

Displaying files: od od [options] [file1 [file2 … ]] -c Also display character values. -x Display numbers in hexadecimal. > file /kernel/genunix /kernel/genunix: ELF 32-bit MSB relocatable SPARC > od -c /kernel/genunix E L F \0 \0 \0 \0 \0 \0 \0 \ \0 001 \0 002 \0 \0 \0 001 \ \0 \0 \ \0 033 ^ ` \0 \0 \0 \0 \0 4 \0 \0 \0 \0 \ \0 017 \0 \n

Displaying files: head and tail Display first/last 10 lines of file. head [-#] [file1 [file2 … ]] -# Display first # lines. tail [-#] [file1 [file2 … ]] -# Display last # lines. -f If data is appended to file, continue displaying new lines as they are added.

File Size Determining File Size – ls –l wc [options] file-list

CIT 140: Introduction to ITSlide #9 Word count: wc wc [options] target1 [target2, …] -cCount bytes in file only. -lCount lines in file only. -wCount words in file only.

Creating and Appending to Files Creating files > cat >file Hello world Ctrl-d Appending to files > cat >> file Hello world line 2 Ctrl-d > cat file Hello world Hello world line 2

Concatenating Files > cat >file1 This is file #1 > cat >file2 This is file #2 > cat file1 file2 >joinedfile > cat joinedfile This is file #1 This is file #2

Comparing files: diff diff [options] oldfile newfile -bIgnore trailing blanks and treat other strings of blanks as equivalent. -cOutput contextual diff format. -eOutput ed script for converting oldfile to newfile. -iIgnore case in letter comparisons. -uOutput unified diff format.

diff [options][file1][file2] Comparing Files with diff

diff Example > diff Fall_Hours Spring_Hours 1c1 < Hours for Fall > Hours for Spring a7 > 1:00 - 2:00 p.m. 9d9 < 3:00 - 4:00 p.m. 12,13d11 < 2:00 - 3:00 p.m. < 4:00 - 4:30 p.m.

uniq [options][+N][input-file][output-file] > cat sample This is a test file for the uniq command. It contains some repeated and some nonrepeated lines. Some of the repeated lines are consecutive, like this. And, some are not consecutive, like the following. Some of the repeated lines are consecutive, like this. The above line, therefore, will not be considered a repeated line by the uniq command, but this will be considered repeated! > uniq sample This is a test file for the uniq command. It contains some repeated and some nonrepeated lines. Some of the repeated lines are consecutive, like this. And, some are not consecutive, like the following. Some of the repeated lines are consecutive, like this. The above line, therefore, will not be considered a repeated line by the uniq command, but this will be considered repeated! Removing Repeated Lines

uniq uniq [options] input [output file] -cPrecedes each output line with a count of the number of times the line occurred in the input. -dSuppresses the writing of lines that are not repeated in the input. -uSuppresses the writing of lines that are repeated in the input.

Removing Repeated Lines uniq [options][+N][input-file][output-file] > uniq -c sample 1 This is a test file for the uniq command. 1 It contains some repeated and some nonrepeated lines. 3 Some of the repeated lines are consecutive, like this. 1 And, some are not consecutive, like the following. 1 Some of the repeated lines are consecutive, like this. 1 The above line, therefore, will not be considered a repeated 2 line by the uniq command, but this will be considered repeated! > uniq -d sample Some of the repeated lines are consecutive, like this. line by the uniq command, but this will be considered repeated! > uniq -d sample out > cat out Some of the repeated lines are consecutive, like this. line by the uniq command, but this will be considered repeated!

Printing Files

lp [options] file-list lpr [options] file-list

lpq [options] Printing Files

Canceling Your Print Job cancel [options] [printer] Printing Files

Canceling Your Print Job (Contd) lprm [options][jobID-list][user(s)] Printing Files

Sorting Ordering set of items by some criteria. Systems in which sorting is used include: – Words in a dictionary. – Names of people in a telephone directory. – Numbers.

Sorting: sort sort [-f] [-i] [-k #] [-d] [-l] [-v] files -d Sort in dictionary order (default.) -f Ignore case of letters. -i Ignore non-printable characters. -k # Sort by field number # -n Sort in numerical order. -r Reverse order of sort -u Do not list duplicate lines in output.

sort Example > cat days.txt Sunday Monday Tuesday Wednesday Thursday Friday Saturday > sort days.txt Friday Monday Saturday Sunday Thursday Tuesday Wednesday

sort Example > cat days.txt Sunday Monday Tuesday Wednesday Thursday Friday Saturday > sort -r days.txt Wednesday Tuesday Thursday Sunday Saturday Monday Friday

sort Example > cat numbers.txt > sort numbers.txt > sort -n numbers.txt

Searching Files: grep grep [-i] [-l] [-n] [-v] pattern file1 [file2,...] Search for pattern in the file arguments. -iIgnore case of letters in files. -lPrint only the names of files that contain matches. -nPrint line numbers along with matching lines. -vPrint only nonmatching lines.

Simple Searches > grep catt /usr/share/dict/words cattail... wildcatting > grep -c catt /usr/share/dict/words 29 > grep –c –v catt /usr/share/dict/words > wc –l /usr/share/dict/words /usr/dict/words > grep –n catt /usr/share/dict/words 28762:cattail … 97276:wildcatting

Regular Expressions ^Beginning of line $End of line [a-z]Character range (all lower case) [aeiou]Character range (vowels).Any character *Zero or more of previous pattern {n}Repeat previous match n times {n,m}Repeat previous match n to m times a|bMatch a or b

Regular Expression Searches > egrep ^dogg /usr/share/dict/words dogged … doggy’s > egrep dogg$ /usr/share/dict/words > egrep mann$ /usr/share/dict/words Bertelsmann … Weizmann > egrep ^mann /usr/share/dict/words manna … mannishness's

Regular Expression Searches > egrep 'catt|dogg' /usr/share/dict/words boondoggle boondoggled... wildcatting > egrep 'catt|dogg' /usr/share/dict/words | wc –l 54 > egrep '^(catt|dogg)‘ /usr/share/dict/words cattail … doggy’s

Character classes > egrep [0-9] /usr/share/dict/words > egrep –c ^xz /usr/share/dict/words 0 > egrep -c ^[xz] /usr/share/dict/words 153 > egrep -c [xz]$ /usr/share/dict/words 321 > egrep -c [aeiou][aeiou][aeiou][aeiou] /usr/dict/words 36 > egrep [aeiou][aeiou][aeiou][aeiou][aeiou] /usr/share/dict/words queueing > egrep [aeiou]{5} /usr/share/dict/words queueing > egrep -c :[0-9][0-9]: /etc/passwd 9 > egrep -c ':[0-9]{2,3}:' /etc/passwd 18

Extracting Fields: cut cut [-f #] [-d delim] file Select sections from each line of file. -f #Select field #. -d delimUse delim instead of tab to separate fields. -b #Select specified bytes instead of fields.

Cut Examples > cut -d: -f 1 /etc/passwd | head -5 root daemon bin sys sync > cut -d: -f 1,3 /etc/passwd | head -5 root:0 daemon:1 bin:2 sys:3 sync:4 > cut -d: -f 1,3-5,7 /etc/passwd | head -5 root:0:0:root:/bin/bash daemon:1:1:daemon:/bin/sh bin:2:2:bin:/bin/sh sys:3:3:sys:/bin/sh sync:4:65534:sync:/bin/sync

Cut Examples > cut -c1-4 /etc/passwd | head -5 root daem bin: sys: sync > cut -d: -f7 /etc/passwd | cut -c1-4 | head -5 /bin > cut -d: -f7 /etc/passwd | cut –c6-20 | head -5 bash sh sync

Searching + Extracting: awk 37 awk [-F delim] ‘/pattern/ {action}’ Execute awk program on each line of file. -F delimUse delim to separate fields Patterns are regular expressions. Actions are extremely powerful, as awk is a simple programming language, but we’ll just use print $#, where # is the field we want to print.

Awk Examples > awk -F: '{print $1}' /etc/passwd|head -5 root daemon bin sys sync > awk -F: '{print $1, $3}' /etc/passwd|head -5 root 0 daemon 1 bin 2 sys 3 sync 4 > awk -F: '/root/ {print $1, $3}' /etc/passwd root 0 > awk -F: '/bin\/false/ {print $1, $3}' /etc/passwd dhcp 101 syslog 102 klog 103

Stream Editor: sed 39 sed [-n] ‘/pattern/action’ files sed [-n] ‘[line1,line2]s/pat1/pat2/options’ files Filter and modify (if specified) each line of file. -n Do not print lines unless action specifies printing. Patterns are regular expressions. Actions: p = print matching lines, d = delete matching lines s = replace pattern1 with pattern2

Using Sed like Grep > sed -n '/catt/p' /usr/share/dict/words cattail … wildcatting > sed -n '/catt/p' /usr/share/dict/words | wc -l 29 > sed '/catt/d' /usr/share/dict/words | wc -l > sed -n '/^dogg/p' /usr/share/dict/words dogged … doggy’s > sed -n '/dogg$/p' /usr/share/dict/words > sed -n '/mann$/p' /usr/share/dict/words Bertelsmann … Weizmann

Sed Examples > cat phones.txt Our phone bill for last year was $859,800, This is our list of phone numbers:

Sed Substitutions > sed 's/859/(513)/' phones.txt | head -5 Our phone bill for last year was $(513),800, This is our list of phone numbers: (513) (513) (513) > sed 's/859-/(513)-/' phones.txt | head -5 Our phone bill for last year was $859,800, This is our list of phone numbers: (513) (513) (513) > sed '3,99s/859/(513)/' phones.txt | head -5 Our phone bill for last year was $859,800, This is our list of phone numbers: (513) (513) (513)

Sed Substitutions > sed 's/[0-9]*-[0-9]*-[0-9]*/Number Redacted/' phones.txt | head -5 Our phone bill for last year was $859,800, This is our list of phone numbers: Number Redacted > sed 's/\([0-9]*-[0-9]*-[0-9]*\)/Phone number is \1/' phones.txt | head -5 Our phone bill for last year was $859,800, This is our list of phone numbers: Phone number is Phone number is Phone number is > sed 's/\([0-9]*\)-\([0-9]*\)-\([0-9]*\)/(\1) \2-\3/' phones.txt | head -5 Our phone bill for last year was $859,800, This is our list of phone numbers: (859) (859) (859)

Sed and Awk Applications Sed Double space a file. DOS to UNIX line endings. Trim leading spaces. Delete consecutive blank lines. Remove blanks from begin/end of file. Awk Manage small file db. Generate reports. Validate data. Produce indexes. Extract fields from UNIX command output. 44

Sed and Awk vs. Ruby and Others Sed and Awk – Small languages – Cryptic syntax – Best for writing one liners in the shell Ruby, Python, Perl, etc. – Large languages – Easy syntax – Best for writing longer programs 45

References 1.Syed Mansoor Sarwar, Robert Koretsky, Syed Ageel Sarwar, UNIX: The Textbook, 2 nd edition, Addison-Wesley, Nicholas Wells, The Complete Guide to Linux System Administration, Thomson Course Technology,