Presentation is loading. Please wait.

Presentation is loading. Please wait.

TEXT PROCESSING UTILITIES. THE cat COMMAND $ cat emp1.lst $ cat emp1.lst 2233 | shukla | g.m | sales | 12/12/52 | 20000 9876 | sharma |d.g.m |product.

Similar presentations


Presentation on theme: "TEXT PROCESSING UTILITIES. THE cat COMMAND $ cat emp1.lst $ cat emp1.lst 2233 | shukla | g.m | sales | 12/12/52 | 20000 9876 | sharma |d.g.m |product."— Presentation transcript:

1 TEXT PROCESSING UTILITIES

2 THE cat COMMAND $ cat emp1.lst $ cat emp1.lst 2233 | shukla | g.m | sales | 12/12/52 | 20000 9876 | sharma |d.g.m |product | 12/03 60 | 15000 7898 | akash |dir. |mark. | 11/06/70 |9000 3456 | tiwary |g.m |product | 05/02/89 |23000 1234 | kumar | mgr |accnts | 18/03/79 |15000 3456 | anil |chman |sales | 30/02/69 |40000 6789 | lalith |mrg | mark. | 17/01/80 |60000 5678 | a | d | m | 12/12/80 |12000 This is the emp database which stores the information about various employees. that is employeenumber. emp name designationdepartment date of birth and their salary.

3 DISPLAYING THE BEGINNING OF A FILE – THE head COMMAND The head command as the name implies displays the top LINES of the file. When used without an option it displays the first ten records of the argument file. The head command as the name implies displays the top LINES of the file. When used without an option it displays the first ten records of the argument file.

4 $ head emp.lst $ head emp.lst 2233 | shukla | g.m | sales | 12/12/52 | 20000 9876 | sharma |d.g.m |product| 12/03 60 | 15000 7898 | akash |dir. |mark. | 11/06/70 |9000 3456 | tiwary |g.m |product| 05/02/89 |23000 1234 | kumar | mgr |accnts | 18/03/79 |15000 3456 | anil |chman |sales | 30/02/69 |40000 6789 | lalith |mrg | mark. | 17/01/80 |60000 5678 | a | d | m | 12/12/80 |12000 This is the emp database which stores

5 You can specify the line count and display say the first three lines of the file. Use the – symbol, followed by a numeric argument. You can specify the line count and display say the first three lines of the file. Use the – symbol, followed by a numeric argument. Ex: $ head -3 emp.lst Ex: $ head -3 emp.lst 2233 | shukla | g.m | sales | 12/12/52 | 20000 9876 | sharma |d.g.m |product| 12/03 60 | 15000 7898 | akash |dir. |mark. | 11/06/70 |9000 If the linecount specified exceeds the number of lines actually present in the file, head displays the entire file. If the linecount specified exceeds the number of lines actually present in the file, head displays the entire file. You can also find out the “record length” by word counting the first line of the file : You can also find out the “record length” by word counting the first line of the file : $ head -1 emp.lst | wc -c $ head -1 emp.lst | wc -c 47 47

6 head also works with multiple files. For each file it indicates the filename and the lines extracted: head also works with multiple files. For each file it indicates the filename and the lines extracted: $ head -2 emp.lst f1.lst $ head -2 emp.lst f1.lst ==> emp.lst emp.lst <== 2233 | shukla | g.m | sales | 12/12/52 | 20000 9876 | sharma |d.g.m|product| 12/03 60 | 15000 ==> f1.lst f1.lst <== root tty7 2009-07-25 09:56 (:0) root pts/1 2009-07-25 09:56 (:0)

7 DISPLAYING THE END OF A FILE – THE tail COMMAND The tail command displays the end of the file. It provides an additional method of addressing lines, and can also extract information in units of blocks and characters. The tail command displays the end of the file. It provides an additional method of addressing lines, and can also extract information in units of blocks and characters. Like head it displays the last ten lines when used without arguments. Like head it displays the last ten lines when used without arguments. Ex: Ex: $ tail -3 emp.lst $ tail -3 emp.lstdepartment date of birth and their salary.

8 $ tail emp.lst $ tail emp.lst This is the emp database which stores the information about various employees. that is employeenumber. emp name designationdepartment date of birth and their salary.

9 [itlaxmi@snist ~]$ tail -40c emp.lst [itlaxmi@snist ~]$ tail -40c emp.lstartment date of birth and their salary. Ex: $ tail -v emp.lst Ex: $ tail -v emp.lst ==> emp.lst emp.lst <== This is the emp This is the emp database which stores database which stores the information about various the information about various employees. employees. that is employeenumber. that is employeenumber. emp name emp name designation designation department department date of birth date of birth and their salary. and their salary.

10 The disadvantage with head and tail is that they cannot display a range of lines. Moreover what is displayed is final. That is if we have displayed the first 50 lines in a file, we cannot move back and view say the 10 lines. The disadvantage with head and tail is that they cannot display a range of lines. Moreover what is displayed is final. That is if we have displayed the first 50 lines in a file, we cannot move back and view say the 10 lines. -v -v If you use this option it will always print the headers giving the file name. If you use this option it will always print the headers giving the file name.

11 Tail also address lines from the beginning of the file instead of the end. The + count option allows you to do that, where count represents the line number from where the selection should begin. Tail also address lines from the beginning of the file instead of the end. The + count option allows you to do that, where count represents the line number from where the selection should begin. Ex: Ex: $ tail -n +8 emp.lst $ tail -n +8 emp.lst 5678 | a | d | m | 12/12/80 |12000 This is the emp database which stores the information about various employees. that is employeenumber. emp name designationdepartment date of birth and their salary.

12 SLITTING A FILE VERTICALLY – THE cut COMMAND While head and tail are used to slice a file horizontally, you can slice a file vertically with the cut command. Cut identifies both columns and fields. While head and tail are used to slice a file horizontally, you can slice a file vertically with the cut command. Cut identifies both columns and fields. Syntax: Syntax: cut cut Ex: store the first 5 lines of the file emp.lst in a file shortlist. Ex: store the first 5 lines of the file emp.lst in a file shortlist. $ head -5 emp.lst >shortlist $ head -5 emp.lst >shortlist

13 $ cat shortlist $ cat shortlist 2233 | shukla | g.m | sales | 12/12/52 | 20000 9876 | sharma |d.g.m|product| 12/03 60 | 15000 7898 | akash |dir. |mark. | 11/06/70 |9000 3456 | tiwary |g.m |product| 05/02/89 |23000 1234 | kumar | mgr |accnts | 18/03/79 |15000 cut can be used to extract specific columns from this file. Use the –c (columns) option for cutting columns: cut can be used to extract specific columns from this file. Use the –c (columns) option for cutting columns: $ cut -c5-20 shortlist $ cut -c5-20 shortlist | shukla | g.m | sharma |d.g.m | sharma |d.g.m | akash |dir. | akash |dir. | tiwary |g.m | tiwary |g.m | kumar | mgr | kumar | mgr Column numbers must immediately follow the option. Ranges are permitted, and commas are used to separate the column chunks. Column numbers must immediately follow the option. Ranges are permitted, and commas are used to separate the column chunks.

14 $ cut -c2-5,10-15,40- shortlist $ cut -c2-5,10-15,40- shortlist 233 ukla || 20000 876 arma || 15000 898 ash ||9000 456 wary ||23000 234 mar ||15000 The expression 40- indicates column number 55 to end of the line. The expression 40- indicates column number 55 to end of the line. The method of tracking fields by column positions is tedious and also the file may doesn’t contain fixed length records. The method of tracking fields by column positions is tedious and also the file may doesn’t contain fixed length records. You can extract specific fields using two options -d (delimiter) for specification of the field delimiter and –f (field) for specifying the field list: You can extract specific fields using two options -d (delimiter) for specification of the field delimiter and –f (field) for specifying the field list: When you use the –f option, don’t forget to use the –d option too, unless the file has the default delimiter (the tab). When you use the –f option, don’t forget to use the –d option too, unless the file has the default delimiter (the tab).

15 Ex: $ cut -d"|" -f2,3 shortlist | tee clist1 Ex: $ cut -d"|" -f2,3 shortlist | tee clist1 shukla | g.m shukla | g.m sharma |d.g.m sharma |d.g.m akash |dir. akash |dir. tiwary |g.m tiwary |g.m kumar | mgr kumar | mgr The tee command saves the output in the file clist1, and also displays it on the terminal. The tee command saves the output in the file clist1, and also displays it on the terminal. $ cat clist1 $ cat clist1 shukla | g.m shukla | g.m sharma |d.g.m sharma |d.g.m akash |dir. akash |dir. tiwary |g.m tiwary |g.m kumar | mgr kumar | mgr

16 PASTING FILES – THE paste COMMAND What you “cut” with the previous command can be pasted with the paste command. What you “cut” with the previous command can be pasted with the paste command. In this respect it resembles the cat command. But while cat pastes more than one file horizontally, paste does it vertically. In this respect it resembles the cat command. But while cat pastes more than one file horizontally, paste does it vertically. $ cut -d"|" -f6 shortlist | tee clist2 $ cut -d"|" -f6 shortlist | tee clist2 20000 20000 15000 1500090002300015000 Cut was used to create two files clist1 and clist2, containing two cut-out portions of the same file. Cut was used to create two files clist1 and clist2, containing two cut-out portions of the same file.

17 $ paste clist1 clist2 $ paste clist1 clist2 shukla | g.m 20000 sharma |d.g.m 15000 sharma |d.g.m 15000 akash |dir. 9000 akash |dir. 9000 tiwary |g.m 23000 tiwary |g.m 23000 kumar | mgr 15000 kumar | mgr 15000 By default paste uses the tab character for pasting files. You can specify a delimiter of your choice: By default paste uses the tab character for pasting files. You can specify a delimiter of your choice: $ paste -d"|" clist1 clist2 $ paste -d"|" clist1 clist2 shukla | g.m # 20000 sharma |d.g.m # 15000 sharma |d.g.m # 15000 akash |dir. # 9000 akash |dir. # 9000 tiwary |g.m # 23000 tiwary |g.m # 23000 kumar | mgr # 15000 kumar | mgr # 15000

18 While using the –d option along with several files in the command line, you can specify more than one delimiter. For ex: While using the –d option along with several files in the command line, you can specify more than one delimiter. For ex: $ paste –d” |#~” file1 file2 file3 file4 file5 $ paste –d” |#~” file1 file2 file3 file4 file5 The above example uses the space character for pasting file1 and file2, the | character for pasting file2 and file3 and so forth. The above example uses the space character for pasting file1 and file2, the | character for pasting file2 and file3 and so forth.

19 ORDERING A FILE – THE sort COMMAND Sorts the contents of a file. Sorts the contents of a file. It can merge multiple sorted files and store the result in the specified output file. It can merge multiple sorted files and store the result in the specified output file. When the command is invoked without options, it sorts the entire line : When the command is invoked without options, it sorts the entire line : Ex: Ex: $ sort shortlist $ sort shortlist 1234 | kumar | mgr |accnts | 18/03/79 |15000 2233 | shukla | g.m | sales | 12/12/52 | 20000 3456 | tiwary |g.m |product| 05/02/89 |23000 7898 | akash |dir. |mark. | 11/06/70 |9000 9876 | sharma |d.g.m|product| 12/03 60 | 15000

20 Sorting starts with the first character of each line in the file. If the first character of two lines is same then the second character in each line is compared and so on. Sorting starts with the first character of each line in the file. If the first character of two lines is same then the second character in each line is compared and so on. The sorting is done according to the ASCII collating sequence. That is, it sorts the spaces and tabs first, then the punctuation marks followed by numbers, uppercase letters and lowercase letters in that order. The sorting is done according to the ASCII collating sequence. That is, it sorts the spaces and tabs first, then the punctuation marks followed by numbers, uppercase letters and lowercase letters in that order. Like cut and paste, sort also works on fields, and the default field separator is the space character. The –t option, followed immediately by the delimiter, overrides the default. This lets you to sort the file on any field, for instance, the second field (name): Like cut and paste, sort also works on fields, and the default field separator is the space character. The –t option, followed immediately by the delimiter, overrides the default. This lets you to sort the file on any field, for instance, the second field (name): $ sort –t”|” –k2 shortlist $ sort –t”|” –k2 shortlist

21 The sort order can be reversed with the –r (reverse) option. The sort order can be reversed with the –r (reverse) option. Ex: Ex: $ sort -r shortlist $ sort -r shortlist 9876 | sharma |d.g.m|product| 12/03 60 | 15000 7898 | akash |dir. |mark. | 11/06/70 |9000 3456 | tiwary |g.m |product| 05/02/89 |23000 2233 | shukla | g.m | sales | 12/12/52 | 20000 1234 | kumar | mgr |accnts | 18/03/79 |15000 We can sort the contents of several files at one shot as in: We can sort the contents of several files at one shot as in: $ sort file1 file2 file3 $ sort file1 file2 file3

22 Instead of displaying the sorted output on the screen we can store it in a file by saying, Instead of displaying the sorted output on the screen we can store it in a file by saying, $ sort –o result clist1 $ sort –o result clist1 $ cat result $ cat result akash |dir. akash |dir. kumar | mgr kumar | mgr sharma |d.g.m sharma |d.g.m shukla | g.m shukla | g.m tiwary |g.m tiwary |g.m To check whether the file has actually been sorted, use To check whether the file has actually been sorted, use $ sort –c shortlist $ sort –c shortlist

23 Sorting on secondary key: Sorting on secondary key: You can sort on more than one key, i.e., you can provide a secondary key to sort. For example, if the primary key is the 3 rd field, and the secondary key is the 2 nd field, then you need to specify for every –k option, where the sort ends. This is done in this way: You can sort on more than one key, i.e., you can provide a secondary key to sort. For example, if the primary key is the 3 rd field, and the secondary key is the 2 nd field, then you need to specify for every –k option, where the sort ends. This is done in this way: $ sort -t"|" -k3,3 -k2,2 shortlist $ sort -t"|" -k3,3 -k2,2 shortlist 9876 | sharma |d.g.m|product| 12/03 60 | 15000 7898 | akash |dir. |mark. | 11/06/70 |9000 2233 | shukla | g.m | sales | 12/12/52 | 20000 3456 | tiwary |g.m |product| 05/02/89 |23000 1234 | kumar | mgr |accnts | 18/03/79 |15000 This sorts the file by designation and name. the –k3,3 option indicates that sorting starts on the 3 rd field and ends on the same field. This sorts the file by designation and name. the –k3,3 option indicates that sorting starts on the 3 rd field and ends on the same field.

24 Sorting on columns : Sorting on columns : You can also specify a character position within a field to be the beginning of sort. For example, if you are to sort the file according to the year of birth, then you need to sort on the 7 th and 8 th column positions within 5 th field: You can also specify a character position within a field to be the beginning of sort. For example, if you are to sort the file according to the year of birth, then you need to sort on the 7 th and 8 th column positions within 5 th field: $ sort -t"|" -k5.7,5.8 shortlist $ sort -t"|" -k5.7,5.8 shortlist 2233 | shukla | g.m | sales | 12/12/52 | 20000 9876 | sharma |d.g.m|product| 12/03 60 | 15000 1234 | kumar | mgr |accnts | 18/03/79 |15000 7898 | akash |dir. |mark. | 11/06/70 |9000 3456 | tiwary |g.m |product| 05/02/89 |23000

25 Numeric sort (-n): Numeric sort (-n): When sort acts on numerals, strange things can happen. When sort acts on numerals, strange things can happen. [itlaxmi@snist ~]$ cat>nfile [itlaxmi@snist ~]$ cat>nfile241027 [itlaxmi@snist ~]$ sort nfile [itlaxmi@snist ~]$ sort nfile102274 This is probably not what you expected, but the ASCII collating sequence places 1 above 2, and 2 above 4. That’s why 10 preceded 2 and 27 preceded 4. This can be overridden by the –n (numeric ) option. This is probably not what you expected, but the ASCII collating sequence places 1 above 2, and 2 above 4. That’s why 10 preceded 2 and 27 preceded 4. This can be overridden by the –n (numeric ) option.

26 [itlaxmi@snist ~]$ sort -n nfile [itlaxmi@snist ~]$ sort -n nfile241027

27 Removing Repeated Lines (-u): Removing Repeated Lines (-u): The –u (unique) option lets you remove repeated lines from a file. To find out the unique designations that occur in the file, cut out the designation field and pipe it to sort : The –u (unique) option lets you remove repeated lines from a file. To find out the unique designations that occur in the file, cut out the designation field and pipe it to sort : $ cut -d"|" -f3 e.lst | sort -u |tee desg.lst $ cut -d"|" -f3 e.lst | sort -u |tee desg.lstdir. g.m g.m mgr mgr Merge sort (-m): Merge sort (-m): When sort is used with multiple filenames as arguments, it concatenates them and sorts them collectively. When sort is used with multiple filenames as arguments, it concatenates them and sorts them collectively. When large files are sorted in this way, performance often suffers. The –m (merge) option can merge two or more files that are sorted individually. When large files are sorted in this way, performance often suffers. The –m (merge) option can merge two or more files that are sorted individually. $ sort –m f1 f2 f3 $ sort –m f1 f2 f3

28 sort options OptionDescription OptionDescription -tcharUses delimeter char to identify fields -tcharUses delimeter char to identify fields -k nSorts on nth field -k nSorts on nth field -k m,nStarts sort on mth field and ends sort on nth field -k m,nStarts sort on mth field and ends sort on nth field -k m.n Starts sort on nth column of mth field -k m.n Starts sort on nth column of mth field -u Removes repeated lines -u Removes repeated lines -n Sorts numerically -n Sorts numerically -rReverses sort order -rReverses sort order -fFolds lowercase to equivalent uppercase (case insensitive sort) -fFolds lowercase to equivalent uppercase (case insensitive sort) -m listMerges sorted files in list -m listMerges sorted files in list -cChecks if the file is sorted -cChecks if the file is sorted -o flnamePlaces output in file flname -o flnamePlaces output in file flname

29 THE uniq COMMAND There is often problem of duplicate entries creeping in due to faulty data entry. Unix offers a special tool to handle these records -- the uniq command. There is often problem of duplicate entries creeping in due to faulty data entry. Unix offers a special tool to handle these records -- the uniq command. The command is most useful when placed in pipelines, and can be used as an SQL type query tool (distinct). The command is most useful when placed in pipelines, and can be used as an SQL type query tool (distinct). Ex: $ cat dept.lst Ex: $ cat dept.lst 01 | accounts | 6213 02 | admin | 5423 03 | marketing | 6521 $ uniq dept.lst $ uniq dept.lst 01 | accounts | 6213 02 | admin | 5423 03 | marketing | 6521

30 uniq simply fetches one copy of the redundant records, writing them to the standard output. uniq simply fetches one copy of the redundant records, writing them to the standard output. Since uniq requires a sorted file as input, the general procedure is to sort a file and pipe the process to uniq. The following pipeline also produces the same output, except that the output is saved in a file : Since uniq requires a sorted file as input, the general procedure is to sort a file and pipe the process to uniq. The following pipeline also produces the same output, except that the output is saved in a file : $ sort dept.lst | uniq - ulist $ sort dept.lst | uniq - ulist [itlaxmi@snist d1]$ cat ulist [itlaxmi@snist d1]$ cat ulist 01 | accounts | 6213 02 | admin | 5423 03 | marketing | 6521 Like sort, uniq also accepts the filename as an argument. Since it is done without using an option (unlike –o in sort), you should make sure that you don’t specify multiple filenames as input to this command; Like sort, uniq also accepts the filename as an argument. Since it is done without using an option (unlike –o in sort), you should make sure that you don’t specify multiple filenames as input to this command; uniq uses only one file at a time. uniq uses only one file at a time.

31 If we use two filenames, then uniq simply processes first file and overwrites the second with its output. So you lose the data in the second file. If we use two filenames, then uniq simply processes first file and overwrites the second with its output. So you lose the data in the second file. If uniq is to merely select unique lines, it is preferable to use sort –u. But uniq has a couple of options which can be used to make simple database queries. If uniq is to merely select unique lines, it is preferable to use sort –u. But uniq has a couple of options which can be used to make simple database queries. Ex: To determine the designation that occurs uniquely in the file e.lst, cut out the 3 rd field, sort it, and then pipe it to uniq. Ex: To determine the designation that occurs uniquely in the file e.lst, cut out the 3 rd field, sort it, and then pipe it to uniq. $ cat e.lst $ cat e.lst 2233 | shukla | g.m | sales | 12/12/52 | 20000 9876 | sharma | mgr |product| 12/03 60 | 15000 7898 | akash | dir. |mark. | 11/06/70 |9000 3456 | tiwary | g.m |product| 05/02/89 |23000 1234 | kumar | mgr |accnts | 18/03/79 |1500

32 The –u (unique) option selects only the non-repeated lines. The –u (unique) option selects only the non-repeated lines.Ex: $ cut -d"|" -f3 e.lst |sort |uniq -u $ cut -d"|" -f3 e.lst |sort |uniq -udir. The –d (duplicate) option selects only one copy of the repeated lines: The –d (duplicate) option selects only one copy of the repeated lines:Ex: $ cut -d"|" -f3 e.lst |sort |uniq -d $ cut -d"|" -f3 e.lst |sort |uniq -d g.m g.m mgr mgr And the –c (count) option displays the frequency of occurrence of all lines, along with the lines: And the –c (count) option displays the frequency of occurrence of all lines, along with the lines:Ex: $ cut -d"|" -f3 e.lst |sort |uniq -c $ cut -d"|" -f3 e.lst |sort |uniq -c 1 dir. 1 dir. 2 g.m 2 mgr 2 mgr

33 LINE NUMBERING – THE nl COMMAND There is separate command in UNIX system that has elaborate schemes for numbering lines --the nl command There is separate command in UNIX system that has elaborate schemes for numbering lines --the nl command nl numbers only logical lines, i.e. the new line character containing something apart from the new line character. nl numbers only logical lines, i.e. the new line character containing something apart from the new line character. By default, nl simply adds line numbers to its input, and prints them in a space six characters wide: By default, nl simply adds line numbers to its input, and prints them in a space six characters wide: Ex: Ex: $ nl clist1 $ nl clist1 1 shukla | g.m 1 shukla | g.m 2 sharma |d.g.m 2 sharma |d.g.m 3 akash |dir. 3 akash |dir. 4 tiwary |g.m 4 tiwary |g.m 5 kumar | mgr 5 kumar | mgr

34 nl uses the tab character to separate the numbers from the text. Use the –w(width) option to specify the width of the number format, and –s (separator) to specify the separator: nl uses the tab character to separate the numbers from the text. Use the –w(width) option to specify the width of the number format, and –s (separator) to specify the separator: Ex: Ex: $ nl -w2 -s":" clist1 $ nl -w2 -s":" clist1 1: shukla | g.m 1: shukla | g.m 2: sharma |d.g.m 2: sharma |d.g.m 3: akash |dir. 3: akash |dir. 4: tiwary |g.m 4: tiwary |g.m 5: kumar | mgr 5: kumar | mgr

35 To have leading zeroes in the first field, use –n option: To have leading zeroes in the first field, use –n option: Ex: Ex: $ nl -w2 -s":" -nrz clist1 $ nl -w2 -s":" -nrz clist1 01: shukla | g.m 01: shukla | g.m 02: sharma |d.g.m 02: sharma |d.g.m 03: akash |dir. 03: akash |dir. 04: tiwary |g.m 04: tiwary |g.m 05: kumar | mgr 05: kumar | mgr The –n option, followed immediately by the parameter rz, right justifies the number, with the leading zeroes to fill the gaps. The other format you can use is ln, which left justifies the number and removes the leading zeroes. The –n option, followed immediately by the parameter rz, right justifies the number, with the leading zeroes to fill the gaps. The other format you can use is ln, which left justifies the number and removes the leading zeroes.

36 In many applications, you have code tables starting from a number different from 1 (or 01 or 001). The –v option followed by a number, determines the initial value that is to be used to number the lines. You can use the number 40 as the initial value: In many applications, you have code tables starting from a number different from 1 (or 01 or 001). The –v option followed by a number, determines the initial value that is to be used to number the lines. You can use the number 40 as the initial value: Ex: Ex: $ nl -w2 -s":" -nrz -v40 clist1 $ nl -w2 -s":" -nrz -v40 clist1 40: shukla | g.m 40: shukla | g.m 41: sharma |d.g.m 41: sharma |d.g.m 42: akash |dir. 42: akash |dir. 43: tiwary |g.m 43: tiwary |g.m 44: kumar | mgr 44: kumar | mgr

37 You can set the increment too with –i (increment) option : You can set the increment too with –i (increment) option : Ex: Ex: $ nl -w2 -s":" -nrz -v40 -i5 clist1 $ nl -w2 -s":" -nrz -v40 -i5 clist1 40: shukla | g.m 40: shukla | g.m 45: sharma |d.g.m 45: sharma |d.g.m 50: akash |dir. 50: akash |dir. 55: tiwary |g.m 55: tiwary |g.m 60: kumar | mgr 60: kumar | mgr

38 TRANSLATING CHARACTERS - THE tr COMMAND The tr (translate) filter manipulates individual characters in a line. The tr (translate) filter manipulates individual characters in a line. It translates characters using one or two compact expressions: It translates characters using one or two compact expressions: Syntax: Syntax: tr options expression1 expression2 standard input tr options expression1 expression2 standard input tr takes input only from the standard input; it doesn’t take a filename as argument. tr takes input only from the standard input; it doesn’t take a filename as argument. By default, it translates each character in expression1 to its mapped counterpart in expression2. By default, it translates each character in expression1 to its mapped counterpart in expression2. The 1 st character in 1 st expression is replaced with the 1 st character in the 2 nd expression, and similarly for the other characters. The 1 st character in 1 st expression is replaced with the 1 st character in the 2 nd expression, and similarly for the other characters.

39 Ex: To replace the “|” with a ~(tilde) and the “/” with a -. Ex: To replace the “|” with a ~(tilde) and the “/” with a -. $ tr '|/' '~-' < shortlist | head -2 $ tr '|/' '~-' < shortlist | head -2 2233 ~ shukla ~ g.m ~ sales ~ 12-12-52 ~ 20000 9876 ~ sharma ~d.g.m~product~ 12-03-60 ~ 15000 Changing case of text: Changing case of text: To change the case of 1 st three lines from lower to upper: To change the case of 1 st three lines from lower to upper: $ head -2 e.lst | tr '[a-z]' '[A-Z]' $ head -2 e.lst | tr '[a-z]' '[A-Z]' 2233 | SHUKLA | G.M | SALES | 12/12/52 | 20000 9876 | SHARMA | MGR |PRODUCT| 12/03 60 | 15000

40 Using ASCII octal values and escape sequences : Using ASCII octal values and escape sequences : tr also uses octal values and escape sequences to represent characters. tr also uses octal values and escape sequences to represent characters. To have each field on a separate line, replae the “|” with the LF character (octal value 012): To have each field on a separate line, replae the “|” with the LF character (octal value 012): $ tr '|' '\012' < emp.lst |head -n 6 $ tr '|' '\012' < emp.lst |head -n 6 2233 2233 shukla shukla g.m g.m sales sales 12/12/52 12/12/52 20000 20000

41 Deleting characters (-d) : Deleting characters (-d) : To delete the characters “|” and “/” from the file: To delete the characters “|” and “/” from the file: $ tr –d ‘|/’ < shortlist | head –n 2 $ tr –d ‘|/’ < shortlist | head –n 2 2233 shukla g.m sales 121252 20000 2233 shukla g.m sales 121252 20000 9876 sharma d.g.m product 1203 60 15000 9876 sharma d.g.m product 1203 60 15000 Compressing Multiple Consecutive characters (-s): Compressing Multiple Consecutive characters (-s): We can eliminate all redundant spaces in the files with delimited fields with the –s (squeeze) option. We can eliminate all redundant spaces in the files with delimited fields with the –s (squeeze) option. The –s option squeezes multiple consecutive occurrences of its argument to a single character. The –s option squeezes multiple consecutive occurrences of its argument to a single character. $ tr –s ‘ ‘ <shortlist | head –n 3 $ tr –s ‘ ‘ <shortlist | head –n 3

42 File Utilities CutPasteHeadTailCmpCommDiff

43 Filters A group of commands, each of which accepts some data as input, performs some manipulation on it, and produces some output. Since they perform some filtering action on the data, they are appropriately called filters. A group of commands, each of which accepts some data as input, performs some manipulation on it, and produces some output. Since they perform some filtering action on the data, they are appropriately called filters. Grep Grep Egrep Egrep Fgrep Fgrep Sed Sed Awk Awk sort sort uniq uniq nl nl

44 SEARCHING FOR A PATTERN – THE grep COMMAND The grep (global regular expression printer) scans a file for the occurrence of a pattern. The grep (global regular expression printer) scans a file for the occurrence of a pattern. It uses a couple of options, and depending on their usage, outputs the lines containing the pattern, or the filenames or the line numbers. It uses a couple of options, and depending on their usage, outputs the lines containing the pattern, or the filenames or the line numbers. Syntax: Syntax: grep grep Most of the grep’s options are shared by its other members also (egrep and fgrep). Most of the grep’s options are shared by its other members also (egrep and fgrep).

45 In addition to options, grep compulsorily requires an expression to represent the pattern to be searched for. The first argument (barring the option) is always treated as the expression, and the ones remaining as the filenames. In addition to options, grep compulsorily requires an expression to represent the pattern to be searched for. The first argument (barring the option) is always treated as the expression, and the ones remaining as the filenames. grep looks for all occurrences of the expression in its input, and, by default, outputs the lines containing the expression. grep looks for all occurrences of the expression in its input, and, by default, outputs the lines containing the expression.

46 Ex: Ex: $ grep "sales" e.lst $ grep "sales" e.lst 2233 | shukla | g.m | sales | 12/12/52 | 20000 2233 | shukla | g.m | sales | 12/12/52 | 20000 When grep is used with multiple filenames, it displays the filenames along with the output. When grep is used with multiple filenames, it displays the filenames along with the output. $ grep "sales" e.lst shortlist $ grep "sales" e.lst shortlist e.lst:2233 | shukla | g.m | sales | 12/12/52 | 20000 shortlist:2233 | shukla | g.m | sales | 12/12/52 | 20000

47 Because grep is also a filter, it can search its standard input for the pattern and store the output in a file: Because grep is also a filter, it can search its standard input for the pattern and store the output in a file: $ Who | grep itlaxmi > fff Quoting in grep: Quoting in grep: Quoting is essential if the search string consists of more than one word, or uses any of the shell’s characters like *,$ etc. Quoting is essential if the search string consists of more than one word, or uses any of the shell’s characters like *,$ etc. grep simply returns the prompt when the pattern can’t be located. grep simply returns the prompt when the pattern can’t be located. $ grep president shortlist $ grep president shortlist $

48 grep options OptionSignificance -cDisplays count of number of occurrences -cDisplays count of number of occurrences -l Displays list of the filenames only -l Displays list of the filenames only -nDisplays line numbers along with the lines -nDisplays line numbers along with the lines -vDoesn’t display lines matching expression -vDoesn’t display lines matching expression -iIgnores case for matching -iIgnores case for matching -hOmits filenames when handling multiple files -hOmits filenames when handling multiple files -f flnameTakes expressions from file flname (egrep and fgrep only). -f flnameTakes expressions from file flname (egrep and fgrep only). -xDisplays lines matched in entirety (fgrep only) -xDisplays lines matched in entirety (fgrep only)

49 Examples 1. $ grep -h mgr emp.lst shortlist 1. $ grep -h mgr emp.lst shortlist 1234 | kumar | mgr |accnts | 18/03/79 |15000 2. $ grep -c 'mgr' e.lst emp.lst 2. $ grep -c 'mgr' e.lst emp.lste.lst:2emp.lst:1 3. $ grep -n 'mgr' e.lst emp.lst 3. $ grep -n 'mgr' e.lst emp.lst e.lst:2:9876 | sharma | mgr |product| 12/03 60 | 15000 e.lst:5:1234 | kumar | mgr |accnts | 18/03/79 |1500 emp.lst:5:1234 | kumar | mgr |accnts | 18/03/79 |15000

50 Examples 4. $ grep -v 'mgr' e.lst 4. $ grep -v 'mgr' e.lst 2233 | shukla | g.m | sales | 12/12/52 | 20000 7898 | akash | dir.|mark. | 11/06/70 |9000 3456 | tiwary | g.m |product| 05/02/89 |23000 -v option is used for deleting lines in grep. -v option is used for deleting lines in grep. 5. $ grep -l 'mgr' *.lst 5. $ grep -l 'mgr' *.lstdesg.lstdesig.lste1.lste.lstemp1.lstemp.lst

51 Examples 6. $ grep -i 'SHUKLA' e.lst 6. $ grep -i 'SHUKLA' e.lst 2233 | shukla | g.m | sales | 12/12/52 | 20000

52 Basic Regular Expressions (BRE) You don’t always search a file with simple strings. It is possible that you may be looking for a name, but don’t know exactly how it is spelt. Or, you may be interested in the occurrences of a pattern only at a certain location, e.g. the beginning of a record. You don’t always search a file with simple strings. It is possible that you may be looking for a name, but don’t know exactly how it is spelt. Or, you may be interested in the occurrences of a pattern only at a certain location, e.g. the beginning of a record. The importance of grep lies not merely in its simple pattern-matching capability but in its acceptance of a regular expression for a pattern. The importance of grep lies not merely in its simple pattern-matching capability but in its acceptance of a regular expression for a pattern. A regular expression is a string of ordinary and metacharacters which can be used to match more than one type of pattern. A regular expression is a string of ordinary and metacharacters which can be used to match more than one type of pattern.

53 The BRE Character Set Used by grep, sed and awk PatternMatches PatternMatches *Zero or more occurrences of the previous character *Zero or more occurrences of the previous character.A single character.A single character [pqr]A single character p,q, or r [pqr]A single character p,q, or r [c1-c2]A single character within the ASCII range represented by c1 and c2 [c1-c2]A single character within the ASCII range represented by c1 and c2 [^pqr]A single character which is not a p, q or r [^pqr]A single character which is not a p, q or r ^patPattern pat at beginning of line ^patPattern pat at beginning of line Pat$ Pattern pat at end of line. Pat$ Pattern pat at end of line.

54 Examples g* Nothing or g, gg, ggg, etc. g* Nothing or g, gg, ggg, etc. gg* g, gg, ggg, etc gg* g, gg, ggg, etc.*Nothing or any number of characters.*Nothing or any number of characters [1-3]A digit between 1 and 3 [1-3]A digit between 1 and 3 [^a-zA-Z] A nonalphabetic character [^a-zA-Z] A nonalphabetic character bash$bash at end of line bash$bash at end of line ^bash$bash as the only word in line ^bash$bash as the only word in line ^$Lines containing nothing. ^$Lines containing nothing.

55 Examples $ grep "k.*" e.lst $ grep "k.*" e.lst 2233 | shukla | g.m | sales | 12/12/52 | 20000 7898 | akash | dir.|mark. | 11/06/70 |9000 1234 | kumar | mgr |accnts | 18/03/79 |1500 $ grep "9000$" e.lst $ grep "9000$" e.lst 7898 | akash | dir.|mark. | 11/06/70 |9000 $ grep '[Ss]h*arma' e.lst $ grep '[Ss]h*arma' e.lst 9876 | sharma | mgr |product| 12/03/60 |15000 8888 | Sarma | dir.| sales | 05/09/60 |25000 $ grep '[1-2]...$' e.lst $ grep '[1-2]...$' e.lst 1234 | kumar | mgr |accnts | 18/03/79 |1500

56 EXTENDING grep – THE egrep The egrep command, extends grep’s pattern-matching capabilities. The egrep command, extends grep’s pattern-matching capabilities. It offers all the options of grep, but its most useful feature is the facility to specify more than one pattern for search. It offers all the options of grep, but its most useful feature is the facility to specify more than one pattern for search. Each pattern is separated from the other by a | (pipe). Each pattern is separated from the other by a | (pipe).

57 The extended regular expression set used by egrep and awk Expression Significance Expression Significance Ch+Matches one or more occurrences of the character ch Ch+Matches one or more occurrences of the character ch Ch?Matches zero or one occurrence of the character ch Ch?Matches zero or one occurrence of the character ch Exp1\exp2 Matches the expression exp1 or exp2 Exp1\exp2 Matches the expression exp1 or exp2 (x1\x2)x3 Matches the expression x1x3 or x2x3 (x1\x2)x3 Matches the expression x1x3 or x2x3

58 Examples g+At least one g g+At least one g g?Nothing or one g g?Nothing or one g GIF|JPEG Matches GIF or JPEG GIF|JPEG Matches GIF or JPEG (lock | ver)woodMatches lockwood or verwood (lock | ver)woodMatches lockwood or verwood $ egrep 'sales |mark.' e.lst 2233 | shukla | g.m | sales | 12/12/52 |20000 7898 | akash | dir.|mark. | 11/06/70 |9000 8888 | Sarma | dir.| sales | 05/09/60 |25000 $ egrep -i '(sh|s)arma' e.lst 9876 | sharma | mgr |product| 12/03/60 |15000 8888 | Sarma | dir.| sales | 05/09/60 |25000

59 $ egrep –f pat.lst emp.lst $ egrep –f pat.lst emp.lst The command takes the expressions from the file pat.lst. This file must contain the patterns, suitably delimited in the same way as they are specified in the command line. The command takes the expressions from the file pat.lst. This file must contain the patterns, suitably delimited in the same way as they are specified in the command line.

60 MULTIPLE STRING SEARCHING – THE fgrep Like egrep, fgrep accepts alternative patterns, both from the command line, as well as from a file, but unlike grep and egrep, it doesn’t accept regular expressions. Like egrep, fgrep accepts alternative patterns, both from the command line, as well as from a file, but unlike grep and egrep, it doesn’t accept regular expressions. If the pattern to search for is a simple string, or a group of them, then fgrep is recommended. If the pattern to search for is a simple string, or a group of them, then fgrep is recommended. It is faster than its two fellow members, and should be used while using fixed strings. It is faster than its two fellow members, and should be used while using fixed strings. Alternative patterns in fgrep are specified by separating one pattern from another by the newline character. This is unlike egrep, which uses the | to delimit two expressions. Alternative patterns in fgrep are specified by separating one pattern from another by the newline character. This is unlike egrep, which uses the | to delimit two expressions.

61 Ex: Ex: If you search for three specific departments (without regular expressions), fgrep used in the following manner can produce a list sorted in reverse order containing the three patterns : If you search for three specific departments (without regular expressions), fgrep used in the following manner can produce a list sorted in reverse order containing the three patterns : $ fgrep ‘sales $ fgrep ‘sales > personnel > admin’ emp.lst | sort –t “|” +3r | tee newlist Like egrep, fgrep also takes patterns from a file, except that each string has to be stored in a separate line. Like egrep, fgrep also takes patterns from a file, except that each string has to be stored in a separate line. EX: $ cat pat1.lst EX: $ cat pat1.lstsalespersonneladmin $ fgrep –f pat1.lst emp.lst

62 Examples $ fgrep 'sales $ fgrep 'sales > mark. > product' e.lst 2233 | shukla | g.m | sales | 12/12/52 |20000 2233 | shukla | g.m | sales | 12/12/52 |20000 9876 | sharma | mgr |product| 12/03/60 |15000 9876 | sharma | mgr |product| 12/03/60 |15000 7898 | akash | dir.|mark. | 11/06/70 |9000 7898 | akash | dir.|mark. | 11/06/70 |9000 3456 | tiwary | g.m |product| 05/02/89 |23000 3456 | tiwary | g.m |product| 05/02/89 |23000 8888 | Sarma | dir.| sales | 05/09/60 |25000 8888 | Sarma | dir.| sales | 05/09/60 |25000

63 RELATIONAL JOIN – THE join COMMAND join helps to establish a logical relationship between two tables. join helps to establish a logical relationship between two tables. It uses a common column in each table to establish this relationship, and, by default, creates a single row which contains all the columns of the two tables. It uses a common column in each table to establish this relationship, and, by default, creates a single row which contains all the columns of the two tables. The prerequisite is that both tables be sorted on the joined columns. The prerequisite is that both tables be sorted on the joined columns. Syntax: Syntax: join file1 file2 When no field delimiters are specified, it assumes that the fields are delimited by spaces. When no field delimiters are specified, it assumes that the fields are delimited by spaces.

64 The join uses numbers to identify fields, but it also uses numbers to identify files. Since you can join only two files with a single command, this parameter can take the values 1 or 2, depending on the location of the file argument in the command line. The join uses numbers to identify fields, but it also uses numbers to identify files. Since you can join only two files with a single command, this parameter can take the values 1 or 2, depending on the location of the file argument in the command line.

65 Examples $ cat > emp_table $ cat > emp_table empid designation deptno 111 director 10 112 manager 10 113 dgm 20 [itlaxmi@snist ~]$ cat > dept_table [itlaxmi@snist ~]$ cat > dept_table deptno deptname 10 sales 20 production $ join -j1 3 -j2 1 emp_table dept_table $ join -j1 3 -j2 1 emp_table dept_table deptno empid designation deptname 10 111 director sales 10 111 director sales 10 112 manager sales 10 112 manager sales 20 113 dgm production 20 113 dgm production

66 CREATING A TEE – THE tee COMMAND Tee is an external command and not a feature of the shell. It handles a character stream by splitting its input into two components. It saves one component in a file and writes the other to the standard output. Tee is an external command and not a feature of the shell. It handles a character stream by splitting its input into two components. It saves one component in a file and writes the other to the standard output. Being also a filter, tee can be placed anywhere in a pipeline. Being also a filter, tee can be placed anywhere in a pipeline. Tee doesn’t perform any filtering action on its input, it gives out exact what it takes. Tee doesn’t perform any filtering action on its input, it gives out exact what it takes. The following command sequence uses tee to display the output of who and saves this output in a file as well. The following command sequence uses tee to display the output of who and saves this output in a file as well.

67 Examples $ who | tee user.lst $ who | tee user.lst root tty7 2009-08-04 09:51 (:0) root pts/1 2009-08-04 09:51 (:0) itlaxmi pts/2 2009-08-04 12:52 (10.4.8.19) [itlaxmi@snist ~]$ cat user.lst [itlaxmi@snist ~]$ cat user.lst root tty7 2009-08-04 09:51 (:0) root pts/1 2009-08-04 09:51 (:0) itlaxmi pts/2 2009-08-04 12:52 (10.4.8.19)

68 Since tee uses standard output, you can pipe its output to another command, say wc: Since tee uses standard output, you can pipe its output to another command, say wc: $ who | tee user.lst | wc -l $ who | tee user.lst | wc -l 3 The –a (append) option appends the output to the file specified as argument. The –a (append) option appends the output to the file specified as argument. $ cal 2009 | tee -a calfile > calfile2 $ cal 2009 | tee -a calfile > calfile2 The sequence appends one stream to the file calfile, while overwriting the file calfile2 with the other stream. The sequence appends one stream to the file calfile, while overwriting the file calfile2 with the other stream.

69 THE pg COMMAND The disadvantage of head and tail is that they cannot display a range of lines. Moreover, what is displayed is final. That is, if we have displayed the first 50 lines in a file we cannot move back and view say the 10 th line. The disadvantage of head and tail is that they cannot display a range of lines. Moreover, what is displayed is final. That is, if we have displayed the first 50 lines in a file we cannot move back and view say the 10 th line. Unix provides two commands which offer more flexibility in viewing files. These are pg and more. Unix provides two commands which offer more flexibility in viewing files. These are pg and more. They are more or less work in the same manner, except for a few minor differences. They are more or less work in the same manner, except for a few minor differences.

70 Each of them helps you view a file page by page with lot of useful options like: Each of them helps you view a file page by page with lot of useful options like: (a) Set the number of lines to be displayed per page. (b) Ability to move either forwards or backwards in a file just at the touch of a key. (c) Skip pages while viewing the file page by page. (d) Search the file for a pattern in forward or backward direction. On executing each of these commands one pageful of file contents are displayed on the screen after which a prompt is displayed at which the user can give various commands that are understood by pg or more.

71 Example $ pg +10 -15 -p “Page no. %d” myfile $ pg +10 -15 -p “Page no. %d” myfile This command starts displaying the contents of myfile, 15 lines at a time from 10 th line onwards. At the end of each displayed page a prompt comes which displays the page number on view. This prompt overrides the default ‘:’ prompt of the pg command. This command starts displaying the contents of myfile, 15 lines at a time from 10 th line onwards. At the end of each displayed page a prompt comes which displays the page number on view. This prompt overrides the default ‘:’ prompt of the pg command.


Download ppt "TEXT PROCESSING UTILITIES. THE cat COMMAND $ cat emp1.lst $ cat emp1.lst 2233 | shukla | g.m | sales | 12/12/52 | 20000 9876 | sharma |d.g.m |product."

Similar presentations


Ads by Google