Presentation is loading. Please wait.

Presentation is loading. Please wait.

18.12.2001Unix Trix for Emprirical CL1 CSA405: Unix Trix for Empirical CL How to use Unix as a toolbox for NLP applications.

Similar presentations


Presentation on theme: "18.12.2001Unix Trix for Emprirical CL1 CSA405: Unix Trix for Empirical CL How to use Unix as a toolbox for NLP applications."— Presentation transcript:

1 18.12.2001Unix Trix for Emprirical CL1 CSA405: Unix Trix for Empirical CL How to use Unix as a toolbox for NLP applications

2 18.12.2001Unix Trix for Emprirical CL2 Acknowledgements Contents of this lecture is inspired by Gerald Gazdar, University of Sussex Ken Church, AT&T Thanks

3 18.12.2001Unix Trix for Emprirical CL3 Unix Tools grep : search for pattern sort : sort a file uniq : eliminate duplicates tr : translate characters wc : count words sed : edit string awk : pattern based programming language cut : cut out selected fields of each line of a file paste : merge corresponding or subsequent lines of files comm : select or reject lines common to two files join : relational database operator man command for further details of these

4 18.12.2001Unix Trix for Emprirical CL4 Text l I was intrigued by the article "Cloning a human being a long way off`" (December 3). I attended the well-presented lecture by Dr Bruce Campbell, wherein the cutting edge of the new cloning technology for the harvesting of human stem cells was explained. This involves the transfer of the nucleus from an adult human mature cell, such as skin, hair or mucosa, into the denucleated human ovum of a female of the species, which is then allowed to start developing for a few days to the stage where the placental precursor cells separate from the cells destined to become the foetus.

5 18.12.2001Unix Trix for Emprirical CL5 Punctuation 1 sed –f markpunct.sed file contents s/"/ xzzdoublequotezzx /g s/'/ xzzquotezzx /g s/`/ xzzquotezzx /g s/(/ xzzleftparenzzx /g I was intrigued by the article xzzdoublequotezzx Cloning a human being xzzquotezzx

6 18.12.2001Unix Trix for Emprirical CL6 Punctuation 2 sed –f angle.sed file contents s/xzz//g I was intrigued by the article Cloning a human being

7 18.12.2001Unix Trix for Emprirical CL7 Case tr 'A-Z' 'a-z' i was intrigued by the article "cloning a human being `a long way off`" (december 3). i attended the well-presented lecture by dr bruce campbell, wherein the cutting edge of the new cloning technology for the harvesting of human stem cells was explained.

8 18.12.2001Unix Trix for Emprirical CL8 Tokenisation tr –sc 'a-zA-Z' '\012' I was intrigued by the article Cloning a

9 18.12.2001Unix Trix for Emprirical CL9 Sorting tr 'A-Z' 'a-z'| tr –sc 'a-zA-Z' '\012' | sort a a a a adult allowed an article as attended become being bruce by by

10 18.12.2001Unix Trix for Emprirical CL10 Making a Wordlist tr 'A-Z' 'a-z'| tr –sc 'a-zA-Z' '\012' | sort | uniq a adult allowed an article as attended become being bruce by campbell cell cells cloning

11 18.12.2001Unix Trix for Emprirical CL11 Counting tr 'A-Z' 'a-z'| tr –sc 'a-zA-Z' '\012'|sort|uniq -c 4 a 1 adult 1 allowed 1 an 1 article 1 as 1 attended 1 become 1 being 1 bruce 2 by 1 campbell 1 cell 3 cells 2 cloning

12 18.12.2001Unix Trix for Emprirical CL12 Sorted Frequency List tr 'A-Z' 'a-z'| tr –sc 'a-zA-Z' '\012'|sort|uniq –c | sort –r 13 the 5 of 4 human 4 a 3 to 3 cells 2 was 2 i 2 from 2 for 2 cloning 2 by 1 which 1 wherein

13 18.12.2001Unix Trix for Emprirical CL13 Sorted Frequency List tr 'A-Z' 'a-z'| tr –sc 'a-zA-Z' '\012'|sort|uniq –c | sort –r | cat -n 1 13 the 2 5 of 3 4 human 4 4 a 5 3 to 6 3 cells 7 2 was 8 2 i 9 2 from 10 2 for 11 2 cloning 12 2 by

14 18.12.2001Unix Trix for Emprirical CL14 Zipf Principle of least effort: people act so as to minimise their probable average rate of work. Speaker’s effort is conserved by having a small no of very frequent words, whilst hearer’s effort demands large number of rare words. Consequence (according to Zipf): relationship between word frequency and rank. Frequency x Rank = constant

15 18.12.2001Unix Trix for Emprirical CL15 Zipf Curve Rank  Frequency 

16 18.12.2001Unix Trix for Emprirical CL16 paste and tail paste: The default operation of paste will concatenate the corresponding lines of the input files. The NEWLINE character of every line except the line from the last input file will be replaced with a TAB character. tail: The tail utility copies the named file to the standard output beginning at a designated place. These two utilities can be used to work with n- grams

17 18.12.2001Unix Trix for Emprirical CL17 Bigrams tr 'A-Z' 'a-z'| tr –sc 'a-zA-Z' '\012‘> foo tr 'A-Z' 'a-z'| tr –sc 'a-zA-Z' '\012'| tail +2 > foo1 paste foo foo1 | sort : human being human mature human ovum human stem : the article the cells the cutting the denucleated the foetus

18 18.12.2001Unix Trix for Emprirical CL18 grep grep '[A-Z] 'Lines w. uppercase char grep ‘^[A-Z] 'Lines starting w. uppercase char grep '[A-Z]$ 'Lines ending w. uppercase char grep '[^aeiou] 'Lines containing non-vowerl grep '[.]'Lines w. any character grep '[A-Z]* 'Lines w. 0 or more vowels


Download ppt "18.12.2001Unix Trix for Emprirical CL1 CSA405: Unix Trix for Empirical CL How to use Unix as a toolbox for NLP applications."

Similar presentations


Ads by Google