Presentation is loading. Please wait.

Presentation is loading. Please wait.

Strings and regular expressions Day 10 LING 681.02 Computational Linguistics Harry Howard Tulane University.

Similar presentations


Presentation on theme: "Strings and regular expressions Day 10 LING 681.02 Computational Linguistics Harry Howard Tulane University."— Presentation transcript:

1 Strings and regular expressions Day 10 LING 681.02 Computational Linguistics Harry Howard Tulane University

2 16-Sept-2009LING 681.02, Prof. Howard, Tulane University2 Course organization  http://www.tulane.edu/~ling/NLP/ http://www.tulane.edu/~ling/NLP/  NLTK is installed on the computers in this room!  How would you like to use the Provost's $150?  Please become a fan of Tulane Linguistics on Facebook.

3 NLPP §3 Processing raw text §3.2 Strings: Text processing at the lowest level

4 16-Sept-2009LING 681.02, Prof. Howard, Tulane University4 Syntax of single-line strings  Strings are specified with single quotes, or double quotes if a single quote is one of the characters: 'Monty Python' "Monty Python's Flying Circus" 'Monty Python\s Flying Circus'

5 16-Sept-2009LING 681.02, Prof. Howard, Tulane University5 Syntax of multi-line strings  A sequence of strings can be joined into a single one with …  a backslash at the end of each line: 'first half'\ 'second half' = 'first halfsecond half'  parentheses to open and close the sequence: ('first half' 'second half') = 'first halfsecond half'  triple double quotes to open and close the sequence and maintain line breaks: """first half second half""" = 'first half/nsecond half'

6 16-Sept-2009LING 681.02, Prof. Howard, Tulane University6 Basic opertions  Concatenation (+)  >>> 'really' + 'really'  'reallyreally'  Repetition (*)  >>> 'really' * 4  'reallyreallyreallyreally'

7 16-Sept-2009LING 681.02, Prof. Howard, Tulane University7 Your Turn p. 88 !!!

8 16-Sept-2009LING 681.02, Prof. Howard, Tulane University8 Printing strings  Make a couple of string assignments: harry = 'Harry Potter' prince = 'Half-Blood Prince'  Inspection of a variable produces Python's representation of its value: >>> harry 'Harry Potter'  Printing a variable produces its value: >>> print harry Harry Potter  What do you expect? >>> print harry + prince >>> print harry, prince >>> print harry, 'and the', prince

9 16-Sept-2009LING 681.02, Prof. Howard, Tulane University9 Using indices  Every character of a string is indexed from 0 (and -1) >>> harry[0] 'H' >>> harry[-1] 'r' >>> harry[:2] 'Har' >>> harry[-12:-10] 'Har' >>> for char in prince:...print char, H a l f - B l o o d P r i n c e

10 16-Sept-2009LING 681.02, Prof. Howard, Tulane University10 More string operations  See Table 3-2

11 16-Sept-2009LING 681.02, Prof. Howard, Tulane University11 Strings vs. lists  Both are sequences and so support joining by concatenation and separation by slicing.  But they are different, so they cannot be concatenated.  Granularity  Strings have a single level of resolution, the individual character > good for writing to screen or file.  Lists can have any level of resolution we want: character, morpheme, word, phrase, sentence, paragraph > good for NLP.  So the second step in the NLP pipeline is to tokenize a string into a list.

12 NLPP §3 Processing raw text §3.3 Text processing with Unicode

13 16-Sept-2009LING 681.02, Prof. Howard, Tulane University13 Unicode  The format for representing special characters that go beyond ASCII  Let's skip this until we really need it.

14 NLPP §3 Processing raw text §3.4 Regular expressions for detecting word formats

15 16-Sept-2009LING 681.02, Prof. Howard, Tulane University15 Getting started  To use regular expressions in Python, we need to import the re library.  We also need a list of words to search.  we'll use the Words Corpus again (Section 2.4).  We will preprocess it to remove any proper names. >>> import re >>> wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]

16 16-Sept-2009LING 681.02, Prof. Howard, Tulane University16 Different terminologies  In textbook, regex = «ed$»  In re, regex = 'ed$' (i.e. a string)

17 16-Sept-2009LING 681.02, Prof. Howard, Tulane University17 Searching  re.search(p, s)  p is a pattern – what we are looking for, and  s is a candidate string for matching the pattern.

18 16-Sept-2009LING 681.02, Prof. Howard, Tulane University18 Some examples  Find words ending in -ed: >>> [w for w in wordlist if re.search('ed$', w)]  Find a word that fits a certain group of blanks in a crossword puzzle that is 8 letters long, with j as the 3rd letter and t as the 6th letter: >>> [w for w in wordlist if re.search('^..j..t..$', w)]  Find the strings email or e-mail: >>> [w for w in wordlist if re.search('^e-?mail$', w)]

19 Next time More on RegEx


Download ppt "Strings and regular expressions Day 10 LING 681.02 Computational Linguistics Harry Howard Tulane University."

Similar presentations


Ads by Google