Strings and regular expressions Day 10 LING Computational Linguistics Harry Howard Tulane University
16-Sept-2009LING , Prof. Howard, Tulane University2 Course organization NLTK is installed on the computers in this room! How would you like to use the Provost's $150? Please become a fan of Tulane Linguistics on Facebook.
NLPP §3 Processing raw text §3.2 Strings: Text processing at the lowest level
16-Sept-2009LING , Prof. Howard, Tulane University4 Syntax of single-line strings Strings are specified with single quotes, or double quotes if a single quote is one of the characters: 'Monty Python' "Monty Python's Flying Circus" 'Monty Python\s Flying Circus'
16-Sept-2009LING , Prof. Howard, Tulane University5 Syntax of multi-line strings A sequence of strings can be joined into a single one with … a backslash at the end of each line: 'first half'\ 'second half' = 'first halfsecond half' parentheses to open and close the sequence: ('first half' 'second half') = 'first halfsecond half' triple double quotes to open and close the sequence and maintain line breaks: """first half second half""" = 'first half/nsecond half'
16-Sept-2009LING , Prof. Howard, Tulane University6 Basic opertions Concatenation (+) >>> 'really' + 'really' 'reallyreally' Repetition (*) >>> 'really' * 4 'reallyreallyreallyreally'
16-Sept-2009LING , Prof. Howard, Tulane University7 Your Turn p. 88 !!!
16-Sept-2009LING , Prof. Howard, Tulane University8 Printing strings Make a couple of string assignments: harry = 'Harry Potter' prince = 'Half-Blood Prince' Inspection of a variable produces Python's representation of its value: >>> harry 'Harry Potter' Printing a variable produces its value: >>> print harry Harry Potter What do you expect? >>> print harry + prince >>> print harry, prince >>> print harry, 'and the', prince
16-Sept-2009LING , Prof. Howard, Tulane University9 Using indices Every character of a string is indexed from 0 (and -1) >>> harry[0] 'H' >>> harry[-1] 'r' >>> harry[:2] 'Har' >>> harry[-12:-10] 'Har' >>> for char in prince:...print char, H a l f - B l o o d P r i n c e
16-Sept-2009LING , Prof. Howard, Tulane University10 More string operations See Table 3-2
16-Sept-2009LING , Prof. Howard, Tulane University11 Strings vs. lists Both are sequences and so support joining by concatenation and separation by slicing. But they are different, so they cannot be concatenated. Granularity Strings have a single level of resolution, the individual character > good for writing to screen or file. Lists can have any level of resolution we want: character, morpheme, word, phrase, sentence, paragraph > good for NLP. So the second step in the NLP pipeline is to tokenize a string into a list.
NLPP §3 Processing raw text §3.3 Text processing with Unicode
16-Sept-2009LING , Prof. Howard, Tulane University13 Unicode The format for representing special characters that go beyond ASCII Let's skip this until we really need it.
NLPP §3 Processing raw text §3.4 Regular expressions for detecting word formats
16-Sept-2009LING , Prof. Howard, Tulane University15 Getting started To use regular expressions in Python, we need to import the re library. We also need a list of words to search. we'll use the Words Corpus again (Section 2.4). We will preprocess it to remove any proper names. >>> import re >>> wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]
16-Sept-2009LING , Prof. Howard, Tulane University16 Different terminologies In textbook, regex = «ed$» In re, regex = 'ed$' (i.e. a string)
16-Sept-2009LING , Prof. Howard, Tulane University17 Searching re.search(p, s) p is a pattern – what we are looking for, and s is a candidate string for matching the pattern.
16-Sept-2009LING , Prof. Howard, Tulane University18 Some examples Find words ending in -ed: >>> [w for w in wordlist if re.search('ed$', w)] Find a word that fits a certain group of blanks in a crossword puzzle that is 8 letters long, with j as the 3rd letter and t as the 6th letter: >>> [w for w in wordlist if re.search('^..j..t..$', w)] Find the strings or >>> [w for w in wordlist if re.search('^e-?mail$', w)]
Next time More on RegEx