Strings and regular expressions Day 10 LING 681.02 Computational Linguistics Harry Howard Tulane University.

Slides:



Advertisements
Similar presentations
Regular expressions Day 2
Advertisements

NLTK & Python Day 4 LING Computational Linguistics Harry Howard Tulane University.
Finite-state automata 2 Day 13 LING Computational Linguistics Harry Howard Tulane University.
ISBN Chapter 6 Data Types Character Strings Pattern Matching.
Tutorial 14 Working with Forms and Regular Expressions.
Working with Files CSC 161: The Art of Programming Prof. Henry Kautz 11/9/2009.
UNICODE & CONTROL DAY /24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Introduction to Programming Prof. Rommel Anthony Palomino Department of Computer Science and Information Technology Spring 2011.
Finite-state automata 3 Morphology Day 14 LING Computational Linguistics Harry Howard Tulane University.
Tutorial 14 Working with Forms and Regular Expressions.
Pattern matching with regular expressions A common file processing requirement is to match strings within the file to a standard form, e.g. address.
Methods in Computational Linguistics II with reference to Matt Huenerfauth’s Language Technology material Lecture 4: Matching Things. Regular Expressions.
 Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’
NLTK & BASIC TEXT STATS DAY /08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
COMPUTATION WITH STRINGS 4 DAY 5 - 9/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Programming for Linguists An Introduction to Python 24/11/2011.
Computer Programming for Biologists Class 5 Nov 20 st, 2014 Karsten Hokamp
Structured programming 4 Day 34 LING Computational Linguistics Harry Howard Tulane University.
ON-LINE DOCUMENTS 3 DAY /17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
NLTK & Python Day 9 LING Computational Linguistics Harry Howard Tulane University.
Copyright © 2012 Pearson Education, Inc. Publishing as Pearson Addison-Wesley C H A P T E R 2 Input, Processing, and Output.
UNICODE DAY /22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
NLTK & Python Day 7 LING Computational Linguistics Harry Howard Tulane University.
Structured programming 3 Day 33 LING Computational Linguistics Harry Howard Tulane University.
COMPUTATION WITH STRINGS 2 DAY 2 - 8/29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
SCRIPTS & FUNCTIONS DAY /06/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
 2002 Prentice Hall. All rights reserved. 1 Chapter 13 – String Manipulation and Regular Expressions Outline 13.1 Introduction 13.2 Fundamentals of Characters.
Working with Forms and Regular Expressions Validating a Web Form with JavaScript.
NLTK & Python Day 5 LING Computational Linguistics Harry Howard Tulane University.
REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
COMPUTATION WITH STRINGS 1 DAY 2 - 8/27/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
OCR Computing GCSE © Hodder Education 2013 Slide 1 OCR GCSE Computing Python programming 8: Fun with strings.
REGULAR EXPRESSIONS 2 DAY 7 - 9/10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Finite-state automata Day 12 LING Computational Linguistics Harry Howard Tulane University.
8-1 Compilers Compiler A program that translates a high-level language program into machine code High-level languages provide a richer set of instructions.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
Strings See Chapter 2 u Review constants u Strings, concatenation and repetition 1.
NLTK & Python Day 8 LING Computational Linguistics Harry Howard Tulane University.
Prof. Alfred J Bird, Ph.D., NBCT Office – McCormick 3rd floor 607.
REGULAR EXPRESSIONS 1 DAY 6 - 9/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Prof. Alfred J Bird, Ph.D., NBCT Door Code for IT441 Students.
ON-LINE DOCUMENTS DAY /13/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
COMPUTATION WITH STRINGS 3 DAY 4 - 9/03/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Regular expressions Day 11 LING Computational Linguistics Harry Howard Tulane University.
Strings CSE 1310 – Introduction to Computers and Programming Alexandra Stefan University of Texas at Arlington 1.
Strings CSE 1310 – Introduction to Computers and Programming Alexandra Stefan University of Texas at Arlington 1.
CONTROL 3 DAY /29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Winter 2016CISC101 - Prof. McLeod1 CISC101 Reminders Quiz 3 this week – last section on Friday. Assignment 4 is posted. Data mining: –Designing functions.
Prof. Alfred J Bird, Ph.D., NBCT Office – McCormick 3rd floor 607.
Regular Expressions In Javascript cosc What Do They Do? Does pattern matching on text We use the term “string” to indicate the text that the regular.
CSC 108H: Introduction to Computer Programming Summer 2011 Marek Janicki.
Lists 1 Day /17/14 LING 3820 & 6820 Natural Language Processing
Regular Expressions Upsorn Praphamontripong CS 1110
Computation with strings 2 Day 3 - 9/02/16
Formal Language Theory
LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong.
Computation with strings 3 Day 4 - 9/07/16
Computation with strings 1 Day 2 - 8/31/16
Working with Forms and Regular Expressions
Regular expressions 2 Day /23/16
Control 3 Day /05/16 LING 3820 & 6820 Natural Language Processing
NLP 2 Day /07/16 LING 3820 & 6820 Natural Language Processing
Regular expressions 3 Day /26/16
LING 408/508: Computational Techniques for Linguists
Topics discussed in this section:
LING 408/508: Computational Techniques for Linguists
CS1110 Today: collections.
Computation with strings 4 Day 5 - 9/09/16
Control 1 Day /30/16 LING 3820 & 6820 Natural Language Processing
PYTHON - VARIABLES AND OPERATORS
Presentation transcript:

Strings and regular expressions Day 10 LING Computational Linguistics Harry Howard Tulane University

16-Sept-2009LING , Prof. Howard, Tulane University2 Course organization   NLTK is installed on the computers in this room!  How would you like to use the Provost's $150?  Please become a fan of Tulane Linguistics on Facebook.

NLPP §3 Processing raw text §3.2 Strings: Text processing at the lowest level

16-Sept-2009LING , Prof. Howard, Tulane University4 Syntax of single-line strings  Strings are specified with single quotes, or double quotes if a single quote is one of the characters: 'Monty Python' "Monty Python's Flying Circus" 'Monty Python\s Flying Circus'

16-Sept-2009LING , Prof. Howard, Tulane University5 Syntax of multi-line strings  A sequence of strings can be joined into a single one with …  a backslash at the end of each line: 'first half'\ 'second half' = 'first halfsecond half'  parentheses to open and close the sequence: ('first half' 'second half') = 'first halfsecond half'  triple double quotes to open and close the sequence and maintain line breaks: """first half second half""" = 'first half/nsecond half'

16-Sept-2009LING , Prof. Howard, Tulane University6 Basic opertions  Concatenation (+)  >>> 'really' + 'really'  'reallyreally'  Repetition (*)  >>> 'really' * 4  'reallyreallyreallyreally'

16-Sept-2009LING , Prof. Howard, Tulane University7 Your Turn p. 88 !!!

16-Sept-2009LING , Prof. Howard, Tulane University8 Printing strings  Make a couple of string assignments: harry = 'Harry Potter' prince = 'Half-Blood Prince'  Inspection of a variable produces Python's representation of its value: >>> harry 'Harry Potter'  Printing a variable produces its value: >>> print harry Harry Potter  What do you expect? >>> print harry + prince >>> print harry, prince >>> print harry, 'and the', prince

16-Sept-2009LING , Prof. Howard, Tulane University9 Using indices  Every character of a string is indexed from 0 (and -1) >>> harry[0] 'H' >>> harry[-1] 'r' >>> harry[:2] 'Har' >>> harry[-12:-10] 'Har' >>> for char in prince:...print char, H a l f - B l o o d P r i n c e

16-Sept-2009LING , Prof. Howard, Tulane University10 More string operations  See Table 3-2

16-Sept-2009LING , Prof. Howard, Tulane University11 Strings vs. lists  Both are sequences and so support joining by concatenation and separation by slicing.  But they are different, so they cannot be concatenated.  Granularity  Strings have a single level of resolution, the individual character > good for writing to screen or file.  Lists can have any level of resolution we want: character, morpheme, word, phrase, sentence, paragraph > good for NLP.  So the second step in the NLP pipeline is to tokenize a string into a list.

NLPP §3 Processing raw text §3.3 Text processing with Unicode

16-Sept-2009LING , Prof. Howard, Tulane University13 Unicode  The format for representing special characters that go beyond ASCII  Let's skip this until we really need it.

NLPP §3 Processing raw text §3.4 Regular expressions for detecting word formats

16-Sept-2009LING , Prof. Howard, Tulane University15 Getting started  To use regular expressions in Python, we need to import the re library.  We also need a list of words to search.  we'll use the Words Corpus again (Section 2.4).  We will preprocess it to remove any proper names. >>> import re >>> wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]

16-Sept-2009LING , Prof. Howard, Tulane University16 Different terminologies  In textbook, regex = «ed$»  In re, regex = 'ed$' (i.e. a string)

16-Sept-2009LING , Prof. Howard, Tulane University17 Searching  re.search(p, s)  p is a pattern – what we are looking for, and  s is a candidate string for matching the pattern.

16-Sept-2009LING , Prof. Howard, Tulane University18 Some examples  Find words ending in -ed: >>> [w for w in wordlist if re.search('ed$', w)]  Find a word that fits a certain group of blanks in a crossword puzzle that is 8 letters long, with j as the 3rd letter and t as the 6th letter: >>> [w for w in wordlist if re.search('^..j..t..$', w)]  Find the strings or >>> [w for w in wordlist if re.search('^e-?mail$', w)]

Next time More on RegEx