Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy.

Slides:



Advertisements
Similar presentations
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Advertisements

Python regular expressions. “Some people, when confronted with a problem, think ‘I know, I'll use regular expressions.’ Now they have two problems.”
Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt) –Given tokens specified as regular expressions, Lex automatically generates a routine.
1 Pass Compiler 1. 1.Introduction 1.1 Types of compilers 2.Stages of 1 Pass Compiler 2.1 Lexical analysis 2.2. syntactical analyzer 2.3. Code generation.
CSE 3302 Programming Languages Chengkai Li, Weimin He Spring 2008 Syntax Lecture 2 - Syntax, Spring CSE3302 Programming Languages, UT-Arlington ©Chengkai.
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
Tools for building compilers Clara Benac Earle. Tools to help building a compiler C –Lexical Analyzer generators: Lex, flex, –Syntax Analyzer generator:
Chapter 3 Program translation1 Chapt. 3 Language Translation Syntax and Semantics Translation phases Formal translation models.
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
 2004 Prentice Hall, Inc. All rights reserved. Chapter 25 – Perl and CGI (Common Gateway Interface) Outline 25.1 Introduction 25.2 Perl 25.3 String Processing.
Binary Search Trees continued Trees Draw the BST Insert the elements in this order 50, 70, 30, 37, 43, 81, 12, 72, 99 2.
Regular Expressions in ColdFusion Applications Dave Fauth DOMAIN technologies Knowledge Engineering : Systems Integration : Web.
 Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’
CIS Computer Programming Logic
Lexical Analysis Natawut Nupairoj, Ph.D.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
H3D API Training  Part 3.1: Python – Quick overview.
REGULAR EXPRESSIONS. Lexical Analysis Lexical analysers can be constructed by programs such as LEX These programs employ as input a description of the.
CS 461 – Oct. 7 Applications of CFLs: Compiling Scanning vs. parsing Expression grammars –Associativity –Precedence Programming language (handout)
COMP Parsing 2 of 4 Lecture 22. How do we write programs to do this? The process of getting from the input string to the parse tree consists of.
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
Introduction to Awk Awk is a convenient and expressive programming language that can be applied to a wide variety of computing and data manipulation tasks.
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
Lexical and Syntax Analysis
Scanning & FLEX CPSC 388 Ellen Walker Hiram College.
Regular Expressions Regular Expressions. Regular Expressions  Regular expressions are a powerful string manipulation tool  All modern languages have.
 2002 Prentice Hall. All rights reserved. 1 Chapter 13 – String Manipulation and Regular Expressions Outline 13.1 Introduction 13.2 Fundamentals of Characters.
Exam 1 Review Instructor – Gokcen Cilingir Cpt S 111, Sections 6-7 (Sept 19, 2011) Washington State University.
Comp 311 Principles of Programming Languages Lecture 3 Parsing Corky Cartwright August 28, 2009.
D Goforth COSC Translating High Level Languages Note error in assignment 1: #4 - refer to Example grammar 3.4, p. 126.
1 Lex & Yacc. 2 Compilation Process Lexical Analyzer Source Code Syntax Analyzer Symbol Table Intermed. Code Gen. Code Generator Machine Code.
1 Parsers and Grammar. 2 Categories of Grammar Rules  Declarations or definitions. AttributeDeclaration ::= [ final ] [ static ] [ access ] datatype.
Lex & Yacc By Hathal Alwageed & Ahmad Almadhor. References *Tom Niemann. “A Compact Guide to Lex & Yacc ”. Portland, Oregon. 18 April 2010 *Levine, John.
Joey Paquet, 2000, Lecture 2 Lexical Analysis.
ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Overview of Previous Lesson(s) Over View  Syntax-directed translation is done by attaching rules or program fragments to productions in a grammar. 
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
Notes on Python Regular Expressions and parser generators (by D. Parson) These are the Python supplements to the author’s slides for Chapter 1 and Section.
The Role of Lexical Analyzer
1 January 18, January 18, 2016January 18, 2016January 18, 2016 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
Trinity College Dublin, The University of Dublin GE3M25: Computer Programming for Biologists Python, Class 2 Karsten Hokamp, PhD Genetics TCD, 17/11/2015.
Python – May 16 Recap lab Simple string tokenizing Random numbers Tomorrow: –multidimensional array (list of list) –Exceptions.
Compilers Computer Symbol Table Output Scanner (lexical analysis)
Compiler Construction CPCS302 Dr. Manal Abdulaziz.
More yacc. What is yacc – Tool to produce a parser given a grammar – YACC (Yet Another Compiler Compiler) is a program designed to compile a LALR(1) grammar.
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
ICS611 Lex Set 3. Lex and Yacc Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture Ahmed Ezzat.
Compiler Chapter 4. Lexical Analysis Dept. of Computer Engineering, Hansung University, Sung-Dong Kim.
CS 3304 Comparative Languages
Parsing 2 of 4: Scanner and Parsing
A Simple Syntax-Directed Translator
CS 3304 Comparative Languages
Notes on Python Regular Expressions and parser generators (by D
Tutorial On Lex & Yacc.
50/50 rule You need to get 50% from tests, AND
Chapter 2 :: Programming Language Syntax
Containers and Lists CIS 40 – Introduction to Programming in Python
Compiler Construction
Formal Language Theory
Chapter 3: Lexical Analysis
R.Rajkumar Asst.Professor CSE
CSCI 431 Programming Languages Fall 2003
CS 3304 Comparative Languages
CS 3304 Comparative Languages
Appendix B.1 Lex Appendix B.1 -- Lex.
Compiler Design 3. Lexical Analyzer, Flex
PYTHON - VARIABLES AND OPERATORS
Presentation transcript:

Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy

Text Parsing ► The three W’s! ► Efficiency and Perfection

What is Text Parsing? ► common programming task ► extract or split a sequence of characters

Why is Text Parsing? ► Simple file parsing  A tab separated file ► Data extraction  Extract specific information from log file ► Find and replace ► Parsers – syntactic analysis ► NLP  Extract information from corpus  POS Tagging

Text Parsing Methods ► String Functions ► Regular Expressions ► Parsers

String Functions ► String module in python  Faster, easier to understand and maintain ► If you can do, DO IT! ► Different built-in functions  Find-Replace  Split-Join  Startswith and Endswith  Is methods

Find and Replace ► find, index, rindex, replace ► EX: Replace a string in all files in a directory files = glob.glob(path) for line in fileinput.input(files,inplace=1): lineno = 0 lineno = 0 lineno = string.find(line, stext) lineno = string.find(line, stext) if lineno >0: if lineno >0: line =line.replace(stext, rtext) line =line.replace(stext, rtext) sys.stdout.write(line) sys.stdout.write(line)

startswith and endswith ► Extract quoted words from the given text myString = "\"123\""; if (myString.startswith("\"")) print "string with double quotes“ print "string with double quotes“ ► Find if the sentences are interrogative or exclamative ► What an amazing game that was! ► Do you like this? endings = ('!', '?') sentence.endswith(endings)

isMethods ► to check alphabets, numerals, character case etc  m = 'xxxasdf ‘  m.isalpha()  False

Regular Expressions ► concise way for complex patterns ► amazingly powerful ► wide variety of operations ► when you go beyond simple, think about regular expressions!

Real world problems ► Match IP Addresses, addresses, URLs ► Match balanced sets of parenthesis ► Substitute words ► Tokenize ► Validate ► Count ► Delete duplicates ► Natural Language processing

RE in Python ► Unleash the power - built-in re module ► Functions  to compile patterns ► complie  to perform matches ► match, search, findall, finditer  to perform opertaions on match object ► group, start, end, span  to substitute ► sub, subn ► - Metacharacters

Compiling patterns ► re.complile() ► pattern for IP Address  ^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$  ^\d+\.\d+\.\d+\.\d+$  ^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$  ^([01]?\d\d?|2[0-4]\d|25[0-])\. ([01]?\d\d?|2[0-4]\d|25[0-5])\. ([01]?\d\d?|2[0-4]\d|25[0-5])\. ([01]?\d\d?|2[0-4]\d|25[0-5])$ ([01]?\d\d?|2[0-4]\d|25[0-5])$

Compiling patterns ► pattern for matching parenthesis  \(.*\)  \([^)]*\)  \([^()]*\)

Substitute ► Perform several string substitutions on a given string import re def make_xlat(*args, **kwargs): adict = dict(*args, **kwargs) rx = re.compile('|'.join(map(re.escape, adict))) def one_xlate(match): return adict[match.group(0)] def xlate(text): return rx.sub(one_xlate, text) return xlate

Count ► Split and count words in the given text  p = re.compile(r'\W+')  len(p.split('This is a test for split().'))

Tokenize ► Parsing and Natural Language Processing  s = 'tokenize these words'  words = re.compile(r'\b\w+\b|\$')  words.findall(s)  ['tokenize', 'these', 'words']

Common Pitfalls ► operations on fixed strings, single character class, no case sensitive issues ► re.sub() and string.replace() ► re.sub() and string.translate() ► match vs. search ► greedy vs. non-greedy

PARSERS ► Flat and Nested texts ► Nested tags, Programming language constructs ► Better to do less than to do more!

Parsing Non flat texts ► Grammar ► States ► Generate tokens and Act on them ► Lexer - Generates a stream of tokens ► Parser - Generate a parse tree out of the tokens ► Lex and Yacc

Grammar Vs RE ► Floating Point #---- EBNF-style description of Python ---# #---- EBNF-style description of Python ---# floatnumber ::= pointfloat | exponentfloat floatnumber ::= pointfloat | exponentfloat pointfloat ::= [intpart] fraction | intpart "." pointfloat ::= [intpart] fraction | intpart "." exponentfloat ::= (intpart | pointfloat) exponent exponentfloat ::= (intpart | pointfloat) exponent intpart ::= digit+ intpart ::= digit+ fraction ::= "." digit+ fraction ::= "." digit+ exponent ::= ("e" | "E") ["+" | "-"] digit+ exponent ::= ("e" | "E") ["+" | "-"] digit+ digit ::= "0"..."9" digit ::= "0"..."9"

Grammar Vs RE pat = r'''(?x) ( # exponentfloat ( # exponentfloat ( # intpart or pointfloat ( # intpart or pointfloat ( # pointfloat ( # pointfloat (\d+)?[.]\d+ # optional intpart with fraction (\d+)?[.]\d+ # optional intpart with fraction | \d+[.] # intpart with period \d+[.] # intpart with period ) # end pointfloat ) # end pointfloat | \d+ # intpart \d+ # intpart ) # end intpart or pointfloat ) # end intpart or pointfloat [eE][+-]?\d+ # exponent [eE][+-]?\d+ # exponent ) # end exponentfloat ) # end exponentfloat | ( # pointfloat ( # pointfloat (\d+)?[.]\d+ # optional intpart with fraction (\d+)?[.]\d+ # optional intpart with fraction | \d+[.] # intpart with period \d+[.] # intpart with period ) # end pointfloat ) # end pointfloat ''' '''

PLY - The Python Lex and Yacc ► higher-level and cleaner grammar language ► LALR(1) parsing ► extensive input validation, error reporting, and diagnostics ► Two moduoles lex.py and yacc.py

Using PLY - Lex and Yacc ► Lex: ► Import the [lex] module ► Define a list or tuple variable 'tokens', the lexer is allowed to produce ► Define tokens - by assigning to a specially named variable ('t_tokenName') ► Build the lexer  mylexer = lex.lex()  mylexer.input(mytext) # handled by yacc

Lex t_NAME = r'[a-zA-Z_][a-zA-Z0-9_]*' def t_NUMBER(t): r'\d+' r'\d+' try: try: t.value = int(t.value) t.value = int(t.value) except ValueError: except ValueError: print "Integer value too large", t.value print "Integer value too large", t.value t.value = 0 t.value = 0 return t return t t_ignore = " \t"

Yacc ► Import the 'yacc' module ► Get a token map from a lexer ► Define a collection of grammar rules ► Build the parser  yacc.yacc()  yacc.parse('x=3')

Yacc ► Specially named functions having a 'p_' prefix def p_statement_assign(p): 'statement : NAME "=" expression' 'statement : NAME "=" expression' names[p[1]] = p[3] names[p[1]] = p[3] def p_statement_expr(p): 'statement : expression' 'statement : expression' print p[1] print p[1]

Summary ► String Functions A thumb rule - if you can do, do it. ► Regular Expressions Complex patterns - something beyond simple! ► Lex and Yacc Parse non flat texts - that follow some rules

References ► ► ► ► ► Mastering Regular Expressions by Jeffrey E F. Friedl ► Python Cookbook by Alex Martelli, Anna Martelli & David Ascher ► Text processing in Python by David Mertz

Thank You Q & A