Strings and Serialization

Slides:



Advertisements
Similar presentations
Perl & Regular Expressions (RegEx)
Advertisements

Regular Expressions (in Python). Python or Egrep We will use Python. In some scripting languages you can call the command “grep” or “egrep” egrep pattern.
Python: Regular Expressions
String and Lists Dr. Benito Mendoza. 2 Outline What is a string String operations Traversing strings String slices What is a list Traversing a list List.
Regular Expression Original Notes by Song Guo. What Regular Expressions Are Exactly - Terminology a regular expression is a pattern describing a certain.
ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23.
Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl Linux editors and commands (e.g.
JavaScript, Third Edition
Scripting Languages Chapter 8 More About Regular Expressions.
CSE467/567 Computational Linguistics Carl Alphonce Computer Science & Engineering University at Buffalo.
Regular Expressions. String Matching The problem of finding a string that “looks kind of like …” is common  e.g. finding useful delimiters in a file,
Regular Language & Expressions. Regular Language A regular language is one that a finite state machine (fsm) will accept. ‘Alphabet’: {a, b} ‘Rules’:
Lesson 3 – Regular Expressions Sandeepa Harshanganie Kannangara MBCS | B.Sc. (special) in MIT.
Last Updated March 2006 Slide 1 Regular Expressions.
Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © – Curt Hill.
Regular Expression Darby Tien-Hao Chang (a.k.a. dirty) Department of Electrical Engineering, National Cheng Kung University.
System Programming Regular Expressions Regular Expressions
Pattern matching with regular expressions A common file processing requirement is to match strings within the file to a standard form, e.g. address.
 Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’
Introduction to Computing Using Python Regular expressions Suppose we need to find all addresses in a web page How do we recognize addresses?
REGULAR EXPRESSIONS. Lexical Analysis Lexical analysers can be constructed by programs such as LEX These programs employ as input a description of the.
Finding the needle(s) in the textual haystack
RegExp. Regular Expression A regular expression is a certain way to describe a pattern of characters. Pattern-matching or keyword search. Regular expressions.
Regular Expressions Regular expressions are a language for string patterns. RegEx is integral to many programming languages:  Perl  Python  Javascript.
I/O Redirection and Regular Expressions February 9 th, 2004 Class Meeting 4.
REGEX. Problems Have big text file, want to extract data – Phone numbers (503)
Introduction Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
Regular Expressions Regular Expressions. Regular Expressions  Regular expressions are a powerful string manipulation tool  All modern languages have.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Regular Expressions for PHP Adding magic to your programming. Geoffrey Dunn
XML 2nd EDITION Tutorial 4 Working With Schemas. XP Schemas A schema is an XML document that defines the content and structure of one or more XML documents.
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
CS 330 Programming Languages 10 / 02 / 2007 Instructor: Michael Eckmann.
I/O Redirection & Regular Expressions CS 2204 Class meeting 4 *Notes by Doug Bowman and other members of the CS faculty at Virginia Tech. Copyright
Copyright © Curt Hill Regular Expressions Providing a Search Pattern.
Regular Expressions CS 2204 Class meeting 6 Created by Doug Bowman, 2001 Modified by Mir Farooq Ali, 2002.
Validation using Regular Expressions. Regular Expression Instead of asking if user input has some particular value, sometimes you want to know if it follows.
Introduction to Programming the WWW I CMSC Winter 2004 Lecture 13.
Pattern Matching: Simple Patterns. Introduction Programmers often need to scan a file, directory, etc. for a specific substring. –Find all files that.
CS 330 Programming Languages 09 / 30 / 2008 Instructor: Michael Eckmann.
May 2006CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
ICS611 Lex Set 3. Lex and Yacc Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the.
String and Lists Dr. José M. Reyes Álamo. 2 Outline What is a string String operations Traversing strings String slices What is a list Traversing a list.
Lesson 4 String Manipulation. Lesson 4 In many applications you will need to do some kind of manipulation or parsing of strings, whether you are Attempting.
RE Tutorial.
Finding the needle(s) in the textual haystack
Regular Expressions Upsorn Praphamontripong CS 1110
Perl Regular Expression in SAS
Theory of Computation Lecture #
Looking for Patterns - Finding them with Regular Expressions
Lecture 19 Strings and Regular Expressions
Lexical Analysis.
Concepts of Programming Languages
CSC 594 Topics in AI – Natural Language Processing
Lecture 9 Shell Programming – Command substitution
Finding the needle(s) in the textual haystack
Concepts of Programming Languages
Finding the needle(s) in the textual haystack
CSC 594 Topics in AI – Natural Language Processing
Advanced Find and Replace with Regular Expressions
CS 1111 Introduction to Programming Fall 2018
Lecture 25: Regular Expressions
Validation using Regular Expressions
Python Strings.
Regular Expression: Pattern Matching
REGEX.
ADVANCE FIND & REPLACE WITH REGULAR EXPRESSIONS
Lecture 23: Regular Expressions
Presentation transcript:

Strings and Serialization Damian Gordon

REGULAR EXPRESSIONS

Regular Expressions A regular expression is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching. Regular expressions originated in 1956, when mathematician Stephen Cole Kleene described regular languages using his mathematical notation called regular sets.

Regular Expressions Basic Patterns Logical OR: A vertical bar separates alternatives. For example, gray|grey can match "gray" or "grey". Grouping: Parentheses are used to define the scope and precedence of the operators. For example, gr(a|e)y Quantification: A quantifier after a token (such as a character) or group specifies how often that preceding element is allowed to occur.

Regular Expressions Qualifications ?: indicates zero or one occurrences of the preceding element. For example, colou?r matches both "color" and "colour". *: indicates zero or more occurrences of the preceding element. For example, ab*c matches "ac", "abc", "abbc", "abbbc", and so on. +: indicates one or more occurrences of the preceding element. For example, ab+c matches "abc", "abbc", "abbbc", and so on, but not "ac".

Regular Expressions Qualifications {n}: The preceding item is matched exactly n times. {min,}: The preceding item is matched min or more times. {min,max}: The preceding item is matched at least min times, but not more than max times.

Regular Expressions The Python Standard Library module for regular expressions is called re, for example: # PROGRAM MatchingPatterns: import re search_string = "hello world" pattern = "hello world" match = re.match(pattern, search_string) if match: # THEN print("regex matches") # ENDIF; # END.

Regular Expressions Bear in mind that the match function matches the pattern to the beginning of the string. Thus, if the pattern were "ello world", no match would be found. With confusing asymmetry, the parser stops searching as soon as it finds a match, so the pattern "hello wo" matches successfully.

Regular Expressions So with this code: import re pattern = "hello world" search_string = "hello world" match = re.match(pattern, search_string) if match: template = "'{}' matches pattern '{}'" else: template = "'{}' does not match pattern '{}'" # ENDIF; # END. print(template.format(search_string, pattern))

Regular Expressions For MATCH MATCH NO MATCH pattern = "hello world" search_string = "hello world" pattern = "hello worl" pattern = "ello world" MATCH MATCH NO MATCH

Matching Single Characters

Regular Expressions The period character, when used in a regular expression pattern, can match any single character. Using a period in the string means you don't care what the character is, just that there is a character there. 'hello world' matches pattern 'hel.o world' 'helpo world' matches pattern 'hel.o world' 'hel o world' matches pattern 'hel.o world' 'helo world' does not match pattern 'hel.o world'

Regular Expressions The square brackets, when used in a regular expression pattern, can match any one of a list of single characters. 'hello world' matches pattern 'hel[lp]o world' 'helpo world' matches pattern 'hel[lp]o world' 'helPo world' does not match pattern 'hel[lp]o world'

Regular Expressions The square brackets, when used in a regular expression pattern, can match a range of single characters. 'hello world' does not match pattern 'hello [a-z] world' 'hello b world' matches pattern 'hello [a-z] world' 'hello B world' matches pattern 'hello [a-zA-Z] world' 'hello 2 world' matches pattern 'hello [a-zA-Z0-9] world'

Regular Expressions But what happens if we want to match the period character or the square bracket? We use the backslash: '.' matches pattern '\.' ‘[' matches pattern '\[' ‘]' matches pattern '\]‘ ‘(' matches pattern '\(‘ ‘)' matches pattern '\)‘

letters, numbers, and underscores Regular Expressions Other backslashes character: Character Description \n newlines \t tabs \s whitespace character \w letters, numbers, and underscores \d Digit

Regular Expressions So for example. '(abc]' matches pattern '\(abc\]' ' 1a' matches pattern '\s\d\w' '\t5n' does not match pattern '\s\d\w' '5n' matches pattern '\s\d\w'

Matching Multiple Characters

Regular Expressions The asterisk (*) character says that the previous character can be matched zero or more times. 'hello' matches pattern 'hel*o' 'heo' matches pattern 'hel*o' 'helllllo' matches pattern 'hel*o'

Regular Expressions [a-z]* matches any collection of lowercase words, including the empty string: 'A string.' matches pattern '[A-Z][a-z]* [a-z]*\.' 'No .' matches pattern '[A-Z][a-z]* [a-z]*\.' '' matches pattern '[a-z]*.*'

Regular Expressions The plus (+) sign in a pattern behaves similarly to an asterisk; it states that the previous character can be repeated one or more times, but, unlike the asterisk is not optional. The question mark (?) ensures a character shows up exactly zero or one times, but not more.

Regular Expressions Some examples: '0.4' matches pattern '\d+\.\d+' '1.' does not match pattern '\d+\.\d+' '1%' matches pattern '\d?\d%' '99%' matches pattern '\d?\d%' '999%' does not match pattern '\d?\d%'

Regular Expressions If we want to check for a repeating sequence of characters, by enclosing any set of characters in parenthesis, we can treat them as a single pattern: 'abccc' matches pattern 'abc{3}' 'abccc' does not match pattern '(abc){3}' 'abcabcabc' matches pattern '(abc){3}'

Regular Expressions Combined with complex patterns, this grouping feature greatly expands our pattern-matching repertoire: 'Eat.' matches pattern '[A-Z][a-z]*( [a-z]+)*\.$' 'Eat more good food.' matches pattern '[A-Z][a-z]*( [a-z]+)*\.$' 'A good meal.' matches pattern '[A-Z][a-z]*( [a-z]+)*\.$' The first word starts with a capital, followed by zero or more lowercase letters. Then, we enter a parenthetical that matches a single space followed by a word of one or more lowercase letters. This entire parenthetical is repeated zero or more times, and the pattern is terminated with a period. There cannot be any other characters after the period, as indicated by the $ matching the end of string.

Regular Expressions Let’s write a Python program to determine if a particular string is a valid e-mail address or not, and if it is an e-mail address, to return the domain name part of the e-mail address. In terms of the regular expression for a valid e-mail format: pattern = "^[a-zA-Z.]+@([a-z.]*\.[a-z]+)$"

Regular Expressions Python's re module provides an object-oriented interface to enter the regular expression engine. We've been checking whether the re.match function returns a valid object or not. If a pattern does not match, that function returns None. If it does match, however, it returns a useful object that we can introspect for information about the pattern.

Regular Expressions Let’s test which of the following addresses are valid: search_string = "Damian.Gordon@dit.ie" search_string = "Damian.Gordon@ditie" search_string = "DamianGordon@dit.ie" search_string = "Damian.Gordondit.ie"

Regular Expressions # PROGRAM DomainDetection: import re def DetectDomain(searchstring): pattern = "^[a-zA-Z.]+@([a-z.]*\.[a-z]+)$" match = re.match(pattern, searchstring) if match != None: domain = match.groups()[0] print("<<", domain, ">>", "is a legimate domain") else: print("<<", search_string, ">>", "is not an e-mail address") # ENDIF; # END DetectDomain

Regular Expressions Regular expression search string for a valid e-mail address, with domain element in parenthesis # PROGRAM DomainDetection: import re def DetectDomain(searchstring): pattern = "^[a-zA-Z.]+@([a-z.]*\.[a-z]+)$" match = re.match(pattern, searchstring) if match != None: domain = match.groups()[0] print("<<", domain, ">>", "is a legimate domain") else: print("<<", search_string, ">>", "is not an e-mail address") # ENDIF; # END DetectDomain Match returns None if there is no match, and an tuples in the search string otherwise The regular expression above has the domain elements in parenthesis, so Groups() returns just the domain

Regular Expressions In addition to the match function, the re module provides a couple other useful functions, search, and findall. The search function finds the first instance of a matching pattern, relaxing the restriction that the pattern start at the first letter of the string. The findall function behaves similarly to search, except that it finds all non-overlapping instances of the matching pattern, not just the first one.

Regular Expressions >>> import re >>> re.findall('a.', 'abacadefagah') ['ab', 'ac', 'ad', 'ag', 'ah'] >>> re.findall('a(.)', 'abacadefagah') ['b', 'c', 'd', 'g', 'h'] >>> re.findall('(a)(.)', 'abacadefagah') [('a', 'b'), ('a', 'c'), ('a', 'd'), ('a', 'g'), ('a', 'h')] >>> re.findall('((a)(.))', 'abacadefagah') [('ab', 'a', 'b'), ('ac', 'a', 'c'), ('ad', 'a', 'd'), ('ag', 'a', 'g'), ('ah', 'a', 'h')]

etc.