Download presentation
Presentation is loading. Please wait.
1
Digital State Machines
Regular Expressions & Languages
2
Chapter Outline Regular Expressions Summary
Basic Regular Expression Patterns Disjunction, Grouping and Precedence Examples Advanced Operators Regular Expression Substitution, Memory and ELIZA Summary 28 November 2018 Veton Këpuska
3
Regular Expressions (RE)
Algebraic Description of finite state automata. Regular Expressions can define exactly the same languages that the various forms of automata describe: regular languages. Regular Expressions (RE) offer a declarative way to express the strings we want to accept – FSA do not! REs serve as the input language for many systems that process strings: Search commands such as UNIX grep (egrep, etc.) for finding strings: WWW Browsers, Text-formatting systems, etc. Search Systems convert REs into FSA(s) (D-FSA or N-FSA). Lexical-analyzer generators, such as LEX or FLEX. Compiler, Language Modeling System in a Speech Recognizer. Grammar and Spell Checkers. 28 November 2018 Veton Këpuska
4
FSA, RE and Regular Languages
Regular expressions Finite automata Regular languages 28 November 2018 Veton Këpuska
5
The Operators of Regular Expressions
Regular Expressions denote languages. 01*+10* - denotes the language consisting of all strings that are either a: {0, 01, 011, 0111, ,…}, or {1, 10, 100, 1000, 10000, …} Operations on Regular Languages that Regular Expressions Represent. Let L, L1 and L2 be regular languages, L={0,1}, L1 = {10, 001, 111} & L2 = {e, 001}, then The union: L1 ∪ L2, the union or disjunction of L1 and L2. L1 ∪ L2 = {e, 10, 001, 111} The concatenation: L1L2 = {xy|x ∈ L1, y ∈ L2}. L1 L2 = {10, 001, 111, 10001, 00001, } The closure (or star, *, or Kleene closure): L*. L* = {L0, L1, L2,…, Li,…, L∞} 28 November 2018 Veton Këpuska
6
Example L={0,11}, L0 = {e} – independent of what language L is.
L1 = L – represents the choice of one string from L. {L0, L1} = {e, 0, 11} L2 = {00, 011,110,1111} L3 = {000, 0011, 0110, 01111,1100,11011,11110,111111} To compute L* must compute Li for each i (i) Li has 2i members. Union of infinite number of terms Li is generally an infinite language (L*) as it is this example. 28 November 2018 Veton Këpuska
7
Example Let L={e, 0, 00, 000, …} – a set of strings consisting all zeros. L – is infinite language L0 = {e} – independent of what language L is. L1 = L – represents the choice of one symbol from L. {L0, L1} = {e, 0, 00, 000, 0000, …...} L2 = {e, 0, 00,000,0000, ...} = L L3 = L L*= L0 L1 L2 … = L - empty set. One of only two languages that its closure, *, is not infinite. 0 = {e} 1 = {e} i = {e} * = {e} 28 November 2018 Veton Këpuska
8
Distinction of Star (*) and Closure (*) Operator
Star *: *- forms all strings whose symbols were chosen from alphabet . Closure * operator is essentially the same with a subtle difference. Let: L – be a language containing strings of length 1, and for each symbol a in there is a string a in L. Thus: - set of symbols, while L – set of strings * and L* denote the same language. 28 November 2018 Veton Këpuska
9
Building Regular Expressions
The algebra of regular expressions follows the pattern of classical algebra. Constants and Variables denote Languages Operators ⇒ {Union, Product, Star/Closure} Define Regular Expression (E - the language that it represents is denoted by L(E)), Recursively: BASIS: The constants e and and are regular expressions, denoting the languages L(e)={e} and L()= respectively. If a is any symbol, then a is a regular expression. L(a)={a}. Any variable, e.g., L, typically capitalized and italic represents any language. 28 November 2018 Veton Këpuska
10
Building Regular Expressions
INDUCTION: If E and F are regular expressions, than E+F is a regular expressions denoting their union: L(E+F) = L(E) L(F). EF is a regular expression denoting their concatenation: L(EF) = L(E)L(F). A dot can optionally be used to denote the concatenation operator on languages or in a regular expression. A regular expression 0.1 is same as 01 that represents the language {01} E* is a regular expression denoting the closure of L(E): L(E*) = (L(E))*. (E) is also a regular expression denoting the same language as E: L((E))=L(E) 28 November 2018 Veton Këpuska
11
Example Develop a regular expression for the language consisting of the single string 01. 0 and 1 are expressions denoting the languages {0} and {1} Concatenation of the two expressions results in regular expression 01 for the language {01}. As a general rule, if we want a regular expression for the language consisting of only the string w, we use w itself as the regular expression. Write a regular expression for set of strings that consists of alternating 0’s and 1’s. Thus from the above we get (01)* Note 1: 01* ≠ (01)* Note 2: L((01)*) – is not exactly what we want – what about when 1 is at the beginning and/or 0 at the end? (01)*+(10)*+1(01)*+0(10)* “+” operator indicates union of the corresponding languages. 28 November 2018 Veton Këpuska
12
Example Alternate Solution: Note: L(e+1)= L(e)L(1)={e}{1}={e,1}
28 November 2018 Veton Këpuska
13
Precedence of Regular Expression Operators
* operator has the highest precedence. Concatenation or dot operator. Union (+) operator Controlling the order of operations by grouping operator “()”. Example: (0(1*))+1 (01)*+1 0(1*+1) 28 November 2018 Veton Këpuska
14
Exercise Examples Exercise 3.1.1:
Write regular expression for the following languages:a The set of strings over alphabet {a, b, c} containing at least one a and at least one b. (aba*b*c*) what about other combinations? ((e+a*)+(e+b*)+(e+c*))*(ab + ba)((e+a*)+(e+b*)+(e+c*))* The set of strings of 0’s and 1’s whose tenth symbols from the right end is 1. (0+1)*1(0+1) (0+1)… (0+1) (0+1) The set of strings of 0’s and 1’s with at most one pair of consecutive 1’s. (0+1)(0+(00)+(01)+(10))* 28 November 2018 Veton Këpuska
15
Finite Automata and Regular Expressions
Regular-expressions describe languages in fundamentally different form from the finite automata. However, they both describe the same set of languages – “Regular Languages”. To show this one must: Every language defined by one of these automata is also defined by a regular expression. Must show that the language is accepted by some D-FSA. Every language defined by a regular expression is defined by one of these automata. Must show that there is an N-FSA with e-transitions accepting the same language. 28 November 2018 Veton Këpuska
16
Finite Automata and Regular Expressions
e-NFSA NFSA RE DFSA Plan for showing the equivalency of four different notations for regular languages. 28 November 2018 Veton Këpuska
17
Converting Regular Expressions to Automata
We can show that every language L, that is L(R) for some regular expression R, is also L(E) for some e-NFSA E. Start by showing how to construct automata for basis expressions, single symbols e and f. Show how to combine these automata into larger automata that accept the union, concatenation, or closure. 28 November 2018 Veton Këpuska
18
Converting Regular Expressions to Automata
Theorem: Every language defined by a regular expression is also defined by a finite automata. Proof: Suppose L=L(R) for a regular expression R. We will show that L=L(E) for some e-NFSA E with: Exactly one accepting state No arcs into the initial state. No arcs out of the accepting state. The proof is by structural induction on R, following the recursive definition of regular expressions. 28 November 2018 Veton Këpuska
19
Converting Regular Expressions to Automata
BASIS: The language of automaton is {e} Depicts construction for f, since there is no path from start state to accepting state. Thus f is the language of automaton. Language of the automaton is L(a) which is the one string a. 28 November 2018 Veton Këpuska
20
Converting Regular Expressions to Automata
INDUCTION: It assumed that the statement of the theorem is true for the immediate sub-expressions of a given regular expression. R+S: L(R) L(S) RS: L(R)L(S) R*: L(R*) 28 November 2018 Veton Këpuska
21
Example Convert (0+1)*1(0+1) to an e-NFSA. (0+1) (0+1)* (0+1)*1(0+1)
28 November 2018 Veton Këpuska
22
Applications of Regular Expressions
28 November 2018 Veton Këpuska
23
Lexical Analysis (lex, flex, yacc) http://dinosaur.compilertools.net/
Finding Patterns in Text 28 November 2018 Veton Këpuska
24
Regular Expressions Formally, a regular expression is an algebraic notation for characterizing a set of strings. Thus they can be used to specify search strings as well as to define a language in a formal way. Regular Expression requires A pattern that we want to search for, and A corpus of text to search through. Thus when we give a search pattern, we will assume that the search engine returns the line of the document returned. This is what the UNIX grep command does. We will underline the exact part of the pattern that matches the regular expression. A search can be designed to return all matches to a regular expression or only the first match. We will show only the first match. 28 November 2018 Veton Këpuska
25
Basic Regular Expression Patterns
The simplest kind of regular expression is a sequence of simple characters: /woodchuck/ /Buttercup/ /!/ RE Example Patterns Matched /woodchucks/ “interesting links to woodchucks and lemurs” /a/ “Mary Ann stopped by Mona’s” /Claire says,/ “Dagmar, my gift please,” Claire says,” /song/ “all our pretty songs” /!/ “You’ve left the burglar behind again!” said Nori 28 November 2018 Veton Këpuska
26
Basic Regular Expression Patterns
Regular Expressions are case sensitive /s/ /S/ /woodchucks/ will not match “Woodchucks” Disjunction: “[“ and “]”. RE Match Example Pattern /[wW]oodchuck/ Woodchuck or woodchuck “Woodchuck” /[abc]/ ‘a’, ‘b’, or ‘c’ “In uomini, in soldati” /[ ]/ Any digit “plenty of 7 to 5” 28 November 2018 Veton Këpuska
27
Basic Regular Expression Patterns
Specifying range in Regular Expressions: “-” RE Match Example Patterns Matched /[A-Z]/ An uppercase letter “we should call it ‘Drenched Blossoms’” /[a-z]/ A lower case letter “my beans were impatient to be hoed!” /[0-9]/ A single digit “Chapter 1: Down the Rabbit Hole” 28 November 2018 Veton Këpuska
28
Basic Regular Expression Patterns
Negative Specification – what pattern can not be: “^” If the first symbol after the open square brace “[” is “^” the resulting pattern is negated. Example /[^a]/ matches any single character (including special characters) except a. RE Match (single characters) Example Patterns Matched /[^A-Z]/ Not an uppercase letter “Oyfn pripetchik” /[^Ss]/ Neither ‘S’ nor ‘s’ “I have no exquisite reason for ’t” /[^\.]/ Not a period “our resident Djinn” /[e^]/ Either ‘e’ or ‘^’ “look up ^ now” /a^b/ Pattern ‘a^b’ “look up a^b now” 28 November 2018 Veton Këpuska
29
Basic Regular Expression Patterns
How do we specify both woodchuck and woodchucks? Optional character specification: /?/ /?/ means “the preceding character or nothing”. RE Match Example Patterns Matched /woodchucks?/ woodchuck or woodchucks “woodchuck” Colou?r color or colour “colour” 28 November 2018 Veton Këpuska
30
Basic Regular Expression Patterns
Question-mark “?” can be though of as “zero or one instances of the previous character”. It is a way to specify how many of something that we want. Sometimes we need to specify regular expressions that allow repetitions of things. For example, consider the language of (certain) sheep, which consists of strings that look like the following: baa! baaa? baaaa? baaaaa? baaaaaa? … 28 November 2018 Veton Këpuska
31
Basic Regular Expression Patterns
Any number of repetitions is specified by “*” which means “any string of 0 or more”. Examples: /aa*/ - a followed by zero or more a’s /[ab]*/ - zero or more a’s or b’s. This will match aaaa or abababa or bbbb 28 November 2018 Veton Këpuska
32
Basic Regular Expression Patterns
We know enough to specify part of our regular expression for prices: multiple digits. Regular expression for individual digit: /[0-9]/ Regular expression for an integer: /[0-9][0-9]*/ Why is not just /[0-9]*/? Because it is annoying to specify “at least once” RE since it involves repetition of the same pattern there is a special character that is used for “at least once”: “+” Regular expression for an integer becomes then: /[0-9]+/ Regular expression for sheep language: /baa*!/, or /ba+!/ 28 November 2018 Veton Këpuska
33
Basic Regular Expression Patterns
One very important special character is the period: /./, a wildcard expression that matches any single character (except carriage return). Example: Find any line in which a particular word (for example Veton) appears twice: /Veton.*Veton/ RE Match Example Pattern /beg.n/ Any character between beg and n begin beg’n, begun 28 November 2018 Veton Këpuska
34
Repetition Metacharacters
Description Example * Matches any number of occurrences of the previous character – zero or more /ac*e/ - matches “ae”, “ace”, “acce”, “accce” as in “The aerial acceleration alerted the ace pilot” ? Matches at most one occurrence of the previous characters – zero or one. /ac?e/ - matches “ae” and “ace” as in “The aerial acceleration alerted the ace pilot” + Matches one or more occurrences of the previous characters /ac+e/ - matches “ace”, “acce”, “accce” as in “The aerial acceleration alerted the ace pilot” {n} Matches exactly n occurrences of the previous characters. /ac{2}e/ - matches “acce” as in “The aerial acceleration alerted the ace pilot” {n,} Matches n or more occurrences of the previous characters /ac{2,}e/ - matches “acce”, “accce” etc., as in “The aerial acceleration alerted the ace pilot” {n,m} Matches from n to m occurrences of the previous characters. /ac{2,4}e/ - matches “acce”, “accce” and “acccce” , as in “The aerial acceleration alerted the ace pilot” . Matches one occurrence of any characters of the alphabet except the new line character /a.e/ matches aae, aAe, abe, aBe, a1e, etc., as in ““The aerial acceleration alerted the ace pilot” .* Matches any string of characters and until it encounters a new line character 28 November 2018 Veton Këpuska
35
Anchors Anchors are special characters that anchor regular expressions to particular places in a string. The most common anchors are: “^” – matches the start of a line “$” – matches the end of the line Examples: /^The/ - matches the word “The” only at the start of the line. Three uses of “^”: /^xyz/ - Matches the start of the line [^xyz] – Negation /^/ - Just to mean a caret /⌴$/ - “⌴” Stands for space “character”; matches a space at the end of line. /^The dog\.$/ - matches a line that contains only the phrase “The dog”. 28 November 2018 Veton Këpuska
36
Anchors /\b/ - matches a word boundary /\B/ - matches a non-boundary
/\bthe\b/ - matches the word “the” but not the word “other”. Word is defined as a any sequence of digits, underscores or letters. /\b99/ - will match the string 99 in “There are 99 bottles of beer on the wall” but NOT “There are 299 bottles of beer on the wall” and it will match the string “$99” since 99 follows a “$” which is not a digit, underscore, or a letter. 28 November 2018 Veton Këpuska
37
Disjunction, Grouping and Precedence.
Suppose we need to search for texts about pets; specifically we may be interested in cats and dogs. If we want to search for either “cat” or the string “dog” we can not use any of the constructs we have introduced so far (why not “[]”?). New operator that defines disjunction, also called the pipe symbol is “|”. /cat|dog/ - matches either cat or the string dog. 28 November 2018 Veton Këpuska
38
Grouping In many instances it is necessary to be able to group the sequence of characters to be treated as one set. Example: Search for guppy and guppies. /gupp(y|ies)/ Useful in conjunction to “*” operator. /*/ - applies to single character and not to a whole sequence. Example: Match “Column 1 Column 2 Column 3 …” /Column⌴[0-9]+⌴*/ - will match “Column # …“ /(Column⌴[0-9]+⌴*)*/ - will match “Column 1 Column 2 Column 3 …” 28 November 2018 Veton Këpuska
39
Operator Precedence Hierarchy
Operator Class Precedence from Highest to Lowest Parenthesis () Counters * + ? {} Sequences and anchors ^ $ Disjunction | 28 November 2018 Veton Këpuska
40
Simple Example Problem Statement: Want to write RE to find cases of the English article “the”. /the/ - It will miss “The” /[tT]he/ - It will match “amalthea”, “Bethesda”, “theology”, etc. /\b[tT]he\b/ - Is the correct RE Problem Statement: If we want to find “the” where it might also have underlines or numbers nearby (“The-” , “the_” or “the25”) one needs to specify that we want instances in which there are no alphabetic letters on either side of “the”: /[^a-zA-Z][tT]he/[^a-zA-Z]/ - it will not find “the” if it begins the line. /(^|[^a-zA-Z])[tT]he/[^a-zA-Z]/ 28 November 2018 Veton Këpuska
41
A More Complex Example Problem Statement: Build an application to help a user purchase a computer on the Web. The user might want “any PC with more than 1000 MHz and 80 Gb of disk space for less than $1000 To solve the problem must be able to match the expressions like 1000 MHz, 1 GHz and 80 Gb as well as $ etc. 28 November 2018 Veton Këpuska
42
Solution – Dollar Amounts
Complete regular expression for prices of full dollar amounts: /$[0-9]+/ Adding fractions of dollars: /$[0-9]+\.[0-9][0-9]/ or /$[0-9]+\.[0-9] {2}/ Problem since this RE only will match “$199.99” and not “$199”. To solve this issue must make cents optional and make sure the $ amount is a word: /\b$[0-9]+(\.[0-9][0-9])?\b/ 28 November 2018 Veton Këpuska
43
Solution: Processor Speech
Processor speech in megahertz = MHz or gigahertz = GHz) /\b[0-9]+⌴*(MHz|[Mm]egahertz|GHz|[Gg]igahertz)\b/ ⌴* is used to denote “zero or more spaces”. 28 November 2018 Veton Këpuska
44
Solution: Disk Space Dealing with disk space: Gb = gigabytes
Memory size: Mb = megabytes or Must allow optional fractions: /\b[0-9]+⌴*(M[Bb]|[Mm]egabytes?)\b/ /\b[0-9]+(\.[0-9]+)?⌴*(G[Bb]|[Gg]igabytes?)\b/ 28 November 2018 Veton Këpuska
45
Solution: Operating Systems and Vendors
/\b((Windows)+⌴*(XP|Vista)?)\b/ /\b((Mac|Macintosh|Apple)\b/ 28 November 2018 Veton Këpuska
46
Aliases for common sets of characters
Advanced Operators RE Expansion Match Example Patterns \d [0-9] Any digit “Party of 5” \D [^0-9] Any non-digit “Blue moon” \w [a-zA-Z0-9⌴] Any alphanumeric or space Daiyu \W [^\w] A non-alphanumeric !!!! \s [⌴\r\t\n\f] Whitespace (space, tab) “ ” \S [^\s] Non-whitespace “in Concord” Aliases for common sets of characters 28 November 2018 Veton Këpuska
47
Literal Matching of Special Characters & “\” Characters
RE Match Example Patterns \* An asterisk “*” “K*A*P*L*A*N” \. A period “.” “Dr. Këpuska, I presume” \? A question mark “?” “Would you like to light my candle?” \n A newline \t A tab \r A carriage return character Some characters that need to be backslashed “\” 28 November 2018 Veton Këpuska
48
Regular Expression Substitution, Memory, and ELIZA
Substitutions are an important use of regular expressions. s/regexp1/regexp2/ - allows a string characterized by one regular expression (regexp1) to be replaced by a string characterized by a second regular expressions (regexp2). s/colour/color/ It is also important to refer to a particular subpart of the string matching the first pattern. Example: replace “the 35 boxes”, to “the <35> boxes” s/([0-9]+)/<\1>/ - “\1” refers to the first pattern matched by the first regular expression. 28 November 2018 Veton Këpuska
49
Regular Expression Substitution, Memory, and ELIZA
The parenthesis and number operators can also be used to specify that a certain string or expression must occur twice in the text. Example: “the Xer they were, the Xer they will be” We want to constrain the two X’s to be the same string: /[Tt]he (.*)er they were, the \1er they will be/ This RE will match: “The bigger they were, the faster they will be” The number operator can be used with other numbers: if you match two different sets of parenthesis, \2 means whatever matched the second set. /[Tt]he (.*)er they (.*), the \1er they \2/ This Re will match: “The bigger they were, the bigger they were”, but not “The bugger the were, the bigger they will be”, 28 November 2018 Veton Këpuska
50
Registers Numbered memories are called registers: \1 – register 1
28 November 2018 Veton Këpuska
51
ELIZA Substitutions using memory are very useful in implementing simple natural-language understanding programs like ELIZA. Here is example of dialog with ELIZA: User1: Men are all alike. ELIZA1: IN WHAT WAY User2: They’re always bugging us about something or other. ELIZA2: CAN YOU THINK OF A SPECIFIC EXAMPLE User3: Well, my boyfriend made me come here. ELIZA3: YOUR BOYFRIEND MADE YOU COME HERE User4: He says I’m depressed much of the time. ELIZA4: I AM SORRY TO HEAR YOU ARE DEPRESSED. 28 November 2018 Veton Këpuska
52
ELIZA Eliza worked by having a cascade of regular expression substitutions that each matched some part of the input lines and changed them. The first substitutions changed all instances of: “my” ⇨ “YOUR” “I’m” ⇨ “YOU ARE” Next set of substitutions looked for relevant patterns in the input and created an appropriate output; s/.* YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/ s/.* YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/ s/.* ALL .*/IN WHAT WAY/ s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/ 28 November 2018 Veton Këpuska
53
ELIZA Since multiple substitutions could apply to a given input, substitutions were assigned a rank and were applied in order. Creation of such patterns is addressed in Exercise 2.2. 28 November 2018 Veton Këpuska
54
Algebraic Laws for Regular Expressions
28 November 2018 Veton Këpuska
55
Algebraic Laws for Regular Expressions
Collection of laws that define when two regular expressions are equivalent. Arithmetic: Commutativity: (x+y = y+x) Switching of order of operands does not change results. Associativity: (xy)z = x(yz) Regroup the operands when the operator is applied twice. Regular expressions have a number of laws similar to the laws for arithmetic. 28 November 2018 Veton Këpuska
56
Associativity and Commutativity
For L,M and N Languages (defined by Regular Expressions or equivalently by FSA) Commutative Law for Union: L+M=M+L Associative Law for Union: (L+M)+N=L+(M+N) Associative Law for Concatenation: (LM)N=L(MN) 28 November 2018 Veton Këpuska
57
Identities and Annihilators
Arithmetic Identity: 0 is identity for addition: 0+x = x+0 = x 1 is identity for multiplication: 1x = x1 = x Annihilator: 0 is annihilator for multiplication: 0x = x0 = 0 Regular Expressions Identity for Union and Concatenation: ∅+L = L+∅ = L ∊L = L∊ = L Annihilator for Concatenation: ∅+L = L+∅ = ∅ Important in simplification of regular expressions. 28 November 2018 Veton Këpuska
58
Distributive Laws Regular Expressions
Arithmetic A distributive law involves two operators. Distributive law of multiplication over addition (most common): x (y+z) = xy+ xz Regular Expressions Left Distributive Law of Concatenation over union: L(M+N) = LM + LN Right Distributive Law of Concatenation over union: (M+N)L = ML + NL 28 November 2018 Veton Këpuska
59
Distributive Laws Theorem: If L, M, and N are any languages, then:
L(M N) = LM LN Proof: Show first that a string w is in L(M N) if and only if it is in LM LN. (Only-if) If w is in L(M N) then w=xy, where x is in L and y is in (M N) ⇒ y is in M or N. If y is in M then w=xy is in LM ⇒ is in LM LN If y is in N then w=xy is in LN ⇒ is in LM LN (if) If w is in LM LN then w is either in LM or in LN If w=xy and w is in LM then x is in L and y in M ⇒ y is in M L, thus w is in L(M N) If w=xy and w is in LN then x is in L and y in N ⇒ y is in M L, thus w is in L(M N) 28 November 2018 Veton Këpuska
60
The Idempotent Law Arithmetic:
Common arithmetic operators are not idempotent: x+x ≠ x and xx ≠ x Regular Expressions: Idempotent law L+L=L 28 November 2018 Veton Këpuska
61
Laws Involving Closures
(L*)* = L* - Closing an expression that is already closed does not change the language. ∅* = - The closure of ∅ contains only the string . * = L+ = LL* = L*L L+ = L + LL + LLL + … L* = + L + LL + LLL + … = + L+ LL* = L + LL + LLL + LLLL + … L = L = L L* = L+ + L? = + L 28 November 2018 Veton Këpuska
62
End 28 November 2018 Veton Këpuska
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.