Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia 2014 2015.

Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia 2014 2015

Lexical Analysis: Lexical analyzer is the first phase of compiler. The lexical analyzer reads the input source program from left to right one character at a time and generates the sequence of tokens that the parser uses for syntax analysis. Each token is a single logical unit such as identifier, keywords, operators and punctuation marks. The role of lexical analyzer in the process of compilation is as shown below: Interaction of Lexical Analyzer with Parser

Functions of lexical analyzer: 1.It reads the source text. 2. It produces stream of tokens. 3. It eliminates white space in the form of blank, tab and new line; and comments. 4. It generates symbol table which stores the information about identifiers, constants encountered in the input. 5. It keeps track of line numbers. 6. It reports the error encountered while generating the tokens. Lexical Terminologies: Let us learn some terminologies which are used when we talk about the activity of lexical analysis:  Tokens: It describes the class or category of input string. A lexical token is a sequence of characters that can be treated as a unit in the grammar of the programming languages. lexical tokens are: identifiers, keywords, constants, literal string ( any characters between “ and “), and punctuation symbols such as parentheses, commas, and semicolons, etc….

 Patterns: Set of rules that describe the token. Regular Expressions are important notation for specify patterns. For example, the pattern for the Pascal identifier token, id, is : id → letter (letter | digit) *  Lexemes: Sequence of characters in the source program that are matched with the pattern of the token. Example: const pi = 3.1416 the sub string pi is a lexeme for the token “ identifier” Symbol table : is a table with two fields. A name field and an information field. This table is generally used to store information about various source language constructs. The information is collected by the analysis phase of the compiler and used by the synthesis phase to generate the target code.

Specification of Tokens: To specify tokens regular expressions are used. When a pattern is matched by some regular expression then token can be recognized. Let us understand the fundamental concepts of language. Strings and Language: String is a collection of finite number of alphabets or letters. The strings are called as words. The length of a string is denoted by | S| The empty string can be denoted by Є The empty set of strings is denoted by Ø Operations on Language: As we have seen that the language is a collection of strings. There are various operations which can be performed on the language:  Union of two language Set of strings in L1 and strings in L2 L1 and L2

 Concatenation of two languages L1.L2={set of strings in L1 followed L1 and L2 by set of strings in L2}  Exponentiation L 2 = LL  Kleen closure of L L* denotes zero or more concatenations of L  Positive closure of L L+ denotes one or more concatenations of L

Tokens are specified by patterns, called regular expressions. For example, the regular expression [a-z][a-zA-Z0-9]* recognizes all identifiers with at least one alphanumeric letter whose first letter is lower-case alphabetic, another example Identifiers : letter (letter | digit)* that describe the identifiers. Regular expression review We assume that you are well acquainted with regular expressions and all this is old news to you. Symbol an abstract entity that we shall not define formally. Letters, digits and punctuation are examples of symbols. Alphabet a finite set of symbols out of which we build larger structures. An alphabet is typically denoted using the Greek sigma Σ, e.g., Σ = {0,1}.

String a finite sequence of symbols from a particular alphabet juxtaposed. For example: a, b, c, are symbols and abcb is a string. empty string denoted ε (or sometimes ∂) is the string consisting of zero symbols. formal language Σ* the set of all possible strings that can be generated from a given alphabet. regular expressions rules that define exactly the set of words that are valid tokens in a formal language. The rules are built up from three operators: concatenation xy alternation x|y x or y repetition x* x repeated 0 or more times

Formally, the set of regular expressions can be defined by the following recursive rules: 1) Every symbol of Σ is a regular expression 2) Є is a regular expression 3) if r1 and r2 are regular expressions, so are (r1) r1r2 r1 | r2 r1* 4) Given an alphabet Ʃ the regular expressions over Ʃ and their corresponding regular languages are a) Ø denotes Ø; ɛ,the empty string, denotes the language { ɛ }. b) for each a in Ʃ, a denotes { a } --- a language with one string. c) if R denotes L R and S denotes L S then R | S denotes the language L R L S, i.e, { x | x L R or x L S }. d) if R denotes L R and S denotes L S then RS denotes the language L R L S, that is, { xy | x L R and y LS }. e) if R denotes L R then R* denotes the language L R * where L* is the union of all L i (i=0,...∞ ) and L i is just {x1x2...xi | x1 L,..., xi L}. f) if R denotes LR then (R) denotes the same language LR.

Transition Diagrams One way to begin the design of any program is to describe the behavior of the program by a flowchart. Remembering previous character by the position flowchart is a valuable tool, so that a specialized kind of flowchart for lexical analyzer, called transition diagram, has evolved. Flowchart with states and edges; each edge is labeled with characters; certain subset of states are marked as “final states” Transition from state to state proceeds along edges according to the next input character. Every string that ends up at a final state is accepted If get “stuck”, there is no transition for a given character, it is an error Transition diagrams can be easily translated to programs using case statements (in C). Finite automata review Once we have all our tokens defined using regular expressions, we can create a finite automaton for recognizing them. To review, a finite automata has:

Finite Automata are similar to transition diagrams; they have states and labeled edges; there are one unique start state and one or more than one final states. Nondeterministic Finite Automata (NFA) : a) Є can label edges (these edges are called Є -transitions). b) some character can label 2 or more edges out of the same state. Deterministic Finite Automata (DFA) : a) no edges are labeled with Є. b) each character can label at most one edge out of the same state. NFA and DFA accepts string x if there exists a path from the start state to a final state labeled with characters in x. NFA: multiple paths DFA: one unique path

sequence of moves that lead to a final state. input string: aabb One successful sequence: Another unsuccessful sequence: Example: DFA There is only one possible sequence of moves --- either lead to a final state and accept or the input string is rejected input string: aabb The successful sequence:

Transition Table Finite Automata can also be represented using transition tables For NFA, each entry is a set of states: STATE a b 0 {0,1} {0} 1 - {2} 2 - {3} 3 - - For DFA, each entry is a unique state: STATE a b 0 1 0 1 1 2 2 1 3 3 1 0

NFA with Є -transitions 1. NFA can have Є -transitions --- edges labeled with Є accepts the regular language denoted by (aa*|bb*)

1) A finite set of states, one of which is designated the initial state or start state, and some (maybe none) of which are designated as final states. 2) An alphabet Σ of possible input symbols. 3) A finite set of transitions that specifies for each state and for each symbol of the input alphabet, which state to go to next.

Minimization of DFAs: Given a DFA D over the alphabet ∑ with states S where F is the set of the accepting states, we construct a minimal DFA Dmin where each state is a group of states from D. We minimize the DFA D in the following way: 1) We start with two groups: the set of accepting states F and the set of nonaccepting states S. These are unmarked. 2) We pick any unmarked group G and check if it is consistent. If it is, we mark it. If G is not consistent, we split it into maximal consistent subgroups and replace G by these. All groups are then unmarked. A consistent subgroup is maximal if adding any other state to it will make it inconsistent. 3) If there are no unmarked groups left, we are done and the remaining groups are the states of the minimal DFA. Otherwise, we go back to step 2.

Example: minimize the following DFA NDFA Figure: Non-minimal DFA

As an example of minimization, take the DFA in figure above. We now make the initial division into two groups: The accepting and the nonaccepting states. G1 = {0,6} G2 = {1,2,3,4,5,7} These are both unmarked. We next pick any unmarked group, say G1. To check if this is consistent, we make a table of its transitions: G1 a b 0 G2 __ 6 G2 __ This is consistent, so we just mark it and select the remaining unmarked group G2 and make a table for this

G2 is evidently not consistent, so we split it into maximal consistent subgroups and erase all marks (including the one on G1): G1 = {0,6} G3 = {1,2,5} G4 = {3} G5 = {4,7} We now pick G3 for consideration:

G3 a b 1 G5 G3 2 G4 G3 5 G5 G3 This is not consistent either, so we split again and get: G1 = {0,6} G4 = {3} G5 = {4,7} G6 = {1,5} G7 = {2}

Figure: minimal DFA

Converting an NFA to a DFA: We will show how NFAs can be converted to DFAs such that we, by combining this with the conversion of regular expressions to NFAs, can convert any regular expression to a DFA. NDFA

The algorithm is: The starting state of the DFA is the epsilon-closure of the set containing just the starting state of the NFA, i.e., the states that are reachable from the starting state by epsilon-transitions. A transition in the DFA is done by finding the set of NFA states that comprise the DFA state, following all transitions (on the same symbol) in the NFA from all these NFA states and finally combining the resulting sets of states and closing this under epsilon transitions.

The set S’ of states in the DFA is the set of DFA states that can be reached from s’0 using the move function. A state in the DFA is an accepting state if at least one of the NFA states it contains is accepting.

Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia 2014 2015.

Similar presentations

Presentation on theme: "Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia 2014 2015."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia 2014 2015.

Similar presentations

Presentation on theme: "Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia 2014 2015."— Presentation transcript:

Similar presentations

About project

Feedback