Compiler verification for fun and profit a week from now, 8:55am in Salle Polyvalente Xavier LeroyInria Paris–Rocquencourt …Compilers are, however, vulnerable.

Slides:



Advertisements
Similar presentations
Grammar vs Recursive Descent Parser
Advertisements

Exercise: Balanced Parentheses
1 Parsing The scanner recognizes words The parser recognizes syntactic units Parser operations: Check and verify syntax based on specified syntax rules.
Exercise 1: Balanced Parentheses Show that the following balanced parentheses grammar is ambiguous (by finding two parse trees for some input sequence)
ICE1341 Programming Languages Spring 2005 Lecture #5 Lecture #5 In-Young Ko iko.AT. icu.ac.kr iko.AT. icu.ac.kr Information and Communications University.
CS5371 Theory of Computation
6/12/2015Prof. Hilfinger CS164 Lecture 111 Bottom-Up Parsing Lecture (From slides by G. Necula & R. Bodik)
1 Contents Introduction A Simple Compiler Scanning – Theory and Practice Grammars and Parsing LL(1) Parsing LR Parsing Lex and yacc Semantic Processing.
CS 330 Programming Languages 09 / 13 / 2007 Instructor: Michael Eckmann.
Context-Free Grammars Lecture 7
Parsing III (Eliminating left recursion, recursive descent parsing)
ISBN Chapter 4 Lexical and Syntax Analysis The Parsing Problem Recursive-Descent Parsing.
1 Predictive parsing Recall the main idea of top-down parsing: Start at the root, grow towards leaves Pick a production and try to match input May need.
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
Parsing — Part II (Ambiguity, Top-down parsing, Left-recursion Removal)
1 The Parser Its job: –Check and verify syntax based on specified syntax rules –Report errors –Build IR Good news –the process can be automated.
CS 536 Spring Ambiguity Lecture 8. CS 536 Spring Announcement Reading Assignment –“Context-Free Grammars” (Sections 4.1, 4.2) Programming.
(2.1) Grammars  Definitions  Grammars  Backus-Naur Form  Derivation – terminology – trees  Grammars and ambiguity  Simple example  Grammar hierarchies.
CSE 413 Programming Languages & Implementation Hal Perkins Autumn 2012 Context-Free Grammars and Parsing 1.
1 Syntax and Semantics The Purpose of Syntax Problem of Describing Syntax Formal Methods of Describing Syntax Derivations and Parse Trees Sebesta Chapter.
1 Introduction to Parsing Lecture 5. 2 Outline Regular languages revisited Parser overview Context-free grammars (CFG’s) Derivations.
Lecture 21: Languages and Grammars. Natural Language vs. Formal Language.
Chapter 4 Context-Free Languages Copyright © 2011 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1.
CPSC 388 – Compiler Design and Construction Parsers – Context Free Grammars.
1 Chapter 5 LL (1) Grammars and Parsers. 2 Naming of parsing techniques The way to parse token sequence L: Leftmost R: Righmost Top-down  LL Bottom-up.
Parsing III (Top-down parsing: recursive descent & LL(1) )
Exercise 1 Consider a language with the following tokens and token classes: ident ::= letter (letter|digit)* LT ::= " " shiftL ::= " >" dot ::= "." LP.
PART I: overview material
Profs. Necula CS 164 Lecture Top-Down Parsing ICOM 4036 Lecture 5.
Lesson 3 CDT301 – Compiler Theory, Spring 2011 Teacher: Linus Källberg.
3-1 Chapter 3: Describing Syntax and Semantics Introduction Terminology Formal Methods of Describing Syntax Attribute Grammars – Static Semantics Describing.
ISBN Chapter 3 Describing Syntax and Semantics.
A Rule of While Language Syntax // Where things work very nicely for recursive descent! statmt ::= println ( stringConst, ident ) | ident = expr | if (
Top Down Parsing - Part I Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.
Exercise 1 A ::= B EOF B ::=  | B B | (B) Tokens: EOF, (, ) Generate constraints and compute nullable and first for this grammar. Check whether first.
Introduction to Parsing
CPS 506 Comparative Programming Languages Syntax Specification.
Chapter 3 Part II Describing Syntax and Semantics.
LESSON 04.
Parsing — Part II (Top-down parsing, left-recursion removal) Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students.
Top-down Parsing lecture slides from C OMP 412 Rice University Houston, Texas, Fall 2001.
Parsing — Part II (Top-down parsing, left-recursion removal) Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students.
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
Top-down Parsing Recursive Descent & LL(1) Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412.
Top-Down Parsing CS 671 January 29, CS 671 – Spring Where Are We? Source code: if (b==0) a = “Hi”; Token Stream: if (b == 0) a = “Hi”; Abstract.
Overview of Previous Lesson(s) Over View  In our compiler model, the parser obtains a string of tokens from the lexical analyzer & verifies that the.
Lecture 3: Parsing CS 540 George Mason University.
Top-down Parsing. 2 Parsing Techniques Top-down parsers (LL(1), recursive descent) Start at the root of the parse tree and grow toward leaves Pick a production.
Unit-3 Parsing Theory (Syntax Analyzer) PREPARED BY: PROF. HARISH I RATHOD COMPUTER ENGINEERING DEPARTMENT GUJARAT POWER ENGINEERING & RESEARCH INSTITUTE.
Top-Down Parsing.
Syntax Analyzer (Parser)
Copyright © 2006 The McGraw-Hill Companies, Inc. Programming Languages 2nd edition Tucker and Noonan Chapter 2 Syntax A language that is simple to parse.
CSE 5317/4305 L3: Parsing #11 Parsing #1 Leonidas Fegaras.
Chapter 8 Properties of Context-free Languages These class notes are based on material from our textbook, An Introduction to Formal Languages and Automata,
LECTURE 4 Syntax. SPECIFYING SYNTAX Programming languages must be very well defined – there’s no room for ambiguity. Language designers must use formal.
1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations.
1 Topic #4: Syntactic Analysis (Parsing) CSC 338 – Compiler Design and implementation Dr. Mohamed Ben Othman ( )
Parsing III (Top-down parsing: recursive descent & LL(1) )
Exercises on Chomsky Normal Form and CYK parsing
Lecture # 10 Grammar Problems. Problems with grammar Ambiguity Left Recursion Left Factoring Removal of Useless Symbols These can create problems for.
Spring 16 CSCI 4430, A Milanova 1 Announcements HW1 due on Monday February 8 th Name and date your submission Submit electronically in Homework Server.
Compiler Construction Lecture Five: Parsing - Part Two CSC 2103: Compiler Construction Lecture Five: Parsing - Part Two Joyce Nakatumba-Nabende 1.
Introduction to Parsing
Parsing & Context-Free Grammars
CS510 Compiler Lecture 4.
Introduction to Parsing (adapted from CS 164 at Berkeley)
Syntax Analysis Chapter 4.
Syntax-Directed Translation
Top-Down Parsing CS 671 January 29, 2008.
Lecture 7: Introduction to Parsing (Syntax Analysis)
Presentation transcript:

Compiler verification for fun and profit a week from now, 8:55am in Salle Polyvalente Xavier LeroyInria Paris–Rocquencourt …Compilers are, however, vulnerable to miscompilation: bugs in the compiler that cause incorrect code to be generated from a correct source code, possibly invalidating the guarantees so painfully obtained by source-level formal verification. Recent experimental studies show that many widely-used production-quality compilers suffer from miscompilation. The formal verification of compilers and related code generators is a radical, mathematically-grounded answer to the miscompilation issue. By applying formal verification (typically, interactive theorem proving) to the compiler itself, it is possible to guarantee that the compiler preserves the semantics of the source programs it transforms, or at least preserves the properties of interest that were formally verified over the source programs…

Exercise: Unary Minus 1) Show that the grammar A ::= − A A ::= A − id A ::= id is ambiguous by finding a string that has two different parse trees. Show those parse trees. 2) Make two different unambiguous grammars for the same language: a) One where prefix minus binds stronger than infix minus. b) One where infix minus binds stronger than prefix minus. 3) Show the syntax trees using the new grammars for the string you used to prove the original grammar ambiguous. 4) Give a regular expression describing the same language.

Unary Minus Solution Sketch 1) An example of a string with two parse trees is - id - id The two parse trees are generated by these imaginary parentheses (shown red):-(id-id)(-id)-id and can generated by these derivations that give different parse trees A => -A => - A - id => - id - id A => A - id => - A - id => - id - id 2) a) prefix minus binds stronger: A ::= B | A - id B ::= -B | id b) infix minus binds stronger A ::= C | -A C ::= id | C - id 3) in two trees that used to be ambiguous instead of some A’s we have B’s in a) grammar or C’s in b) grammar. 4) -*id(-id)*

Exercise: Basic Dangling Else The dangling-else problem happens when the conditional statements are parsed using the following grammar. S ::= id := id S ::= if id then S S ::= if id then S else S 1) To prove that the grammar is ambiguous, find two parse trees that derive the same string 2) Find an unambiguous grammar that accepts the same strings and matches the else statement with the nearest unmatched if. This is a real issue in languages like C, Java –resolved by saying else binds to innermost if

Dangling Else: Ambiguous Input 1) if id then if id then id := id else id := id

Dangling Else: Solution In class we have seen several wrong solutions, which are either still ambiguous, or remove some strings from the grammar. The following is a correct solution: S ::= id := id S ::= if id then S S ::= if id then S 1 else S S 1 ::= id := id S 1 ::= if id then S 1 else S 1 It can be parsed using recursive descent (LL(1)) parsers, so it is non-ambiguous. We will see later how to show this mechanically.

Dangling Else: Solution (Continued) It is clear that every string of the new grammar can be derived by the old one: take a parse tree of the new grammar, and replace S 1 with S. We obtain a valid parse tree of the old grammar. It is less obvious that the converse holds: for every parse tree in the old grammar, we can find a parse tree in the new grammar that yields the same string. We prove this as follows. Consider a derivation in the old grammar. Apply the following transformation to all subtrees in the old tree as long long as possible: This transformation preserves the string that the parse tree yields.

Dangling Else: Solution (Continued) In the resulting tree, consider every subtree that appears like this: And replace all occurrences of S inside such subtree with S 1 The crucial observation is that inside such tree each “then” has a corresponding else branch. This is clear for the top level, otherwise we would have moved the else shown in the picture one level down using the transformation on the previous slide. By applying this reasoning to the if then else inside, we also conclude (by induction) that all if nodes have else branches. This means that we have a valid tree according to the new grammar.

Exercise: Dangling Else in Context Suppose that in addition assignments and if statements we have statement sequencing: S ::= S ; S S ::= id := id S ::= if id then S S ::= if id then S else S Find an unambiguous grammar that accepts the same conditional statements, matches the else statement with the nearest unmatched if, and treats the priority of “;” similarly to Java.

Sources of Ambiguity in this Example Ambiguity arises in this grammar here due to: –dangling else –binary rule for sequence (;) as for parentheses –priority between if-then-else and semicolon (;) if p1 if p2 z = x; u = z // last assignment is not inside if Wrong parse tree -> wrong generated code

How we Solved It We identified a wrong tree and tried to refine the grammar to prevent it, by making a copy of the rules. Also, we changed some rules to disallow sequences inside if-then-else and make sequence rule non-ambiguous. The end result is something like this: S::=  |A S// a way to write S::=A* A ::= id := id A ::= if id then A A ::= if id then A' else A A' ::= id := id A' ::= if id then A' else A' (At some point we had a useless rule, so we deleted it.) Note we cannot have multiple statements inside if branches. We therefore looked at what a grammar would need, to allow building ASTs with sequences inside if-then-else. It would add a case for blocks, like this: A ::= { S } A' ::= { S } We could factor out some common definitions (e.g. define A in terms of A'), but that is not important for this problem.

Formalizing and Automating Recursive Descent: LL(1) Parsers

Task: Rewrite Grammar to make it suitable for recursive descent parser Assume the priorities of operators as in Java expr ::= expr (+|-|*|/) expr | name | `(’ expr `)’ name ::= ident

Grammar vs Recursive Descent Parser expr ::= term termList termList ::= + term termList | - term termList |  term ::= factor factorList factorList ::= * factor factorList | / factor factorList |  factor ::= name | ( expr ) name ::= ident def expr = { term; termList } def termList = if (token==PLUS) { skip(PLUS); term; termList } else if (token==MINUS) skip(MINUS); term; termList } def term = { factor; factorList }... def factor = if (token==IDENT) name else if (token==OPAR) { skip(OPAR); expr; skip(CPAR) } else error("expected ident or )") Note that the abstract trees we would create in this example do not strictly follow parse trees.

Rough General Idea A ::= B 1... B p | C 1... C q | D 1... D r def A = if (token  T1) { B 1... B p else if (token  T2) { C 1... C q } else if (token  T3) { D 1... D r } else error("expected T1,T2,T3") where: T1 = first(B 1... B p ) T2 = first(C 1... C q ) T3 = first(D 1... D r ) first(B 1... B p ) = {a  | B 1...B p ...  aw } T1, T2, T3 should be disjoint sets of tokens.

Computing first in the example expr ::= term termList termList ::= + term termList | - term termList |  term ::= factor factorList factorList ::= * factor factorList | / factor factorList |  factor ::= name | ( expr ) name ::= ident first(name) = {ident} first(( expr ) ) = { ( } first(factor) = first(name) U first( ( expr ) ) = {ident} U{ ( } = {ident, ( } first(* factor factorList) = { * } first(/ factor factorList) = { / } first(factorList) = { *, / } first(term) = first(factor) = {ident, ( } first(termList) = { +, - } first(expr) = first(term) = {ident, ( }

Algorithm for first Given an arbitrary context-free grammar with a set of rules of the form X ::= Y 1... Y n compute first for each right-hand side and for each symbol. How to handle alternatives for one non-terminal sequences of symbols nullable non-terminals recursion

Rules with Multiple Alternatives A ::= B 1... B p | C 1... C q | D 1... D r first(A) = first(B 1... B p ) U first(C 1... C q ) U first(D 1... D r ) Sequences first(B 1... B p ) = first(B 1 ) if not nullable(B 1 ) first(B 1... B p ) = first(B 1 ) U... U first(B k ) if nullable(B 1 ),..., nullable(B k-1 ) and not nullable(B k ) or k=p

Abstracting into Constraints expr ::= term termList termList ::= + term termList | - term termList |  term ::= factor factorList factorList ::= * factor factorList | / factor factorList |  factor ::= name | ( expr ) name ::= ident expr' = term' termList' = {+} U {-} term' = factor' factorList' = {*} U { / } factor' = name' U { ( } name' = { ident } recursive grammar: constraints over finite sets: expr' is first(expr) nullable: termList, factorList For this nice grammar, there is no recursion in constraints. Solve by substitution.

Example to Generate Constraints S ::= X | Y X ::= b | S Y Y ::= Z X b | Y b Z ::=  | a S' = X' U Y' X' = reachable (from S): productive: nullable: terminals: a,b non-terminals: S, X, Y, Z First sets of terminals: S', X', Y', Z'  {a,b}

Example to Generate Constraints S ::= X | Y X ::= b | S Y Y ::= Z X b | Y b Z ::=  | a S' = X' U Y' X' = {b} U S' Y' = Z' U X' U Y' Z' = {a} reachable (from S): S, X, Y, Z productive: X, Z, S, Y nullable: Z terminals: a,b non-terminals: S, X, Y, Z These constraints are recursive. How to solve them? S', X', Y', Z'  {a,b} How many candidate solutions in this case? for k tokens, n nonterminals?

Iterative Solution of first Constraints S' X' Y' Z' {} {} {} {} {} {b} {b} {a} {b} {b} {a,b} {a} {a,b} {a,b} {a,b} {a} {a,b} {a,b} {a,b} {a} S' = X' U Y' X' = {b} U S' Y' = Z' U X' U Y' Z' = {a} Start from all sets empty. Evaluate right-hand side and assign it to left-hand side. Repeat until it stabilizes Sets grow in each step initially they are empty, so they can only grow if sets grow, the RHS grows (U is monotonic), and so does LHS they cannot grow forever: in the worst case contain all tokens

Constraints for Computing Nullable Non-terminal is nullable if it can derive  S ::= X | Y X ::= b | S Y Y ::= Z X b | Y b Z ::=  | a S' = X' | Y' X' = 0 | (S' & Y') Y' = (Z' & X' & 0) | (Y' & 0) Z' = 1 | 0 S', X', Y', Z'  {0,1} 0 - not nullable 1 - nullable | - disjunction & - conjunction S' X' Y' Z' again monotonically growing

Computing first and nullable Given any grammar we can compute –for each non-terminal X whether nullable(X) –using this, the set first(X) for each non-terminal X General approach: –generate constraints over finite domains, following the structure of each rule –solve the constraints iteratively start from least elements keep evaluating RHS and re-assigning the value to LHS stop when there is no more change

Summary: Algorithm for nullable nullable = {} changed = true while (changed) { changed = false for each non-terminal X if ((X is not nullable) and (grammar contains rule X ::=  |... ) or (grammar contains rule X ::= Y1... Yn |... where {Y1,...,Yn}  nullable) then { nullable = nullable U {X} changed = true } }

Summary: Algorithm for first for each nonterminal X: first(X)={} for each terminal t: first(t)={t} repeat for each grammar rule X ::= Y(1)... Y(k) for i = 1 to k if i=1 or {Y(1),...,Y(i-1)}  nullable then first(X) = first(X) U first(Y(i)) until none of first(…) changed in last iteration