Language Translation Principles Part 1: Language Specification.

Slides:



Advertisements
Similar presentations
Translator Architecture Code Generator ParserTokenizer string of characters (source code) string of tokens abstract program string of integers (object.
Advertisements

Grammars, constituency and order A grammar describes the legal strings of a language in terms of constituency and order. For example, a grammar for a fragment.
Grammars, Languages and Parse Trees. Language Let V be an alphabet or vocabulary V* is set of all strings over V A language L is a subset of V*, i.e.,
Chapter Chapter Summary Languages and Grammars Finite-State Machines with Output Finite-State Machines with No Output Language Recognition Turing.
ICE1341 Programming Languages Spring 2005 Lecture #5 Lecture #5 In-Young Ko iko.AT. icu.ac.kr iko.AT. icu.ac.kr Information and Communications University.
1 Pass Compiler 1. 1.Introduction 1.1 Types of compilers 2.Stages of 1 Pass Compiler 2.1 Lexical analysis 2.2. syntactical analyzer 2.3. Code generation.
Session 14 (DM62) / 15 (DM63) Recursive Descendent Parsing.
ISBN Chapter 3 Describing Syntax and Semantics.
CS 330 Programming Languages 09 / 13 / 2007 Instructor: Michael Eckmann.
Chapter 3 Describing Syntax and Semantics Sections 1-3.
Context-Free Grammars Lecture 7
Fall 2007CS 2251 Miscellaneous Topics Deque Recursion and Grammars.
Chapter 3 Describing Syntax and Semantics Sections 1-3.
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
Chapter 3 Describing Syntax and Semantics Sections 1-3.
Normal forms for Context-Free Grammars
Chapter 3: Formal Translation Models
Specifying Languages CS 480/680 – Comparative Languages.
COP4020 Programming Languages
Languages and Grammars MSU CSE 260. Outline Introduction: E xample Phrase-Structure Grammars: Terminology, Definition, Derivation, Language of a Grammar,
Chapter 2 Syntax A language that is simple to parse for the compiler is also simple to parse for the human programmer. N. Wirth.
1 Syntax and Semantics The Purpose of Syntax Problem of Describing Syntax Formal Methods of Describing Syntax Derivations and Parse Trees Sebesta Chapter.
2.2 A Simple Syntax-Directed Translator Syntax-Directed Translation 2.4 Parsing 2.5 A Translator for Simple Expressions 2.6 Lexical Analysis.
1 Introduction to Parsing Lecture 5. 2 Outline Regular languages revisited Parser overview Context-free grammars (CFG’s) Derivations.
Syntax & Semantic Introduction Organization of Language Description Abstract Syntax Formal Syntax The Way of Writing Grammars Formal Semantic.
Languages & Strings String Operations Language Definitions.
Introduction Syntax: form of a sentence (is it valid) Semantics: meaning of a sentence Valid: the frog writes neatly Invalid: swims quickly mathematics.
BİL 744 Derleyici Gerçekleştirimi (Compiler Design)1 Syntax Analyzer Syntax Analyzer creates the syntactic structure of the given source program. This.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
Winter 2007SEG2101 Chapter 71 Chapter 7 Introduction to Languages and Compiler.
1 Chapter 3 Describing Syntax and Semantics. 3.1 Introduction Providing a concise yet understandable description of a programming language is difficult.
A sentence (S) is composed of a noun phrase (NP) and a verb phrase (VP). A noun phrase may be composed of a determiner (D/DET) and a noun (N). A noun phrase.
Languages, Grammars, and Regular Expressions Chuck Cusack Based partly on Chapter 11 of “Discrete Mathematics and its Applications,” 5 th edition, by Kenneth.
Context-Free Grammars
CS Describing Syntax CS 3360 Spring 2012 Sec Adapted from Addison Wesley’s lecture notes (Copyright © 2004 Pearson Addison Wesley)
Copyright © 2006 The McGraw-Hill Companies, Inc. Programming Languages 2nd edition Tucker and Noonan Chapter 2 Syntax A language that is simple to parse.
Grammars CPSC 5135.
PART I: overview material
LANGUAGE DESCRIPTION: SYNTACTIC STRUCTURE
Lesson 3 CDT301 – Compiler Theory, Spring 2011 Teacher: Linus Källberg.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
Introduction to Language Theory
C H A P T E R TWO Syntax and Semantic.
ISBN Chapter 3 Describing Syntax and Semantics.
CMSC 330: Organization of Programming Languages Context-Free Grammars.
CFG1 CSC 4181Compiler Construction Context-Free Grammars Using grammars in parsers.
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
CPS 506 Comparative Programming Languages Syntax Specification.
D Goforth COSC Translating High Level Languages Note error in assignment 1: #4 - refer to Example grammar 3.4, p. 126.
ISBN Chapter 3 Describing Syntax and Semantics.
LESSON 04.
Chapter 3 Context-Free Grammars and Parsing. The Parsing Process sequence of tokens syntax tree parser Duties of parser: Determine correct syntax Build.
Unit-3 Parsing Theory (Syntax Analyzer) PREPARED BY: PROF. HARISH I RATHOD COMPUTER ENGINEERING DEPARTMENT GUJARAT POWER ENGINEERING & RESEARCH INSTITUTE.
Syntax and Semantics Form and Meaning of Programming Languages Copyright © by Curt Hill.
Copyright © 2006 The McGraw-Hill Companies, Inc. Programming Languages 2nd edition Tucker and Noonan Chapter 2 Syntax A language that is simple to parse.
Programming Languages and Design Lecture 2 Syntax Specifications of Programming Languages Instructor: Li Ma Department of Computer Science Texas Southern.
Formal Languages and Grammars
LECTURE 4 Syntax. SPECIFYING SYNTAX Programming languages must be very well defined – there’s no room for ambiguity. Language designers must use formal.
Chapter 4: Syntax analysis Syntax analysis is done by the parser. –Detects whether the program is written following the grammar rules and reports syntax.
Language Translation Part 2: Finite State Machines.
©SoftMoore ConsultingSlide 1 Context-Free Grammars.
Copyright © 2006 Addison-Wesley. All rights reserved.1-1 ICS 410: Programming Languages Chapter 3 : Describing Syntax and Semantics Syntax.
CSE 311 Foundations of Computing I Lecture 19 Recursive Definitions: Context-Free Grammars and Languages Autumn 2012 CSE
CSE 311 Foundations of Computing I Lecture 18 Recursive Definitions: Context-Free Grammars and Languages Autumn 2011 CSE 3111.
Chapter 3 – Describing Syntax CSCE 343. Syntax vs. Semantics Syntax: The form or structure of the expressions, statements, and program units. Semantics:
Modeling Arithmetic, Computation, and Languages Mathematical Structures for Computer Science Chapter 8 Copyright © 2006 W.H. Freeman & Co.MSCS SlidesAlgebraic.
Chapter 3 – Describing Syntax
A Simple Syntax-Directed Translator
Chapter 3 Context-Free Grammar and Parsing
CS 3304 Comparative Languages
Presentation transcript:

Language Translation Principles Part 1: Language Specification

Attributes of a language Syntax: rules describing use of language tokens Semantics: logical meaning of combinations of tokens In a programming language, “tokens” include identifiers, keywords, and punctuation

Linguistic correctness syntactically correctA syntactically correct program is one in which the tokens are arranged so that the code can be successfully translated into a lower-level language semantically correctA semantically correct program is one that produces correct results

Language translation tools Parser: scans source code, compares with established syntax rules Code generator: replaces high level source code with semantically equivalent low level code

Techniques to describe syntax of a language Grammars: specify how you combine atomic elements of language (characters) to form legal strings (words, sentences) Finite State Machines: specify syntax of a language through a series of interconnected diagrams Regular Expressions: symbolic representation of patterns describing strings; applications include forming search queries as well as language specification

Elements of a Language Alphabet: finite, non-empty set of characters –not precisely the same thing we mean when we speak of natural language alphabet –for example, the alphabet of C++ includes the upper- and lowercase letters of the English alphabet, the digits 0-9, and the following punctuation symbols: {,},[,],(,),+,-,*,/,%,=,>,<,!,&,|,’,”,,,.,:,;,_,\ –Pep/8 alphabet is similar, but uses less punctuation –Language of real numbers has its own alphabet; the set of characters {0,1,2,3,4,5,6,7,8,9,+,-,.}

Language as ADT A language is an example of an Abstract Data Type (ADT) An ADT has these characteristics: –Set of possible values (an alphabet) –Set of operations on those values One of the operations on the set of values in a language is concatenation

Concatenation Concatenation is the joining of two or more characters to form a string Many programming language tokens are formed this way; for example: –> and = form >= –& and & form && –1, 2, 3 and 4 form 1234 Concatenation always involves two operands – either one can be a string or a single character

String characteristics The number of characters in a string is the string’s length An empty string is a string with length 0; we denote the empty string with the symbol є The є is the identity element for concatenation; if x is string, then: єx = xє = x

Closure of an alphabet closure, T*The set of all possible strings that can formed by concatenating elements from an alphabet is the alphabet’s closure, denoted T* for some alphabet T The closure of an alphabet includes strings that are not valid tokens in the language; it is not a finite set For example, if R is the real number alphabet, then R* includes: and but also and

Languages & Grammars A language is a subset of the closure of an alphabet A grammar specifies how to concatenate symbols from an alphabet to form legal strings in a language

Parts of a grammar N: a nonterminal alphabet; each element of N represents a group of characters from: –T: a terminal alphabet –P: a set of rules for string production; uses nonterminals to describe language structure –S: the start symbol, an element of N

Terminal vs. non-terminal symbols A non-terminal symbol is used to describe or represent a set of terminal symbols For example, the following standard data types are terminal symols in C++ and Java: int, double, float, char The non-terminal symbol could be used to represent any or all of these

Valid strings S (the start symbol) is a single symbol, not a set Given S and P (rules for production), you can decide whether a set of symbols is a valid string in the language Conversely, starting from S, if you can generate a string of terminal symbols using P, you can create a valid string

Productions A  w a non-terminal produces a string of terminals & non-terminals

Derivations A grammar specifies a language through the derivation process: –begin with the start symbol –substitute for non-terminals using rules of production until you get a string of terminals

Example: a grammar for identifiers (a toy example) N = {,, } T = {a, b, c, 1, 2, 3} P = the productions: (  means “produces”) 1.  2.  3.  4.  a 5.  b 6.  c 7.  1 8.  2 9.  3 S =

Example: deriving a12bc:  (rule 2)  c (rule 6)  means  c (rule 2) derives in one  bc (rule 5) step  bc (rule 3)  2bc (rule 8)  2bc (rule 3)  12bc (rule 7)  12bc  a12bc

Closure of derivation The symbol  * means “derives in 0 or more steps” A language specified by a grammar consists of all strings derivable from the start symbol using the rules of production –provides operational test for membership in the language –if a string can’t be derived using production rules, it isn’t in the language

Example: attempting to derive 2a   a Since there is no  combination in the production rules, we can’t proceed any further This means that 2a isn’t a valid string in our language

A grammar for signed integers N = {I, F, M} –I means integer –F means first symbol; optional sign –M means magnitude T = {+,-,d} (d means digit 0-9) P = the productions: 1.I  FM 2.F  + 3.F  - 4.F  є (means +/- is optional) 5.M  dM 6.M  d S = I

Examples Deriving 14:Deriving -7: I  FMI  FM  єM  -M  dM  -d  dd  -7  14

Recursive rules Both of the previous examples (identifiers, integers) have rules in which a nonterminal is defined in terms of itself: –  and –M  dM Such rules produce languages with infinite sets of legal sentences

Context-sensitive grammar A grammar in which the production rules may contain more than one non-terminal on the left side The opposite (all of the examples we have seen thus far), have production rules restricted to a single non-terminal on the left: these are known as context-free grammars

Example N = {A,B,C} T = {a,b,c} P = the productions: 1.A  aABC 2.A  abC 3.CB  BC 4.bB  bb 5.bC  bcThis rule is context-sensitive: C can be substituted with c only if C is immediately preceded by b 6.cC  cc S = A

Context-sensitive grammar N = {A, B, C} T = {a, b, c} P = the productions 1.A --> aABC 2.A --> abC 3.CB --> BC 4.bB --> bb 5.bC --> bc 6.cC --> cc S = A Example: aaabbbcc is a valid string by: A => aABC (1) => aaABCBC (1) => aaabCBCBC (2) => aaabBCCBC (3) => aaabBCBCC (3) => aaabBBCCC (3) => aaabbBCCC (4) => aaabbbCCC (4) => aaabbbcCC (5) Here, we substituted c for C; this is allowable only if C has b in front of it => aaabbbccC (6) => aaabbbccc (6)

Valid & invalid strings from previous example: Valid: –abc –aabbcc –aaabbbccc –aaaabbbbcccc Invalid: –aabc –cba –bbbccc The grammar describes a language consisting of strings that start with a number of a’s, followed by an equal number of b’s and c’s; this language can be defined mathematically as: L = {a n b n c n | n > 0} Note: a n means the concatenation of n a’s

A grammar for expressions N = {E, T, F} where: E: expression T: term – T = {+, *, (, ), a} F: factor P: the productions: 1.E -> E + T 2.E -> T 3.T -> T * F 4.T -> F 5.F -> (E) 6.F -> a S = E

Applying the grammar You can’t reach a valid conclusion if you don’t have a valid string, but the opposite is not true For example, suppose we want to parse the string (a * a) + a using the grammar we just saw First attempt: E=> T (by rule 2) => F (by rule 4) … and, we’re stuck, because F can only produce (E) or a; so we reach a dead end, even though the string is valid

Applying the grammar Here’s a parse that works for (a*a)+a: E=> E+T (rule 1) => T+T (rule 2) => F+T (rule 4) => (E)+T (rule 5) => (T)+T (rule 2) => (T*F)+T (rule 3) => (T*a)+T (rule 6) => (F*a)+F (rule 4 applied twice) => (a*a) + a (rule 6 applied twice)

Deriving a valid string from a grammar Arbitrarily pick a nonterminal on right side of current intermediate string & select rules for substitution until you get a string of terminals Automatic translators have more difficult problem: –given string of terminals, determine if string is valid, then produce matching object code parsing –only way to determine string validity is to derive it from the start string of the grammar – this is called parsing

The parsing problem Automatic translators aren’t at liberty to pick rules randomly (as illustrated by the first attempt to translate the preceding expression) Parsing algorithm must search for the right sequence of substitutions to derive a proposed string Translator must also be able to prove that no derivation exists if proposed string is not valid

Syntax tree A parse routine can be represented as a tree –start symbol is the root –interior nodes are nonterminal symbols –leaf nodes are terminal symbols –children of an interior node are symbols from right side of production rule substituted for parent node in derivation

Syntax tree for (a*a)+a

Grammar for a programming language A grammar for a subset of the C++ language is laid out on pages of the textbook A sampling (suitable for either C++ or Java) is given on the next couple of slides

Rules for declarations -> ; -> char | int | double (remember, this is subset of actual language) -> |, -> | | -> a|b|c| … |z|A|B|…|Z -> 0|1|2|3|4|5|6|7|8|9

Rules for control structures -> if ( ) | if ( ) else -> while ( ) | do while ( ) ;

Rules for expressions -> ; -> | = -> | > | | >= etc.

Backus-Naur Form (BNF) BNF is the standardized form for specification of a programming language by its rules of production In BNF, the -> operator is written ::= ALGOL-60 first popularized the form

BNF described in terms of itself (from Wikipedia) ::= | ::= " ">" "::=" ::= " " | "" ::= | "|" ::= | ::= | " ">“ ::= '"' '"' | "'" "'"