The Nature of Computer Languages. Grammars Since the development of ALGOL in the mid 1950s, programming languages have been described using formal grammars.

The Nature of Computer Languages

Grammars Since the development of ALGOL in the mid 1950s, programming languages have been described using formal grammars ANTLR leverages the power of grammars by automatically generating code that can lexically scan and parse certain grammars The target language for us is Java, but ANTLR can also target other languages

Language Description The programming languages we want to describe are infinite. How can they be described? Two ways to describe a language: 1.Build a generator that will produce valid sentences 2.Build a recognizer that can determine if a given input sentence is valid

Language Generator “System.out.println(“Hi”);”

Language Recognizer “x = x + 1;” Good

Language Generation How can we generate words that produce a valid sentence? How can we extract meaning from a sequence of words? “Terence says Sriram likes chicken tika.” “Terence says Sriram, likes chicken tika.” Same sequence of words. The structure determines meaning. X = 2 + 3 * 4;

Grammar Background Deterministic Finite State Machine Non-Deterministic Finite Automata Pushdown Automata Language ambiguity

Finite State Machine (Informal) S0 S1 S2 S4 He She It loves hates writes ANTLR golf gin He loves golf She hates gin She writes golf

Deterministic Finite State Machine (Formally) A deterministic finite state machine is a quintuple (Σ,S,s0,δ,F), where: Σ is the input alphabet (a finite, non-empty set of symbols). S is a finite, non-empty set of states. s0 is an initial state, an element of S. δ is the state-transition function: S x Σ  S F is the set of final states, a (possibly empty) subset of S.

Deterministic Finite Automaton (DFA) a deterministic finite state machine or deterministic finite automaton (DFA) is a finite state machine where for each pair of state and input symbol there is one and only one transition to a next state. DFAs recognize the set of regular languages and are equivalent to regular grammars

Transitions in a DFA X Y X X Fine Nondeterminism S1 S2 S3 S1

Non-deterministic Finite Automaton (NDFA) a non-deterministic finite state machine or non-deterministic finite automaton (NDFA) is a finite state machine where for each pair of state and input symbol there are possibly two or more transitions to a next state. Every NDFA can be converted to a DFA

Requirements for Generating Complex Language State machines do generate sentences Not all the sentences they generate are valid Grammatical doesn’t imply sensible (“She writes golf”) There are dependencies between words of a sentence. In a program “[“ precedes “]” There are order requirements (a[i+3)]

Requirements for Generating Complex Language DFAs can only generate the class of regular languages Programming languages belong to the class of context-free languages We will need to work with context-free grammars

Chomsky Hierarchy Noam Chomsky (1950’s) described 4 classes of grammars 1) Type 0 – unrestricted grammars 2) Type 1 – Context sensitive grammars 3) Type 2 – Context free grammars 4) Type 3 – Regular grammars

Pushdown Automaton (Informal) Our generator needs more power – we need a finite state machine equipped with a stack The pushdown automaton (PDA) can use the top of the stack to decide which transition to take The pushdown automaton can manipulate the stack as part of performing a transition

Tree Structure of Sentences A book is not simply a sequence of words It has an organization: chapters, sections, paragraphs, sentences, phrases, words “Nested and Ordered” – Tree structured Even sentences have structure Trees work perfectly to describe sentence structure

The Structure of X = 0; statement assignment ; X = expr 0

Generating X = 0; void statement() { assignment(); System.out.println(“;”); } void assignment() { System.out.println(“x”); System.out.println(“=“); expr(); } void expr() { System.out.println(“0”);}

Generating Language A finite state machine is like a stream of instructions within a single method, unable to make method calls A pushdown automaton can freely invoke other parts of the machine to recognize more parts of the language This provides for nesting and sequencing

PDA Representation Pushdown automata can be represented with syntax diagrams INT ID ‘(‘ ‘[‘ expr ‘]’ expr ‘)‘

Ambiguity “Bush appeals to Democrats” Grammars can be ambiguous. Consider the phrase 2 + 3 * 4 Ambiguity is not tolerated in a programming language Within a syntax diagram, an ambiguous sentence is one the diagram can generate by following two distinct paths

Ambiguity A syntax diagram in C can generate the statement i*j; in two ways i*j; a multiplication of i and j i*j; j is a pointer to type I Languages are usually described formally with syntax diagrams and informally with descriptive language that disambiguates the sentences

Vocabulary Symbols Symbols have structure too Sentences are sequences of characters but your brain groups them into units “Now is the time …” Your brain constructs structures from the sequence of symbols The language processors we build will be separated into two parts: lexers and parsers

Characters, Tokens, ASTs... W I D T H = 2 0 0 ; \ n … Characters (CharStream) … ID WS = WS INT ; WS … x x x tokens (Token) = ID INT AST (CommonTree)

Language Generation Your brain generates sentences by starting from a main idea – the root of a tree structure The brain creates phrases and sub- phrases until it reaches the leaves of the tree Sentence recognition occurs in reverse. Scanning through the words you must reconstruct the structure

Reading a Data File BufferedReader f = new BufferedReader(new FileReader(“random.txt”)); int c = f.read( ); //get first char StringBuffer s = new StringBuffer( ) while (c != -1) { s.append((char) c); c = f.read( ); }

Reading a Structured Data File BufferedReader f = new BufferedReader(new FileReader(”letsdigs.txt”)); int c = f.read(); get 1 st char StringBuffer lets = new StringBuffer(); StringBuffer digs = new StringBuffer(); while(Character.isLetter((char)c){ lets.append((char)c); c = f.read();} while(Character.isDigit((char)c) { digs.append((char)c); c = f.read();}

Reading and Recognizing With ANTLR /** Grammar to Read the file File :.+; //consume to eof /** Grammar to Recognize file File : LETTERS DIGITS LETTERS : ’a’..’z’* ; DIGITS : ’0’..’9’* ;

ANTLR Generates Java Code void file (){ LETTERS(); DIGITS(); } void LETTERS(){ while (Character.isLetter((char)c){ c = f.read();} Void DIGITS(){ while (Character.isDigit((char)c){ c = f.read();}

ANTLR Constraints The code in the last slide is similar to code ANTLR would generate ANTLR can target other languages besides Java ANTLR can’t generate code for all grammars ANTLR generates an LL recognizer. Technically, LL(*)

LL Recognizers LL means to recognize the input from left to right, tracing out a leftmost derivation For more information on derivations and grammars, listen the to the grammar lecture LL recognizers use one method per grammar rule. The method may be recursive

LL Recognizers LL recognizers restrict the form of acceptable grammars Rules cannot be left-recursive expr : expr ’++’ ; void expr() { expr(); //infinite loop match(”++”); }

LL Recognizers Right-recursion is not a problem expr : ’++’ expr ; void expr() { match(”++”); expr(); } Execution terminates if the next token isn’t “++”

LL Recognizers Another restriction occurs when the lookahead is too weak to handle a give grammar Lookahead is the number of tokens ahead in the input stream that the parser can see. This similar to the number of symbols on the maze floor you can see before making a decision on which turn to take

Lookahead Consider this rule: decl: ’int’ ID ’=’ INT ’;’ | ’int’ ID ’;’ Here are examples : int x = 32; int y; We need to see three tokens ahead in order to distinguish which form of the rule to apply. We need an LL(3) grammar.

Lookahead Given a language, there are infinitely many grammars that describe it Often we can alter a given grammar so that ANTLR has an easier time generating the recognizer An LL(3) grammar might be converted to LL(1) by refactoring decl : ’int’ ID (’=’ INT)? ’;’

Lookahead Increasing the lookahead depth increases the power of the recognizer Even so, some grammars are not LL(k) for any value of k decl: modifier* ’int’ ID ’=’ INT ’;’ | modifier* ’int’ ID ’;’; modifier : ’static’ | ’register’ ;

Lookahead ANTLR can implement LL(*) to solve the previous problem LL(*) uses arbitrary lookahead to get past the problem in the grammar If LL(*) won’t work, you can tell ANTLR to try the rule alternatives in a specified order This is like trying alternatives in the maze until you hit on the escape route

ANTLR uses Memoization Memoization is an optimization technique that trades space for speed Backtracking over several rule alternatives means repeatedly invoking certain sub- rules. The results of the sub-rules invocations can be memoized so they don’t have to be recalculated

Grammar Limitations Not every language can be described using only grammar rules Most programming languages are supplemented with English language descriptions for constraints that are too hard to capture with grammar rules For example, x = 3; only makes sense if x has been declared previously

Grammar Limitations In C++, T(i) is either a function call or constructor-style typecast - (T) I If T is defined as a class, then T(i) is a type cast. We say the phrase is context- sensitive A context-sensitive rule has a more complicated left side: xTy  xby

Grammar Limitations We are using context-free grammars to describe programming languages The rules are simpler than context- sensitive rules Every left side is a single Non-terminal symbol T  xT | y

Grammar Limitations For context-sensitive rules, ANTLR uses a feature called “semantic predicates” expr:{la(1) is function }? functionClall | {la(1) is type}? ctorTypecast ;

Syntatic Predicates Syntactic predicates provide a way to prioritize rule alternatives Stat:(declaration)=> declaration | expression If the current statement looks like a declaration, it is. Even if it matches the second alternative With LL(*), semantic predicates, and syntatic predicates, ANTLR can build recognizers for most languages

The Nature of Computer Languages. Grammars Since the development of ALGOL in the mid 1950s, programming languages have been described using formal grammars.

Similar presentations

Presentation on theme: "The Nature of Computer Languages. Grammars Since the development of ALGOL in the mid 1950s, programming languages have been described using formal grammars."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Nature of Computer Languages. Grammars Since the development of ALGOL in the mid 1950s, programming languages have been described using formal grammars.

Similar presentations

Presentation on theme: "The Nature of Computer Languages. Grammars Since the development of ALGOL in the mid 1950s, programming languages have been described using formal grammars."— Presentation transcript:

Similar presentations

About project

Feedback