Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Foundations of Software Design Lecture 24: Compilers, Lexers, and Parsers; Intro to Graphs Marti Hearst Fall 2002.

Similar presentations


Presentation on theme: "1 Foundations of Software Design Lecture 24: Compilers, Lexers, and Parsers; Intro to Graphs Marti Hearst Fall 2002."— Presentation transcript:

1 1 Foundations of Software Design Lecture 24: Compilers, Lexers, and Parsers; Intro to Graphs Marti Hearst Fall 2002

2 2 How Do Computers Work (Revisited)? Bits & BytesBinary Numbers Number Systems Orders of MagnitudeGates Boolean Logic Circuits CPUMachine Instructions Assembly Language Programming Languages Address Space Code vs. Data Compiler

3 3 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html The Compiler What is a compiler? –A recognizer (of some source language L). –A translator (of programs written in L into programs written in some object or target language L'). A compiler is itself a program, written in some host language Operates in phases Machine Instructions Assembly Language Programming Languages Compiler

4 4 Converting Java to Byte Code When you compile a java program, javac produces byte codes (stored in the class file). The byte codes are not converted to machine code. Instead, they are interpreted in the VM when you run the program called java.

5 5 Machine Code Assembly Language C code Translated by the C compiler (gcc or cc) Byte code (class file) Java code Translated by the java compiler (javac or jit) Java Virtual Machine Creates the JVM once Individual program is loaded & run in JVM

6 6 Compiler Compilers Which came first: the compiler or the program? –The very first one has to be written in assembly language! –This is why most programming languages today start with the C code generator After you have created the first compiler for a given language, say java, then you … Use that compiler to compile itself!!

7 7 Compiling Your Compiler Write the first java compiler using C Javac in C Compile using gcc Write the second java compiler using java Javac in java Compile using javac Write other java programs Compile using javac

8 8 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html Lexical analyzer (scanner) Syntax analyzer (parser) Semantic analyzer Intermediate Code Generator Optimizer Code Generator Compiler in more detail.

9 9 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html The Scanner Task: –Translate the sequence of characters into a corresponding sequence of tokens (by grouping characters into lexemes). How it’s done –Specify lexemes using Regular Expressions –Convert these Regular Expressions into Finite Automata

10 10 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html Lexemes and Tokens Here are some Java lexemes and the corresponding tokens: ; = index tmp 37 102 SEMI-COLON ASSIGN IDENT IDENT INT-LIT INT-LIT Note that multiple lexemes can correspond to the same token (e.g., there are many identifiers). Given the source code: position = initial + rate * 60 ; a Java scanner would return the following sequence of tokens: IDENT ASSIGN IDENT PLUS IDENT TIMES INT-LIT SEMI-COLON

11 11 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html The Scanner Also called the Lexer How it works: –Reads characters from the source program. –Groups the characters into lexemes (sequences of characters that "go together"). –Each lexeme corresponds to a token; the scanner returns the next token (plus maybe some additional information) to the parser. –The scanner may also discover lexical errors (e.g., erroneous characters). The definitions of what is a lexeme, token, or bad character all depend on the source language.

12 12 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html Two kinds of Automata Deterministic (DFA): –No state has more than one outgoing edge with the same label. Non-Deterministic (NFA): –States may have more than one outgoing edge with same label. –Edges may be labeled with  (epsilon), the empty string. –The automaton can take an  epsilon transition without looking at the current input character.

13 13 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html Regular Expressions to Finite Automata Generating a scanner Regular expressions NFA DFA Lexical Specification Table-driven Implementation of DFA

14 14 BNF Backus-Naur form, Backus-Normal form –A set of rules (or productions) –Each of which expresses the ways symbols of the language can be grouped together Non-terminals are written upper-case Terminals are written lower-case The start symbol is the left-hand side of the first production The rules for a CFG are often referred to as its BNF

15 15 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html Java Identifier Definition Described in the Java specification: –http://java.sun.com/docs/books/jls/second_edition/html/lexical.d oc.html#44591http://java.sun.com/docs/books/jls/second_edition/html/lexical.d oc.html#44591 –“An identifier is an unlimited-length sequence of Java letters and Java digits, the first of which must be a Java letter. –An identifier cannot have the same spelling (Unicode character sequence) as a keyword (§3.9), Boolean literal (§3.10.3), or the null literal (§3.10.7).”

16 16 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html Java Identifier Definition

17 17 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html Java Integer Literals An integer literal may be expressed in decimal (base 10), hexadecimal (base 16), or octal (base 8) Examples: 0 2 0372 0xDadaCafe 1996 0x00FF00FF (opt means optional)

18 18 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html Defining Java Decimal Numerals A decimal numeral is either the single ASCII character 0, representing the integer zero, or consists of an ASCII digit from 1 to 9, optionally followed by one or more ASCII digits from 0 to 9, representing a positive integer:

19 19 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html Defining Floating-Point Literals A floating-point literal has the following parts: a whole-number part, a decimal point (represented by an ASCII period character), a fractional part, an exponent, and a type suffix. The exponent, if present, is indicated by the ASCII letter e or E followed by an optionally signed integer.

20 20 From the Lucene HTML Scanner

21 21 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html The Functionality of the Parser Input: sequence of tokens from lexical analysis Output: parse tree of the program –parse tree is generated if the input is a legal program –if input is an illegal program, syntax errors are issued Note: –Instead of parse tree, some parsers produce directly: abstract syntax tree (AST) + symbol table, or intermediate code, or object code

22 22 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html Parser vs. Scanner PhaseInputOutput ScannerString of characters String of tokens ParserString of tokensParse tree

23 23 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html The Parser Groups tokens into "grammatical phrases", discovering the underlying structure of the source program. Finds syntax errors. –Example position = * 5 ; –corresponds to the sequence of tokens: IDENT ASSIGN TIMES INT-LIT SEMI-COLON –All are legal tokens, but that sequence of tokens is erroneous. Might find some "static semantic" errors, e.g., a use of an undeclared variable, or variables that are multiply declared. Might generate code, or build some intermediate representation of the program such as an abstract- syntax tree.

24 24 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html What must the parser do? 1.Recognizer: not all strings of tokens are programs –must distinguish between valid and invalid strings of tokens 2.Translator: must expose program structure e.g., associativity and precedence must return the parse tree We need: –A language for describing valid strings of tokens context-free grammars (analogous to regular expressions in the scanner) –A method for distinguishing valid from invalid strings of tokens (and for building the parse tree) the parser (analogous to the state machine in the scanner)

25 25 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html Parser Example position = initial + rate * 60 ; = + * position initial rate 60

26 26 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html The Semantic Analyzer The semantic analyzer checks for (more) "static semantic" errors, e.g., type errors. Annotates and/or changes the abstract syntax tree –(e.g., it might annotate each node that represents an expression with its type). –Example with before and after: = + * position initial rate 60 = + * position initial rate 60 (float) int- to-float() (float) (int)

27 27 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html Intermediate Code Generator The intermediate code generator translates from abstract-syntax tree to intermediate code. –One possibility is 3-address code. –Here's an example of 3-address code for the abstract- syntax tree shown above: temp1 = int-to-float(60) temp2 = rate * temp1 temp3 = initial + temp2 position = temp3

28 28 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html The Optimizer Examine the program and rewrite it in ways the preserve the meaning but are more efficient. Incredibly complex programs and algorithms Example –Move the declaration of temp outside the loop so it isn’t re-declared every time the loop is executed –Change 2*5 to 10 since it is a constant (no need to do an expensive multiply at run time) –If we removed the line with temp, the program might even skip the loop altogether You can see in advance that count ends up = 30 int count = 0; for (int j=0; j < 2*5; j++) { int temp = j + 1; count += 3; }

29 29 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html The Code Generator The code generator generates object code from (optimized) intermediate code. LOADF rate,R1 MULF #60.0,R1 LOADF initial,R2 ADDF R2,R1 STOREF R1,position

30 30 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html Tools Scanner Generator –Used to create a scanner automatically –Input: a regular expression for each token to be recognized –Output: a finite state machine –Examples: lex or flex (produce C code), or jlex (produce java) Compiler Compilers yacc (produces C) or JavaCC (produces Java, also has a scanner generator).

31 31 From the Lucene HTML Parser

32 32 From the Lucene HTML Parser

33 33 Graphs / Networks

34 34 Slide adapted from Goodrich & Tamassia What is a Graph?

35 35 Slide adapted from Goodrich & Tamassia

36 36 Slide adapted from Goodrich & Tamassia

37 37 Slide adapted from Goodrich & Tamassia

38 38 Slide adapted from Goodrich & Tamassia

39 39 Slide adapted from Goodrich & Tamassia

40 40 Slide adapted from Goodrich & Tamassia

41 41 Slide adapted from Goodrich & Tamassia

42 42 Slide adapted from Goodrich & Tamassia

43 43 Slide adapted from Goodrich & Tamassia

44 44 Slide adapted from Goodrich & Tamassia

45 45 Slide adapted from Goodrich & Tamassia

46 46 Slide adapted from Goodrich & Tamassia

47 47 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html Next Time Graph Traversal Directed Graphs (digraphs) DAGS Weighted Graphs


Download ppt "1 Foundations of Software Design Lecture 24: Compilers, Lexers, and Parsers; Intro to Graphs Marti Hearst Fall 2002."

Similar presentations


Ads by Google