Compilation With an emphasis on getting the job done quickly Copyright © 2003-2015 – Curt Hill.

Introduction One of the goals of these courses is to get the ability to implement simple control languages –This is what this is about There are whole courses devoted to compiler theory and construction of compilers and interpreters –Just not here Copyright © 2003-2015 – Curt Hill

Stages Compilers typically have several stages These may be sequentially run or concurrently run These include –The lexical analyzer or scanner –The syntax analyzer or parser –The code generator –The optimization routines In this class little concern for last two Copyright © 2003-2015 – Curt Hill

Lexical analyzer or scanner Front end for a parser Take a string of characters and produce a sequence of tagged tokens In the input there are things like comments, white space, and line breaks –None of those make any difference to the parser –Neither does input buffering and numerous other details Copyright © 2003-2015 – Curt Hill

Why separate scanner and parser? This simplifies the parser –Which is inherently more complicated We can optimize the scanner in different ways than the parser Separation makes both more modular The parser is mostly portable –Scanner may or may not be since it deals more seriously with files Copyright © 2003-2015 – Curt Hill

Scanner Lexical errors - there are just a few –Invalid format of a number or identifier name –Unmatched two part comments or quoted strings –Character not in alphabet The lexical analyzer might or might not actually do something with the symbol table –Depends on the format of the token stream Copyright © 2003-2015 – Curt Hill

The token stream A token is usually constructed from a record or class One item must contain the class of token –This usually assigns a number or enumeration for each reserved word, punctuation mark and for various things like identifiers, and constants These may contain supplemental information as needed Copyright © 2003-2015 – Curt Hill

Supplemental information A reserved word is defined sufficiently by the assigned number of enumeration A numeric constant needs to carry the actual value and possibly type An identifier needs the canonical representation –In languages that are not case sensitive this is usually every character converted to all upper or all lower case –In case sensitive languages this is the exact representation of the name Copyright © 2003-2015 – Curt Hill

Supplemental information Often the location of the token is also passed along so that an error message may be pinned to a usable source location –Line number and column position Not needed for parsing but helps the user determine how to fix the error The parser will merely ask for one token at a time, picking it off the token stream –Parser sees no lines Copyright © 2003-2015 – Curt Hill

Creating a lexical analyzer A lexical analyzer usually recognizes a Type 3 language –A regular language Thus you can describe this token language rather simply –Such as using regular expressions or a Finite State Automaton These are easy to code by hand –There are programs that do so as well Copyright © 2003-2015 – Curt Hill

Relationship It is possible to have the scanner operate like the preprocessor –Read in a source code file and write out a file of tokens Usually it is just a function called by the parser to deliver the next token No reason why it could not be a co- routine as well Copyright © 2003-2015 – Curt Hill

Generated Scanners There are programs that generate scanners based on some type of formal language that describes –The most famous of which is lex on UNIX systems Many types of parser generators either come with one that is easy to modify to your particular grammar or generate one while processing the syntax Copyright © 2003-2015 – Curt Hill

Simple Example Copyright © 2003-2015 – Curt Hill Program demo(input,output); { a comment } var x:integer; begin x := 5; writeln(‘This is x’,x) End. program identdemo ( identinput, identoutput ) ; var identx : integer ; begin identx := numeric constant 5 Source Code Token Stream

The Parser Determines if the source file conforms to language syntax –Generate meaningful error messages Build a structure suitable for later stages to operate –Mostly the code generator –This structure is usually a parse tree Copyright © 2003-2015 – Curt Hill

Top down Uses leftmost derivation Build the parse tree in a top down and left to right fashion Languages that may be parsed in this way are termed LL(N) –The first L specifies a Left to right scan of the source code –The second L specifies that the left most derivation is the one generated –N is often one and it represents the number of symbols that needs to be looked ahead at in order to decide which rule to use next Copyright © 2003-2015 – Curt Hill

Example Consider a handout for this –After the ident is either a comma or right parenthesis –The parser looks ahead to that item to determine what to do next –In every fork in the syntax diagrams there is a look ahead set of tokens –We determine which way to go by determining if the next symbol is in one set or another –If we never need more than a single look ahead then 1 is the constant Copyright © 2003-2015 – Curt Hill

Top Down Again For most programming languages an LL(1) grammar exists –This requires just a single look ahead token Lets look through the handout and see how this works Copyright © 2003-2015 – Curt Hill

Bottom up Looks at the leaves and works its way towards the root Bottom up usually accepts LR(n) languages Since they start at the bottom of the parse tree they use shift-reduce algorithms In this presentation we will skip how this actually works Copyright © 2003-2015 – Curt Hill

LL and LR In theory you could parse a programming language in many ways other than LL or LR –However, doing so makes the running time of O(N 3 ) which is not good for something that is run as often as a compiler –By forcing a Left to right scan of the source (the first L) we can get O(N) compiles which makes everyone much happier Copyright © 2003-2015 – Curt Hill

Subsets LL(1) languages are a subset of LR(1) Hence for any LL(1) language there is an LR(1) grammar There may be an LR(1) language for which there is no LL(1) grammar There are two other classes SLR and LALR which are simplifications of LR Copyright © 2003-2015 – Curt Hill

Commonly The LL, SLR and LALR parsers have been the dominant ones because the tables needed by an LR could grow exponentially in the worst case Hence most of the table driven parsers were LL or LALR However, since that time quite a bit of work has been done and there are now some decent LR table parsers Copyright © 2003-2015 – Curt Hill

Recursive descent parsers Since the grammar is recursive we can make our program follow the grammar For each production/non-terminal generate a function that processes that non-terminal It simply calls another function for each non-terminal on the RHS of the production We are less interested in these Copyright © 2003-2015 – Curt Hill

Recursive Descent We have seen –LR(n) –LL(n) –LALR(n) The N determines how many tokens have to be looked at to make a decision which production is involved Most programming languages have an LL(1) grammar –A recursive descent parser can look ahead just one token and then choose which is the right production Copyright © 2003-2015 – Curt Hill

Generated parsers These parsers are usually LALR(1) or LL(1) but a few are LR(1) There are number of these available –YACC UNIX LALR(1) These read in some form of a grammar Generate a series of tables that is used by a parser The scanner is also generated and then plugged into the parser Copyright © 2003-2015 – Curt Hill

How do these work? The scanner reads in so many tokens The parser then looks for tokens that fit into the pattern of a production That is it looks for what is on the RHS When it finds the pattern it does a reduction –A reduction is moving right to left across a production –Replace RHS with the LHS Copyright © 2003-2015 – Curt Hill

More When a reduction is done a semantic routine is usually called This routine may do any of the following: –Check semantic things Is a variable defined? –Update the symbol table –Generate code –Do the things not possible with BNF Eventually we should be able to reduce to the distinguished symbol, then we are done Copyright © 2003-2015 – Curt Hill

Finally Lexical analysis implements a finite state automaton Parsing implements a push down automaton –In a recursive descent the stack is the run-time stack of function calls –In bottom up parsing the stack contains tokens As Software Engineers we are usually interested in generated parsers for the ease of construction Copyright © 2003-2015 – Curt Hill

Compilation With an emphasis on getting the job done quickly Copyright © 2003-2015 – Curt Hill.

Similar presentations

Presentation on theme: "Compilation With an emphasis on getting the job done quickly Copyright © 2003-2015 – Curt Hill."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Compilation With an emphasis on getting the job done quickly Copyright © 2003-2015 – Curt Hill.

Similar presentations

Presentation on theme: "Compilation With an emphasis on getting the job done quickly Copyright © 2003-2015 – Curt Hill."— Presentation transcript:

Similar presentations

About project

Feedback