Tokenizers 29-Nov-15. Tokens A tokenizer is a program that extracts tokens from an input stream A token is a “word” or a significant punctuation mark.

Slides:



Advertisements
Similar presentations
1 StringBuffer & StringTokenizer Classes Chapter 5 - Other String Classes.
Advertisements

Code Generator Translator Architecture Parser Tokenizer string of characters (source code) string of tokens abstract program string of integers (object.
Chapter 1 Object-Oriented Concepts. A class consists of variables called fields together with functions called methods that act on those fields.
Chapter 7 Strings F Processing strings using the String class, the StringBuffer class, and the StringTokenizer class. F Use the String class to process.
IntroductionIntroduction  Computer program: an ordered sequence of statements whose objective is to accomplish a task.  Programming: process of planning.
Arrays Liang, Chpt 5. arrays Fintan Array of chars For example, a String variable contains an array of characters: An array is a data structure.
14-Jun-15 State Machines. 2 What is a state machine? A state machine is a different way of thinking about computation A state machine has some number.
Chapter 9 Characters and Strings. Topics Character primitives Character Wrapper class More String Methods String Comparison String Buffer String Tokenizer.
©The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 4 th Ed Chapter Chapter 9 Characters and Strings (sections ,
String Tokenization What is String Tokenization?
23-Jun-15 Strings, Etc. Part I: String s. 2 About Strings There is a special syntax for constructing strings: "Hello" Strings, unlike most other objects,
StringBuffer class  Alternative to String class  Can be used wherever a string is used  More flexible than String  Has three constructors and more.
25-Jun-15 State Machines. 2 What is a state machine? A state machine is a different way of thinking about computation A state machine has some number.
Strings, Etc. Part I: Strings. About Strings There is a special syntax for constructing strings: "Hello" Strings, unlike most other objects, have a defined.
Lecture 2: Variables and Expressions Yoni Fridman 6/29/01 6/29/01.
Chapter 91 Streams and File I/O Chapter 9. 2 Reminders Project 6 released: due Nov 10:30 pm Project 4 regrades due by midnight tonight Discussion.
©The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 4 th Ed Chapter Chapter 9 Characters and Strings (sections ,
14-Jul-15 State Machines Abbreviated lecture. 2 What is a state machine? A state machine is a different way of thinking about computation A state machine.
CSM-Java Programming-I Spring,2005 Control Flow Lesson - 3.
University of Limerick1 Work with API’s. University of Limerick2 Learning OO programming u Learning a programming language can be broadly split into two.
Introduction to Programming David Goldschmidt, Ph.D. Computer Science The College of Saint Rose Java Fundamentals (Comments, Variables, etc.)
Lecture 2 Object Oriented Programming Basics of Java Language MBY.
Chapter 7 Strings  Use the String class to process fixed strings.  Use the StringBuffer class to process flexible strings.  Use the StringTokenizer.
Chapter 2: Java Fundamentals
Chapter 9-Text File I/O. Overview n Text File I/O and Streams n Writing to a file. n Reading from a file. n Parsing and tokenizing. n Random Access n.
CompSci 100E 2.1 Java Basics - Expressions  Literals  A literal is a constant value also called a self-defining term  Possibilities: o Object: null,
COMP Parsing 3 of 4 Lectures 23. Using the Scanner Break input into tokens Use Scanner with delimiter: public void parse(String input ) { Scanner.
 Character set is a set of valid characters that a language can recognise.  A character represents any letter, digit or any other sign  Java uses the.
13-Nov-1513-Nov-1513-Nov-15 State Machines. What is a state machine? A state machine is a different way of thinking about computation A state machine.
Introduction to Java Lecture Notes 3. Variables l A variable is a name for a location in memory used to hold a value. In Java data declaration is identical.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
SE-1010 Dr. Mark L. Hornick 1 Variables & Datatypes.
1 Chapter 3 Syntax, Errors, and Debugging Fundamentals of Java: AP Computer Science Essentials, 4th Edition Lambert / Osborne.
Java Programming Java Basics. Data Types Java has two main categories of data types: –Primitive data types Built in data types Many very similar to C++
Strings and Text File I/O (and Exception Handling) Corresponds with Chapters 8 and 17.
1 CHAPTER 3 StringTokenizer. 2 StringTokenizer CLASS There are BufferedReader methods to read a line (i.e. a record) and a character, but not just a single.
Files and Streams CS /02/05 L7: Files Slide 2 Copyright 2005, by the authors of these slides, and Ateneo de Manila University. All rights reserved.
Java Language Basics By Keywords Keywords of Java are given below – abstract continue for new switch assert *** default goto * package.
Java for C++ Programmers A Brief Tutorial. Overview Classes and Objects Simple Program Constructors Arrays Strings Inheritance and Interfaces Exceptions.
CS 115 OBJECT ORIENTED PROGRAMMING I LECTURE 9 GEORGE KOUTSOGIANNAKIS Copyright: 2014 Illinois Institute of Technology- George Koutsogiannakis 1.
Sections © Copyright by Pearson Education, Inc. All Rights Reserved.
19-Dec-15 Tokenizers. Tokens A tokenizer is a program that extracts tokens from an input stream A token has two parts: Its value—this is just the characters.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
 In the java programming language, a keyword is one of 50 reserved words which have a predefined meaning in the language; because of this,
© 2006 Pearson Addison-Wesley. All rights reserved 1-1 Chapter 1 Review of Java Fundamentals.
© 2011 Pearson Education, publishing as Addison-Wesley Chapter 1: Computer Systems Presentation slides for Java Software Solutions for AP* Computer Science.
Topics for today: 1.Comments 2.Data types 3.Variable declaration.
© 2006 Pearson Addison-Wesley. All rights reserved 1-1 Chapter 1 Review of Java Fundamentals.
Working with Java.
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Multiple variables can be created in one declaration
Variables and Arithmetic Operators in JavaScript
University of Central Florida COP 3330 Object Oriented Programming
Chapter 7: Strings and Characters
MSIS 655 Advanced Business Applications Programming
null, true, and false are also reserved.
Introduction to Java Programming
An overview of Java, Data types and variables
OBJECT ORIENTED PROGRAMMING I LECTURE 9 GEORGE KOUTSOGIANNAKIS
Tokenizers 25-Feb-19.
State Machines 6-Apr-196-Apr-19.
The Recursive Descent Algorithm
Comments Any string of symbols placed between the delimiters /* and */. Can span multiple lines Can’t be nested! Be careful. /* /* /* Hi */ is an example.
Tokenizers 26-Apr-19.
Tokenizers 3-May-19.
Chap 2. Identifiers, Keywords, and Types
State Machines 8-May-19.
State Machines 16-May-19.
Agenda Types and identifiers Practice Assignment Keywords in Java
Problem 1 Given n, calculate 2n
Presentation transcript:

Tokenizers 29-Nov-15

Tokens A tokenizer is a program that extracts tokens from an input stream A token is a “word” or a significant punctuation mark. A token has two parts: Its value Its kind, or type For example, if we tokenize "while (x >= 0)" we might get these tokens: "while", keyword "(", punctuation "x", name ">=", operator "0", integer ")", punctuation

Tokenizers as state machines Tokenizers can be implemented as state machines, but with these important differences: To succeed (recognize a token), the tokenizer does not have to reach the end of input; it only has to reach a final state When the tokenizer returns a token, the remainder of the input string is kept for use in getting the remaining tokens Tokenizers are almost always implemented as state machines We’ll do a quick tokenizer to recognize tokens in arithmetic expressions: Integers (digits only) Variables (letters and digits, starting with a letter) Operators, + - * / % Parentheses, ( ) Errors (anything not in the above list)

TokenType public enum TokenType { INTEGER, VARIABLE, OPERATOR, PARENTHESIS, ERROR }

Token public class Token { private TokenType type; private String value; public Token(TokenType type, String value) { this.type = type; this.value = value; } public TokenType getType() { return type; } public String getValue() { return value; } }

Additions to the Token class For my JUnit testing, I needed to ask whether my Tokenizer was returning the correct Tokens public boolean equals(Object object) { if (object == null) return false; if (!(object instanceof Token) return false; Token that = (Token)object; return this.type == that.type && this.value.equals(that.value); } Since my tests were failing, I wanted to see what tokens I was actually getting public String toString() { return value + ":" + type; }

The constructor and hasNext() public class Tokenizer { private String input; private int position; public Tokenizer(String input) { this.input = input.trim() + " "; // to simplify getting last token position = -1; } public boolean hasNext() { return position < input.length() - 2; } public Token next() {... } }

The shell of next() public class Tokenizer { private enum States { READY, IN_NUMBER, IN_VARIABLE, ERROR }; public Token next() { States state; String value = ""; if (!hasNext()) { throw new IllegalStateException("No more tokens!"); } state = States.READY; while ((++position) < input.length()) { char ch = input.charAt(position); switch (state) { case READY: {... } case IN_VARIABLE: {... } case IN_NUMBER: {... } default: {... } return new Token(TokenType.ERROR, value); } } assert false; // should never get here return null; } }

The READY state case READY: value = ch + ""; if (Character.isWhitespace(ch)) break; if ("()".contains(ch + "")) { return new Token(TokenType.PARENTHESIS, value); } if ("+-*/%".contains(ch + "")) { return new Token(TokenType.OPERATOR, value); } if (Character.isLetter(ch)) { state = States.IN_VARIABLE; break; } if (Character.isDigit(ch)) { state = States.IN_NUMBER; break; } return new Token(TokenType.ERROR, value);

The IN_NUMBER state case IN_NUMBER: if (Character.isDigit(ch)) { value += ch; break; } else { position--; // save char for next time return new Token(TokenType.INTEGER, value); }

The IN_VARIABLE state case IN_VARIABLE: if (Character.isLetter(ch) || Character.isDigit(ch)) { value += ch; break; } else { position--; // save char for next time return new Token(TokenType.VARIABLE, value); }

The default case default: return new Token(TokenType.ERROR, value);

java.util.StringTokenizer StringTokenizer is a trivial tokenizer provided by Sun Everything is either a “token” or a “delimiter” The most important methods are hasMoreTokens() and nextToken() There are three constructors: StringTokenizer(String str) Delimiters are whitespace characters; any sequence of non-whitespace characters is returned as a token StringTokenizer(String str, String delim) Same as above, except you get to specify which characters are delimiters StringTokenizer(String str, String delim, boolean returnDelims) Same as above, except you get to say you also want the delimiters returned as tokens

java.io.StreamTokenizer StreamTokenizer is a much more powerful (and much more complex) tokenizer It is basically capable of tokenizing C and Java programs, including integers, doubles, and comments There are a large number of possible settings, so that the tokenizer can be customized The constructor is StreamTokenizer(Reader r), where Reader is an abstract class for reading character streams The most important method is int nextToken(), where the returned int tells you what kind of token it found Once you know what kind of token has been found, you access fields of the tokenizer to get its value I’m not going to cover StreamTokenizer in my lectures All the details are in the Java API You may want to use StreamTokenizer in subsequent assignments

15 The End