Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scanning & Regular Expressions CPSC 388 Ellen Walker Hiram College.

Similar presentations


Presentation on theme: "Scanning & Regular Expressions CPSC 388 Ellen Walker Hiram College."— Presentation transcript:

1 Scanning & Regular Expressions CPSC 388 Ellen Walker Hiram College

2 Scanning Input: characters from the source code Output: Tokens –Keywords: IF, THEN, ELSE, FOR … –Symbols: PLUS, LBRACE, SEMI … –Variable tokens: ID, NUM Augment with string or numeric value

3 TokenType Enumerated type (a c++ construct) Typedef enum {IF, THEN, ELSE …} TokenType IF, THEN, ELSE (etc) are now literals of type TokenType

4 Using TokenType void someFun(TokenType tt){ … switch (tt){ case IF: … break; case THEN: … break; … }

5 Token Class (partial) class Token { public: TokenType tokenval; string tokenchars; double numval; }

6 Interlude: References and Pointers Java has primitives and references –Primitives are int, char, double, etc. –References “point to” objects C++ has only primitives –But, one of the primitives is “address”, which serves the purpose of a reference.

7 Interlude: References and Pointers To declare a pointer, put * after the type char x;// a character char *y;// a pointer to a character Using pointers: x = ‘a’; y = &x; //y gets the address of x *y = ‘b’; //thing pointed at by y becomes ‘b’; //note that x is now also b!

8 Interlude: References and Pointers Continuing the example… cout << x << endl; // prints b cout << *y << endl; // prints b cout << y << endl; // prints a hex address cout << &x << endl; // same as above cout << &y << endl; // a different address - where the pointer is stored

9 GetToken(): A scanning function Token *getToken(istream &sin) –Read characters from sin until a complete token is extracted, return (a pointer to) the token –Usually called by the parser –Note: version in the book uses global variables and returns only the token type

10 Using GetToken Token *myToken = GetToken(cin); while (myToken != NULL){ //process the token switch (myToken->TokenType){ //cases for each token type } myToken = GetToken(cin); }

11 Result of GetToken

12 Tokens and Languages The set of valid tokens of a particular type is a Language (in the formal sense) More specifically, it is a Regular Language

13 Language Formalities Language: set of strings String: sequence of symbols Alphabet: set of legal symbols for strings –Generally  is used to denote an alphabet

14 Example Languages L1 = {aa, ab, bb},  = {a, b} L2 = { ,ab, abab, … },  = {a, b} L3 = {strings of N a’s where N is an odd integer},  = {a} L4 = {  } (one string with no symbols) L5 = { } (no strings at all) L5 = Ø

15 Denoting Languages Expressions (regular languages only) Grammars –Set of rewrite rules that express all and only the strings in the language Automata –Machines that “accept” all and only the strings in the language

16 Primitive Regular Expressions  –L(  ) = {}(no strings)  –L(  ) = {  }(one string, no symbols) a where a is a member of  –L(a) = {a}(one string, one symbol)

17 Combining Regular Expressions Choice: r | s (sometimes r+s) –L(r | s) = L(r )  L(s) Concatenation: rs –L(rs) = L(r)L(s) –All combinations of 1 from r and 1 from s Repetition: r* –L(r*) =   L(r )  L(rr)  L(rrr )  … –0 or more strings from r concatenated

18 Precedence Repetition before concatenation Concatenation before choice Use parentheses to override aa* vs. (aa)* ab|c vs. a(b|c)

19 Example Languages L1 = {aa, ab, bb},  = {a, b} L2 = { ,ab, abab, … }, S = {a, b} L3 = {strings of N a’s where N is an odd integer}, S = {a} L4 = {  } (one string with no symbols) L5 = { } (no strings at all) L5 = Ø

20 R.E.’s for Examples L1 = aa | ab | bb L1 = a(a|b) | bb L1 = aa | (a|b) b L2 = (ab)* not ab* ! L3 = a(aa)*

21 What are these languages? a* | b* | c* a*b*c* (a*b*)* a(a|b)*c (a|b|c)*bab(a|b|c)*

22 What are the RE’s? In the alphabet {a,b,c}: –All strings that are in alphabetical order –All strings that have the first a before the first b, before the first c, e.g. ababbabca –All strings that contain “abc” –All strings that do not contain “abc”

23 Extended Reg. Exp’s Additional operations for convenience r+ = rr* (one or more reps). ( any character in the alphabet).* = any possible string from the alphabet [a-z] = a|b|c|…|z [^aeiou] = b|c|d|f|g|h|j...


Download ppt "Scanning & Regular Expressions CPSC 388 Ellen Walker Hiram College."

Similar presentations


Ads by Google