Chapter 23 Text Processing Bjarne Stroustrup www.stroustrup.com/Programming.

Slides:



Advertisements
Similar presentations
Chapter 3 Objects, types, and values Bjarne Stroustrup
Advertisements

Chapter 23 Text Processing Bjarne Stroustrup
Liang, Introduction to Java Programming, Ninth Edition, (c) 2013 Pearson Education, Inc. All rights reserved. 1 Chapter 9 Strings.
 C++ programming facilitates a disciplined approach to program design. ◦ If you learn the correct way, you will be spared a lot of work and frustration.
Introduction to First Amendment Law. The First Amendment “Congress shall make no law respecting an establishment of religion, or prohibiting the free.
CSCE 121, Sec 200, 507, 508 Overview of Chapters 23 and 24 Fall 2010 Prof. Jennifer L. Welch.
Chapter 8. 2 Objectives You should be able to describe: One-Dimensional Arrays Array Initialization Arrays as Arguments Two-Dimensional Arrays Common.
Overview of C++ Chapter 2 in both books programs from books keycode for lab: get Program 1 from web test files.
Chapter 7. 2 Objectives You should be able to describe: The string Class Character Manipulation Methods Exception Handling Input Data Validation Namespaces.
IS 1181 IS 118 Introduction to Development Tools Chapter 4 String Manipulation and Regular Expressions.
The First Amendment. Actual Text Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging.
The First Amendment Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the freedom.
UNIX Filters.
More on Regular Expressions Regular Expressions More character classes \s matches any whitespace character (space, tab, newline etc) \w matches.
CSCI 1730 January 17 th, 2012 © by Pearson Education, Inc. All Rights Reserved.
CSCE 121: Introduction to Program Design and Concepts J. Michael Moore Fall 2014 Set 11: More Input and Output 1 Based on slides created by Bjarne.
Tutorial 14 Working with Forms and Regular Expressions.
Regular Expressions Dr. Ralph D. Westfall May, 2011.
CSCE 121: Introduction to Program Design and Concepts J. Michael Moore Spring 2015 Set 20: Regular Expressions 1.
Working with Strings Lecture 2 Hartmut Kaiser
 A database is a collection of data that is organized so that its contents can easily be accessed, managed, and updated. What is Database?
Chapter 11 Customizing I/O Bjarne Stroustrup
2.6 Protecting Individual Citizens 1 st & 4 th Amendments In Depth Government & Citizenship Timpanogos High School.
CS 114 – Class 02 Topics  Computer programs  Using the compiler Assignments  Read pages for Thursday.  We will go to the lab on Thursday.
CSE 232: C++ Input/Output Manipulation Built-In and Simple User Defined Types in C++ int, long, short, char (signed, integer division) –unsigned versions.
SIXTH GRADE WRITING CLASS “FREEDOM OF SPEECH” IN THE.
Journal #4 Imagine you are the shark and describe your thoughts. Use complete sentences and write ½ page minimum! 1 You have 10mins to write a CREATIVE,
CS Midterm Study Guide Fall General topics Definitions and rules Technical names of things Syntax of C++ constructs Meaning of C++ constructs.
BANNED BOOKS. #1! 2CvlU.
Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the freedom of speech, or of.
Looping and Counting Lecture 3 Hartmut Kaiser
GREP. Whats Grep? Grep is a popular unix program that supports a special programming language for doing regular expressions The grammar in use for software.
The Bill of Rights. Congress shall make no law The Bill of Rights Congress shall make no law a) respecting an establishment of religion,
Basics of Religious Rights. 1 st Amendment Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof;
A first program 1. #include 2. using namespace std; 3. int main() { 4. cout
A First Book of C++: From Here To There, Third Edition2 Objectives You should be able to describe: One-Dimensional Arrays Array Initialization Arrays.
C++ for Engineers and Scientists Second Edition Chapter 7 Completing the Basics.
Unit 11 –Reglar Expressions Instructor: Brent Presley.
Fall 2015CISC/CMPE320 - Prof. McLeod1 CISC/CMPE320 Managers and “mentors” identified on projects page. All member accounts created and projects populated.
Amendment a·mend·ment P Pronunciation Key ( -m nd m nt) n. Pronunciation Key 1. The act of changing for the better; improvement:
Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the freedom of speech, or of.
The 1 st Amendment. Brainstorm… Imagine you are in a club or a group and you have a super important message. You need as many people as possible to hear.
A FIRST BOOK OF C++ CHAPTER 14 THE STRING CLASS AND EXCEPTION HANDLING.
Arrays Declaring arrays Passing arrays to functions Searching arrays with linear search Sorting arrays with insertion sort Multidimensional arrays Programming.
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
Amendment I Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the freedom of speech,
Searching, Modifying, and Encoding Text. Parts: 1) Forming Regular Expressions 2) Encoding and Decoding.
CSE 232: Moving Data Within a C++ Program Moving Data Within a C++ Program Input –Getting data from the command line (we’ve looked at this) –Getting data.
Civics. 1 st amendment Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the.
LEA 2 Cours de civilisation américaine J. Kempf Americans and religion 1.Centrality in American life 2.An ambiguous separation of churches and State 3.The.
Chapter 4 Strings and Screen I/O. Objectives Define strings and literals. Explain classes and objects. Use the string class to store strings. Perform.
Lesson 4 String Manipulation. Lesson 4 In many applications you will need to do some kind of manipulation or parsing of strings, whether you are Attempting.
C++ First Steps.
CS 330 Class 7 Comments on Exam Programming plan for today:
Perl Regular Expression in SAS
The First Amendment.
1st Amendment Court Cases
Chapter 23 Text Processing
Chapter 2 part #3 C++ Input / Output
Chapter 23 Text Processing
Personal protections and liberties added to the Constitution for you!
Chapter 23 Text Processing
Americans and religion
The First Amendment Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the freedom.
The First Amendment!.
String Processing 1 MIS 3406 Department of MIS Fox School of Business
Chapter 2 part #3 C++ Input / Output
Newspaper bhspioneerspirit.
Presentation transcript:

Chapter 23 Text Processing Bjarne Stroustrup

Overview Application domains Application domains Strings Strings I/O I/O Maps Maps Regular expressions Regular expressions Stroustrup/PPP - Nov'132

Now you know the basics Really! Congratulations! Really! Congratulations! Don’t get stuck with a sterile focus on programming language features Don’t get stuck with a sterile focus on programming language features What matters are programs, applications, what good can you do with programming What matters are programs, applications, what good can you do with programming Text processing Text processing Numeric processing Numeric processing Embedded systems programming Embedded systems programming Banking Banking Medical applications Medical applications Scientific visualization Scientific visualization Animation Animation Route planning Route planning Physical design Physical design Stroustrup/PPP - Nov'133

Text processing “all we know can be represented as text” “all we know can be represented as text” And often is And often is Books, articles Books, articles Transaction logs ( , phone, bank, sales, …) Transaction logs ( , phone, bank, sales, …) Web pages (even the layout instructions) Web pages (even the layout instructions) Tables of figures (numbers) Tables of figures (numbers) Graphics (vectors) Graphics (vectors) Mail Mail Programs Programs Measurements Measurements Historical data Historical data Medical records Medical records … Stroustrup/PPP - Nov'134 Amendment I Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the freedom of speech, or of the press; or the right of the people peaceably to assemble, and to petition the government for a redress of grievances.

String overview Strings Strings std::string std::string s.size() s.size() s1==s2 s1==s2 C-style string (zero-terminated array of char) C-style string (zero-terminated array of char) or or strlen(s) strlen(s) strcmp(s1,s2)==0 strcmp(s1,s2)==0 std::basic_string, e.g. Unicode strings std::basic_string, e.g. Unicode strings using string = std::basic_string ; using string = std::basic_string ; Proprietary string classes Proprietary string classes Stroustrup/PPP - Nov'135

C++11 String Conversion In, for numerical values In, for numerical values For example: For example: string s1 = to_string(12.333);// "12.333" string s2 = to_string(1+5*6-99/7);// "17" Stroustrup/PPP - Nov'136

String conversion We can write a simple to_string() for any type that has a We can write a simple to_string() for any type that has a “put to” operator<< “put to” operator<< template string to_string(const T& t) { ostringstream os; os << t; return os.str(); } For example: For example: string s3 = to_string(Date(2013, Date::nov, 14)); Stroustrup/PPP - Nov'137

C++11 String Conversion Part of, for numerical destinations Part of, for numerical destinations For example: For example: string s1 = "-17"; int x1 = stoi(s1);// stoi means string to int string s2 = "4.3"; double d = stod(s2);// stod means string to double Stroustrup/PPP - Nov'138

String conversion We can write a simple from_string() for any type that has an We can write a simple from_string() for any type that has an “get from” operator<< “get from” operator<< template T from_string(const string& s) { istringstream is(s); T t; if (!(is >> t)) throw bad_from_string(); return t; } For example: For example: double d = from_string ("12.333"); Matrix m = from_string >("{ {1,2}, {3,4} }"); Stroustrup/PPP - Nov'139

General stream conversion template template Target to(Source arg) { std::stringstream ss; Target result; if (!(ss << arg)// read arg into stream || !(ss >> result)// read result from stream || !(ss >> std::ws).eof())// stuff left in stream? || !(ss >> std::ws).eof())// stuff left in stream? throw bad_lexical_cast(); return result; } string s = to (to (" 12.7 "));// ok // works for any type that can be streamed into and/or out of a string: XX xx = to (to (XX(whatever)));// !!! Stroustrup/PPP - Nov'1310

I/O overview Stroustrup/PPP - Nov'1311 istreamostream ifstreamiostreamofstreamostringstreamistringstream fstreamstringstream Stream I/O in >> xRead from in into x according to x’s format out << x Write x to out according to x’s format in.get(c)Read a character from in into c getline(in,s)Read a line from in into the string s

Map overview Associative containers Associative containers,,,,,, map map multimap multimap set set multiset multiset unordered_map unordered_map unordered_multimap unordered_multimap unordered_set unordered_set unordered_multiset unordered_multiset The backbone of text manipulation The backbone of text manipulation Find a word Find a word See if you have already seen a word See if you have already seen a word Find information that correspond to a word Find information that correspond to a word See example in Chapter 23 See example in Chapter 23 Stroustrup/PPP - Nov'1312

Map overview Stroustrup/PPP - Nov'1313 vector multimap “John Doe” “John Q. Public” Mail_file:

A problem: Read a ZIP code U.S. state abbreviation and ZIP code U.S. state abbreviation and ZIP code two letters followed by five digits two letters followed by five digits string s; while (cin>>s) { if (s.size()==7 && isletter(s[0]) && isletter(s[1]) && isdigit(s[2]) && isdigit(s[3]) && isdigit(s[4]) && isdigit(s[5]) && isdigit(s[6])) cout << "found " << s << '\n'; } Brittle, messy, unique code Brittle, messy, unique code Stroustrup/PPP - Nov'1314

A problem: Read a ZIP code Problems with simple solution Problems with simple solution It’s verbose (4 lines, 8 function calls) It’s verbose (4 lines, 8 function calls) We miss (intentionally?) every ZIP code number not separated from its context by whitespace We miss (intentionally?) every ZIP code number not separated from its context by whitespace "TX77845", TX , and ATM77845 "TX77845", TX , and ATM77845 We miss (intentionally?) every ZIP code number with a space between the letters and the digits We miss (intentionally?) every ZIP code number with a space between the letters and the digits TX TX We accept (intentionally?) every ZIP code number with the letters in lower case We accept (intentionally?) every ZIP code number with the letters in lower case tx77845 tx77845 If we decided to look for a postal code in a different format we would have to completely rewrite the code If we decided to look for a postal code in a different format we would have to completely rewrite the code CB3 0DS, DK-8000 Arhus CB3 0DS, DK-8000 Arhus Stroustrup/PPP - Nov'1315

TX st try:wwddddd 1 st try:wwddddd 2 nd (remember ):wwddddd-dddd 2 nd (remember ):wwddddd-dddd What’s “special”? What’s “special”? 3 rd :\w\w\d\d\d\d\d-\d\d\d\d 3 rd :\w\w\d\d\d\d\d-\d\d\d\d 4 th (make counts explicit):\w2\d5-\d4 4 th (make counts explicit):\w2\d5-\d4 5 th (and “special”): \w{2}\d{5}-\d{4} 5 th (and “special”): \w{2}\d{5}-\d{4} But was optional? But was optional? 6 th : \w{2}\d{5}(-\d{4})? 6 th : \w{2}\d{5}(-\d{4})? We wanted an optional space after TX We wanted an optional space after TX 7 th (invisible space): \w{2} ?\d{5}(-\d{4})? 7 th (invisible space): \w{2} ?\d{5}(-\d{4})? 8 th (make space visible): \w{2}\s?\d{5}(-\d{4})? 8 th (make space visible): \w{2}\s?\d{5}(-\d{4})? 9 th (lots of space – or none):\w{2}\s*\d{5}(-\d{4})? 9 th (lots of space – or none):\w{2}\s*\d{5}(-\d{4})? Stroustrup/PPP - Nov'1316

#include #include using namespace std; int main() { ifstream in("file.txt");// input file if (!in) cerr << "no file\n"; regex pat ("\\w{2}\\s*\\d{5}(-\\d{4})?"); // ZIP code pattern // cout << "pattern: " << pat << '\n'; // printing of patterns is not C++11 // … } Stroustrup/PPP - Nov'1317

int lineno = 0; string line;// input buffer while (getline(in,line)) { ++lineno; smatch matches;// matched strings go here if (regex_search(line, matches, pat)) { cout << lineno << ": " << matches[0] << '\n';// whole match if (1<matches.size() && matches[1].matched) cout << "\t: " << matches[1] << '\n‘;// sub-match }} Stroustrup/PPP - Nov'1318

Results Input:address TX77845 ffff tx asasasaa ggg TX howdy zzz TX sss ggg TX cvzcv TX sdsas xxxTx77845xxxTX Output:pattern: "\w{2}\s*\d{5}(-\d{4})?" 1: TX : tx : TX : : TX : : Tx : TX : Stroustrup/PPP - Nov'1319

Regular expression syntax Regular expressions have a thorough theoretical foundation based on state machines Regular expressions have a thorough theoretical foundation based on state machines You can mess with the syntax, but not much with the semantics You can mess with the syntax, but not much with the semantics The syntax is terse, cryptic, boring, useful The syntax is terse, cryptic, boring, useful Go learn it Go learn it Examples Examples Xa{2,3}// Xaa Xaaa Xa{2,3}// Xaa Xaaa Xb{2}// Xbb Xb{2}// Xbb Xc{2,}// Xcc Xccc Xcccc Xccccc … Xc{2,}// Xcc Xccc Xcccc Xccccc … \w{2}-\d{4,5}// \w is letter \d is digit \w{2}-\d{4,5}// \w is letter \d is digit (\d*:)?(\d+) // 124: : (\d*:)?(\d+) // 124: : Subject: (FW:|Re:)?(.*)//. (dot) matches any character Subject: (FW:|Re:)?(.*)//. (dot) matches any character [a-zA-Z] [a-zA-Z_0-9]*// identifier [a-zA-Z] [a-zA-Z_0-9]*// identifier [^aeiouy] // not an English vowel [^aeiouy] // not an English vowel Stroustrup/PPP - Nov'1320

Searching vs. matching Searching for a string that matches a regular expression in an (arbitrarily long) stream of data Searching for a string that matches a regular expression in an (arbitrarily long) stream of data regex_search() looks for its pattern as a substring in the stream regex_search() looks for its pattern as a substring in the stream Matching a regular expression against a string (of known size) Matching a regular expression against a string (of known size) regex_match() looks for a complete match of its pattern and the string regex_match() looks for a complete match of its pattern and the string Stroustrup/PPP - Nov'1321

Table grabbed from the web KLASSE ANTAL DRENGE ANTAL PIGER ELEVER IALT 0A A7815 1B A A A7714 4B A A B A G358 7I7310 8A A MO325 0P1112 0P B448 10CE011 1MO8513 2CE8513 3DCE336 4MO415 6CE347 8CE448 9CE4913 REST5611 Alle klasser Stroustrup/PPP - Nov'1322 Numeric fields Text fields Invisible field separators Semantic dependencies i.e. the numbers actually mean something first row + second row == third row Last line are column sums

Describe rows Header line Header line Regular expression:^[\w ]+([\w ]+)*$ Regular expression:^[\w ]+([\w ]+)*$ As string literal:"^[\\w ]+([\\w ]+)*$" As string literal:"^[\\w ]+([\\w ]+)*$" Other lines Other lines Regular expression: ^([\w ]+)(\d+)(\d+)(\d+)$ Regular expression: ^([\w ]+)(\d+)(\d+)(\d+)$ As string literal: "^([\\w ]+)(\\d+)(\\d+)(\\d+)$" As string literal: "^([\\w ]+)(\\d+)(\\d+)(\\d+)$" Aren’t those invisible tab characters annoying? Aren’t those invisible tab characters annoying? Define a tab character class Define a tab character class Aren’t those invisible space characters annoying? Aren’t those invisible space characters annoying? Use \s Use \s Stroustrup/PPP - Nov'1323

Simple layout check int main() { ifstream in("table.txt");// input file if (!in) error("no input file\n"); string line;// input buffer string line;// input buffer int lineno = 0; regex header( "^[\\w ]+([\\w ]+)*$"); // header line regex row( "^([\\w ]+)(\\d+)(\\d+)(\\d+)$"); // data line // … check layout … // … check layout …} Stroustrup/PPP - Nov'1324

Simple layout check int main() { // … open files, define patterns … if (getline(in,line)) {// check header line smatch matches; if (!regex_match(line, matches, header)) error("no header"); } while (getline(in,line)) {// check data line ++lineno; smatch matches; if (!regex_match(line, matches, row)) error("bad line", to_string(lineno)); }} Stroustrup/PPP - Nov'1325

Validate table int boys = 0;// column totals int girls = 0; while (getline(in,line)) {// extract and check data smatch matches; if (!regex_match(line, matches, row)) error("bad line"); int curr_boy = from_string (matches[2]);// check row int curr_girl = from_string (matches[3]); int curr_total = from_string (matches[4]); if (curr_boy+curr_girl != curr_total) error("bad row sum"); if (matches[1]=="Alle klasser") {// last line; check columns: if (curr_boy != boys) error("boys don't add up"); if (curr_girl != girls) error("girls don't add up"); return 0; } boys += curr_boy; girls += curr_girl; } Stroustrup/PPP - Nov'1326

Application domains Text processing is just one domain among many Text processing is just one domain among many Or even several domains (depending how you count) Or even several domains (depending how you count) Browsers, Word, Acrobat, Visual Studio, … Browsers, Word, Acrobat, Visual Studio, … Image processing Image processing Sound processing Sound processing Data bases Data bases Medical Medical Scientific Scientific Commercial Commercial … Numerics Numerics Financial Financial Real-time control Real-time control … Stroustrup/PPP - Nov'1327