Presentation is loading. Please wait.

Presentation is loading. Please wait.

My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University.

Similar presentations


Presentation on theme: "My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University."— Presentation transcript:

1 My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University

2 2 Even when lives are at stake, people still make typos. intro ● current practice ● real data ● topes ● closing vignette Hurricane Katrina “Person Locator” Web site

3 3 Data errors reduce the usefulness of data. Even little typos impede data de-duplication. Age is not useful for flying my helicopter to come rescue you. Age belongs in the description/additional information field so I can recognize you. And a “city name” with 1 letter is no use at all. intro ● current practice ● real data ● topes ● closing vignette

4 4 The website creators omitted input validation. Reasons: They thought… –it would be too hard to write the validation. –catching obviously-wrong inputs would prevent collecting maybe-correct data. This is the UI code for the web form where people could type in the data. A RAD tool called CodeCharge Studio was used to create the UI. intro ● current practice ● real data ● topes ● closing vignette

5 5 Outline and main points 1.Current practice –Currently, writing validation code is hard… 2.Real data –because how do you express, “this is questionable data?” 3.Topes –Topes can express that—and they’re easy to create, too. 4.Closing vignette: my brother’s truck tires –Email me some vignettes of your own. intro ● current practice ● real data ● topes ● closing vignette

6 6 Programmers have lots of tricks to simplify writing validation code. Split inputs into multiple easy-to-validate fields. Who cares if the user has to type tabs now, or if he can’t just copy-paste into one field? Make users pick from drop-downs. Who cares if it’s faster for users to type “NJ” or “1/2007”? (Disclaimer: drop-downs sometimes good!) I implemented this code for NJTransit.com. intro ● current practice ● real data ● topes ● closing vignette

7 7 Even with these tricks, writing validation is still very time-consuming. Overall, the site had over 1100 lines of JavaScript just for validation…. Plus equivalent server-side Java code (can’t trust some users) Sample code below. intro ● current practice ● real data ● topes ● closing vignette if (!rfcCheckEmail(frm.primaryemail.value)) return messageHelper(frm.primaryemail, "Please enter a valid Primary Email address."); var atloc = frm.primaryemail.value.indexOf('@'); if (atloc > 31 || atloc < frm.primaryemail.value.length-33) return messageHelper(frm.primaryemail, "Sorry. You may only enter 32 characters or less for your email name\r\n”+ ”and 32 characters or less for your email domain (including @).");

8 8 That was worst case. Best case: reusable regexps. Many IDEs allow the programmer to enter one regular expression for validating each input field. –Usually, this drastically reduces the amount of code, since most validation ain’t fancy. –Unfortunately… intro ● current practice ● real data ● topes ● closing vignette

9 9 Regexps are a good bullet but not a silver bullet—so lots of data goes unvalidated. The world is full of programmers who can’t read regexps. –Do a search on Google some time for “regular expressions” and read what people say in the forums. –USA alone has over 55 million non-expert creators of web sites, databases, and spreadsheets (which have most of the same data problems that web sites do). Regexps only work for data where you can say, “Yes, this is definitely ok” or “No, this is definitely wrong”. –What would a regexp for a valid company name look like? intro ● current practice ● real data ● topes ● closing vignette

10 10 So we did a preliminary review of real data needing validation. Sources: –Comments from Information Week readers to a survey –Observations of people as they created and used websites –Many Hurricane Katrina sites –Cursory browsing of the EUSES spreadsheet corpus –Browsing around the web –My own experience as a professional webapp developer We found 3 primary problems with regexps… intro ● current practice ● real data ● topes ● closing vignette

11 11 1. Real data doesn’t always conform well to the simple “binary” regexp model. Data is sometimes questionable… yet valid. –Remember the suspiciously long NJTransit email address? –In practice, person names and other proper nouns are never validated with regexps… too brittle. –Life is full of corner cases and exceptions. If your code can identify questionable data, then it can double-check the data: –Ask an application end user to confirm the input –Flag the input for checking by a system administrator –Compare the value to a list of known allowable exceptions –Call up a server and see if it can confirm the value intro ● current practice ● real data ● topes ● closing vignette

12 12 2. Real data often can occur in multiple different formats. Two different strings can be equivalent. –How many ways can you write a date? –What if an end user types a date in the wrong format? –“Jan-1-2007” and “1/1/2007” mean the same thing because of the category that they are in: date. –Sometimes the interpretation is ambiguous. In real life, we use preferences and experience to guide interpretation. If your code can transform among formats (ie: not just recognize formats with regexps), then it can put data in an unambiguous format of your choice. –Display the result so users can fix interpretations if needed intro ● current practice ● real data ● topes ● closing vignette

13 13 3. The meaning of data is often tied to its “parts”, not directly to its characters. Real data often has parts, each with their own meaning. –What are the parts of a date, 1/1/2007? –Valid data conforms to intra- and inter-part constraints. –Writing regexps requires you to translate constraints into a character sequence… tough in many cases, practically or truly impossible in others. –No wonder most people can’t read or write regexps. If your code could succinctly state the parts, as well as mandatory and optional constraints on the parts, wouldn’t the code be easier to write and maintain? intro ● current practice ● real data ● topes ● closing vignette

14 14 Imagine a world… Where your code could say to an oracle, “Is this input a company name?”, and the oracle would say yes, no, almost definitely, probably not, and other shades of gray. Where your code could accept an input in any reasonable format, since your code could ask the oracle to put the input into whatever format you actually want. Where you could teach the oracle about a new category of data by concisely stating the parts and constraints of that data. intro ● current practice ● real data ● topes ● closing vignette

15 15 Tope = an abstraction for a data category Greek word for “place,” because each corresponds to a data category with a natural place in the problem domain Topes in practice: 1.People implement new topes by using the basic tope editor (or another language such as JavaScript) 2.People publish tope implementations on repositories. 3.People download tope implementations to local caches. 4.Tool plug-ins let people browse their local cache and associate topes with variables and input fields. 5.Plug-ins use tope implementations from local cache to recognize, transform, and equivalence-test data. intro ● current practice ● real data ● topes ● closing vignette

16 16 A tope is a graph. Node = format, edge = transformation A notional representation for a CMU room number tope. –Note that edges (transformations) can be chained Formal building name & room number Elliot Dunlap Smith Hall 225 Building abbreviation & room number EDSH 225 Colloquial building name & room number Smith 225 intro ● current practice ● real data ● topes ● closing vignette

17 17 A tope is a conceptual abstraction. A tope implementation is code you can run. Each tope implementation contains executable functions: –1 isa:string  [0,1] function per format, for recognizing instances of the format –0 or more trf:string  string function linking formats, for transforming values form one format to another intro ● current practice ● real data ● topes ● closing vignette

18 18 Common kinds of topes: enumerations and proper nouns Multi-format, non-binary enums, e.g.: US states –Fixed list of definitely valid names (e.g.: “Maryland”) –Transformed to other formats via lookup tables (“MD”) –Augmented with a list of unusual values that technically might be ok in some circumstances (“PR”) Open-set proper nouns, e.g.: Company names –You certainly can’t list all of these –Collect a whitelist of definitely valid names (“Google”), with alternate formats (e.g. “Google Corporation”, “GOOG”) –Augment with a pattern for recognizing promising inputs that are not yet on the whitelist intro ● current practice ● real data ● topes ● closing vignette

19 19 Two more common kinds of topes: numeric and hierarchical Numeric, e.g.: area codes –Check that inputs are numeric and in a certain range –Values outside the range might be valid but questionable –Very rarely, numeric data are explicitly flagged with a unit Hierarchical, e.g.: address lines –Parts are described with other topes (e.g.: “100 Main St.” uses a numeric, a proper noun, and an enum) –Each part has its own internal constraints; the hierarchical tope may add inter-part constraints. –Simple isa functions can be implemented with regexps. –Transformations involve permutation of parts, changes to separators, simple arithmetic, and lookup tables. intro ● current practice ● real data ● topes ● closing vignette

20 20 We have a tool to help people succinctly express common kinds of topes. Features: Format inference Format/part names Soft constraints “isa” generation Testing features Format reusability (Similar UI style for implementing trfs) intro ● current practice ● real data ● topes ● closing vignette

21 21 And we have plug-ins for using topes in web forms, databases, and spreadsheets Visual Studio: drag-and drop code generation Microsoft Excel: buttons and menus intro ● current practice ● real data ● topes ● closing vignette

22 22 We have conducted a variety of evaluations. Expressiveness: –We have implemented formats for dozens of kinds of data (1) EUSES spreadsheet corpus (2) Hurricane Katrina, and Google Base website data (3) logs of admin assistants’ web browsing –… and topes were very effective at identifying data errors. Usability: –Controlled experiment shows that our format editor enables admin assistants and master’s students to validate data more quickly and accurately than with Lapis patterns or with regexps. intro ● current practice ● real data ● topes ● closing vignette

23 23 For more details… Ask me for the papers on… –Surveys and other studies of programmers and users –The topes model –Our user interfaces –Our evaluations Ask me for the tools: –Some modules are already open-sourced –Modules have a clean API (if you just want a binary) –The evaluations pointed out some places for improvement (when this is done, the rest will be open-sourced, too) intro ● current practice ● real data ● topes ● closing vignette

24 24 A closing vignette This vignette illustrates many of the characteristics of data that I mentioned today. I would value similar (true story) vignettes from you –To help highlight what real data looks like –To help communicate the concept of a tope –To provide me with test cases for topes intro ● current practice ● real data ● topes ● closing vignette

25 25 My brother (Ben) and I hit a rake while driving his truck around the backyard. We got a flat. intro ● current practice ● real data ● topes ● closing vignette

26 26 Fortunately, Ben knows A LOT about trucks… and their tires. Observe the pencil marks on the tire, where my brother drew while explaining what the parts of the tire size meant. (I tweaked the contrast on this image to make the lettering stand out.) intro ● current practice ● real data ● topes ● closing vignette

27 27 So Ben went online to order a tire. Observe the red neck and boots. My brother is very web savvy. He is an electrician by day, but he assembles computers and sets up web sites as a side job. intro ● current practice ● real data ● topes ● closing vignette

28 28 But even though Ben knows tires really well, he couldn’t implement tire size validation. Each part has meaning (cross section, sidewall aspect ratio, internal construction, etc). Though parts must be selected from a simple enumeration, there are inter-part constraints. “Questionable” sizes? Dunno. Maybe those that are reasonable but hard- to-find? intro ● current practice ● real data ● topes ● closing vignette

29 29Summary Real data is full of input errors. Real validation is currently hard to write. Topes enable accurate, convenient validation by capturing soft constraints and the multiple formats of real data. Please email me some vignettes of your own. intro ● current practice ● real data ● topes ● closing vignette

30 30 Thank you… … to lots of people (including Mary Shaw, Brad Myers, Jim Herbsleb, and Jonathan Aldrich) for encouraging feedback about topes and lots of suggestions. … to anybody who emails me a vignette. …to NSF and EUSES for funding (ITR-0325273 and CCF-0438929)


Download ppt "My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University."

Similar presentations


Ads by Google