Presentation is loading. Please wait.

Presentation is loading. Please wait.

Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University.

Similar presentations


Presentation on theme: "Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University."— Presentation transcript:

1 Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University

2 2 Even when lives are at stake, people still make typos. Hurricane Katrina “Person Locator” Web site Problem  Topes  Validation  Conclusion

3 3 Data errors reduce the usefulness of data. Wrong data category Problem  Topes  Validation  Conclusion Questionable input Incorrect formatting

4 4 The website creators omitted input validation. Primary reason: rejecting obviously-wrong inputs would prevent collecting questionable data –Eg: Would you accept a city with 1 letter? This is the UI code for the web form where users entered data for this website. A RAD tool called CodeCharge Studio was used to create the UI. Problem  Topes  Validation  Conclusion

5 5 This site was not alone in lacking input validation. Eg: Google Base web application –13 primary web forms –Even numeric fields accept unreasonable inputs (such as a salary of “-45”) Eg: Spreadsheets –40% of cells are non-numeric, non-date textual data –Commonly used to gather and organize textual data for reports Problem  Topes  Validation  Conclusion

6 6 Validation of these short human-readable strings must support… Testing membership in a data category –Categories based on standards (eg: email address) –Categories lacking standards (eg: city name) Ambiguously defined categories –Identify questionable values for double-checking Multiple formats –Format consistency, post-validation Platform-independent implementation –Reuse in webapps, spreadsheets, others Problem  Topes  Validation  Conclusion

7 7 Limitations of existing approaches Types do not support questionable values Grammars do not, either, nor can they reformat Information extraction algorithms rely on grammatical cues that are absent during validation Cues, Forms/3,  -calculus, Slate, pollution markers, etc, infer numerical constraints but not constraints on strings, nor are they platform-independent Problem  Topes  Validation  Conclusion

8 8 New Approach: Topes A tope = a platform-independent abstraction describing how to recognize and transform strings in one category of data Greek word for “place,” because each corresponds to a data category with a natural place in the problem domain Validating with topes improves –Accuracy of validation –Reusability of validation code –Subsequent duplicate identification Problem  Topes  Validation  Conclusion

9 9 A tope is a graph. Node = format, edge = transformation Notional representation for a CMU room number tope… Formal building name & room number Elliot Dunlap Smith Hall 225 Colloquial building name & room number Smith 225 Problem  Topes  Validation  Conclusion Building abbreviation & room number EDSH 225

10 10 A tope is a conceptual abstraction. A tope implementation is code. Each tope implementation has executable functions: –1 isa:string  [0,1] function per format, for recognizing instances of the format (a fuzzy set) –0 or more trf:string  string functions linking formats, for transforming values from one format to another Validation function:  (str) = max(isa f (str)) where f ranges over tope’s formats –Valid when  (str) = 1 –Invalid when  (str) = 0 –Questionable when 0 <  (str) < 1 Problem  Topes  Validation  Conclusion

11 11 Common kinds of topes: enumerations and proper nouns Multi-format Enumerations, e.g: US states –“New York”, “CA”, maybe “Guam” Open-set proper nouns, e.g.: Company names –Whitelist of definitely valid names (“Google”), with alternate formats (e.g. “Google Corp”, “GOOG”) –Augmented with a pattern for promising inputs that are not yet on the whitelist Problem  Topes  Validation  Conclusion

12 12 Two more common kinds of topes: numeric and hierarchical Numeric, e.g.: human masses –Numeric and in a certain range –Values slightly outside range might be questionable –(Very rarely) labeled with an explicit unit –Transformation usually by multiplication Hierarchical, e.g.: address lines –Parts described with other topes (e.g.: “100 Main St.” uses a numeric, a proper noun, and an enum) –Simple isas can be implemented with regexps. –Transformations involve permutation of parts, changes to separators, arithmetic, and lookup tables. Problem  Topes  Validation  Conclusion

13 13 Formal tool demonstration on Friday Features: Format inference Format/part names Soft constraints Testing features Format reusability Problem  Topes  Validation  Conclusion

14 14 Formal tool demonstration on Friday Microsoft Excel: buttons and menus Visual Studio: drag-and drop code generation Problem  Topes  Validation  Conclusion

15 15 Evaluating accuracy, reusability, and usefulness for data cleaning Implemented topes for spreadsheet data –32 topes based on 720 online spreadsheets –Tested accuracy Reused topes on web application data –8 data categories in Google Base and 5 data categories in Hurricane Katrina site –Tested accuracy Used transformations to reformat data –5 data categories in Hurricane Katrina site –Measured increase in number of duplicates identified Problem  Topes  Validation  Conclusion

16 16 Extracting spreadsheet test data Cluster spreadsheet columns based on data category –EUSES spreadsheet corpus “database” section –Hierarchical agglomerative clustering –Manual inspection –Result = 1713 columns in 246 clusters (1 cluster per data category) Created 1 tope for each of 32 most common categories –Yielding 32 topes –Covered 70% of clustered columns Problem  Topes  Validation  Conclusion

17 17 We considered 5 validation strategies Strategy 1: Current spreadsheet practice (accept all inputs) Strategy 2: Current webapp practice (validate with regexp or fixed list, when available; accept all other inputs) –36 regexps + 35 fixed lists, in 7 categories Strategy 3A: Tope rejecting questionable (accept when  (str)=1) Strategy 3B: Tope accepting questionable (accept when  (str)>0) Strategy 4: Tope warn on questionable (simulate double-check by user when 0<  (str)<1) Problem  Topes  Validation  Conclusion

18 18Measurements Based on 100 random values per category Used F1 to measure accuracy –standard measure of accuracy for classifiers = (precision*recall)/avg(precision,recall) Considered topes with 1, 2, 3, 4, or 5 formats Problem  Topes  Validation  Conclusion

19 19 Recognizing multiple formats and questionable inputs raises accuracy Condition 4: Hypothetical user has to help on ~ 3% of inputs Condition 1: Recall = 0 (fails to identify any invalid inputs) Problem  Topes  Validation  Conclusion

20 20 Topes based on spreadsheet data were accurate on web application data. Problem  Topes  Validation  Conclusion Hurricane Katrina Google Base

21 21 Putting data in a consistent format improves duplicate identification. Randomly extracted 10000 values for each of 5 Hurricane Katrina data categories Implemented transformations for each 5-format tope from the less commonly used formats to the most commonly used Found approximately 8% more duplicates after transformation Problem  Topes  Validation  Conclusion

22 22 Topes improve data validation Validating with topes improves –Accuracy of validation –Reusability of validation code –Subsequent duplicate identification Contributions: –Support for ambiguous data categories –Support for transforming values –Platform-independent validation Problem  Topes  Validation  Conclusion

23 23 Future Work: Sharing topes Repository search mechanisms based on –Relevance to new applications –Quality criteria Integrate with more programming platforms –Microsoft Excel  –Microsoft Visual Studio.NET  –A simple XML processing API  –Univ. Nebraska’s Robofox  –IBM’s CoScripter  –Your tool or platform? Problem  Topes  Validation  Conclusion

24 24 Thank You… To Jeff Magee, Betty Cheng, Barbara Ryder, Margaret Burnett, and others at ICSE 2007 for early feedback To NSF for funding To ICSE 2008 for this opportunity to present Problem  Topes  Validation  Conclusion


Download ppt "Topes: Reusable Abstractions for Validating Data Christopher Scaffidi Brad Myers, Mary Shaw Carnegie Mellon University."

Similar presentations


Ads by Google