Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advanced data preparation operators and Data-Knoller

Similar presentations


Presentation on theme: "Advanced data preparation operators and Data-Knoller"— Presentation transcript:

1 Advanced data preparation operators and Data-Knoller
Prof. Felix Naumann Lan Jiang, John Koumarelas Campus II F-E.06

2 Robust operator implementation
Operator executability Under which precondition changeDateFormat(property: shipDate, sourceFormat: ‘mmddyyyy’, targetFormat: ‘dd-mm-yyyy’) Precondition: Metadata: <Date format, ‘mmddyyyy’, shipDate> exists targetFormat is a valid date pattern Data Preparation for Science, Introduction WS 18/19

3 Robust operator implementation
Error handling Cover the cases that might produce errors changeDateFormat(property: shipDate, sourceFormat: ‘mmddyyyy’, targetFormat: ‘dd-mm-yyyy’) orderDate 4/12/2018 orderDate null Data Preparation for Science, Introduction WS 18/19

4 Normalize date/phone format
orderDate 11/11/2018 4/12/2018 orderDate 11/11/2018 11/01/2013 01/31/2018 02/15/2015 04/12/2018 Phone/Date Rule-based approach Bound to rules Hard to extend Programming by example Extensible Data Preparation for Science, Introduction WS 18/19

5 Normalize date/phone format
orderDate 11/11/2018 4/12/2018 orderDate 11/11/2018 11/01/2013 Phone/Date Rule-based approach Bound to rules Hard to extend Programming by example Extensible Data Preparation for Science, Introduction WS 18/19

6 Normalize date/phone format
orderDate 11/11/2018 4/12/2018 orderDate 11/11/2018 11/01/2013 Phone/Date Rule-based approach Bound to rules Hard to extend Programming by example Extensible Data Preparation for Science, Introduction APIs: changeDateFormat(attribute, sourceFormat, targetFormat) changeDateFormat(attribute, targetFormat) changePhoneFormat(attribute, sourceFormat, targetFormat) changePhoneFormat(attribute, targetFormat) WS 18/19

7 Discover and change encoding
Discover the encoding Change it to another City Berlin M√©xico D.F. London Lule√• City Berlin México D.F. London Luleå ? ASCII ? UTF-8 Data Preparation for Science, Introduction APIs: changeEncoding(sourceEncoding, targetEncoding) changeEncoding(targetEncoding) WS 18/19

8 Split an attribute APIs: split(attribute, separator)
Info No Middle Name | Lee Child $9.99 | ISBN: Ready Player One | Ernest Cline $9.99 | ISBN: The Whistler | John Grisham $9.99 | ISBN: Red Sparrow | Jason Matthews $9.99 | ISBN: Never Never | James Patterson $9.99 | ISBN: split1 split2 split3 No Middle Name Lee Child $9.99 ISBN: Ready Player One Ernest Cline $9.99 ISBN: The Whistler John Grisham $9.99 ISBN: Red Sparrow Jason Matthews $9.99 ISBN: Never Never James Patterson $9.99 ISBN: Data Preparation for Science, Introduction APIs: split(attribute, separator) split(attribute, separator, direction, times) split(attribute) WS 18/19

9 Split file APIs: splitFile(fileSeparator) splitFile()
Data Preparation for Science, Introduction WS 18/19

10 Pivot and unpivot APIs: pivot(toRow, toColumn, aggregation) pivot()
unpivot(unpivotedCols) unpivot() Data Preparation for Science, Introduction WS 18/19

11 Stemming & Lemmatization
“cats” -> “cat” “ponies” -> “poni” “stemmed”, “stemming” -> “stem” ”drove”, “driving” -> “drive” “ponies” -> “pony” APIs: stem(attribute) stem(attributeSet) stem() lemmatize(attribute) lemmatize(attributeSet) lemmatize() Data Preparation for Science, Introduction WS 18/19

12 Sample Sample a subset of records from the table
With different algorithms? APIs: sample(targetRecordCount, withReplacement) sample(probability, withReplacement) sample(expectCount, withReplacement) sample(dist, withReplacement) Data Preparation for Science, Introduction WS 18/19

13 Preamble discovery APIs: removePreamble(lines) removePreamble()
Preambles Data Data Preparation for Science, Introduction WS 18/19

14 List of choices Change date format Change phone format
Discover and change encoding Discover and remove preamble Split attribute Pivot and unpivot attributes Split data file Stem and lemmatize Sample Data Preparation for Science, Introduction WS 18/19

15 Data-Knoller Systematic Data Preparation Lan Jiang

16 Data-Knoller Java Scala Define the structure to represent the metadata
Systematic Data Preparation Scala Lan Jiang Define different APIs of this preparator

17 Assignment 2 Choose a concrete operator (from the above or from your assignment one) Implement the operator The implementation should care about the executability (metadata handling), potential error handling Implement with java/scala + spark Find yourself several datasets to evaluate your implementation Do not have to be large and complicated Cover the issues that your preparator is about to solve Write unit tests for them in the project Data Preparation for Science, Introduction WS 18/19

18 Assignment 2 Presentation on Dec. 4th Implementation details
Github Each team creates a branch, in the end makes a pull request Open for suggestions on the system itself! Data Preparation for Science, Introduction WS 18/19


Download ppt "Advanced data preparation operators and Data-Knoller"

Similar presentations


Ads by Google