Presentation is loading. Please wait.

Presentation is loading. Please wait.

Towards Automated Data Wrangling

Similar presentations


Presentation on theme: "Towards Automated Data Wrangling"— Presentation transcript:

1 Towards Automated Data Wrangling
Curation of Example Datasets May Yong Research Software Engineer The Alan Turing Institute 05/09/2017 Towards Automating Data Wrangling – Curation of Example Datasets

2 Data wrangling tools should reflect this.
Data wrangling is the process of going from "raw" data to "usable data”. James Geddes, Principal Data Scientist, The Alan Turing Institute Transparency and reproducibility is essential for the data wrangling process. Data wrangling tools should reflect this. All manner of sins 05/09/2017 Towards Automating Data Wrangling – Curation of Example Datasets

3 Datasets containing wrangling tasks
Wrangling challenges Datasets containing wrangling tasks Surgery Outcomes Web browsing history Neonatal ICU Rainfall UK E-Petitions Cybersecurity Obtaining, or inferring a data dictionary Data integration Record linkage Spelling and format variability Reformatting the structure of the data Handling missing data Anomaly detection “Improving the Data Analytics Process” Turing Institute workshop 18-21 July 2016 Task: To find raw data, document the wrangling required to bring it to the stage where it can be analyzed. All manner of sins 05/09/2017 Towards Automating Data Wrangling – Curation of Example Datasets

4 Obtaining, or inferring a data dictionary
Understanding the meaning of individual data, fields and tables or other complex structures Surgery Outcomes Web browsing history Neonatal ICU Rainfall UK E-Petitions Cybersecurity All manner of sins 05/09/2017 Towards Automating Data Wrangling – Curation of Example Datasets

5 Data Integration Combining from multiple sources data that is conceptually “about the same thing.” Surgery Outcomes Web browsing history Neonatal ICU Rainfall UK E-Petitions Cybersecurity All manner of sins 05/09/2017 Towards Automating Data Wrangling – Curation of Example Datasets

6 Coping with spelling and format variability
Surgery Outcomes Web browsing history Neonatal ICU Rainfall UK E-Petitions Cybersecurity Record linkage Recognising that two distinct pieces of information in the data do in fact concern the same entity Coping with spelling and format variability Recovering the value of a datum from its representation (eg, recognising the string “25 Mar 16” as the ISO-8601-encoded date ) All manner of sins 05/09/2017 Towards Automating Data Wrangling – Curation of Example Datasets

7 Reformat data structure
Switching from “wide” to “tall” format, normalising/de-normalising relational datasets Surgery Outcomes Web browsing history Neonatal ICU Rainfall UK E-Petitions Cybersecurity Day 1 Day 2 All manner of sins 05/09/2017 Towards Automating Data Wrangling – Curation of Example Datasets

8 Missing Data All manner of sins Surgery Outcomes Web browsing history
Missing data sources in ‘Neonatal’ Surgery Outcomes Web browsing history Neonatal ICU Rainfall UK E-Petitions Cybersecurity Missing weather stations in ‘Rainfall’ Missing postcodes in ‘UK E-Petitions’ All manner of sins Day 1 Day 2 05/09/2017 Towards Automating Data Wrangling – Curation of Example Datasets

9 Anomaly Detection All manner of sins Surgery Outcomes
Web browsing history Neonatal ICU Rainfall UK E-Petitions Cybersecurity Day 1 Day 2 All manner of sins 05/09/2017 Towards Automating Data Wrangling – Curation of Example Datasets

10 https://alan-turing-institute.github.io/wrangling-tests/
All manner of sins 05/09/2017 Towards Automating Data Wrangling – Curation of Example Datasets

11 aida-dwt-petitions/code/Wrangling tasks for UK Petitions data.ipynb
All manner of sins 05/09/2017 Towards Automating Data Wrangling – Curation of Example Datasets

12 turing.ac.uk @turinginst
05/09/2017 Towards Automating Data Wrangling – Curation of Example Datasets


Download ppt "Towards Automated Data Wrangling"

Similar presentations


Ads by Google