Download presentation
Presentation is loading. Please wait.
1
Towards Automated Data Wrangling
Curation of Example Datasets May Yong Research Software Engineer The Alan Turing Institute 05/09/2017 Towards Automating Data Wrangling – Curation of Example Datasets
2
Data wrangling tools should reflect this.
Data wrangling is the process of going from "raw" data to "usable data”. James Geddes, Principal Data Scientist, The Alan Turing Institute Transparency and reproducibility is essential for the data wrangling process. Data wrangling tools should reflect this. All manner of sins 05/09/2017 Towards Automating Data Wrangling – Curation of Example Datasets
3
Datasets containing wrangling tasks
Wrangling challenges Datasets containing wrangling tasks Surgery Outcomes Web browsing history Neonatal ICU Rainfall UK E-Petitions Cybersecurity Obtaining, or inferring a data dictionary Data integration Record linkage Spelling and format variability Reformatting the structure of the data Handling missing data Anomaly detection “Improving the Data Analytics Process” Turing Institute workshop 18-21 July 2016 Task: To find raw data, document the wrangling required to bring it to the stage where it can be analyzed. All manner of sins 05/09/2017 Towards Automating Data Wrangling – Curation of Example Datasets
4
Obtaining, or inferring a data dictionary
Understanding the meaning of individual data, fields and tables or other complex structures Surgery Outcomes Web browsing history Neonatal ICU Rainfall UK E-Petitions Cybersecurity All manner of sins 05/09/2017 Towards Automating Data Wrangling – Curation of Example Datasets
5
Data Integration Combining from multiple sources data that is conceptually “about the same thing.” Surgery Outcomes Web browsing history Neonatal ICU Rainfall UK E-Petitions Cybersecurity All manner of sins 05/09/2017 Towards Automating Data Wrangling – Curation of Example Datasets
6
Coping with spelling and format variability
Surgery Outcomes Web browsing history Neonatal ICU Rainfall UK E-Petitions Cybersecurity Record linkage Recognising that two distinct pieces of information in the data do in fact concern the same entity Coping with spelling and format variability Recovering the value of a datum from its representation (eg, recognising the string “25 Mar 16” as the ISO-8601-encoded date ) All manner of sins 05/09/2017 Towards Automating Data Wrangling – Curation of Example Datasets
7
Reformat data structure
Switching from “wide” to “tall” format, normalising/de-normalising relational datasets Surgery Outcomes Web browsing history Neonatal ICU Rainfall UK E-Petitions Cybersecurity Day 1 Day 2 All manner of sins 05/09/2017 Towards Automating Data Wrangling – Curation of Example Datasets
8
Missing Data All manner of sins Surgery Outcomes Web browsing history
Missing data sources in ‘Neonatal’ Surgery Outcomes Web browsing history Neonatal ICU Rainfall UK E-Petitions Cybersecurity Missing weather stations in ‘Rainfall’ Missing postcodes in ‘UK E-Petitions’ All manner of sins Day 1 Day 2 05/09/2017 Towards Automating Data Wrangling – Curation of Example Datasets
9
Anomaly Detection All manner of sins Surgery Outcomes
Web browsing history Neonatal ICU Rainfall UK E-Petitions Cybersecurity Day 1 Day 2 All manner of sins 05/09/2017 Towards Automating Data Wrangling – Curation of Example Datasets
10
https://alan-turing-institute.github.io/wrangling-tests/
All manner of sins 05/09/2017 Towards Automating Data Wrangling – Curation of Example Datasets
11
aida-dwt-petitions/code/Wrangling tasks for UK Petitions data.ipynb
All manner of sins 05/09/2017 Towards Automating Data Wrangling – Curation of Example Datasets
12
turing.ac.uk @turinginst
05/09/2017 Towards Automating Data Wrangling – Curation of Example Datasets
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.