Presentation is loading. Please wait.

Presentation is loading. Please wait.

Linking data resources Paul Lambert, University of Stirling Presentation to the Scottish Civil Society Data Partnership Project (S-CSDP), Webinar 3 on.

Similar presentations


Presentation on theme: "Linking data resources Paul Lambert, University of Stirling Presentation to the Scottish Civil Society Data Partnership Project (S-CSDP), Webinar 3 on."— Presentation transcript:

1 Linking data resources Paul Lambert, University of Stirling Presentation to the Scottish Civil Society Data Partnership Project (S-CSDP), Webinar 3 on ‘Dealing with data: Using standard measures and variables and linking together datasets’ www.thinkdata.org.ukwww.thinkdata.org.uk, 10 Mar 2016

2 Linking data resources? 1)The importance of ‘identifiers’ 2)Software tools for linking data 3)Key categories of data linkage S-CSDP, 10 Mar 20162 …In the ‘big data’ tradition and era of ‘datafication’ we increasingly recognise the potential of bringing data together from different sources… Social survey data plus administrative data Different sources of by-product data Social science data analysis has always benefitted from linking (quantitative) datasets (‘data management’) Linking ‘microdata’ and ‘macrodata’ Comparative analysis linking records from different years/countries/surveys

3 1) The importance of identifiers ‘id’ variable(s) Numeric or string format …Should uniquely identify each row in at least one of the data files… Value of standard categories! Post-processing to adapt formats? Reconstruction based on combined characteristics? S-CSDP, 10 Mar 20163 Good format! idsexage 1145 2138 3225 Decent format! id1id2sex 1234567891311 21 1234567891412 Bad format id1sexage fk9 4la145 FK 9 4 LA138 CF12 1lw225 Fk9 4251

4 2) Software tools for data linkage - Some popular software can be used to link data ‘on the fly’ (e.g. MS Excel, Access). Software designed for research data analysis has the attraction of purpose build match-merge commands and their syntactical documentation – Appending data SPSS: add files /file=“file1.sav” /file=“file2.sav”. Stata: use file1.dta, clear append using file2.dta – Aggregating data SPSS: aggregate outfile=“file3.sav” /meaninc=mean(income) /break=pid. Stata: collapse (mean) meaninc=income, by(pid) – One-to-one matching SPSS: match files /file=“file1.sav” /file=“file2.sav” /by=pid. Stata: merge 1:1 pid using file2.dta – One-to-many matching (‘table distribution’) SPSS: match files /file=“file1.sav” /table=“file2.sav” /by=pid. Stata: merge m:1 pid using file2.dta – Many-to-Many matches (‘joinby’) – Related cases matching (see also www.dames.org.uk/workshops/ )www.dames.org.uk/workshops/ 4 Collectively known as ‘match-merge’ functions or ‘deterministic matching’ S-CSDP, 10 Mar 2016

5 3) Key categories of data linkage Probabilistic linkage versus deterministic linkage – Algorithmic approximations versus ‘match-merge’ operations Linked data providers versus your own data processing – E.g. www.ipums.org (‘attach characteristics’)www.ipums.org Linked data in a secure/restricted setting versus linking accessible data – E.g. Scottish Longitudinal Study, see http://sls.lscs.ac.uk/http://sls.lscs.ac.uk/ – E.g. British Household Panel Study, see https://www.iser.essex.ac.uk/bhps/documentation/volb/allrecs.html https://www.iser.essex.ac.uk/bhps/documentation/volb/allrecs.html – E.g. Linking aggregate occupational data to survey microdata on occupations (talk 4) S-CSDP, 10 Mar 20165

6 Appending data Add one or more datasets ‘on top of’ each other Usually full or partial overlap of variables Metadata preserved (but metadata from 1 file can overwrite another) Typically used for ‘repeated cross-section’ surveys. S-CSDP, 10 Mar 20166

7 Aggregating data Refers to generating new data of summary stats about original data (‘macrodata’) Often then want to link aggregated data back to the original records (‘microdata’) Most stats packages also allow generation of summary values and/or variables without aggregating the cases 7 This bit is the aggregation S-CSDP, 10 Mar 2016

8 One-to-one match Using a shared identifier to link records from different sources on same units Here, responses from the same person at different time points (using ‘pid’) 8S-CSDP, 10 Mar 2016

9 One-to-many match Use a shared identifier to send values from a unit to multiple relevant records Common examples include using occupational data and macro-level cross- national data Often called ‘table’ distribution In Stata, take care to retain suitable cases (‘_merge’) only 9S-CSDP, 10 Mar 2016

10 Many-to-many match Special scenario – want to distribute data to all permutations of linked records E.g. witihin-household links; data on events; data on financial patterns; data on illnesses S-CSDP, 10 Mar 201610

11 Related cases matching A version of one-to-one match-merge, where specific relations between units are defined and exploited Use a purpose-built ‘alter’ identifier (e.g. ‘sppid’) or derive one from data E.g., link data on a husband to a wife; data on a father to a daughter S-CSDP, 10 Mar 201611

12 Summary: Liking linking? People often under-exploit their data by not implementing linkages when they might be helpful – Technical/software challenges – Sometimes, erroneous links are made Clear documentation of data files can help Syntactical documentation of tasks will help – E.g. Long 2009; Boslaugh 2005; Standard measures (identifiers) will make things easier S-CSDP, 10 Mar 201612 References cited Boslaugh, S. (2005). An intermediate guide to SPSS programming: Using syntax for data management. London: Sage. Long, J. S. (2009). The Workflow of Data Analysis Using Stata. Boca Raton: CRC Press.


Download ppt "Linking data resources Paul Lambert, University of Stirling Presentation to the Scottish Civil Society Data Partnership Project (S-CSDP), Webinar 3 on."

Similar presentations


Ads by Google