Presentation is loading. Please wait.

Presentation is loading. Please wait.

Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.

Similar presentations

Presentation on theme: "Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research."— Presentation transcript:

1 Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research Node,

2 2 1.Some background on DAMES 2. First thoughts on linking DAMES and e-Stat 3. Some proposals on usability / services

3 3 1) Data Management though e- Social Science DAMES – ESRC Node funded Aim: Useful social science provisions Specialist data topics – occupations; education qualifications; ethnicity; social care; health Mainstream packages and accessible resources Aim: To exploit/engage with existing DM resources In social science – e.g. ESDS, CESSDA In e-Science – e.g. OGSA-DAI; OMII

4 4 To us Data management means… the tasks associated with linking related data resources, with coding and re-coding data in a consistent manner, and with accessing related data resources and combining them within the process of analysis […DAMES Node..] Usually performed by social scientists themselves Pre-analysis tasks (though often revised/updated) Inputs also from data providers Usually a substantial component of the work process But may not be explicitly rewarded (and sometimes penalised) differentiate from archiving / controlling data itself differentiate from archiving / controlling data itself

5 5 Some components… Manipulating data Recoding categories / operationalising variables Linking data Linking related data (e.g. longitudinal studies) combining / enhancing data (e.g. linking micro- and macro-data) Secure access to data Linking data with different levels of access permission Detailed access to micro-data cf. access restrictions Harmonisation standards Approaches to linking concepts and measures (indicators) Recommendations on particular variable constructions Cleaning data missing values; implausible responses; extreme values

6 6 Example – recoding data

7 7 Example –Linking data Linking via ojbsoc00 : c1-5 =original data / c6 = derived from data / c7 = derived from

8 8 Matching files (deterministic) Complex data (complex research) is distributed across different files. In surveys, use key linking variables for... One-to-one matching SPSS: match files /file=file1.sav /file=file2.sav /by=pid. Stata: merge pid using file2.dta One-to-many matching (table distribution) SPSS: match files /file=file1.sav /table=file2.sav /by=pid. Stata: merge pid using file2.dta Many-to-one matching (aggregation) SPSS: aggregate outfile=file3.sav /meaninc=mean(income) /break=pid. Stata: collapse (mean) meaninc=income, by(pid) Many-to-Many matches Related cases matching

9 9 A bit of focus… I tend to emphasise two data management activities: 1)Variable constructions oCoding and re-coding values 2)Linking datasets oInternal and external linkages

10 the centrality of keeping clear records of DM activities Reproducible (for self) Replicable (for all) Paper trail for whole lifecycle Cf. Dale 2006; Freese 2007 In survey research, this means using clearly annotated syntax files (e.g. SPSS/Stata) Syntax Examples:

11 Principle DAMES services (current status) GESDE specialist data environments (prototypes) Occupations, educational qualifications, ethnicity Data curation tool (prototype) Data fusion tool (prototype) Secure data demonstrator for e-Health research (complete) Micro-simulation model for social care data (prototype) Training workshops and events (in progress) 11

12 GEMDE – Grid Enabled Specialist Data Environments 12

13 GEODE – Occupational data

14 Data curation tool 14 The curation tool obtains metadata and supports the storage and organisation of data resources in a more generic way

15 Data fusion tool 15

16 16 2. Linking DAMES and e-Stat High level vision is to ingrain data management functionality and uptake within e-Stat modelling capabilities -Using/adapting DAMES contributions -DAMES services for data linking -DAMES resources for recoding variables -Making replication central to the data story

17 Data and variables DAMES does not in general provide routes to new/alternative microdata, but to relevant supplementary data (e.g. aggregate data) Anything on educational qualifications, occupations, ethnicity is of particular interest Generic tools for merging micro-data Generic tools for other variable processes 17

18 Data oriented review Applied research perspective Range of data resources Accessing and documenting data resource options 18

19 The implementation for e-Stat This is mostly a blank space… …and weve not hitherto used Python Data curation tool and GEODE/GEEDE use IRODS GEMDE uses a bespoke SQL database Data fusion tool uses R (and some Stata) scripts accessed via a Liferay portal

20 20 3. A pitch for specific e-Stat facilities..harvest the best of data analysis packages from applied data perspective Replication in human readable syntax Something like Statas est store for multiple model comparisons Fluency in data oriented options Training resources in data

21 Est store demo here 21

22 Appendix items 22

23 23 Data file specificationVariable manipulation & analysis DAMES most common commands: Commands invoking other packages -> usedataset{UKDA_5151} -> usedatafile{individuals wave A} -> matchdata{individuals wave A;individuals wave B; link variable=pid; format=wide} -> SPSS{match files file=aindresp.sav /file=bindresp.sav /by=pid} -> SPSS{fre var=ajbrgsc} -> Stata{recode ageb 16/30=1 31/50=2 *=.} -> R{..} -> Stata{do $path2\} Model 1: Graphics Text interface Invoked manually or in response to manipulating graphs BHPS, wave A individuals BHPS wave B individuals. Analytical file Wave C Gender Current job RGSC Spouse CAMSIS Age (yrs) Age bands Spouse SOC

24 24 The significance of data management for social survey research (see The data manipulations described above are a major component of the social survey research workload Pre-release manipulations performed by distributors / archivists Coding measures into standard categories Dealing with missing records Post-release manipulations performed by researchers Re-coding measures into simple categories We do have existing tools, facilities and expert experience to help us…but we dont make a good job of using them efficiently or consistently So the significance of DM is about how much better research might be if we did things more effectively…

25 25 Some provocative examples for the UK… Social mobility is increasing, not decreasing! Popularity of controversial findings associated with Blanden et al (2004) Contradicted by wider ranging datasets and/or better measures of stratification position DM: researchers ought to be able to more easily access wider data and better variables Degrees, MScs and PhDs are getting easier! {or at least, more people are getting such qualifications} Correlates with measures of education are changing over time DM: facility in identifying qualification categories & standardising their relative value within age/cohort/gender distributions isnt, but should, and could, be widespread Black-Caribbeans are not disappearing! As the immigrant cohort ages, the Black-Caribbean group is decreasingly prominent due to return migration and social integration of immigrant descendants Data collectors under-pressure to measure large groups only DM: It ought to remain easy to access and analyse survey data on Black-Caribbeans, such as by merging survey data sources and/or linking with suitable summary measures

26 26 Comment – growing interest in data management..? Historically, references covering DM were few and far between Dale, A., Arber, S., & Procter, M. (1988). Doing Secondary Analysis. London: Unwin Hyman Ltd. Recently, theres been a small burst of relevant references Levesque, R., & SPSS Inc. (2008). Programming and Data Management for SPSS Statistics Chicago, Il.: SPSS Inc.. Long, J. S. (2009). The Workflow of Data Analysis Using Stata. Boca Raton: CRC Press. Treiman, D. J. (2009). Quantitative Data Analysis: Doing Social Research to Test Ideas. New York: Jossey Bass. growing interest re. documentation for replication Dale, A. (2006). Quality Issues with Survey Research. International Journal of Social Research Methodology, 9(2), Freese, J. (2007). Replication Standards for Quantitative Social Science: Why Not Sociology? Sociological Methods and Research, 36(2), 2007.

27 27 E-Science and Data Management E-Science isnt essential to good DM, but it has capacity to improve and support conduct of DM… 1.Concern with standards setting in communication and enhancement of data 2.Linking distributed/heterogeneous/dynamic data Coordinating disparate resources; interrogating live resources 3)Contribution of metadata tools/standards for variable harmonisation and standardisation 4)Linking data subject to different security levels 5)The workflow nature of many DM tasks

Download ppt "Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research."

Similar presentations

Ads by Google