Presentation is loading. Please wait.

Presentation is loading. Please wait.

Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented.

Similar presentations

Presentation on theme: "Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented."— Presentation transcript:

1 Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented to the workshop The significance of data management for social survey research, University of Essex, a workshop organised by the Economic and Social Data Service ( and the Data Management through e-Social Science research Node of the National Centre for e-Social Science (

2 2 Data Management though e- Social Science DAMES – ESRC Node funded 2008-2011 Aim: Useful social science provisions Specialist data topics – occupations; education qualifications; ethnicity; social care; health Mainstream packages and accessible resources Aim: To exploit/engage with existing DM resources In social science – e.g. CESSDA In e-Science – e.g. OGSA-DAI; OMII

3 3 Data management means… the tasks associated with linking related data resources, with coding and re-coding data in a consistent manner, and with accessing related data resources and combining them within the process of analysis […DAMES Node..] Usually performed by social scientists themselves Most overt in quantitative survey data analysis variable constructions, data manipulations navigating abundance of data – thousands of variables Usually a substantial component of the work process Here we differentiate from archiving / controlling data itself Here we differentiate from archiving / controlling data itself

4 4 Some components… Manipulating data Recoding categories / operationalising variables Linking data Linking related data (e.g. longitudinal studies) combining / enhancing data (e.g. linking micro- and macro-data) Secure access to data Linking data with different levels of access permission Detailed access to micro-data cf. access restrictions Harmonisation standards Approaches to linking concepts and measures (indicators) Recommendations on particular variable constructions Cleaning data missing values; implausible responses; extreme values

5 5 Example – recoding data

6 6 Example –Linking data Linking via ojbsoc00 : c1-5 =original data / c6 = derived from data / c7 = derived from

7 7 The significance of data management for social survey research The data manipulations described above are a major component of the social survey research workload Pre-release manipulations performed by distributors / archivists Coding measures into standard categories Dealing with missing records Post-release manipulations performed by researchers Re-coding measures into simple categories We do have existing tools, facilities and expert experience to help us…but we dont make a good job of using them efficiently or consistently So the significance of DM is about how much better research might be if we did things more effectively…

8 8 Some provocative examples for the UK… Social mobility is increasing, not decreasing Popularity of controversial findings associated with Blanden et al (2004) Contradicted by wider ranging datasets and/or better measures of stratification position DM: researchers ought to be able to more easily access wider data and better variables Degrees, MScs and PhDs are getting easier {or at least, more people are getting such qualifications} Correlates with measures of education are changing over time DM: facility in identifying qualification categories & standardising their relative value within age/cohort/gender distributions isnt, but should, and could, be widespread Black-Caribbeans are not disappearing As the 1948-70 immigrant cohort ages, the Black-Caribbean group is decreasingly prominent due to return migration and social integration of immigrant descendants Data collectors under-pressure to measure large groups only DM: It ought to remain easy to access and analyse survey data on Black-Caribbeans, such as by merging survey data sources and/or linking with suitable summary measures

9 9 Our own motivation (in DAMES) 1.DM is a big part of the research process..but receives limited methodological attention 2.Poor practice in soc. sci. DM is easily observed Not keeping adequate records Not linking relevant data Not trying out relevant variable operationalisations 3.Even though.. There are plenty of existing resources and standards relevant to data management activities There are suitable software and internet facilities (Scott Long 2009) People are working on DM support (e.g. ESDS, DAMES)

10 10 A bit of focus… Most of the DAMES applications aim to facilitate one of two data management activities: 1)Variable constructions oCoding and re-coding values 2)Linking datasets oInternal and external linkages

11 11 The relevance of e-Science Data management through e-Social Science E-Science refers to adopting a number of particular approaches and standards from computing science, to applied research areas These approaches include the Grid; distributed computing; data and computing standardisation; metadata; security; research infrastructures DAMES (2008-11) – developing services / resources using e-Science approaches which will help social scientists in undertaking data management tasks

12 12 National Centre for e-Social Science, Major UK investment into UK oriented e-social science projects, typically: Handling and displaying large volumes of complex data E.g. GeoVue; DReSS; Obesity e-LabGeoVueDReSSObesity e-Lab Resources for computationally demanding analytical tasks CQeSS; MoSeS CQeSSMoSeS Standards setting in preparing / supporting data and research

13 13 E-Science and Data Management E-Science isnt essential to good DM, but it has capacity to improve and support conduct of DM… 1.Concern with standards setting in communication and enhancement of data 2.Linking distributed/heterogeneous/dynamic data Coordinating disparate resources; interrogating live resources 3)Contribution of metadata tools/standards for variable harmonisation and standardisation 4)Linking data subject to different security levels 5)The workflow nature of many DM tasks

14 14 E.g. of GEODE: Organising and distributing specialist data resources (on occupations)

15 15 The contribution of DAMES 8 project themes 1.1) Grid Enabled Specialist Data Environments (GE*DE) 2.1) Description, discovery & service use through metadata and data abstraction 1.2) Data resources for micro- simulation on social care data 2.2) Techniques to handle data from multiple sources 1.3) Linking e-Health and social science databases 2.3) Workflow modelling for social science 1.4) Training and interfaces for management of complex survey data 2.4) Security driven data management

16 16 DAMES research Node social researchers often spend more time on data management than any other part of the research process Appendix 1 – other extant resources relevant to DM Data access / collection Data Management Data Analysis UK Data Archive Qualidata Flagship social surveys Office for National Statistics Administrative data Specialist academic outputs DAMES ONS support ESDS support NCRM workshops Essex summer school ESRC RDI initiatives CQeSS

17 17 Some Key issues and concerns for DAMES 4 good habits and principles 3 Challenges

18 18 (a) Good habit: Keep clear records of your DM activities Reproducible (for self) Replicable (for all) Paper trail for whole lifecycle Cf. Dale 2006; Freese 2007 In survey research, this means using clearly annotated syntax files (e.g. SPSS/Stata) Syntax Examples:

19 19 Stata syntax example (do file)

20 20 Software and handling variables – a personal view Stata is the superior package for secondary survey data analysis: oAdvanced data management and data analysis functionality oSupports easy evaluation of alternative measures (e.g. est store) oCulture of transparency of programming/data manipulation oCf. Scott Long (2009) Problems with Stata oNot available to all users o{Slow estimation times}

21 21 (b) Principle: Use existing standards and previous research Variable operationalisations Use recognised recodes / standard classifications ONS harmonisation standards [Shaw et al. 2007] Cross-national standards. [Hoffmeyer-Zlotnick & Wolf 2003] Common vs best practices (e.g. dichotomisations) Use reproducible recodes / classifications (paper trail) Other data file manipulations Missing data treatments Matching data files (finding the right data)

22 22 (c) Principle: Do something, not nothing We currently put much more effort into data collection and data analysis, and neglect data manipulation Survey research – the influence of what was on the archive version …In my experience, a common reason why people didnt do more DM was because they were frightened to…

23 23 (d) Principle: Learn how to match files (deterministic) Complex data (complex research) is distributed across different files. In surveys, use key linking variables for... One-to-one matching SPSS: match files /file=file1.sav /file=file2.sav /by=pid. Stata: merge pid using file2.dta One-to-many matching (table distribution) SPSS: match files /file=file1.sav /table=file2.sav /by=pid. Stata: merge pid using file2.dta Many-to-one matching (aggregation) SPSS: aggregate outfile=file3.sav /meaninc=mean(income) /break=pid. Stata: collapse (mean) meaninc=income, by(pid) Many-to-Many matches Related cases matching

24 24 Some challenges for data management.. (e) Agreeing about variable constructions Unresolved debates about optimal measures and variables Esp. in comparative research such as across time, between countries In DAMES, we have particular interests in comparability for: Longitudinal comparability ( Scaling / scoring categories to achieve meaning equivalence or specific measures

25 25 Some challenges for data management.. (f) Worrying about data security DM activities could challenge data security Inspecting individual cases Multiple copies of related data files Ability to link with other datasets Hands-on model of data review New and exciting data resources have more individual information are more likely to be released with stringent conditions may jeopardize traditional DM approaches

26 26 Some routes to secure data Secure portals for direct access to remote data Secure settings (e.g. safe labs) Data annonymisation and attenuation Emphasis on users responsibility rather than the data provider

27 27 Some challenges for data management.. (g) Incentivising documentation / replicability There is little to press researchers to better document DM, but much to press them not to Make DM and its documentation easier? Reward documentation (e.g. citations)?

28 28 Appendices

29 29 Appendix 1: Existing resources (i): Data providers - a) Documentation and metadata files

30 30 Existing resources (i): D ata providers b)Resources for variables CESSDA PPP on key variables UK Question Bank ONS Harmonisation c)Resources for datasets UK Census data portal, IPUMS international census data facilities, European Social Survey, d)Data manipulations prior to data release Missing data imputation / documentation Survey design / weighting information Influential – most analysts use the archive version

31 31 Existing resources (ii) Resource projects / infrastructures -UK ESDS ESDS International| ESDS Government ESDS Longitudinal|ESDS Qualidata -Helpdesks; online instructions; user support.. -UK ESRC NCRM / NCeSS / RDI initiatives -Longitudinal data – -Linking micro/macro - -Other resources / projects / initiatives -EDACwowe - -….

32 32 Existing resources (iii) Analytical and software support Textbooks featuring data management [Levesque 2008] [Sarantakos 2007] [Scott Long 2009] Software training covering DM Statas data management manual SPSS user group course on syntax and data management, But generally, sustained marginalisation of DM as a topic Advanced methods texts use simplistic data Advanced software for analysis isnt usually combined with extended DM requirements

33 33 Existing resources (iv) Data analysts contributions Academic researchers often generate and publish their own DM resources, e.g. Harry Ganzeboom on education and occupations, Provision of whole or partial syntax programming examples Analysts often drive wider resource provisions related to DM CAMSIS project on occupational scales, CASMIN project on education and social class

34 34 Existing resources (v) Literatures on harmonisation and standardisation National Statistics Institutes principles and practices E.g. ONS Cross-national organisations E.g. UNSTATS - Academic studies E.g. [Harkness et al 2003]; [Hoffmeyer-Zlotnick & Wolf 2003] [Jowell et al. 2007]

35 35 Appendix 2: Some other selected NCeSS projects (concerned with accessing/handling complex data) GENeSIS Geographical data collection for visualisation and simulation analysis DReSS Storing and processing high-volume Qualitative data (audio/visual) LifeGuide Collecting/coordinating health/lifestyle information resources for public health dissemination Obesity e-Lab Collecting, linking and accessing health/social data re diet/lifestyle/obesity PolicyGrid Organise/access evidence from mixed data types to assist social science policy making CQeSS Statistical analysis resources for specification of models for data on complex multi-process systems

36 36 References Blanden, J., Goodman, A., Gregg, P., & Machin, S. (2004). Changes in generational mobility in Britain. In M. Corak (Ed.), Generational Income Mobility in North America and Europe. Cambridge: Cambridge University Press. Dale, A. (2006). Quality Issues with Survey Research. International Journal of Social Research Methodology, 9(2), 143-158. Freese, J. (2007). Replication Standards for Quantitative Social Science: Why Not Sociology? Sociological Methods and Research, 36(2), 2007. Harkness, J., van de Vijver, F. J. R., & Mohler, P. P. (Eds.). (2003). Cross-Cultural Survey Methods. New York: Wiley. Hoffmeyer-Zlotnik, J. H. P., & Wolf, C. (Eds.). (2003). Advances in Cross-national Comparison: A European Working Book for Demographic and Socio-economic Variables. Berlin: Kluwer Academic / Plenum Publishers. Jowell, R., Roberts, C., Fitzgerald, R., & Eva, G. (2007). Measuring Attitudes Cross-Nationally. London: Sage. Levesque, R., & SPSS Inc. (2008). Programming and Data Management for SPSS 16.0: A Guide for SPSS and SAS users. Chicago, Il.: SPSS Inc. Sarantakos, S. (2007). A Tool Kit for Quantitative Data Analysis Using SPSS. London: Palgrave MacMillan. Scott Long, J. (2009). The Workflow of Data Analysis Using Stata. Boca Raton: CRC Press. Shaw, M., Galobardes, B., Lawlor, D. A., Lynch, J., Wheeler, B., & Davey Smith, G. (2007). The Handbook of Inequality and Socioeconomic Position: Concepts and Measures. Bristol: Policy Press.

Download ppt "Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented."

Similar presentations

Ads by Google