Presentation is loading. Please wait.

Presentation is loading. Please wait.

Manipulating data: Deriving variables, handling missing data, and cleaning data - practices, services and standards Paul Lambert (Dept. Applied Social.

Similar presentations

Presentation on theme: "Manipulating data: Deriving variables, handling missing data, and cleaning data - practices, services and standards Paul Lambert (Dept. Applied Social."— Presentation transcript:

1 Manipulating data: Deriving variables, handling missing data, and cleaning data - practices, services and standards Paul Lambert (Dept. Applied Social Science, Univ. Stirling) Vernon Gayle (Dept. Applied Social Science, Univ. Stirling and ISER, Univ. Essex) 27 th January 2009 Presented to the workshop The significance of data management for social survey research, University of Essex, a workshop organised by the Economic and Social Data Service ( and the Data Management through e- Social Science research Node of the National Centre for e-Social Science (

2 2 Manipulating data Operations performed on datasets by researchers and/or data distributors At any stage of the research lifecycle Of considerable consequence to analytical results DAMES Node: Data Management = manipulation of data, and documenting/assisting the processes of manipulation E-Social Science approach to facilitating data manipulation (metadata resources; data access facilities; workflow models)

3 3 Deriving variables, handling missing data, and cleaning data..Especially common types of data manipulation.. 1)Deriving variables = computing new measures for purposes of analysis oE.g. recoding complex categorical variables; standardising scores; linking micro- and macro-data o{Creating composite vars., e.g. selection model hazards, propensity scores, weights} 2)Handling missing data = strategies for item or case non-response oE.g. imputation approaches; listwise/pairwise deletion o{deriving missing variables via data fusion} oClarifying, stating & documenting assumptions (see 3)Cleaning data = monitoring and adjusting responses across a given set of variables oE.g. extreme values; erroneous values; re-scaling distributions;

4 4 In this talk… Practices, services and standards …For deriving variables, handling missing data, and cleaning data… Practices oKey, or common, features of current approaches Services oResources available/conceivable Standards oPreliminary thoughts on standards setting

5 5 (i) An brief illustrative example from the UK RAE 2008 Research Assessment Exercise data published Dec 2008 Extended reporting on basic data by media/within HE sector, e.g. Cambridge leads the way Nursing raises its status Numerous enhancements/amendments to data & analysis could be easily generated, and often lead to a different story Lambert, P.S. and Gayle, V. (2008). Data management and standardisation: A methodological comment on using results from the UK Research Assessment Exercise 2008, University of Stirling: Technical Paper 2008-3 of the Data Management through e-Social Science Research Node (

6 6 …Extending analysis of the 2008 RAE using data manipulations... Deriving variables Commonly used RAE Grade point average [4.(%4*) + 3.(%3*) + 2.(%2*) + (%1*)] / 100 Calculate alternative GPA measures Standardise GPA within Units of Assessment Rate Units of Assessment by external measures of relative prestige Link with 2001 standard thresholds Other external data – e.g. Univ. typologies; RAE panel membership Cleaning data Of 159 HEIs, 27 HEIs have only 1 UoA cf.mean 15 UoAs within HEI, max 53 (Manchester) The single UoA HEIs often have outlying GPAs Analyses of averages might excluding these HEIs Handling missing data Less conventionally missing data (admin dataset) But - not all HEI staff included within RAE; consider analysis accounting for number of excluded staff..?

7 7 Conventional RAE 2008 results for Univ. Essex

8 8 Alternative RAE 2008 measures for Univ. Essex (within- and between-subject standardisations)

9 9

10 10

11 11

12 12

13 13 RAE data manipulations example – practices, services and standards Practices Media/HEI announcements concentrate upon simplistic, unweighted, unstandardised rankings/averages Various alternative measures tell different stories – we found.. LSE outranks Cambridge Nursing ranks 6 least prestigious UoA from 67 Services Raw data available online: Relevant supplementary data: ; Standards RAE level documentation on grading criteria and approach, Software based Workflow approach (cf. Scott Long, 2009) oIn our paper we show Stata syntax for derived variables (

14 14 (ii) Some wider thoughts on data manipulation practices, services and standards Currently…, Practices are messy and painful Lack of replication and consistency in data manipulation tasks with complex survey data Few people relish data manipulations! Services exist but are under-exploited Standards are not agreed Ignoring standards no barrier to publication

15 15 Practices: apparent trends Deriving variables, handling missing data, cleaning data More interest in harmonisation and comparability Longitudinal and cross-national data Documentation challenges encourage simplifying approaches New data and analytical opportunities Increasing opportunities for enhancing data by linking at micro- or aggregate level Increasing availability of routines for missing values, extreme values Raising standards in secondary analysis of large scale surveys Inadequacy of simple analyses which ignore multivariate relations, missing data, multiprocess systems, hierarchical structures oData manipulations often conducted outside these considerations Desirability of replication

16 16 Services: key challenges Deriving variables, handling missing data, cleaning data Software issues Dominance of major proprietary database packages Other specialist/minority packages (e.g. MLwiN) Documentation / replication between packages..? Data security Few services can offer to let experts take over a dataset Approaches to reviewing data ought to avoid inspecting cases, duplicate copies Keeping up-to-date? oFinding data - need for search facilities [via metadata] oUpdating specialist advice E.g. of GEODE, occupational data out of date before completion NSIs strict focus on contemporary data

17 17 Standards: key requirements Deriving variables, handling missing data, cleaning data Need for documentation for replication Detailed accounts of process Citation of sources DAMES – to facilitate with metadata and process tools Resolving some difficult debates oApproaches to comparative research (measurement equivalence vs meaning equivalence) oNecessary standards for analysis/reporting on missing data oAppropriate approaches to extreme values, e.g. robust regressions

18 18 Forthcoming DAMES contributions Summer workshops on documenting manipulation and analysis of complex survey data To Stata and beyond.. Services for improving data manipulation activities Specialist data on occupations, ethnicity, education Specialist data on social care, mental health Tools for performing data manipulations (linking data and operationalising variables) Services for recording data manipulation activities Workflow modelling tools Metadata records for data linkages and variables Citation information

Download ppt "Manipulating data: Deriving variables, handling missing data, and cleaning data - practices, services and standards Paul Lambert (Dept. Applied Social."

Similar presentations

Ads by Google