Presentation is loading. Please wait.

Presentation is loading. Please wait.

4. Data Management and Variable Operationalisations Paul Lambert, 24-25 August 2009 Presented to Data Management for Social Survey Research, a workshop.

Similar presentations

Presentation on theme: "4. Data Management and Variable Operationalisations Paul Lambert, 24-25 August 2009 Presented to Data Management for Social Survey Research, a workshop."— Presentation transcript:

1 4. Data Management and Variable Operationalisations Paul Lambert, 24-25 August 2009 Presented to Data Management for Social Survey Research, a workshop organised by the ESRC Data Management through e- Social Science research Node (

2 2 Deriving variables, handling missing data, and cleaning data..Especially common types of data manipulation.. 1)Deriving variables = computing new measures for purposes of analysis oE.g. recoding complex categorical variables; standardising scores; linking micro- and macro-data o{Creating composite vars., e.g. selection model hazards, propensity scores, weights} 2)Handling missing data = strategies for item or case non-response oE.g. imputation approaches; listwise/pairwise deletion o{deriving missing variables via data fusion} oClarifying, stating & documenting assumptions (see 3)Cleaning data = monitoring and adjusting responses across a given set of variables oE.g. extreme values; erroneous values; re-scaling distributions;

3 3 Variable operationalisations Analytical and conceptual issues 1) Harmonisation naming (!) 2) The value of trying multiple measures and standardisations 3) Multivariate vs univariate context 4) Functional form Thinking about key variables

4 4 1) Why harmonisation naming Much attention to variable operationalisations involves proposing optimum / standard measures UK – ONS Harmonisation EU – Eurostat standards Studies of criterion and construct validity Standard measures impact other analyses Affects available data Affect interpretations of data

5 5 a method for equating conceptually similar but operationally different variables.. [Harkness et al 2003, p352] Input harmonisation [esp. Harkness et al 2003] harmonising measurement instruments [H-Z and Wolf 2003, p394] unlikely / impossible in longer-term longitudinal studies common in small cross-national and short term lngtl. studies Output harmonisation (ex-post harmonisation) harmonising measurement products [H-Z and Wolf 2003, p394] Key variables – Harmonisation (across countries; across time periods)

6 6 More on harmonisation [esp. HZ and Wolf 2003, p393ff] Numerous practical resources to help with input and output harmonisation [e.g. ONS ; UN / EU / NSIs; LIS project; IPUMS ] [Cross-national e.g.: HZ & Wolf 2003; Jowell et al. 2007] Room for more work in justifying/ understanding interpretations after harmonisation

7 7 the degree to which survey measures or questions are able to assess identical phenonema across two or more cultures [Harkness et al 2003, p351] Equivalence Measurement equivalence involves same instruments and equality of measures (e.g. income in pounds) Functional equivalence involves different instruments, but addresses same concepts (e.g. inflation adjusted income)

8 8 Equivalence is the only meaningful criterion if data is to be compared from one context to another. However, equivalence of measures does not necessarily mean that the measurement instruments used in different countries are all the same. Instead it is essential that they measure the same dimension. Thus, functional equivalence is more precisely what is required [HZ and Wolf 2003, p389] More importantly, measurement equivalence rarely achieves what it intends to in the context of longitudinal and/or cross-national comparisons…

9 9 Measurement equivalence and social class Show tabplot here

10 10 Harmonisation & equivalence combined Universality or specificity in variable constructions Universality: collect harmonised measures, analyse standardised schemes Specificity: collect localised measures, analyse functionally equivalent schemes Most prescriptions aim for universality But specificity is theoretically better Specificity is more easily obtained than is often realised Especially for well-known key variables [e.g. Lambert et al., 2008]

11 11 2) The value of trying out different variable constructions Meaning? Coding frames; re-coding decisions; metric transformations and functional forms; relative effects in multivariate models Data collection and data analysis Cf. processes by which survey measures are defined and subsequently interpreted by research analysts

12 12 βs - Wheres the action? If we have lots of variables, lots of cases, yet often quite simple techniques and software, the action is primarily in the variable constructions… oThe example of social mobility research – see Lambert et al. (2007) i.How we chose between alternative measures ii.How much data management we try (or bother with) Plus other issues in how we analyse & interpret the results of analysis

13 13 Working with variables - speculation Data manipulation skills and inertia I would speculate that around 80% of applications using key variables dont consult literature and evaluate alternative measures, but choose the first convenient and/or accessible variable in the dataset Data supply decisions (what is on the archive version) are critical Much of the explanation lies with lack of confidence in data manipulation / linking data Too many under-used resources – cf.

14 14 Working with variables – further issues Re-inventing the wheel …In survey data analysis, somebody else has already struggled through the variable constructions your are working on right now… Increasing attention to documentation and replicability [cf Dale 2006; Freese 2007] Guidance and support In the UK, use Most guidance concerns collecting & harmonising data Less is directed to analytically exploiting measures

15 15 Betas in Society and Demystifying Coefficients 3) Multivariate contexts Betas in Society and Demystifying Coefficients Dorling, D., & Simpson, S. (Eds.). (1999). Statistics in Society: The Arithmetic of Politics. London: Arnold. Irvine, J., Miles, I., & Evans, J. (Eds.). (1979). Demystifying Social Statistics. London: Pluto Press. Famous works on critical interpretation of social statistics tend to have a univariate / bivariate focus Measuring unemployment; averaging income; bivariate significance tests; correlation vs causation But social survey analysts usually argue that complex multivariate analyses are more appropriate.. Critical interpretation of joint relative effects Attention to effects of key variables in multivariate analysis

16 16 Socio-economic processes require comprehensive approaches as they are very complex (everything depends on everything else). The data and computing power needed to disentangle the multiple mechanisms at work have only just become available. [Crouchley and Fligelstone 2004]

17 17 Endogeneity and variable measures everything depends on everything else [Crouchley and Fligelstone 2004] The effects of concepts are very difficult to isolate Example of key variables (e.g. education, occupations) oKey variables often change the main effects of other variables oSimple decisions about contrast categories can influence interpretations oInteraction terms are often significant and influential oKey variables are often endogenous (because they are key!)

18 18 4) Functional form The way in which measures are arithmetically incorporated in analysis Level of measurement (nominal, ordinal, interval, ratio) Alternative models and link functions Other variables and interaction effects

19 19 Levels of measurement and the desire to categorise Categories are easier to envisage / communicate oMuch harmonisation work locating into categories oAppearance of measurement equivalence oBut functional equivalence is seldom achieved Metrics are better for functional equivalence oE.g. Standardised income oHow to deal with categorisations? The qualitative foundation of quantity [Prandy 2002]

20 20 Example: categorisation and the scandalous use of collapsed EGP/NS-SEC…! Ignores heterogeneity within occupations Defines and hinges on arbitrary boundaries Creates artefactual gender differences

21 21 The scaling alternative… Many concepts can be reasonably regarded as metric cf. simplified / dichotomisted categorisations Comparability / standardisation is easier with scales Complex / Multi-process systems are easier with scales Structural Equation Models Interaction effects Growing availability/use of distance score techniques Stereotyped ordered logit [slogit in Stata] Correspondence Analysis Latent variable models o…But, scaling seems to be seen by some as a wicked, positivistic activity..!

22 22 Being creative with functional forms Treiman (2009: 162): nonlinear specifications of time and age effects Year of birth effect on literacy in China: dicontinuity at 1955; curve 1955-1967; knot at 1967

23 23 Practical suggestions on functional form Its rare not to have a few alternative measures of the same concepts at different levels of measurement Good practice would be to try alternative measures and see what difference they make consider treatment of missing values in relation to measurement instrument choice Engage as much as possible with other studies

24 24 Variable operationalisations Analytical and conceptual issues 1) Harmonisation naming (!) 2) The value of trying multiple measures and standardisations 3) Multivariate vs univariate context 4) Functional form Thinking about key variables

25 25 Key variables and social science measurement Defining key variables -Commonly used concepts with numerous previous examples -Methodological research on best practice / best measurement [cf. Stacey 1969; Burgess 1986] ONS harmonisation primary standards

26 26 Key variables: concepts and measures VariableConceptSomething useful OccupationClass; stratification; unemployment EducationCredentials; Ability; ; [Schneider 2008] Ethnic groupEthnicity; race; religion; national origins [Bosveld et al 2006] AgeAge; life course stage; cohort [Abbott 2006] GenderGender; household / family context IncomeIncome; wealth; poverty; [SN 3909]

27 27 An example: Occupations In the social sciences, occupation is seen as one of the most important things to know about a person Direct indicator of economic circumstances Proxy Indicator of social class or stratification Projects at Stirling ( oGEODE – how social scientists use data on occupations oDAMES – extending GEODE resources

28 Stage 1 - Collecting Occupational Data (and making a mess) Example 1: BHPS Occ descriptionEmployment statusSOC-2000EMPST Miner (coal)Employee81227 Police officer (Serg.)Supervisor33126 Electrical engineerEmployee21237 Retail dealer (cars)Self-employed w/e12342 Example 2: European Social Survey, parents data Occ descriptionSOC-2000EMPST Miner?8122?6/7 Police officer?3312?6/7 Engineer?? Self employed businessman???1/2

29 29

30 30 Occupations: we agree on what we should do: Preserve two levels of data Source data: Occupational unit groups, employment status Social classifications and other outputs Use transparent (published) methods [i.e. OIRs] for classifying index units for translating index units into social classifications for instance.. Bechhofer, F. 1969. 'Occupations' in Stacey, M. (ed.) Comparability in Social Research. London: Heinemann. Jacoby, A. 1986. 'The Measurement of Social Class' Proceedings from the Social Research Association seminar on "Measuring Employment Status and Social Class". London: Social Research Association. Lambert, P.S. 2002. 'Handling Occupational Information'. Building Research Capacity 4: 9-12. Rose, D. and Pevalin, D.J. 2003. 'A Researcher's Guide to the National Statistics Socio-economic Classification'. London: Sage.

31 31 …in practice we dont keep to this... Inconsistent preservation of source data Alternative OUG schemes oSOC-90; SOC-2000; ISCO; SOC-90 (my special version) Inconsistencies in other index factors oemployment status; supervisory status; number of employees oIndividual or household; current job or career Inconsistent exploitation of Occupational Information Numerous alternative occupational information files o(time; country; format) Inconsistent translations to social classifications – by file or by fiat Dynamic updates to occupational information resources Strict security constraints on users micro-social survey data

32 32 GEODE provides services to help social scientists deal with occupational information resources 1)disseminate, and access other, Occupational Information Resources 2)Link together their (secure) micro-data with OIRs External user (micro-social data) Occ info (index file) (aggregate) Users output (micro-social data) idougsex.ougCS-MCS-FEGPidougCS 11101. 6058I 111060. 23201. 6971II 232069. 33202.8743951VIIa 332071. 48741. 4 39. 58742. 5 51.

33 33 Existing resources on occupations Popular websites: Emerging resource: Some papers: Chan, T. W., & Goldthorpe, J. H. (2007). Class and Status: The Conceptual Distinction and its Empirical Relevance. American Sociological Review, 72, 512- 532. Rose, D., & Harrison, E. (2007). The European Socio-economic Classification: A New Social Class Scheme for Comparative European Research. European Societies, 9(3), 459-490. Lambert, P. S., Tan, K. L. L., Gayle, V., Prandy, K., & Bergman, M. M. (2008). The importance of specificity in occupation-based social classifications. International Journal of Sociology and Social Policy, 28(5/6), 179-192.

34 34 Using data on occupations – further speculation Growing interest in longitudinal analysis and use of longitudinal summary data on occupations oIntuitive measures (e.g. ever in Class I) Lampard, R. (2007). oEmpirical career trajectories / sequences Halpin, B., & Chan, T. W. (1998). Growing cross-national comparisons Ganzeboom, H. B. G. (2005).. Treatment of the non-working populations oSeldom adequate to treat non-working as a category oSelection modelling approaches expanding

35 35 Occupations as key variables Extensive debate about occupation-based social classifications oDocument your procedures.. you may be asked to do something different.. When choosing between occupation-based measures… They all measure, mostly, the same things Dont assume concepts measure measures oLambert, P. S., & Bihagen, E. (2007). Concepts and Measures: Empirical evidence on the interpretation of ESeC and other occupation-based social classifications. Paper presented at the ISA RC28 conference, Montreal (14- 17 August),

36 36 Data management and key variables In DAMES, we identify three important categorical variables (occupations, educational qualifications, ethnicity), and collect information about them in order to improve data management and hence exploitation of such data Key social science variables Existing resources (and metadata & support on those resources) UK and beyond

37 37 Occupational Information Resources Small databases (square electronic files) linking lists of occupational positions (occupational unit groups) with information about those positions Many existing resources already used in academic research (> 1000)

38 38 Educational information resources Small databases (often on paper) linking lists of educational qualifications with information about them Many existing resources (>500), but less communication between them [Part of UK scheme from ONS (2008)]

39 39 Ethnic Minority/Migration Information Resources Data which links measures of ethnicity / migration status with other information In high demand, but few existing resources (? < 500)

40 40 Summary – Variable operationalisations and social science We argue that the route to better critical understanding of variable effects combines complex analysis with many mundane, prosaic tasks in checking data ANALYSIS: Coefficient effects in multivariate models; multi- process models; understanding interactions; etc DATA MANAGEMENT: Re-coding data; linking data; missing data mechanisms; reviewing literature oSeldom central to previous methodological reviews oCf.

41 41 Appendix

42 42 Existing resources (i): Data providers - a) Documentation and metadata files

43 43 Existing resources (i): D ata providers b)Resources for variables CESSDA PPP on key variables UK Question Bank ONS Harmonisation c)Resources for datasets UK Census data portal, IPUMS international census data facilities, European Social Survey, d)Data manipulations prior to data release Missing data imputation / documentation Survey design / weighting information Influential – most analysts use the archive version

44 44 Existing resources (ii) Resource projects / infrastructures -UK ESDS ESDS International| ESDS Government ESDS Longitudinal|ESDS Qualidata -Helpdesks; online instructions; user support.. -UK ESRC NCRM / NCeSS / RDI initiatives -Longitudinal data – -Linking micro/macro - -Other resources / projects / initiatives -EDACwowe - -….

45 45 Existing resources (iii) Analytical and software support Textbooks featuring data management [Levesque 2008] [Sarantakos 2007] [Long 2009] Software training covering DM Statas data management manual SPSS user group course on syntax and data management, But generally, sustained marginalisation of DM as a topic Advanced methods texts use simplistic data Advanced software for analysis isnt usually combined with extended DM requirements

46 46 Existing resources (iv) Data analysts contributions Academic researchers often generate and publish their own DM resources, e.g. Harry Ganzeboom on education and occupations, Provision of whole or partial syntax programming examples Analysts often drive wider resource provisions related to DM CAMSIS project on occupational scales, CASMIN project on education and social class

47 47 Existing resources (v) Literatures on harmonisation and standardisation National Statistics Institutes principles and practices E.g. ONS Cross-national organisations E.g. UNSTATS - Academic studies E.g. [Harkness et al 2003]; [Hoffmeyer-Zlotnick & Wolf 2003] [Jowell et al. 2007]

48 48 References Abbott, A. (2006). Mobility: What? When? How? In S. L. Morgan, D. B. Grusky & G. S. Fields (Eds.), Mobility and Inequality. Stanford University Press. Bosveld, K., Connolly, H., Rendall, M. S., & (2006). A guide to comparing 1991 and 2001 Census ethnic group data. London: Office for National Statistics. Burgess, R. G. (Ed.). (1986). Key Variables in Social Investigation. London: Routledge. Crouchley, R., & Fligelstone, R. (2004). The Potential for High End Computing in the Social Sciences. Lancaster: Centre for Applied Statistics, Lancaster University, and Dale, A. (2006). Quality Issues with Survey Research. International Journal of Social Research Methodology, 9(2), 143-158. Dorling, D., & Simpson, S. (Eds.). (1999). Statistics in Society: The Arithmetic of Politics. London: Arnold. Freese, J. (2007). Replication Standards for Quantitative Social Science: Why Not Sociology? Sociological Methods and Research, 36(2), 2007. Halpin, B., & Chan, T. W. (1998). Class Careers as Sequences : An optimal matching analysis of work-life histories. European Sociological Review, 14(2), 111-130. Ganzeboom, H. B. G. (2005). On the Cost of Being Crude: A Comparison of Detailed and Coarse Occupational Coding. In J. H. P. Hoffmeyer-Zlotnick & J. Harkness (Eds.), Methodological Aspects in Cross-National Research (pp. 241-257). Mannheim: ZUMA, Nachrichten Spezial. Harkness, J., van de Vijver, F. J. R., & Mohler, P. P. (Eds.). (2003). Cross-Cultural Survey Methods. New York: Wiley. Hoffmeyer-Zlotnik, J. H. P., & Wolf, C. (Eds.). (2003). Advances in Cross-national Comparison: A European Working Book for Demographic and Socio-economic Variables. Berlin: Kluwer Academic / Plenum Publishers. Irvine, J., Miles, I., & Evans, J. (Eds.). (1979). Demystifying Social Statistics. London: Pluto Press. Jowell, R., Roberts, C., Fitzgerald, R., & Eva, G. (2007). Measuring Attitudes Cross-Nationally. London: Sage. Lambert, P. S., Prandy, K., & Bottero, W. (2007). By Slow Degrees: Two Centuries of Social Reproduction and Mobility in Britain. Sociological Research Online, 12(1). Lambert, P. S., Tan, K. L. L., Gayle, V., Prandy, K., & Bergman, M. M. (2008). The importance of specificity in occupation-based social classifications. International Journal of Sociology and Social Policy, 28(5/6), 179-192. Lampard, R. (2007). Is Social Mobility an Echo of Educational Mobility? Parents' Educations and Occupations and Their Children's Occupational Attainment. Sociological Review Online, 12(5). Li, Y., & Heath, A. F. (2008). Socio-Economic Position and Political Support of Black and Ethnic Minority Groups in the United Kingdom, 1972-2005 [computer file]. 2nd Edition. Colchester, Essex: UK Data Archive [distributor], SN: 5666. Prandy, K. (2002). Measuring quantities: the qualitative foundation of quantity. Building Research Capacity, 2, 3-4. Procter, M. (2001). Analysing Survey Data. In G. N. Gilbert (Ed.), Researching Social Life, Second Edition (pp. 252-268). London: Sage. Schneider, S. L. (2008). The International Standard Classification of Education (ISCED-97). An Evaluation of Content and Criterion Validity for 15 European Countries. Mannheim: MZES. Simpson, L., & Akinwale, B. (2006). Quantifying Stablity and Change in Ethnic Group. Manchester: University of Manchester, CCSR Working Paper 2006-05. Stacey, M. (Ed.). (1969). Comparability in Social Research. London: Heineman. Treiman, D. J. (2009). Quantitative Data Analysis: Doing Social Research to Test Ideas. New York: Jossey Bass.

Download ppt "4. Data Management and Variable Operationalisations Paul Lambert, 24-25 August 2009 Presented to Data Management for Social Survey Research, a workshop."

Similar presentations

Ads by Google