Presentation on theme: "An introduction to the large-scale Government Surveys & Samples of Anonymised Records Jo Wathan ESDS(Government) & SARs support team CCSR, University of."— Presentation transcript:
An introduction to the large-scale Government Surveys & Samples of Anonymised Records Jo Wathan ESDS(Government) & SARs support team CCSR, University of Manchester
Today What data is available? What is it like? Considerations when using the data How are they used in research? How do you access them? Resources & Support
Why should you want to know? Because the data are... Very cost effective: data free of charge to academic researchers Saves time: no need to conduct survey Access to high quality, well documented data Can provide nationally representative data allows generalisation to population Allows historical and geographical comparisons to be made ESRC funded data support services
What data am I talking about? UK is particularly rich in microdata which is available for secondary analysis Today focus on cross-sectional microdata from government surveys and The Census –Samples of Anonymised Records –ESDS Government Surveys (e.g. LFS, GHS) Other major sources: –Longitudinal data (e.g. LS, BHPS) –International microdata (e.g. ESS) –ESDS core function/UK Data Archive –Aggregate data
The Samples of Anonymised Records (SARs) Microdata samples from Census 1991 & 2001 Available for the first time after research into the confidentiality risk More flexible than conventional aggregate tables SAR FilesIndividualHouseholdSmall Area Microdata 1991 (GB/NI) 2% with SAR area 1% with Region licensed data 3% with GOR (UK) 1% England & Wales only (special license) 5% with LA/UA/PC 2001 Controlled Access Microdata 3% with LA/UA/PC 1% with LA/UA/PC -
Whats in the SARs? UK Census Microdata Census has high response rate because compulsory –1991 only enumerated cases in data –2001 missing people are imputed Census topics only – brief self-completion form –Accomodation, transport, socio-economic characteristics, ethnicity, religion, health Anonymised and data limited to ensure confidentiality –Most restrictive in the end user license files for 2001, e.g. less geography in the individual and household files, age banded –Unusual cases perturbed Extremely large sample sizes!
ESDS Government Surveys General Household Survey Labour Force Survey Family Resources Survey Expenditure and Food Survey (previously the National Food Survey and Family Expenditure Survey) ONS Omnibus Survey National Travel Survey Time Use Survey British Crime Survey/Scottish Crime Survey British Social Attitudes/Scottish Social Attitudes/Northern Ireland Life & Times/Young Peoples Social Attitudes Health Survey for England/Wales/Scotland Survey of English Housing (England only)
What are ESDS Government data like? Nationally representative survey microdata Large sample sizes (but smaller than the SARs) Identifying information is removed Most are conducted on an annual basis Continuous surveys – always up-to-date Cross-sectional (although the LFS has a 5- quarter panel element) Specialist topic surveys – more depth than the Census
All of these microdata are: Individual information akin to the sort of data you would collect if you were conducting your own survey Need to be analysed in an appropriate software package (like SPSS or Stata) Cross-sectional snapshots (exception: the LFS is actually 5 snapshots per address!) Good quality collected by a professional data collection organisation –Office for National Statistics –National Centre for Social Research Collected for policy purposes Has good quality documentation & support services
Thinking about using the data? 1.What is your research question? 2.What evidence do you need to answer your research question? 3.Is the evidence you need already available check the literature and published reports. 4.Is cross-sectional secondary microdata appropriate for your research question? Is your question quantitative? Do you need to follow individuals over time? 5.Is data available?
Locating and assessing data Locating data: –What data is available for my topic? –Are the variables I need available? Assessing data for analysis: –What population is the sample drawn from? –What sampling scheme was used? –Do I need to weight?
What datasets cover my topic? Question Bank –has topic guides and a search engine across questionnaires Census topics: –Limited due to legislation, scale & self- completion; –View the codebooks to see what data is in which files on SARs web pages Finding topics in surveys: –Much wider range of topics from large number of different sources –ESDS Government topic guides on employment, health, social capital, Scotland –ESDS/UK Data Archive search engine
What variables are available for my topic? To understand the variables you have available –View the documentation/user guide –A list of variables & codings should be available –Information on how derived variables were created should be available –Double check in the dataset!
What do the variables mean? Unless... you can track your variable back to the question(s) asked on the questionnaire Know who the questions were asked of And what was done with the raw data to turn it into the final data... You dont understand the data
Routeing in the documentation: GHS
Variable Name : ECSTILO Variable Label : Economic status (harmonised) Topic : Employment Population : Adults Hhld/indiv.level : Individual Range : 1 to 10 Missing values : -6, -8 1 'Working (incl Unpaid FW' 2 'Gov sch with emp' 3 'Gov sch at coll' 4 'Unemployed (ILO)' 5 'Other Unemployed' 7 'Retired' 6 'Perm unable to work' 8 'Keeping house' 9 'Student' 10 'Other inactive' -8 'NA, ECSTA not known' -6 'Child/No int'. Derived variables DO IF SCHEDTYP = 3 OR AGE LT COMPUTE ECSTILO = -6. ELSE. + DO IF DVILO3A = 1. + DO IF SCHEMEET = 1. + DO IF TRN = 1. + COMPUTE ECSTILO = 2. + ELSE IF TRN = 2. + COMPUTE ECSTILO = 3. + END IF. + ELSE. + COMPUTE ECSTILO = 1. + END IF. + ELSE IF DVILO3A = 2. + COMPUTE ECSTILO = 4. + ELSE IF DVILO3A = 3. + DO IF YINACT = 1. + COMPUTE ECSTILO = 9. + ELSE IF YINACT = 2. + COMPUTE ECSTILO = 8. + ELSE IF YINACT = 3. + COMPUTE ECSTILO = 10.
The population base: nation Most large scale surveys seek to be nationally representative but what is a nation? –Labour Force Survey = UK –General Household Survey = GB (but strange things can happen North of the Caledonian Canal) –Health Survey for England = England –Not always apparent from the name –Increase of country-specific surveys following devolution Over 80% of the population live in England (9% Scotland, 5% Wales, 3% NI) so surveys designed for UK wide analyses will not generally have large enough samples to analyse separate countries
Population base: type of survey Most large scale surveys are household surveys they interview 1+ person in private households –This will exclude people in institutions –Has knock effects for particular topics; health, age etc. Surveys tend to gather limited information about children –May only relate to their existence age and relationships to other household members –There may also be other age restrictions on all or part of the survey
Population base - setting You may need to subset to obtain a reasonable database –SARs 1991 could double count visitors (at place of residence AND location on Census night) –SARs 2001 can double count students (at place of termtime residence AND parental address) –Need to subset to prevent double counting
The sampling strategy will affect your results Few data sources approximate simple random sampling – the SARs does Stratification increases the precision of estimates – the Labour Force Survey is stratified Clustering reduces the precision of estimates – e.g. the General Household Survey Many major surveys use stratification and clustering Guidance should be available in the documentation PEAS website
Disproportionate sampling The British Social Attitudes survey takes only 1 person per household –If left like this the chance of selection in the sample would be inversely proportional to the size of ones household Over-sampling in order to obtain satisfactory sample sizes for minority groups (often referred to as boosts) –Health Survey for England has done this with ethnic minorities
Weighting can be used to prevent bias from disproportionate sampling weightedunweighted Frequency% of allFrequency% of all Number in household including R? Q Total Dataset: British Social Attitudes Survey, 2003
Non-response trends – another reason for weighting Source: Barton in ESDS weighting guide
Imputation: 2001SARs Not ONC imputed ONC imputed White Mixed Asian Black Chinese/Other All
Exercise Suggest datasets which would fulfil the following criteria, for a range of employment projects: 1.A large up-to-date UK dataset with extensive questions on employment and training 2.The maximum possible sample size for a single time point to allow minority groups to be distinguished in analysis. 3.Any 1960s employment microdata 4.A dataset with extensive questions on income from sources other than just earnings 5.A dataset which could be used to look at attitudes to work
What would you use the data for? Straightforward secondary analysis –To assess theoretical accounts –To quantify characteristics or behaviours –To challenge official views –To apply alternative definitions Context to your own primary research –Your research could be quantitative or qualitative –To assess the national context of an area study –To assess whether your sample is typical –To assess the scale of behaviours
Practical research uses of the data Looking at change over time Look at sub-populations Using the flexibility of the data to look at alternative definitions Looking within households
Secondary analysis: change for subpopulations Marmot, M (2003)
Using successive cross-sectional data over time Pros… Reasonable amount of comparability Can pool years/quarters Data is representative at each time point Good at looking at impacts on groups Cons… Limits to continuity in the data (e.g. ethnic) Cannot establish individual change
Looking at small populations Many surveys with 10+k respondents –Permits minority groups to be represented –Rare subpopulations sample size may be too small… can consider combining years if appropriate Largest sample sizes available from the Samples of Anonymised Records –The Small Area Microdata file contains nearly 3 million records!
Survey data is subject to sampling error! Example: Pregnancy and Employment Using General Household Survey data alone there are only 168 pregnant women aged % Confidence interval for % pregnant women economically inactive 34.2 – 49.1% Combined 3 years data to obtain sample of 465 pregnant women Confidence interval using 3 years data: 34.9 – 43.9% Combining datasets to increase sample size
Using the flexibility of the data to look at alternative definitions What are hours worked? Is it just paid work? Or unpaid as well? Hours usually worked, or actually worked last week? In main job, or in any job? What about students? Overtime – paid? Overtime – unpaid? Lunch hours? Do non-workers work zero hours or should they be excluded?
Hierarchical data: conceptually Household 1 North West Social rented Person 1 HoH Female 28 GCSE P/T Work No LTILL Person 2 Son of HoH Male 12 N/A No LTILL Household 2 Wales Owner occupier Person 1 HoH Male 33 Degree F/T Employee No LTILL Person 2 Spouse of HOH Female 31 Degree P/T Employee No LTILL Person 3 Parent of HoH Female 72 No quals Econ Inactive LTILL
Source: Richard Dickens, Paul Gregg and Jonathan Wadsworth (2000) New Labour and the Labour Market, CMPO Working Paper Series 00/19 Table 5
Finding out about whats been/being done with the data User meetings –General Household Survey –Labour Force Survey –Health Surveys –Samples of Anonymised Records ESDS Government –Publications database –Usage pages
Accessing & Support Services The data teams: –ESDS Government –SARs team at CCSR Registering to use the data Special license and CAM data Getting support
SARs Data team CENSUS MICRODATA SUPPORT Register for the data Access SARs documentation for all SARs dataset Explore data online or download datasets in SPSS, Stata, or tab delimited form for: –1991 data, 2001 Individual licensed file, 2001 Small Area Microdata Information about 2001 Special Licence Household SAR – link to UK Data Archive for download
ESDS Government MAJOR CROSS-SECTIONAL UK SURVEYS Survey pages Introductory guides and resources including topic guides, weighting guide, software guides Links to relevant external resources Links to the UK Data Archive for –Register for the data –Download the data in Stata, SPSS etc. –Explore the data online in Nesstar –Access documentation
The licence All users need to be licensed Academics complete license as part of the Census Registration System Process Non-academic users contact UK Data Archive (Surveys) or CCSR (SARs) to arrange registration – charges may apply Cannot pass the data to an unlicensed user Cannot attempt to identify an individual
The licence – good practice Keep your data password protected Destroy your data when you have finished using it Remove files before passing on your PC to someone else Tell the data team about your publications Tell the data team if you leave your institution
Special licence files Special licence is new way of making more detailed data available to social researchers –Annual Population Survey data –Household SAR 2001 Full & legally binding paper registration process – requires institutional signature & ONS approval Must agree to extensive data stewardship conditions
Controlled Access Microdata SARs Controlled Access Microdata designed for professional researchers who have no other data options open to them Access in safe setting only at ONS site Specification on SARs website Individual file and Household file Files contains much more detail; e.g. –Individual year of age (topcoded at 95) –Full coding on country of birth –SOC Unit Goup –Local authority geography –Index of Deprivation for SOAs –Index of Deprivation for migrants last address Further information and appropriate forms at Contact for more details
User support SARs: helpdesk tel: (0161) SARS jiscmail list ESDS Government: helpdesk tel: (0161) ESDS-Govsurveys jiscmail list