Presentation on theme: "Quantitative Data Preparation Alasdair Crockett, Data Services Manager UK Data Archive."— Presentation transcript:
Quantitative Data Preparation Alasdair Crockett, Data Services Manager UK Data Archive
What characterises a good Quantitative dataset? i) Accurate data ii) Well labelled data iii) Well documented data iv) Data that can be stored in user- friendly dissemination formats, but can also be archived in a future-proof preservation format.
Accuracy of data: validation checks Computer aided surveys (CAPI, CATI or CAWI) These are the most accurate way of gathering survey data, but the software (e.g. Blaise) and hardware (e.g. a laptop for every interviewer) may be beyond project resources Computer aided surveys allow one to build in as many logical checks - on question routing and responses - as is possible at the point of data creation. Non computer aided surveys Less control over initial responses, but checks can performed: i) At the point of data entry/transcription if data entry software is used. However, there are few cheap data entry packages around. ii) The only feasible option may be to enter data without checks directly into a spreadsheet style interface (e.g. Excel worksheet, SPSS data view), and perform validation checks afterwards - via command files in statistical packages or Visual Basic code in Excel or Access.
An example of data seemingly untouched by the human eye: Originating error in text variables: OccupationDescription of Occupation sole traderpurveyor of seafood Propagated error in derived numeric variables: Respondent was coded under the standard occupational (SIC) code relating to food retailers: 52.2 Retail sale of food, beverages and tobacco in specialised stores
Labelling of data All variables should be named. Variable names should not exceed 8 characters where possible, as the most common format for disseminating data is SPSS. All variables should be labelled. Labels should be brief (preferably < 80 characters), but precise and always make explicit the unit of measurement for continuous (interval) variables. Where possible, all variable labels should reference the question number (and if necessary questionnaire). For example, the variable q11bhexc might have the label q11b: hours spent taking physical exercise in a typical week. This gives the unit of measurement and a reference to the question number (q11b), so the user can quickly and easily cross-reference to it. For categorical variables, all codes (values) should be given a brief label (preferably < 60 characters). For example, p1sex (gender of person 1) might have these value labels: 1 = male, 2 = female, -8 = dont know, -9 = not answered. Where possible, all such labelling should be created and supplied to the UKDA as part of the data file itself. This is the expectation with data supplied in one of the three major statistical packages - SPSS, STATA or SAS.
Documentation Core documentation: Questionnaire. Methodology: details of sample design, response rate, etc. Codebook, i.e. a comprehensive list of variable names, variable descriptions, code names and variable formatting information. This is essential If the package being used for data management does not allow the sort of variable and code labelling to be stored within the data file Technical report describing the research project. Other useful documentation that is seldom supplied: Code used to create derived variables or check data (e.g. SPSS, STATA or SAS command files).
Good and bad data documentation formats For full details for all types of data see: http://www.data-archive.ac.uk/depositingData/howtoDeposit.asp#format Preferred format(s)Acceptable format(s) Problematic format(s) Data held in a statistical package SPSS - portable (.por) or system (.sav) file. STATA; SAS (with formats information), delimited text Fixed-width (undelimited) text format. Data held in a Spreadsheet Delimited text (tab delimited or comma separated), Excel, Lotus Quattro Pro Data held in a database Delimited text with SQL data definition statements, MS ACCESS, dBase, FoxPro, SIR export, XML Filemaker Pro, ParadoxFixed-width (undelimited) text format. Documentation (e.g. questionnaires, codebooks, interviewers instructions, project description, etc.) Microsoft Word, Adobe PDF, Rich text format (RTF) SGML, HTML, XML, WordPerfect Hard copy (paper)