Presentation on theme: "Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's."— Presentation transcript:
Data Handling I: Data Preparation and Data Cleaning Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's College London Email: email@example.com@kcl.ac.uk Drug Development Statistics & Data Management
Session objectives Part 1: – How to create a simple data file ready for analysis – How to create data files for large scale studies – Case study on data checking/cleaning Part 2 – Advantages and disadvantages of various summary statistics – Select appropriate summary statistics for categorical, binary, ordinal and continuous data. Reading: Statistics as Square One, Chapter 2. 2
Outline of computerisation of data Plan – at protocol stage of study. Data entry. Data checking and editing of individual files. Merging and appending files if necessary. Cross-checking of merged files. Data analysis. 3
4 Planning stage Design of data collection forms, e.g. questionnaires, clinical data sheets. Coding instructions. Decide on data entry program. Decide on eventual data analysis program. Ensure compatibility between data to be entered in different files. Set up data entry.
5 Design of data collection forms Example: European Community Respiratory Health Survey 6
6 Questionnaires - layout Readability and attractiveness to responder or interviewer. Readability and lack of ambiguity for data entry clerk. Collect dates of birth and occasion, not age. Other issues later in course.
7 Coding instructions Unique identification required for each individual in the study – must be included in each separate set of data. Assign unique NUMERIC code to all categorical/qualitative data, e.g. male – 1, female – 2. Codes may be printed on questionnaire, implemented in data entry; or later text to numeric conversion. Decide on code for ‘missing’ data – should be a number well away from possible data, e.g. 9 for gender, 999 for weight in kg (if use ‘blank’ need to be sure that this will be transferred as ‘missing’).
8 ECRHS coding instructions General ECHRS-European Community Respiratory Health Survey
10 Data entry Excel – part of Microsoft Office package so almost always available. Access – part of Microsoft Office Professional. Stata and SPSS – statistical analysis programs, available at King’s. Epi-Data – freeware from http://www.epidata.dk/
12 Data entry programs Small amounts of data can be entered in Excel, Stata or SPSS. Verification/double entry? – necessary except for small amounts of data. Verification/double entry most easily carried out if data entered using Epi-Data. Microsoft Access – sophisticated data entry, but verification requires complex programming.
13 Data analysis programs Stata – popular with medical statisticians and epidemiologists. – flexible, powerful, very few drawbacks SPSS – popular with sociologists and psychologists. SAS – popular with statisticians in pharmaceutical R&D. R or S-plus – popular with academic statisticians. Beware of little-known packages. – Unknown, limited validation
14 File transfer between programs Excel can read and write text delimited files. (A delimited text file is one in which each line of text is a record, and the fields are separated by a known character such as comma and tab) Stata, SPSS and most statistical packages can read and write text delimited files or Excel files. Data can be exported directly from Epi-Data to Stata or SPSS. Variable names/labels preserved in most cases.
15 Program formats Each program has its own special format File extensions tell you the file format – Excel.xls – Stata.dta – SPSS.sav – Access.mdb – Epi-Data.rec – Comma separated file.csv
16 Spreadsheet & comma-separated files Each spreadsheet has its own ‘format’, but it is possible to write a ‘comma-separated file’ which can be read by other programs.
18 Steps for most studies Data entered in Excel (or Epi-Data). Data transferred (‘exported’) to Stata or SPSS. Data checking and editing. Data analysis.
19 Setting up data entry in Epi-Data Should correspond to questionnaire or other data collection form. Allowable data determined by coding instructions. Set ranges for quantitative data, e.g. Height. Data entry “clerk” should not be constantly checking. Decide how dates are to be handled.
20 Preliminary editing If data entered as text codes convert to numeric codes. Text to numeric conversion simple in Excel.
21 Data checking Data correspond with coding instructions. Data correspond with plausible/possible distribution. Graphs, tables can identify if there is a problem, e.g. outliers and missing values. Listing selected data required to identify where there is a problem.
22 Multiple files Data from different centres need to be appended – add more rows (more records). Data from different sources/questionnaires/time periods for the same individuals need to be merged (e.g. cohort studies and RCTs) – add more columns (more variables). Efficient to enter data in separate files if not all data apply to all individuals, e.g. special questionnaire for women.
23 Compatible files If files are to be appended they must contain the same data variables names (columns) for different people (rows). If files are to be merged they should contain different data (columns) for the same individuals. The identification number(s) needs to be the same on each file for each individual and the identification variable name needs to be the same in each file.
24 Graphs for checking data (1) Single continuous variable – Histogram can detect ‘outliers’, e.g. in height (also dot plot, some box and whisker plots).
Histogram of age of SLSR patientsBoxplot of age of SLSR patients
26 Graphs for checking data (2) Two continuous variables – Scatter plot can detect ‘outliers’, e.g. in weight for height. Follow with list of aberrant values. Graphs less useful for categorical data
The relationship between Aortic pulse wave velocity (Ao-PWV) and Ambulatory arterial stiffness index (AASI) in patients with type2 diabetes, microalbuminuria and systolic hypertension at baseline
28 Tables for categorical data Wheeze in last 12 months Frequency (n)% No194575.0 Yes64224.7 Not known80.3 Total2595100.0
29 Tables for checking categorical data. tab q1 q1 | Freq. Percent Cum. ------------+----------------------------------- 1 | 1945 74.95 74.95 2 | 642 24.74 99.69 9 | 8 0.31 100.00 ------------+----------------------------------- Total | 2595 100.00. list area id if q1==9 area id 640. 110 640 1853. 110 1853 3280. 110 3280 3624. 110 3624 3663. 110 3663 4509. 110 4509 4623. 110 4623 4923. 110 4923
30 Missing data Convert to program missing value code before calculating summary statistics or plotting graphs E.g. in Stata – mvdecode gender, mv(9) – mvdecode weight, mv(999)
Case study: Scottish Family History Study Data SFHS Data Quality Report: PCQ data Report description: Draft Summary Report Prepared by: Yanzhong Wang Last run on: 07/11/2008 by Yanzhong Wang Report file name: SFHS_PCQ_check_report.doc Created by program: //Rcb-file-2000/Filestore/Studies/SFHS/statistics/programs/PCQ_datacheck_prog/SFHS_PCQ_check_v7.R Created using software: R version 2.5.1 (2007-06-27) for Windows Checked by The program for producing this report has not been checked by second statistician
Overview of SFHS PCQ data Duplicate records The analysis data set PCQ combines subjects from both Pre-clinic questionnaire version 1 and version 2. There are total 6882 records in the PCQ data set and each record has 415 variables. 6863 records have unique subject numbers. 19 subject numbers appear more than once. The number of subject numbers that appear more than twice is 0. The duplicate subject numbers are SFT0400662 SFT0400659 SFT0400656 SFT0400660 SFT0400985 SFT0434277 SFG9500780 SFT0441461 SFT0435134 SFT0441530 SFT0435427 SFT0435155 SFT0441473 SFT0435157 SFT0441602 SFT0435584 SFT0435438 SFG9501775 SFT0435630. The 19 duplicated records are omitted from further analysis, leaving a total of 6863 records with unique subject numbers.
Overview of SFHS PCQ data (continue) Blank variables There are 5 variables which contain all NAs for all the subjects. They are: "PCQCIV4" – “Prescribed injection / Suppository 4”, "PCQCIV6" – “Prescribed injection / Suppository 6”, "PCQEHB" – “Family Health / Breast cancer – brother (version 1)”, "PCQEKS" – “Family Health / Prostate cancer – sister (version 1)”, "PCQELB" – “Family Health / Hip fracture – brother (version 1)”.
Overview of SFHS PCQ data (continue) Pre-processing data