Felicity Clemens 18 May 2005 Data cleaning: hints and tips Felicity Clemens Stata Users’ Group meeting London, 17 & 18 th May 2005.

Slides:



Advertisements
Similar presentations
SADC Course in Statistics Session 4 & 5 Producing Good Tables.
Advertisements

Previously… Processes –Process States –Context Switching –Process Queues Threads –Thread Mappings Scheduling –FCFS –SJF –Priority scheduling –Round Robin.
March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)
Quantitative Methods and Computer Applications in the Historical and Social Sciences Roman Studer Nuffield College
Evaluating Search Engine
Example – calculating interest until the amount doubles using a for loop: will calculate up to 1000 years, if necessary if condition decides when to terminate.
Objectives By the end of this class you should be able to… Explain the importance of involving users in requirements gathering Describe various types of.
11/8/20051 Ontology Translation on the Semantic Web D. Dou, D. McDermott, P. Qi Computer Science, Yale University Presented by Z. Chen CIS 607 SII, Week.
Catalog: Batch delete old Patron Records How to conduct global/batch updates to records – patron Adding Faculty and Patron/Student Records Manually Standardizing.
SADC Course in Statistics Producing Good Tables In Excel Module B2 Sessions 4 & 5.
The information integration wizard (Iwiz) project Report on work in progress Joachim Hammer Presented by Muhammed Al-Muhammed.
10 ThinkOfANumber program1July ThinkOfANumber program CE : Fundamental Programming Techniques.
Homework Discussion Homework 1 (Glade Manual Chapter 1) Introduction to Excel.
REPETITION STRUCTURES. Topics Introduction to Repetition Structures The while Loop: a Condition- Controlled Loop The for Loop: a Count-Controlled Loop.
How FACILITY CMIS and E-Portal are used within the organisation
Scale estimation and significance testing for three focused statistics Peter A. Rogerson Departments of Geography and Biostatistics University at Buffalo.
Key Data Management Tasks in Stata
SEM II : Marketing Research
A Brief Introduction to PROC TRANSPOSE prepared by Voytek Grus for
L3: BIG STATA CONCEPTS Getting started with Stata Angela Ambroz May 2015.
Textbook Problem Java – An Introduction to Problem Solving & Programming, Walter Savitch, pp.219, Problem 08.
Writing an Effective Resume and Cover Letter Consumer Economics.
SPIN Benchmarking The product planning phase contains all steps that are necessary to develop a good design brief for further product development in phase.
CVS – concurrent versions system Network Management Workshop intERlab at AIT Thailand March 11-15, 2008.
System for Administration, Training, and Educational Resources for NASA SATERN Overview for Users December 2009.
Experimental Design ã Dependent variable (DV): Variable observed to determine the effects of an experimental manipulation (behavior) ã Independent variable.
Version Control with SVN Images from TortoiseSVN documentation
Today’s Goals Answer questions about homework and lecture 2 Understand what a query is Understand how to create simple queries using Microsoft Access 2007.
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.
Matlab Programming for Engineers
Copyright © 2012 Pearson Education, Inc. Publishing as Pearson Addison-Wesley C H A P T E R 5 Repetition Structures.
12 CVS Mauro Jaskelioff (originally by Gail Hopkins)
R Workshop #2 Basic Data Analysis. What we did last week: Understand the basics of how R works Generated objects (vectors, matrices, etc.) Read in data.
Data Management Seminar, 8-11th July 2008, Hamburg WinDEM - Merge Files.
CSCI 6962: Server-side Design and Programming Shopping Carts and Databases.
NAME Designation, ORGANISTION, Country. About Us 2 Hint: A brief information about the organisation.
Chapter 6: Modifying and Combining Data Sets  The SET statement is a powerful statement in the DATA step DATA newdatasetname; SET olddatasetname;.. run;
HOW TALL IS THE TREE? An application of proportional relationship and direct variation.
Nolan Business Solutions
Makefiles Manolis Koubarakis Data Structures and Programming Techniques 1.
Systems Development Lifecycle Analysis. Learning Objectives (Analysis) Analysis Describe different methods of researching a situation. State the need.
Statistical Exploratory Analysis with “EnQuireR” 1.Introduction 2.Installation 3.How to 4.Report.
Lesson Objectives Aims Key Words
By Sasikumar Palanisamy
Outline lecture Revise arrays Entering into an array
Function Tables.
Topics Introduction to Repetition Structures
Chapter 5: Repetition Structures
Topics Introduction to Repetition Structures
CSC115 Introduction to Computer Programming
Introduction to Stata Spring 2017.
MATH UNIT #2 Multiplication and Division
Chapter 6: Repetition Structures
Chapter 5: Repetition Structures
Introduction to Repetition Structures
Designing Algorithms for Multiplication of Fractions
Topics Introduction to Repetition Structures
Labs from Units 2, 3, and 4: Guided Inquiry: Molarity, Set Up Lab Equipment, Begin Lab Set Up.
How many groups are there?
MATH TALK POWER NUMBER 64 Set 1.
MATH TALK POWER NUMBER 27.
PROBLEM: Recruiting the right physician at the right time is time-consuming and expensive.
MATH TALK POWER NUMBER 36.
MATH TALK POWER NUMBER 25.
MATH TALK POWER NUMBER 64 Set 2.
Town & Country.
MATH TALK POWER NUMBER 16.
Introduction to Computer Science
The Right Way to code simulation studies in Stata
Presentation transcript:

Felicity Clemens 18 May 2005 Data cleaning: hints and tips Felicity Clemens Stata Users’ Group meeting London, 17 & 18 th May 2005

Felicity Clemens 18 May 2005 Introduction  Data cleaning – one of the most time consuming jobs of all!  Many ways of attacking the same problem when using Stata  The talk will describe some common problems and propose possible solutions  These are mostly reminders!

Felicity Clemens 18 May 2005 Contents 1)Introduction to the first datasets 2)Identifying and removing duplicates – by hand 3)Merging data and uses of the merge command 4)Generating a moving target variable

Felicity Clemens 18 May 2005 The study  A case-control study carried across 3 central European countries  Exposure of interest: exposure to chemicals in the environment  Outcome of interest: cancer

Felicity Clemens 18 May 2005 Identifying duplicates in a dataset  This can be done automatically (using the duplicates set of commands)  We will demonstrate a manual method of identifying duplicates  Two different possibilities:  The same data have been entered on more than one occasion;

Felicity Clemens 18 May 2005 Identifying duplicates in a dataset  This can be done automatically (using the duplicates set of commands)  We will demonstrate a manual method of identifying duplicates  Two different possibilities:  The same data have been entered on more than one occasion;  Different data have been entered using the same identifier (id numbers)

Felicity Clemens 18 May 2005 The merge command A necessary command in data management of most big studies There are many different uses of the merge command. We look at two of them:  Simple merge on id  Multiple merge on id

Felicity Clemens 18 May 2005 Identifying a moving target  Scenario: we have data for each town giving the chemical concentration for each year between 1982 and 2002  Problem: we need to identify the year counting backwards from 2002 in which the chemical changed from its 2002 level  Why? We need to overwrite the 2002 value with a new value, and overwrite backwards until the value changed

Felicity Clemens 18 May 2005 Identifying a moving target (2) rescodey1990y1991y

Felicity Clemens 18 May 2005 Identifying a moving target (3) We will use the forval loop to examine the relationship between each year’s observed value and the observed value for the previous year

Felicity Clemens 18 May 2005 Summary  Identifying duplicates – can be done by hand or automatically using the “duplicates” set of commands  Use of the merge command – to merge on a specific variable, to multiply merge datasets  Generating a moving target variable – the use of the “forval” loop