SADC Course in Statistics Exploratory Data Analysis (EDA) in the data analysis process Module B2 Session 13.

Slides:



Advertisements
Similar presentations
Advanced Piloting Cruise Plot.
Advertisements

Chapter 1 The Study of Body Function Image PowerPoint
We need a common denominator to add these fractions.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
Determine Eligibility Chapter 4. Determine Eligibility 4-2 Objectives Search for Customer on database Enter application signed date and eligibility determination.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Chapter 12 Analysing quantitative data
Multistage Sampling Module 3 Session 9.
1 Questionnaire design Module 3 Session 3. 2 Overview (of Session) This session starts by introducing some aspects that need to be considered when designing.
1 Data processing and exporting Module 2 Session 6.
Collecting data for informed decision-making
1 From the data to the report Module 2. 2 Introduction Welcome Housekeeping Introductions Name, job, district, team.
Module Introduction and Getting Started with Stata
1 Adding a statistics package Module 2 Session 7.
Module 2 Sessions 10 & 11 Report Writing.
SADC Course in Statistics Basic summaries for demographic studies (Session 03)
Assumptions underlying regression analysis
Correlation & the Coefficient of Determination
SADC Course in Statistics Sampling design using the Paddy game (Sessions 15&16)
SADC Course in Statistics Processing single and multiple variables Module I3 Sessions 6 and 7.
SADC Course in Statistics Assessing data critically Module B1 Session 17.
SADC Course in Statistics Session 4 & 5 Producing Good Tables.
SADC Course in Statistics Graphical summaries for quantitative data Module I3: Sessions 2 and 3.
SADC Course in Statistics Types and Sources of Errors in Statistical Data.
SADC Course in Statistics Common complications when analysing survey data Module I3 Sessions 14 to 16.
SADC Course in Statistics Revision using CAST (Session 04)
SADC Course in Statistics Introduction to the module and the sessions Module I4, Sessions 1 and 2.
SADC Course in Statistics Reporting on the web site Module I4, Sessions 14 and 15.
SADC Course in Statistics Review of ideas of general regression models (Session 15)
Using a statistics package to analyse survey data Module 2 Session 8.
SADC Course in Statistics Reviewing reports Module I4, Session 9.
SADC Course in Statistics Producing a product portfolio Module I3 Session
SADC Course in Statistics Handling Data Module B2.
SADC Course in Statistics Objectives and analysis Module B2, Session 14.
SADC Course in Statistics Risks and return periods Module I3 Sessions 8 and 9.
SADC Course in Statistics Analysing Data Module I3 Session 1.
SADC Course in Statistics Revision on tests for proportions using CAST (Session 18)
SADC Course in Statistics Good graphs & charts using Excel Module B2 Sessions 6 & 7.
SADC Course in Statistics Excel for statistics Module B2, Session 11.
SADC Course in Statistics Module B2, Session3
SADC Course in Statistics Exploratory Data Analysis for single variables Module B2 Session 12.
Staff Education and Development FinancialLink & Excel/Pivot Tables Adam DiProfio Director of Budget and Planning Jacobs School of Engineering x44950,
Copyright © Cengage Learning. All rights reserved.
ABC Technology Project
Vanderbilt Business Objects Users Group 1 Reporting Techniques & Formatting Beginning & Advanced.
Pivot Tables Overview 1. What are Pivot Tables Pivot tables in Excel are a versatile reporting tool that makes it easy to extract information from large.
VOORBLAD.
Benchmark Series Microsoft Excel 2013 Level 2
Chapter 5 Microsoft Excel 2007 Window
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 28 Slide 1 Process Improvement 1.
© 2012 National Heart Foundation of Australia. Slide 2.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
Key Stage 3 National Strategy Handling data: session 4.
25 seconds left…...
U1A L1 Examples FACTORING REVIEW EXAMPLES.
Januar MDMDFSSMDMDFSSS
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
PSSA Preparation.
Simple Linear Regression Analysis
Chapter 14 Writing and Presenting The Systems Proposal
Tables and graphs for frequencies and summary statistics
SADC Course in Statistics Introduction to the module and the session Module I1, Session 1.
SADC Course in Statistics Producing Good Tables In Excel Module B2 Sessions 4 & 5.
1 Statistical concepts Module 1, Session 2. 2 Objectives From this session participants will be able to: Define statistics Enter simple datasets once.
Presentation transcript:

SADC Course in Statistics Exploratory Data Analysis (EDA) in the data analysis process Module B2 Session 13

To put your footer here go to View > Header and Footer 2 Learning Objectives students should be able to Construct a dot plot for a numeric variable split by a categorical variable Apply EDA concepts to a large dataset Explain the use of Excels pivot tables and filters, in the EDA process Explain the importance of EDA for data checking and at the start of the analysis Relate EDA to the principles of official statistics ….

To put your footer here go to View > Header and Footer 3 EDA with small and large data sets Session 12: Stressed the importance of EDA Introduced 2 new tools (dot and stem) Practiced with small data sets In this session we scale up Look at large data sets The tools do not scale up easily But the concepts do scale up EDA becomes even more crucial Most data sets are large! at least compared with teaching examples

To put your footer here go to View > Header and Footer 4 The essence of a stem and leaf plot Stem and leaf plot Stacked dot plot The leaf shows the next digit. This can be useful in the exploration phase data …

To put your footer here go to View > Header and Footer 5 What are the key points? We look at individual data points not summaries at this stage this is general for EDA The stem and leaf plot in particular keeps the actual numbers as far as possible This can be important An example uses the Tanzania survey

To put your footer here go to View > Header and Footer 6 Tanzania agriculture survey This is the variable we wish to explore. It is a value between 0 and 100

To put your footer here go to View > Header and Footer 7 The data in Excel The variable to explore before analysis

To put your footer here go to View > Header and Footer 8 How to explore this value Can we do a stem and leaf plot? By hand in Excel – but there are values! Even if automated, that is too many! The essence of a stem and leaf plot is to look at all the possible values Try a pivot table a powerful feature in Excel used previously on categorical data

To put your footer here go to View > Header and Footer 9 The pivot table

To put your footer here go to View > Header and Footer 10 Some results

To put your footer here go to View > Header and Footer 11

To put your footer here go to View > Header and Footer 12 What do you deduce? There are oddities in rounding Perhaps enumerator differences Can this question be answered to 1%? So what should be done before analysis? First – look further at the data Excel can help – it can drill down to examine individual records The concept: Use the table to look for oddities Then examine them in more detail

To put your footer here go to View > Header and Footer 13 Drilling down – an example Make the 6 corresponding to 2% the active cell Then double click to give the detail 4 of these values are from the same village – so same enumerator

To put your footer here go to View > Header and Footer 14

To put your footer here go to View > Header and Footer 15 What do you conclude – technique/results Technique Stem and leaf plots when looking at small datasets Pivot tables when datasets are large –But the principle is general Numbers must be looked at carefully! The principle can be adapted for the data and explored effectively in Excel Results – Did enumerators have different interpretations of the precision required in the percentages This needs further exploration and the analysis needs to take account of this

To put your footer here go to View > Header and Footer 16 Another new element in this session Exploratory analysis includes looking for oddities in the data Unexplained oddities cause variation that can make it difficult to detect the pattern because they add unnecessary noise to the data How do you tame the variation One way is to examine related variables This is important in the analysis the next slide is a repeat from Session 3 It is also a key weapon in data exploration and is covered in the practical

To put your footer here go to View > Header and Footer 17 Slide from Module B2 Session 3 To do good statistics you must fight the curse of variation Two main strategies to overcome variation 1. Take enough observations In the Tanzania survey there were 3223 households just from this one region 2. Measure characteristics that explain variation Variation itself is not necessarily the problem Variation you do not understand is the problem Here we start understanding variation at the exploration stage

To put your footer here go to View > Header and Footer 18 Practical – three parts Tanzania data practice what has been done in these slides Dot plots – split by a factor demonstration and practice Swaziland data apply the concepts checking factors as well as numeric columns Then the key points are reviewed

To put your footer here go to View > Header and Footer 19 Points for review after the practical Looking for individual problems And surprising patterns Exploratory graphics need to help the analyst and data checker see dot plots on next slide Tables are also useful especially with the facility to drill down Look at individual variables and at records as a whole Trust your common sense It is useful to estimate results And question the computer if they are very different

To put your footer here go to View > Header and Footer 20 Dot plots - yield by variety Outliers (typing errors) are clear, but only because of the 2 nd variable They are not outliers overall

To put your footer here go to View > Header and Footer 21 EDA is a continuous process EDA effectively is a continuation of the data checking process The example on the previous slide shows how some oddities only become clear once the analysis is undertaken This continues into the formal analysis where it involves looking at the residuals They are the unexplained variation As discussed in Session 3! So analysis is not just a set of rules It is a thoughtful process Where you become the data detective!

To put your footer here go to View > Header and Footer 22 Swaziland data was for checking

To put your footer here go to View > Header and Footer 23 Investigating the column called Presence What does 0 mean? Why are there blanks? Next steps: 1. Look at the questionnaire 2. Select these records You are becoming detectives!

To put your footer here go to View > Header and Footer 24 Codes for the column Seems clear enough. Zeros and blanks still a puzzle

To put your footer here go to View > Header and Footer 25 Selecting the blank records i.e. serious problems with the whole record Missing also Too young and all the same Crop code not recognised Areas too large

To put your footer here go to View > Header and Footer 26 Dot plot of area by Presence Odd crop areas were ALL associated with odd codes for the column PRESENCE It was found to be a data transfer problem with one byte missing in these records

To put your footer here go to View > Header and Footer 27 Checking data quality and EDA WhereWhyHowBy Whom Before data entry To ensure complete data set received Manual check supervisor During data entry To highlight anomalies Filter, dot plots etc Supervisor and helpers Before analysis Double checkAs above Analyst/ statistician During analysis Remain criticalResidualsAnalyst/ statistician

To put your footer here go to View > Header and Footer 28 Importance – principles of official statistics Principle 2: Professional standards It is unprofessional to analyse the data and report results without exploring critically at all stages Principle 4: Prevention of misuse We risk misusing the data unless we explore the data critically Principle 5: Sources of statistics Includes a requirement to avoid undue burden on respondents We must process the data fully and effectively. This needs EDA Otherwise the burden imposed on respondents is to some extent wasted

To put your footer here go to View > Header and Footer 29 Can you now: Apply EDA concepts to a large dataset Explain the importance of EDA for data checking and at the start of the analysis Relate EDA to the principles of official statistics

To put your footer here go to View > Header and Footer 30 Now you can organise the data for analysis And then do an exploratory analysis We show next how the analysis is easy IF your objectives are clear