FDI - Imputation. Overview Introduction Overview of Imputation Methods Overview of Outliering methods Overview of Estimation methods Aggregation Disclosure.

Slides:

Advertisements

Similar presentations

By: Saad Rais, Statistics Canada Zdenek Patak, Statistics Canada

Advertisements

Unido.org/statistics International workshop on industrial statistics 8 – 10 July, Beijing Non response in industrial surveys Shyam Upadhyaya.

Module B-4: Processing ICT survey data TRAINING COURSE ON THE PRODUCTION OF STATISTICS ON THE INFORMATION ECONOMY Module B-4 Processing ICT Survey data.

Descriptive Measures MARE 250 Dr. Jason Turner.

Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Chapter 1 An Introduction to Business Statistics.

BCOR 1020 Business Statistics Lecture 4 – January 29, 2008.

Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.

Agricultural and Biological Statistics

The Simple Regression Model

Documentation and survey quality. Introduction.

Numerically Summarizing Data

Slides by JOHN LOUCKS St. Edward’s University.

QBM117 Business Statistics

Maintenance of Selective Editing in ONS Business Surveys Daniel Lewis.

1 1 Slide © 2003 South-Western/Thomson Learning TM Slides Prepared by JOHN S. LOUCKS St. Edward’s University.

STANDARD SCORES AND THE NORMAL DISTRIBUTION

Measures of Central Tendency

The Data Analysis Plan. The Overall Data Analysis Plan Purpose: To tell a story. To construct a coherent narrative that explains findings, argues against.

Analysis of Variance. ANOVA Probably the most popular analysis in psychology Why? Ease of implementation Allows for analysis of several groups at once.

Chapter 13: Inference in Regression

Describing Data: Numerical

Chapter 2 Describing Data with Numerical Measurements

Psy B07 Chapter 2Slide 1 DESCRIBING AND EXPLORING DATA.

Describing distributions with numbers

STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION.

Descriptive Statistics Used to describe the basic features of the data in any quantitative study. Both graphical displays and descriptive summary statistics.

Chapter 2 Describing Data with Numerical Measurements General Objectives: Graphs are extremely useful for the visual description of a data set. However,

1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.

Sampling. Concerns 1)Representativeness of the Sample: Does the sample accurately portray the population from which it is drawn 2)Time and Change: Was.

Copyright 2010, The World Bank Group. All Rights Reserved. Estimation and Weighting, Part I.

Chapter 3 – Descriptive Statistics

Chapter 3: Central Tendency. Central Tendency In general terms, central tendency is a statistical measure that determines a single value that accurately.

1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

Q2010, Helsinki Development and implementation of quality and performance indicators for frame creation and imputation Kornélia Mag László Kajdi Q2010,

Chapter Twelve Census: Population canvass - not really a “sample” Asking the entire population Budget Available: A valid factor – how much can we.

Sébastien CHAMI 5 May, 2010 Reengineering French structural business statistics An extended use of administrative data.

PPA 501 – Analytical Methods in Administration Lecture 5a - Counting and Charting Responses.

Measures of Variability In addition to knowing where the center of the distribution is, it is often helpful to know the degree to which individual values.

QBM117 Business Statistics Descriptive Statistics Numerical Descriptive Measures.

Who Wants to Be a Millionaire? SOCI 3303 SOCIAL STATISTICS.

1.1 - Populations, Samples and Processes Pictorial and Tabular Methods in Descriptive Statistics Measures of Location Measures of Variability.

1 2.4 Describing Distributions Numerically – cont. Describing Symmetric Data.

A Strategy for Prioritising Non-response Follow-up to Reduce Costs Without Reducing Output Quality Gareth James Methodology Directorate UK Office for National.

Descriptive Statistics becoming familiar with the data.

Describing distributions with numbers

Performance of Resampling Variance Estimation Techniques with Imputed Survey data.

Editing a Mixture of Canadian 2006 Census and Tax Data Mike Bankier Statistics Canada 2006 Work Session on Statistical Data Editing

1 1 Slide © 2007 Thomson South-Western. All Rights Reserved.

1 Calculation of unit value indices at Eurostat Training course on Trade Indices Beirut, December 2009 European Commission, DG Eurostat Unit G3 International.

Using administrative registers in sample surveys European Conference on Quality in Official Statistics 3-6 May 2010 Kaja Sõstra Statistics Estonia.

United Nations Economic Commission for Europe Statistical Division Mapping Data Production Processes to the GSBPM Steven Vale UNECE

The Robust Approach Dealing with real data. Estimating Population Parameters Four properties are considered desirable in a population estimator:  Sufficiency.

Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 3-1 Chapter 3 Numerical Descriptive Measures Business Statistics, A First Course.

DATA PREPARATION: PROCESSING & MANAGEMENT Lu Ann Aday, Ph.D. The University of Texas School of Public Health.

Central Tendency & Dispersion

Workshop on Price Index Compilation Issues February 23-27, 2015 Data Collection Issues Gefinor Rotana Hotel, Beirut, Lebanon.

Sources of Errors M&E Capacity Strengthening Workshop, Addis Ababa 4 to 8 June 2012 Arif Rashid, TOPS.

May 12-15, Evaluating the Integrated Census Israel Pnina ZADKA Central Bureau of Statistics Israel.

Outlier Treatment in HCSO Present and future. Outline Outlier detection – types, editing, estimation Description of the current method Alternatives Future.

Chapter 3: Central Tendency. Central Tendency In general terms, central tendency is a statistical measure that determines a single value that accurately.

Numerical descriptions of distributions

5.8 Finalise data files 5.6 Calculate weights Price index for legal services Quality Management / Metadata Management Specify Needs Design Build CollectProcessAnalyse.

Data Description Chapter 3. The Focus of Chapter 3  Chapter 2 showed you how to organize and present data.  Chapter 3 will show you how to summarize.

Chapter 6: Descriptive Statistics. Learning Objectives Describe statistical measures used in descriptive statistics Compute measures of central tendency.

Summarizing Data with Numerical Values Introduction: to summarize a set of numerical data we used three types of groups can be used to give an idea about.

Chapter Fourteen Data Preparation 14-1 Copyright © 2010 Pearson Education, Inc.

Correlation, Bivariate Regression, and Multiple Regression

WinTIM, Indices methodology and tool Wiking Althoff, CESD Communautaire External trade experts meeting on the CARDS Programme, Luxembourg, May.

A New Business Statistics in Finland - Quarterly Investments

PRODCOM SURVEY IN THE UNITED KINGDOM

Presentation transcript:

FDI - Imputation

Overview Introduction Overview of Imputation Methods Overview of Outliering methods Overview of Estimation methods Aggregation Disclosure Quality information

Results process ValidationAnalysisImputationOutlieringEstimationAggregationDisclosureOutputs

Methodology review of methods for FDI Methods reviewed back in 2011 as part of ESA10 Changes in international regulations require changes to be made to the FDI questionnaire, plus the survey data take-on and processing system. Opportunity to harmonise the Annual and Quarterly FDI methods and improve data quality

What is imputation? Imputation is defined as “A procedure for entering a value for a specific data item where the response is missing or unusable”. (UNECE Glossary of Terms) In practice, imputation is a way to estimate for a non-responder or for an unusable response. For example, unusable due to errors or inconsistent responses.

6 There are two types of non-response: complete and partial. These are known respectively as unit non- response and item non-response. Unit non-response occurs when –a respondent answers no survey questions Item non-response occurs when –a respondent answers some but not all survey questions Types of non-response

7 Ideally, non-response should be avoided completely, by solving the issues which cause it: –negative attitude towards ONS –problems contacting the ONS –problems with questionnaire design –problems with timing, burden, sensitivity etc However, the reality is that non-response always occurs in sample surveys. Avoiding non-response

8 Once it has occurred, non-response can be dealt with by the following: Do nothing re-contact imputation or more subjectively, manual construction Dealing with non-response

When CORA uses the different imputation methods MethodWhen appliedQuestions applied to Question descriptions Ratio of means imputation Annual and Qtr 1011, 1012, 1111, 1112, 1211,1212, 1311, 1312, 1321, 1322, 3412, 3422, 3712, 3722 Profit /loss, tax credits, closing balances Default to zero Annual and Qtr 2039, 2111, 2112, 2121, 2122, 2211, 2212, 2221, 2222, 2611, 2612, 2621, 2622 Exceptional dividends, acquisitions and disposals and increase and decreases in equity Copy forward previous period Annual and Qtr 3191 (impute prev 3192) 3291 (impute prev 3292) 3411 (impute prev 3412) 3421 (impute prev 3422) 3691 (impute prev 3692) 3711 (impute prev 3712) 3721 (impute prev 3722) Opening balances Impute a median value Qtr 2019 Ordinary dividend

Ratio of Means This is the main imputation method Used for profit /loss, tax credits, closing balances questions The next few slides will walk you through how the Ratio of Means is calculated. It is important to note that the calculations will all be done in CORA.

How Ratio of Means is Calculated For each question (relevant for Ratio of Means) group question data by company type i.e. branch or subsidiary and by industry. Sum the question response for each company within the group for the current period Sum the question response for each company within the group for the previous period Current period question total Previous period question total = question ratio

Ratio of Means Example Current period Previous period 45 / 62 = 0.73 (ratio) Company ACountryIndustryValue 1US6922 2US FR59050 Company BFR Company ACountryIndustryValue 1US6926 2US FR Company BFR 620? 62 45

Ratio of Means Example Current period Previous period Company ACountryIndustryValue 1US6922 2US FR59050 Company BFR Company ACountryIndustryValue 1US6926 2US FR Company BFR 620? Ratio= x

Application of Ratio of Means So the above slides creates the Ratio, but how is this applied? Where the company responded in the last period: Previous response was a positive number - multiply the response by the ratio to create an imputed value. Previous response was a negative number – current period value set to 0.

Application of Ratio of Means Where the company did not respond in the last period: Ratio of mean is not used as there is no value to apply the ratio to in the previous data. So trimmed mean is used to impute.

Checking! Important to check output of the ratio values If ratio is big – means that there is a big difference between the aggregated total for the current period and the previous. Check data – is a big companies data missing, incorrect units?

Copy forward previous period Used to move closing balances form the previous period into the opening balances for the new period. Method If the respondent has not completed a question then the system looks for data in a previous period. If it finds data for the missing question then the data of the previous period will be copied forward. If no value is there then it will calculate the median value

Copy forward previous period Current period Previous period Company ACountryIndustryValue 1US6922 2US FR59050 Company BFR Company ACountryIndustryValue 1US6922 2US FR Company BFR 620? 40

Median Imputation Used to impute for the Ordinary dividends questions Method Orders the question values by size and then counts the number of observations for the question and indentifies the middle number (median). The system imputes the blank cell with the value that is the middle of all the observations

Example for Median Imputation Data for question 3272

What is an outlier? Non-typical, unusual or extreme (large or small) values, relative to the rest of the data Outliers can be –non-representative - one-off values (often errors) –representative - there are similar values in the population

22 Why do outliers occur ? The ‘shape’ of the population –skewness –large variability Problems with the frame and sample design –misclassifications –poor relationship between stratification and survey variables Errors –data capture error –response error

Outliering methods used in FDI 1)Distance from the Mean – trims the data according to the set number of standard deviations. 2) Winsorisation - an outliering process used to identify responses that are different to other responses within its group. These data points are then amended prior to implementing other processes to values that are deemed to be within an acceptable range.

Distance from the Mean A unit is an outlier if: y i is outside the tolerance interval Where = trimmed sample mean s = sample standard deviation Outlier is excluded from estimation

Winsorisation Assumption: – Sampled outliers are true values, not necessarily unique in the population Method: –Identify outliers –decrease the values of sampled outliers that seem “too high” – non-outliers remain unchanged

One-Sided Winsorisation 26 k 0 y

How Winsorisation is calculated In the case of an expansion estimator, the optimal cut-off is calculated as L is a Winsorisation parameter: – computed from past data – minimising the Mean Square Error of the estimator – needs to be updated regularly (by Methodology)

Winsorisation Use a trimmed mean to ensure robustness FDI uses model based estimation (more on this to follow), no outliers weights are required. Reduce the value of the outlier to k-value and include reduced value in estimation

Main differences between the methods Distance from the mean excludes data Winsorisation does not exclude data points but alters the value to bring it closer to the mean.

Winsorisation One-sided winsorisation only outliers large positive values Some questions can contain positive and negative numbers – need to split these data into two parts - positive data - negative data – this is absoluted to remove negatives Winsorisation then applied to both parts before data is recombined.

Application of methods In the first instance Distance from the Mean will be used for outliering all data. Once we have a better understanding of the new data coming in then Winsorisation will be turned on in the system.

What is estimation? A method of deriving values for companies who weren’t sampled Ensures an overall data output can be provided for the population Only applied to stratum that are not fully enumerated

Method applied Weighted stratum mean Apply the mean for each stratum and question group to every non sampled business That’s it !

Aggregation Population is then added up over industry and country groups to produce final set of results

Primary Disclosure 3 rules are applied to test for disclosive data If a value passes one of these rules then value is disclosive and is suppressed Rules – 1.If < 3 wowentrefs within a cell 2.If largest value > 91% of total within a cell 3.If less than 19 RUs within a cell and (total value – (largest + 2 nd largest value) < 0.1 * largest value

Secondary Disclosure Further suppression is required as it is possible to recalculate some of the suppressed values within a group if only one value has been suppressed

Example Can calculate Malta as EU total – all other countries Malta = 1403

Example Can calculate Malta as EU total – all other countries Malta = 1403 Suppress a 2 nd country to hide Malta’s value