Presentation is loading. Please wait.

Presentation is loading. Please wait.

DATA INTEGRATION AND ERROR: BIG DATA FROM THE 1930’S TO NOW.

Similar presentations


Presentation on theme: "DATA INTEGRATION AND ERROR: BIG DATA FROM THE 1930’S TO NOW."— Presentation transcript:

1 DATA INTEGRATION AND ERROR: BIG DATA FROM THE 1930’S TO NOW

2 Copyright ©2012 The Nielsen Company. Confidential and proprietary. 2 CONTENTS Big Data in the 1930’s and why that matters now TV measurement and Return Path Data (STB) Interesting questions for understanding error

3 Copyright ©2012 The Nielsen Company. Confidential and proprietary. 3 BIG DATA 1930’S STYLE

4 Copyright ©2012 The Nielsen Company. Confidential and proprietary. 4 PROBABILITY SAMPLING 1930’S STYLE

5 Copyright ©2012 The Nielsen Company. Confidential and proprietary. 5 EVOLUTION OF STATISTICAL CONCEPTS IN RESEARCH Early days: Novel, non-scientific 1930’s: Scientific sampling Since the 1950’s: weighting, probability models, imputation techniques, data fusion, time series analyses, hybrid (Big Data/sample integration)

6 Copyright ©2012 The Nielsen Company. Confidential and proprietary. 6 NIELSEN AND AUDIENCE MEASUREMENT 1923: Nielsen Founded 1950: Introduces TV Audience Measurement Current technology: People Meter Electronic measurement Probability samples All people and sets in home measured Nielsen Ratings are the currency for US TV advertising

7 Copyright ©2012 The Nielsen Company. Confidential and proprietary. 7 THE CHANGING TV ENVIRONMENT Fragmentation of Viewing Choices Proliferation of Devices Increasing Population Diversity

8 Copyright ©2012 The Nielsen Company. Confidential and proprietary. 8 RESEARCH DATA - STATISTICAL TOOLS From: Sample/Measure/Project (Panel Data) To: Sample/Measure/Project + Integrate - Data Fusion - Probability Modeling - Calibration - Predictive Modeling Using Multiple Panels, Census Data, Surveys

9 Copyright ©2012 The Nielsen Company. Confidential and proprietary. 9 WHAT STB AND PANELS CAN GIVE US STB Large convenience samples, stable results DATA Panels Completeness of Audience Measurement RESEARCH PRODUCTS In combination, STB + Panels offer the possibility of stable, UNBIASED RESEARCH + =

10 Copyright ©2012 The Nielsen Company. Confidential and proprietary. 10 STB GAPS AND BIAS 1.Data Quality/coverage/ timeliness/representativeness 2.Set Activity (On/Off/Other Source) 3.Household Characteristics 4.Persons viewing (including visitors in the home) 5.Other Viewing Activity Bias Standard Error STB Bias Standard Error People Meter STB + People Meter? Bias Standard Error Total Survey Error

11 Copyright ©2012 The Nielsen Company. Confidential and proprietary. 11 STB DATA QUALITY – EXAMPLE ANALYSES Good… Not so good… Machine Reboot Activity Program junction spikes

12 Copyright ©2012 The Nielsen Company. Confidential and proprietary. 12 ARE WE IMPROVING THE MEASUREMENT? 1.Transparency and validation at each step and overall 2. Total Survey Error Total Survey Error Bias Standard Error

13 Copyright ©2012 The Nielsen Company. Confidential and proprietary. 13 ASSESSING INTEGRATION ERROR Input Error (GIGO) Matching Error Statistical Error Validity Levels Multiple Database error compounding

14 Copyright ©2012 The Nielsen Company. Confidential and proprietary. 14 ASSESSING INTEGRATION ERRORS Input Error (GIGO) -Coverage Gaps, Definitional problems, Input Errors etc -But possible improvement through integration weighting effects Most problems remain but some can be mitigated through integration

15 Copyright ©2012 The Nielsen Company. Confidential and proprietary. 15 ASSESSING INTEGRATION ERRORS Matching Error (eg address matching) -Good – correct match, Bad – no match, Ugly – incorrect match -Trade-off between match rates and error rates Multiple databases may have correlated errors – that may be preferable to random errors since overall effect is restricted to a smaller group (eg new householders in some address lists)

16 Copyright ©2012 The Nielsen Company. Confidential and proprietary. 16 STATISTICAL ERROR (SAMPLE-BASED IMPUTATION) Model bias leads to attenuation (regression to mean) Individual data point bias can be undetectable due to sampling error

17 Copyright ©2012 The Nielsen Company. Confidential and proprietary. 17 SEPARATING MODEL BIAS AND SAMPLING ERROR Z-tests on each comparison and evaluation of Z-score distributions Deviation from expected distribution gives bias estimate

18 Copyright ©2012 The Nielsen Company. Confidential and proprietary. 18 STATISTICAL ERROR - MULTIPLE DATA SETS TV BuyWeb Hub and Spoke Sequential TV BuyWeb Comparison with Single Source Data: Nielsen National People Meter TV and Internet matched with Credit Card Purchase Data

19 Copyright ©2012 The Nielsen Company. Confidential and proprietary. 19 ACCURACY TEST TV BuyWeb Hub and Spoke Sequential TV BuyWeb R = 0.4 Correlation of 8 product categories with 14 TV Networks and 60 Websites R = 0.5 R = 0.67 R = 0.44

20 Copyright ©2012 The Nielsen Company. Confidential and proprietary. 20 SEQUENTIAL VS HUB AND SPOKE Unless the Hub has all the relevant linking information, a sequential approach gives better results In our example, we captured interactions between web and purchase behavior through the sequential fusion However sequential fusions can fall down with too many data-sets as error compounds.

21 Copyright ©2012 The Nielsen Company. Confidential and proprietary. 21 VALIDITY LEVELS – INDIVIDUAL VS AGGREGATED Individual Prediction IDEAL SCENARIO: You can predict every individual’s behavior REALITY With most Imputation methods we can do better than random but rarely can we get close to 100% accuracy. Eg ~40% improvement on random when predicting product users based on cookies. ie 14% of online ad impressions delivered to product users rather than 10% Aggregate Prediction Imputation methods can reliably predict aggregate level behavior given good predictive variables Eg 90% Accuracy (10% regression to mean) for TV audience estimates by product users Errors compound with multiple sources but extent varies by case

22 Copyright ©2012 The Nielsen Company. Confidential and proprietary. 22 CONCLUSION Data Everywhere! Data quality and relevance is essential Integration brings insights and error Statistical Integrity is as important now as it was in the 1930’s

23 Copyright ©2012 The Nielsen Company. Confidential and proprietary. 23 APPENDIX

24 Copyright ©2012 The Nielsen Company. Confidential and proprietary. 24 AD EFFECTIVENESS - MORE COMPLICATED Imagine a data set of 10,000 people for whom you have tracked exposure to a brand’s website and subsequent purchase of that brand. In our initial thought experiment, 76% converted. HUB: Matching info TBD... PUR- CHASE Website visit TBD...

25 Copyright ©2012 The Nielsen Company. Confidential and proprietary. 25 A BASIC EXPERIMENT Now imagine that you have measurement error in 10% of your cases. We ran a simulation of 1000 datasets which had incorrect data on site visits in 10% of cases. The difference between the original conversion rate and that in the 1000 error ridden test cases is about 8.5%. SD is xx.

26 Copyright ©2012 The Nielsen Company. Confidential and proprietary. 26 A BASIC EXPERIMENT What happens when we add another data set? HUB: Matching info TBD... PUR- CHASE! Website visit Saw TV ad TBD...

27 Copyright ©2012 The Nielsen Company. Confidential and proprietary. 27 MORE DATA – SAME ERROR Given two types of ad exposure data to measure, the impact of error in a single data source should be less... Imagine that you have measurement error in 10% of your cases for one data source – the same error as in previous experiment. As expected, conversion values are closer to our error-free data set. SD =

28 Copyright ©2012 The Nielsen Company. Confidential and proprietary. 28 MORE DATA – MORE ERROR Next, we introduced error into the TV data set as well. Worsening of performance SD is xx. But it looks more additive than exponential.

29 Copyright ©2012 The Nielsen Company. Confidential and proprietary. 29 MORE DATA – EVEN MORE ERROR Next, we imagined combining 6 data sets, each with 10% error. WHAT DO WE SEE?

30 Copyright ©2012 The Nielsen Company. Confidential and proprietary. 30 MATCHING ERROR In any data combination, there is an additional source of error – mismatches to the HUB or identity variable. Mispelled names can lead to false negatives. Non-deterministic matching can lead to false positives. Introducing 10% matching error (to first only, both and second only data sets) suggests that the impact is negligible over conversion in error free data. Suggests the quality of data is more important than the matching quality.

31 Copyright ©2012 The Nielsen Company. Confidential and proprietary. 31 ASIDE: THE IMPORTANCE OF WEIGHT Here, TV data was heavily weighted toward exposure. That overwhelmed any error from website visit data. Indeed, it appeared to counterbalance it.

32 Copyright ©2012 The Nielsen Company. Confidential and proprietary. 32 ASIDE: THE IMPORTANCE OF CORRELATION The greater the correlation between the dependent and independent variable, the greater the impact of error. Weaker correlation between webvisit and purchase (xx) Strong correlation between webvisit and purchase (xx)

33 Copyright ©2012 The Nielsen Company. Confidential and proprietary. 33 WHAT DO WE KNOW THUS FAR? Still more work to do certainly. But we have formed certain hypotheses: When combining multiple data sets, the error appears additive. Error rates being equal, the underlying aspects of the data are more likely to impact the outcome than the combination. It is important, however, to qualify basic relatedness between each independent variable and the dependent outcome. This argues for a hub and spoke approach to data combination. SO how did these hypotheses fare in a quick test using real world data? (next slide on your recent error work)

34 Copyright ©2012 The Nielsen Company. Confidential and proprietary. 34 There are two basic paths to integrating data A serial integration: (A+B)+C Each data set resulting from an integration is smaller than either original source due to non-matches. Combining Data Sets Data Source A+B Data Source B Data Source A Data Source C Data Source A+B+C += += Data Source A+B

35 Copyright ©2012 The Nielsen Company. Confidential and proprietary. 35 COMBINING DATA SETS Another approach is a hub and spoke model: (A+B)+(A+C)...etc. While the final integrated set is still reduced due to non- matches, the error from each match to the HUB is known. HUB: Matching info TBD... TBD. TBD...

36 Copyright ©2012 The Nielsen Company. Confidential and proprietary. 36 AD EFFECTIVENESS - MORE COMPLICATED Ad effectiveness captures the correlation between exposure to advertising and subsequent purchase of a product. When someone who sees an ad buys a product, we say they have CONVERTED. HUB: Matching info TBD... PUR- CHASE TBD. TBD...


Download ppt "DATA INTEGRATION AND ERROR: BIG DATA FROM THE 1930’S TO NOW."

Similar presentations


Ads by Google