How to Fake Data if you must Department of Statistics Rachel Fewster.

Slides:



Advertisements
Similar presentations
If you are viewing this slideshow within a browser window, select File/Save as… from the toolbar and save the slideshow to your computer, then open it.
Advertisements

If you are viewing this slideshow within a browser window, select File/Save as… from the toolbar and save the slideshow to your computer, then open it.
How to Factor Quadratics of the Form
1 Week 2 Sampling distributions and testing hypotheses handout available at Trevor Thompson.
SADC Course in Statistics Common complications when analysing survey data Module I3 Sessions 14 to 16.
Probability Distributions
Cheat or Honest? An Experiment Department of Statistics Rachel Fewster.
Benfords Law refers to the frequency distribution of first digits in many natural and human-constructed sources of data. In this distribution, the number.
Thomas A. Stewart Literacy Test (OSSLT) Prep Guide 2013
Experimental Measurements and their Uncertainties
Chapter 2: Frequency Distributions
SIT008 – Research Design in Practice
AP Statistics Chapter 2 review “Are you feeling normal today?”
The T Distribution ©Dr. B. C. Paul Wasn’t the Herby Assembly Line Problem Fun But there is one little problem But there is one little problem We.
Inferential Statistics
Probability and Induction
Announcements You survived midterm 2! No Class / No Office hours Friday.
Exponential Functions Logarithmic Functions
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Surveys and Questionnaires. How Many People Should I Ask? Ask a lot of people many short questions: Yes/No Likert Scale Ask a smaller number.
Intro to Probability STA 220 – Lecture #5. Randomness and Probability We call a phenomenon if individual outcomes are uncertain but there is nonetheless.
Presidential Election 2011 The Who? Where? What? & How?
Stat 155, Section 2, Last Time Producing Data: How to Sample? –Placebos –Double Blind Experiment –Random Sampling Statistical Inference –Population “parameters”,,
Statistical Tests How to tell if something (or somethings) is different from something else.
Copyright (c) Bani Mallick1 Lecture 4 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #4 Probability The bell-shaped (normal) curve Normal probability.
Estimation 8.
Need to know in order to do the normal dist problems How to calculate Z How to read a probability from the table, knowing Z **** how to convert table values.
The Central Limit Theorem
How Science Works Glossary AS Level. Accuracy An accurate measurement is one which is close to the true value.
A quadratic equation is a second degree polynomial, usually written in general form: The a, b, and c terms are called the coefficients of the equation,
Mass Media. What’s the news?
Introduction to Statistical Inferences
First Things First Log Tables, Random Numbers, and Tax Fraud Ahbel, copyright 2001.
All of Statistics Chapter 5: Convergence of Random Variables Nick Schafer.
One Way ANOVA ©2005 Dr. B. C. Paul modified 2009 Note – The concepts presented in these slides are considered common knowledge to those familiar with statistics.
Stat 13, Thu 5/10/ CLT again. 2. CIs. 3. Interpretation of a CI. 4. Examples. 5. Margin of error and sample size. 6. CIs using the t table. 7. When.
Sight Words.
Population distribution VS Sampling distribution
AP STATS: Take 10 minutes or so to complete your 7.1C quiz.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Statistics Introduction 2. The word Probability derives from the Latin probabilitas, which can also mean probity, a measure of the authority of a witness.
Graphing Data: Introduction to Basic Graphs Grade 8 M.Cacciotti.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
Discrete Random Variables
Chapter 15 – CTRW Continuous Time Random Walks. Random Walks So far we have been looking at random walks with the following Langevin equation  is a.
Chapter 8: Probability: The Mathematics of Chance Lesson Plan Probability Models and Rules Discrete Probability Models Equally Likely Outcomes Continuous.
Inference: Probabilities and Distributions Feb , 2012.
Continuous Random Variables Section 5.1
Copyright (c) Bani K. Mallick1 STAT 651 Lecture 6.
Sight Words.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
High Frequency Words.
Benford’s law of anomalous numbers
WHY DO WE NEED BOTH? Scientific Theory and Scientific Law.
Unit 8: The Normal Distribution. Probability distributions The probability of an outcome in an interval is shown in an histogram as the area above that.
Benford’s Law Keyang He Probability & Statistic. History 1881: Simon Newcomb noticed that the early pages of log table books were more grubby than the.
Chapter 8: Probability: The Mathematics of Chance Lesson Plan Probability Models and Rules Discrete Probability Models Equally Likely Outcomes Continuous.
Journal 9/8/15 Is there anything in your life that you are 100% certain about? Anything you know for sure? Objective Tonight’s Homework To learn about.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
The inference and accuracy We learned how to estimate the probability that the percentage of some subjects in the sample would be in a given interval by.
Copyright © Cengage Learning. All rights reserved. Estimation 7.
Making Sense of Statistics: A Conceptual Overview Sixth Edition PowerPoints by Pamela Pitman Brown, PhD, CPG Fred Pyrczak Pyrczak Publishing.
Active vs. Passive Voice. Active versus Passive Voice Many people are confused by whether they are using the active or passive voice when writing, and.
Chapter 21 More About Tests.
Unit 1 Patterns & Relations.
Welcome to Week 02 Thurs MAT135 Statistics
Copyright © Cengage Learning. All rights reserved.
CHAPTER 12: Introducing Probability
S.M .JOSHI COLLEGE ,HADAPSAR
Presentation transcript:

How to Fake Data if you must Department of Statistics Rachel Fewster

Who wants to fake data? Electoral finance returns… Toxic emissions reports… Business tax returns…

Land areas of world countries: real or fake?

IIIII III I I II I

Land areas of world countries: real or fake? I I III I IIII I II III IIIII III I II I

Land areas of world countries: real or fake? I III I IIII I II III IIIII III I II I This one seems more even… This one has as many 1s as 5-9s put together! This one is right!

Real land areas of world countries IIIII III I II I 11 of them begin with digits 1 – 4… Only 5 begin with digits 5 – 9…

Fridays Newspaper: IIII IIII III IIII II IIII II III 10 out of 34 numbers began with a 1… None out of 34 began with a 9!

The Curious Case of the Grimy Log-books In 1881, American astronomer Simon Newcomb noticed something funny about books of logarithm tables…

The Curious Case of the Grimy Log-books The books always seemed grubby on the first pages… … but clean on the last pages The first pages are for numbers beginning with digits 1 and 2… The last pages are for numbers beginning with digits 8 and 9…

The Curious Case of the Grimy Log-books People seemed to look up numbers beginning with 1 and 2 more often than they looked up numbers beginning with 8 and 9. Why? Because numbers beginning with 1 and 2 are MORE COMMON than numbers beginning with 8 and 9!!

Newcombs Law American Journal of Mathematics, % of numbers begin with a 1 !! < 5% of numbers begin with a 9 !!

The First Digits… Over 30% of numbers begin with a 1 Only 5% of numbers begin with a 9

The First Digits… Numbers beginning with a 1 Numbers beginning with a 9 There is the same opportunity for numbers to begin with 9 as with 1 … but for some reason they dont!

0.301 = log 10 (2/1) = log 10 (3/2) = log 10 (4/3) Chance of a number starting with digit d

Reactions to Newcombs law Nothing! …for 57 years!

Enter Frank Benford: 1938 Physicist with the General Electric Company Assembled over 20,000 numbers and counted their first digits! A study as wide as time and energy permitted.

Populations Numbers from newspapers Drainage rates of rivers Numbers from Readers Digest articles Street addresses of American Men of Science

About 30% begin with a 1About 5% begin with a 9

Benford gave the law its name… …but no explanation. Anomalou s numbers !!

…The logarithmic law applies to outlaw numbers that are without known relationship, rather than to those that follow an orderly course; and so the logarithmic relation is essentially a Law of Anomalous Numbers.

Explanations for Benfords Law Numbers from a wide range of data sources have about 30% of 1s, down to only 5% of 9s. Benford called these outlaw or anomalous numbers. They include street addresses of American Men of Science, populations, areas, numbers from magazines and newspapers. Benfords orderly numbers dont follow the law – like atomic weights and physical constants What is the explanation?

Popular Explanations Scale Invariance Base Invariance Complicated Measure Theory Divine choice Mystery of Nature These two say that IF there is a universal law, it must be Benfords. They dont explain why there should be a law to start with!

In a nutshell … If you grab numbers from all over the place (a random mix of distributions), their digit frequencies ultimately converge to Benfords Law Complicated Measure Theory

Thats why THIS works well

It doesnt explain why street addresses of American Men of Science works well! It doesnt really explain WHAT will work well, nor why

The Key Idea… If a hat is covered evenly in red and white stripes… Photo - Eric Pouhier http ://commons.wikimedia.org/wiki/Napoleon

The Key Idea… … it will be half red and half white. If a hat is covered evenly in red and white stripes…

The red stripes and the white stripes even out over the shape of the hat If the red stripes cover half the base, theyll cover about half the hat

What if the red stripes cover 30% of the base? Then theyll cover about 30% of the hat.

What if the red stripes cover precisely fraction of the base? = log 10 (2/1) Then theyll cover fraction ~0.301 of the hat.

Think of X as a random number… We want the probability that X has first digit = 1 Let the hat be a probability density curve for X Then AREAS on the hat give PROBABILITIES for X

Think of X as a random number… We want the probability that X has first digit = 1 Let the hat be a probability density curve for X Then AREAS on the hat give PROBABILITIES for X Pr(1 < X < 5) = 0.95 Area = 0.95 from 1 to 5 Total area = 1

In the same way … If the red stripes somehow represent the X values with first digit = 1, and the red stripes have area ~ 0.301, then Pr(X has first digit 1) ~

So X values with first digit=1 somehow lie on a set of evenly spaced stripes? Write X in Scientific Notation:

So X values with first digit=1 somehow lie on a set of evenly spaced stripes? Write X in Scientific Notation: r is betwee n 1 and 10 n is an integer

For example… r is betwee n 1 and 10 n is an integer

For example… For the first digit of X, only r matters!

For example… For the first digit of X, only r matters! 1 < r < 2 r > 2

Take logs to base 10… Or in other words…

r is betwee n 1 and 10 n is an integer

r is betwee n 1 and 10 n is an integer

r is betwee n 1 and 10 n is an integer

X has first digit 1 precisely when log(X) is between n and n for any integer n n = 0 : n = 1 : n = 2 : X from 1 to 2 X from 10 to 20 X from 100 to 200

n is an integer X has first digit 1 precisely when log(X) is between n and n for any integer n n = 0 : n = 1 : n = 2 : STRIPES!!

n = 0 : n = 1 : n = 2 : X values with first digit = 1 satisfy: and so on! The hat is the probability density curve for log(X)

n = 0 : n = 1 : n = 2 : X values with first digit = 1 satisfy: The hat is the probability density curve for log(X) X from 1 to 2 X from 10 to 20 X from 100 to 200

So X values with first digit=1 DO lie on evenly spaced stripes, on the log scale! The PROBABILITY of getting first digit 1 is the AREA of the red stripes, ~ approx the fraction on the base, =

Weve done it! Weve shown that we really should expect the first digit to be 1 about 30% of the time!

The log scale distorts: small numbers (e.g. 100) are stretched out; larger numbers (e.g. 900) are bunched up. The first digit corresponds to regularly spaced stripes on the log scale. Intuitively… So the smallest numbers (first digit = 1) are stretched out, and get the highest probability!

We need a lot of stripes to balance out big ones and little ones! We get one stripe every integer… So we need a lot of integers! When is this going to work? The distribution of X needs to be WIDE on the log scale!

X ranges from 0 to 6 on the log scale… So it ranges from 1 to 10 6 on usual scale! When is this going to work? Miss a few , ,000,000

These are Benfords Outlaw Numbers! All we need is a distribution that is: WIDE (4 – 6 orders of magnitude or more) Reasonably SMOOTH … Then the red stripes will even out to cover about 30% of the total area.

In Real Life… World Populations: From 50 for the Pitcairn Islands … To 1.3 x 10 9 for China… Wide (9 integers => 9 stripes) First digits very good fit to Benford!

In Real Life… World Populations: From 50 for the Pitcairn Islands … To 1.3 x 10 9 for China…

Electorate populations? From 583,000 to 773,000 in California: Of course not! All the first digits are 5, 6, or 7… The hat has less than one stripe! Benford doesnt work here.

But naturally occurring populations are a different story! Cities in California: - from 94 in the city of Vernon… - to 3.9 million in Los Angeles… Yes! Its Benford! Wide enough (5 integers => 5 stripes)

Powerball Jackpots? - from $10 million to $365 million… Not bad! Orders of magnitude only 1.5 … … but sometimes you just hit lucky! Data with kind permission from

Your tax return….? If you plan to fake data, you should first check whether it ought to be Benford! BUT the IRD has a few other tricks up its sleeve too….

To find out more: A Simple Explanation of Benfords Law by R. M. Fewster The American Statistician, to appear. PDF from Judy Patersons CMCT course, Term : Centre for Mathematical Content in Teaching Centre for Mathematical Content in Teaching Thanks for listening!