Data Perturbation An Inference Control Method for Database Security Dissertation Defense Bob Nielson Oct 23, 2009.

Slides:



Advertisements
Similar presentations
Statistical Significance What is Statistical Significance? What is Statistical Significance? How Do We Know Whether a Result is Statistically Significant?
Advertisements

HYPOTHESIS TESTING Four Steps Statistical Significance Outcomes Sampling Distributions.
Introduction to Hypothesis Testing
Statistical Significance What is Statistical Significance? How Do We Know Whether a Result is Statistically Significant? How Do We Know Whether a Result.
Introduction to Hypothesis Testing
Independent Samples and Paired Samples t-tests PSY440 June 24, 2008.
Evaluating Hypotheses Chapter 9 Homework: 1-9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics ~
Statistics for the Social Sciences Psychology 340 Fall 2006 Hypothesis testing.
Understanding Statistics in Research
HAWKES LEARNING SYSTEMS math courseware specialists Copyright © 2010 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Chapter 12 Additional.
Copyright (c) Bani Mallick1 Lecture 4 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #4 Probability The bell-shaped (normal) curve Normal probability.
Testing the Difference Between Means (Small Independent Samples)
Chapter 9 Hypothesis Testing.
Analysis of Variance & Multivariate Analysis of Variance
The t Tests Independent Samples.
Basic Analysis of Variance and the General Linear Model Psy 420 Andrew Ainsworth.
Inferential Statistics
8/20/2015Slide 1 SOLVING THE PROBLEM The two-sample t-test compare the means for two groups on a single variable. the The paired t-test compares the means.
Introduction to Hypothesis Testing
+ Quantitative Statistics: Chi-Square ScWk 242 – Session 7 Slides.
The basic idea So far, we have been comparing two samples
Week 9 Chapter 9 - Hypothesis Testing II: The Two-Sample Case.
Jeopardy Hypothesis Testing T-test Basics T for Indep. Samples Z-scores Probability $100 $200$200 $300 $500 $400 $300 $400 $300 $400 $500 $400.
Copyright © 2012 by Nelson Education Limited. Chapter 8 Hypothesis Testing II: The Two-Sample Case 8-1.
Tuesday, September 10, 2013 Introduction to hypothesis testing.
Discovering Interesting Subsets Using Statistical Analysis Maitreya Natu and Girish K. Palshikar Tata Research Development and Design Centre (TRDDC) Pune,
T-distribution & comparison of means Z as test statistic Use a Z-statistic only if you know the population standard deviation (σ). Z-statistic converts.
The Hypothesis of Difference Chapter 10. Sampling Distribution of Differences Use a Sampling Distribution of Differences when we want to examine a hypothesis.
Introductory Statistics for Laboratorians dealing with High Throughput Data sets Centers for Disease Control.
Week 111 Power of the t-test - Example In a metropolitan area, the concentration of cadmium (Cd) in leaf lettuce was measured in 7 representative gardens.
A P STATISTICS LESSON 2 – 2 STANDARD NORMAL CALCULATIONS.
Lecturer’s desk INTEGRATED LEARNING CENTER ILC 120 Screen Row A Row B Row C Row D Row E Row F Row G Row.
A Statistical Analysis of Seedlings Planted in the Encampment Forest Association By: Tony Nixon.
ANOVA (Analysis of Variance) by Aziza Munir
Blackjack Betting and Playing Strategies: A Statistical Comparison By Jared Luffman MSIM /3/2007.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Normal Distr Practice Major League baseball attendance in 2011 averaged 30,000 with a standard deviation of 6,000. i. What percentage of teams had between.
Chapter 10: Analyzing Experimental Data Inferential statistics are used to determine whether the independent variable had an effect on the dependent variance.
A Comparison of Two MCMC Algorithms for Hierarchical Mixture Models Russell Almond Florida State University College of Education Educational Psychology.
Security Control Methods for Statistical Database Li Xiong CS573 Data Privacy and Security.
Introduction to Inferential Statistics Statistical analyses are initially divided into: Descriptive Statistics or Inferential Statistics. Descriptive Statistics.
STA Lecture 251 STA 291 Lecture 25 Testing the hypothesis about Population Mean Inference about a Population Mean, or compare two population means.
Component 4: Introduction to Information and Computer Science Unit 6a Databases and SQL.
Statistical analysis. Types of Analysis Mean Range Standard Deviation Error Bars.
Week111 The t distribution Suppose that a SRS of size n is drawn from a N(μ, σ) population. Then the one sample t statistic has a t distribution with n.
Aron, Aron, & Coups, Statistics for the Behavioral and Social Sciences: A Brief Course (3e), © 2005 Prentice Hall Chapter 6 Hypothesis Tests with Means.
The t-distribution William Gosset lived from 1876 to 1937 Gosset invented the t -test to handle small samples for quality control in brewing. He wrote.
Student’s t test This test was invented by a statistician WS Gosset ( ), but preferred to keep anonymous so wrote under the name “Student”. This.
Research Methods and Data Analysis in Psychology Spring 2015 Kyle Stephenson.
MATB344 Applied Statistics I. Experimental Designs for Small Samples II. Statistical Tests of Significance III. Small Sample Test Statistics Chapter 10.
Security Methods for Statistical Databases. Introduction  Statistical Databases containing medical information are often used for research  Some of.
1 Testing Statistical Hypothesis The One Sample t-Test Heibatollah Baghi, and Mastee Badii.
Environmental Modeling Basic Testing Methods - Statistics II.
Inferential Statistics Psych 231: Research Methods in Psychology.
STA248 week 121 Bootstrap Test for Pairs of Means of a Non-Normal Population – small samples Suppose X 1, …, X n are iid from some distribution independent.
Statistics for the Social Sciences
Hypothesis Testing: Preliminaries
Math 4030 – 10b Inferences Concerning Variances: Hypothesis Testing
i) Two way ANOVA without replication
Part Three. Data Analysis
Central Limit Theorem, z-tests, & t-tests
Introduction to Inferential Statistics
Introduction to Statistics for the Social Sciences SBS200 - Lecture Section 001, Spring 2017 Room 150 Harvill Building 9:00 - 9:50 Mondays, Wednesdays.
Statistical Analysis Determining the Significance of Data
Reasoning in Psychology Using Statistics
Reasoning in Psychology Using Statistics
Hypothesis Testing II ?10/10/1977?.
Reasoning in Psychology Using Statistics
Reasoning in Psychology Using Statistics
Type I and Type II Errors
Presentation transcript:

Data Perturbation An Inference Control Method for Database Security Dissertation Defense Bob Nielson Oct 23, 2009

I. Introduction Most security concerns can be handled with the grant command. Others require a view approach But what happens if we wish to disclose partial information in a table field but not the individual records?

I. Introduction – The Problem The problem is to allow for statistical analysis of data but still protecting individual records. Example: Given a database of cancer patients. Allow for a researcher to know what the cancer rate is, but not that patient X has cancer.

I. Introduction – The Problem NameDeptSexSalary BobCSM30,000 FredCSM100,000 MaryCSF50,000 TimITM50,000 TomITM60,000 MarthaITF70,000 KenITM50,000

II. Related Work Suppression Anonymization Partitioning Data Logging Conceptual Hybrid Perturbation

II. Related Work- Suppression Must access n records Only n queries per day There are known methods to get around these protections.

II. Related Work- Anonymization Replace the identifying fields with special characters. This method can still be compromised.

II. Related Work- Anonymization NameDeptSexSalary *CSM30,000 *CSM100,000 *CSF50,000 *ITM50,000 *ITM60,000 *ITF70,000 *ITM50,000

II. Related Work- Partitioning All queries must access more than one band of records.

II. Related Work- Partitioning NameDeptSexSalary BobCSM30,000 FredCSM100,000 MaryCSF50,000 TimITM50,000 TomITM60,000 MarthaITF70,000 KenITM50,000

II. Related Work – Logging A log of every query ran is kept. Before a query is allowed all possible inferences are checked. If it releases one record, then that query is not permitted. Soon there are no queries allowed.

II. Related work – Conceptual Design the database so that no confidential information is stored.

II. Related Work – Hybrid Try using a combination of several of these methods.

II. Related Work - Perturbation Output Perturbation Data Perturbation Liew Perturbation Nielson Perturbation Note: Perturbation means data changing

II. Related Work – Output Perturbation Output perturbation works by changing the output of the query not the physical data.

II. Related Work – Output Perturbation

II. Related Work – Data Perturbation Data perturbation works by changing the physical data. Two common methods: 1.To add a random value to each value 2.To multiple each value by a random value

II. Related Work – Data Perturbation

II. Related Work – Liew Perturbation Liew perturbation steps: 1.Calculate the average, standard deviation, and count of the data 2.Generate a new data set with the same average, standard deviation and count 3.Sort both data sets in ascending order 4.Swap the perturbed values with each other.

II. Related Work–Liew Perturbation

III Hypothesis and Proof Prove: H1: Nielson perturbation is better than No Perturbation H2: Nielson perturbation is better than data perturbation (20%) H3: Nielson perturbation is better than Liew perturbation (20%)

III Hypothesis and Proof Disprove: H1: Nielson perturbation is not better than No Perturbation H2: Nielson perturbation is not better than data perturbation (20%) H3: Nielson perturbation is not better than Liew perturbation (20%)

IV. Methodology What is Nielson Perturbation? Calculating the absolute error... Finding optimal values for Nielson perturbation... Experimental design... Conducting the experiment...

IV. Methodology- Nielson Perturbation Nielson Perturbation is a form of data perturbation. Each value is multiplied by a random value between alpha and beta for the first gamma records in the data set. This value is randomly negated.

IV Methodology- Nielson Perturbation

IV. Methodology - Nielson

IV. Methodology- Alpha/Beta/Gamma What are the best values? An evolutionary algorithm was deployed. The results after several days of computation were: 1.Alpha = Beta = Gamma = 66.87

IV. Methodology- Evolutionary Results

IV. Methodology- Nielson Perturbation

IV. Methodology- The Method Calculate the average error of each method. Use the law of large numbers: An average of averages approaches a normal distribution as the sample size grows.

IV. Methodology- The Method Use a t-test to calculate whether two sample means are statistically different from each other with a significance of 95%

IV. Methodology- Monte Carlo Simulation Randomly generate 100,000 databases and execute 100’s of queries. I will use arrays to test the accuracy. Speed is of major importance here. Arrays vs. databases do not matter for calculating the accuracy of query outputs

IV. Methodology- Calculating the average error The error should be bigger with smaller query sizes. The error should be smaller with larger query sizes.

IV.Methodology- The Fitness Function e=|x-x’| If q < n/2 fitness=100-e Else fitness=e Smaller fitness scores are better

V. Results and Conclusions

V. Results and Conclusions Significance There is a real need for partial disclosure of a field in a table. My method insures a higher degree of security. My method still allows for release of averages and totals.

VI. Further Studies Transformation Times On the fly perturbing