Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Quality is Bad? Deal With It Dennis Shasha New York University.

Similar presentations


Presentation on theme: "Data Quality is Bad? Deal With It Dennis Shasha New York University."— Presentation transcript:

1 Data Quality is Bad? Deal With It Dennis Shasha New York University

2 Data Quality Problem –challenges Two companies merge or two divisions want to share data. Problem: identify common customers even though their names are spelled differently (work with Bellcore/Telcordia colleagues: Munir Cochinwala, Verghese Kurien, and Gail Lalk) Real-time sensor network. Problem: sensors fail; want to avoid false alarms (work with physicist Alan Mincer and student Yunyue Zhu)

3 My Approach Let’s look at fields that have dealt with data quality problems for years though they consider these problems part of business as usual. We will ask: what do these fields do and how might that help us?

4 Data Quality Problem – biology Take two genetically identical plants, treat them in the same way, and measure the RNA expression levels. Get vastly different results. Differences increase if experiments done in different labs or by different people in the same lab. Even breathing can be dangerous… Goal: find causal relationships among genes.

5 What Can One Do? One way to tease out causality is to perform a time series experiment on closely spaced time points. Want close spacing to be able to say gene expression level at time t depends on gene expression levels at t-1. Start with noise-free model.

6 Red Squares represent a transition function f to be learned TFs t TFs + targets t+1 TFs t+3 TFs + targets t+4 tt + 1t + 2t + 3t + 4 Gene expression Time ffff gene zi gene zk Krouk et al 2010 submitted [19] Noise-free Modeling of Transcriptome Time Series Data Explain target gene expression as function of up to 4 input TFs

7 Modeling Noise (poor quality) There is reason to believe that Gaussian noise is a decent model of the inconsistencies in biological replicates. So model the relationship between observations and “true” value by a Gaussian noise component. We’ll see whether this is a good idea or not.

8 (B) Noisy model (black box is Gaussian noise) “Leave-out-last” test: observation model g dynamic model f 06391215 Training set Z(t)Z(t)Z(t+1)Z(t+2)Z(t+3)Z(t+4)Z(t+5) Y(t)Y(t)Y(t+1)Y(t+2)Y(t+3)Y(t+4)Y(t+5) (C) Naive 71% correct Krouk et al 2010 submitted [19] Predict direction of change of each gene @ 20 min 0639121520 min (A) Transcriptome Data Set – time series 06391215 min1215 min “Trend-forecast” test: 51% correct Predict direction of change of each gene @ 20 min Predict 20 min? Training set

9 Test and Adaptation Test the model by predicting values at a time point not used in the training. Predictions are not generally perfect, so adaptation is to figure out which other time points to test. One way to do this is to perform the training and testing process with one fewer experiment. If the most critical experiment is at time t, then gather more data at time t.

10 Lessons from Network Inference The objective is predictive power. Use the training set to train noise model and causal relationships among the genes. If predictions work out, then good. Modeling data quality is part of the learning problem.

11 Physics -- supernovas Look at sky and observe showers of gamma particles. Model the background as a Poisson process. Look for exceptionally high bursts (these can last seconds, minutes, hours, up to days). Aim telescopes in the appropriate part of the sky.

12 12 Astrophysical Application Motivation: In astrophysics, the sky is constantly observed for high-energy particles. When a particular astrophysical event happens, a shower of high-energy particles arrives in addition to the background noise. An unusual event burst may signal an event interesting to physicists. Technical Overview: 1.The sky is partitioned into 1800*900 buckets. 2.14 Sliding window lengths are monitored from 0.1s to 39.81s 1800 900

13 Physics -- adaptation A burst is only the first filter for detecting a supernova. If certain kinds of bursts (e.g. 10 second long bursts) lead to false positives often, then adjust the thresholds.

14 Physics -- lessons Once again the noise model is an integral part of the problem setting. Adaptation is ongoing (no fixed training set). Because physicists are looking for a single piece of information, e.g. there is a supernova at location X,Y, redundancy can overcome noise.

15 Drug Testing Give N patients a drug and N patients a placebo. This is a classic “data quality”/”biological variation” situation. Different patients will react differently to a drug and almost all patients will benefit from a placebo. Two questions: is the drug better than the placebo and how much?

16 Drug Testing -- Resampling Suppose you arrange the results in a table (patient id, drug/placebo, improvement). Compute the average improvement for the drug population Evaluate significance using a permutation test Evaluate the level using confidence intervals Don’t require assumption about distribution.

17 Typical table Patient improvement Drug/Placebo 10 Drug 12 Placebo 8 Drug -3 Placebo 20 Drug 4 Placebo Drug improvement: 38/3; Placebo: 11/3

18 One Permutation of table Patient improvement Drug/Placebo 10 Drug 12 Placebo 8 Drug -3 Placebo 20 Placebo 4 Drug Drug improvement: 22/3; Placebo: 29/3

19 Significance Test – is the drug’s apparent effect due to luck? count = 0 do 10,000 times permute the drug/placebo column recompute improvement under permutation if recomputed improvement >= measured improvement in real test then count+= 1 P-value = count/10,000; chance that improvement was due to chance.

20 Confidence interval – what’s a good estimate of the drug’s benefit count = 0 do 10,000 times take 2N elements from the original table with replacement compute improvement Sort the 10,000 improvement scores and compute 95% confidence interval as 250 th score to 9,750 th score.

21 Lessons from Drug Testing Assume different patients can react differently. Is the drug benefit significant? How much of a benefit does it have? Lesson: questions are simple; individual noise is overcome with redundancy.

22 Data Quality Problem – adversaries A farmer in the developing world wants to do a banking transaction. The bank has appointed the shopkeeper the bank agent. The shopkeeper will call the bank over an insecure phone line. The farmer doesn’t know whether the shopkeeper is truly honest and even whether messages can be intercepted and mangled (poor quality due to adversary).

23 Basic Solution Bank provides a collection of (essentially) one- time nonces and one-time pads to each of farmer and shopkeeper ahead of time. Per transaction: each of farmer/shopkeeper sends one-time nonce and messages to the bank listing the amount of the transaction. The bank verifies their identities via the nonces and the farmer/shopkeepers verify the amounts via the one-time pad.

24 “Quality Issues” this Solves Replay is impossible because nonces are one- time. Mangling will be detected because of one- time pads. False confederates and hacking of telephone network will be detected thanks to one-time pads. Even a determined adversary can be overcome. Never mind a little random noise.

25 Application – record matching Develop noise model: how sounds are misheard or how symbols are mistyped? Develop training set having correct outcomes but also metadata properties (e.g. who took the information and when was it taken) in case noise characteristics/probabilities depend on that. Model cost of errors vs. cost to clean.

26 Application – sensor reading Be conscious of what the goals of the sensor are, e.g. fire/no fire; earthquake/no earthquake. Use burst detection to locate possibly troublesome sensors in quiet times. Error model is key: could there be an adversary? Can you use non-parametric stats?

27 Lessons Data quality problems (i.e. noise or adversarial attacks) are an everyday occurrence in many fields. First lesson: model the amount of noise and design system to answer critical question (e.g. what is causal network, is drug effective, where is supernova) in spite of noise.

28 More Lessons Second lesson: If you can design for an adversary, then get noise correction for free. Third lesson: Use the meta-data to try to localize bursts of errors to try to shut down the reason for noise.


Download ppt "Data Quality is Bad? Deal With It Dennis Shasha New York University."

Similar presentations


Ads by Google