Probabilistic Data Management

Slides:



Advertisements
Similar presentations
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Advertisements

Aggregating local image descriptors into compact codes
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
Modeling and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David.
Representing and Querying Correlated Tuples in Probabilistic Databases
Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB Presented by SHAO Yufeng.
Uncertainty Lineage Data Bases Very Large Data Bases
Indexing the imprecise positions of moving objects Xiaofeng Ding and Yansheng Lu Department of Computer Science Huazhong University of Science & Technology.
LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DATABASE SYSTEMS GROUP DEPARTMENT INSTITUTE FOR INFORMATICS Probabilistic Similarity Queries in Uncertain Databases.
Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work:
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)
Tracking Moving Objects in Anonymized Trajectories Nikolay Vyahhi 1, Spiridon Bakiras 2, Panos Kalnis 3, and Gabriel Ghinita 3 1 St. Petersburg State University.
Model-driven Data Acquisition in Sensor Networks Amol Deshpande 1,4 Carlos Guestrin 4,2 Sam Madden 4,3 Joe Hellerstein 1,4 Wei Hong 4 1 UC Berkeley 2 Carnegie.
Preserving Privacy in Clickstreams Isabelle Stanton.
RADAR: An In-Building RF-Based User Location and Tracking system Paramvir Bahl and Venkata N. Padmanabhan Microsoft Research Presented by: Ritu Kothari.
Statistical Methods for long-range forecast By Syunji Takahashi Climate Prediction Division JMA.
Presented by: Xi Du, Qiang Fu. Related Work Methodology - The RADAR System - The RADAR test bed Algorithm and Experimental Analysis - Empirical Method.
1 Efficiently Learning the Accuracy of Labeling Sources for Selective Sampling by Pinar Donmez, Jaime Carbonell, Jeff Schneider School of Computer Science,
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Sensor Data Management: Challenges and (some) Solutions Amol Deshpande, University of Maryland.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Query Segmentation Using Conditional Random Fields Xiaohui and Huxia Shi York University KEYS’09 (SIGMOD Workshop) Presented by Jaehui Park,
Presented By, Shivvasangari Subramani. 1. Introduction 2. Problem Definition 3. Intuition 4. Experiments 5. Real Time Implementation 6. Future Plans 7.
Privacy-preserving data publishing
Handbook for Health Care Research, Second Edition Chapter 7 © 2010 Jones and Bartlett Publishers, LLC CHAPTER 7 Designing the Experiment.
Location Privacy Protection for Location-based Services CS587x Lecture Department of Computer Science Iowa State University.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Data Mining: Data Prepossessing What is to be done before we get to Data Mining?
Versatile Publishing For Privacy Preservation
More SQL: Complex Queries, Triggers, Views, and Schema Modification
Chapter 13 Editing and Topology.
Course Outline 1. Pengantar Data Mining 2. Proses Data Mining
Market-Risk Measurement
Text Indexing and Search
Databases Chapter 16.
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Data and Applications Security Introduction to Data Mining
Probabilistic Data Management
Probabilistic Data Management
CS & CS Probabilistic Data Management
What is a Database and Why Use One?
Location Privacy.
Statistical Learning Dong Liu Dept. EEIS, USTC.
Data Mining Practical Machine Learning Tools and Techniques
Chapter 4: Probabilistic Query Answering (2)
Probabilistic Data Management
Probabilistic Data Management
MANAGING DATA RESOURCES
Probabilistic Data Management
CS & CS ST: Probabilistic Data Management
The European Statistical Training Programme (ESTP)
Chap 8. Instance Based Learning
Indoor Location Estimation Using Multiple Wireless Technologies
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Probabilistic Databases
EAST GRADE course 2019 Introduction to Meta-Analysis
Analytics – Statistical Approaches
Model generalization Brief summary of methods
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Data Engineering Research Group
A Unified Framework for Location Privacy
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Social Research.
Chapter 13: Item nonresponse
Overview: Chapter 2 Localization and Tracking
Presentation transcript:

Probabilistic Data Management Chapter 1: An Overview of Probabilistic Data Management

Objectives In this chapter, you will: Get to know what uncertain data look like Explore causes of uncertain data in different applications Learn the importance of studying uncertain data management Become aware of the classifications of uncertain data

Objectives (cont'd) Discover the pros and cons of uncertain data management, compared with traditional certain data management Become familiar with the history of uncertain data management, including some existing systems

Outline Introduction Applications of Probabilistic Data Management Classifications of Uncertain Data Comparisons: Uncertain vs. Certain Data The Existing Systems

Introduction Uncertain data are pervasive in real-world applications A.k.a. probabilistic data / imprecise data / inaccurate data / noisy data Data uncertainty may occur, during: Data collection Data transmission Data processing probability reported data actual data

Data Collection Data collection devices are sometimes imperfect Sensors Abnormal sensor readings RFID readers Miss-read Cross-read

Data Collection (cont'd) Data extraction techniques are often inaccurate Information extraction from unstructured text Different techniques can produce different extraction results Technique 1 Address: West Sugar Road Technique 2 Address: Sugar Road unstructured text I live at 203W Sugar Road

Data Transmission During the data transmission, errors may occur Sensor networks Packet losses  fewer or biased samples Transmission errors  erroneous sensory data sink sensor network

Data Transmission (cont'd) During the data transmission, errors may occur Global Positioning System (GPS) refraction reflection

Data Processing Data can be imprecise, when we manipulate the data Privacy preserving Add synthetic noises to protect users' privacy before publishing data Lossy data compression Trade the data accuracy for space Data integration Merge data from multiple data sources

Outline Introduction Applications of Probabilistic Data Management Classifications of Uncertain Data Comparisons: Uncertain vs. Certain Data The Existing Systems

Real-World Applications Applications of Probabilistic Data Management Sensor networks Location-based services Moving object search Data extraction and integration Privacy preserving

Applications (1) – Sensor Networks Causes of data uncertainty Environmental factors Low battery power Packet losses sensor networks Figure sources: : www.dei.unipd.it/~schenato/ http://particle.teco.edu/devices/devices.html http://www.olsr.org/ www.robotstorehk.com/sensors/sensor.html

Applications (2) – Global Positioning System (GPS) Causes of data uncertainty Reflection or refraction of the satellite signal refraction Reflection or refraction of the signal reflection

Applications (3) – Data Extraction and Integration Causes of data uncertainty Unreliability of data sources the confidence that a document is true Doc 1 0.2 Doc 2 0.4 … … … … a document entity Doc l 0.3 near duplicate documents data sources

Applications (4) – Privacy Preserving Medical data analysis Generalize attribute values to uncertain intervals Avoid identifying sensitive information of patients Age Sex Zipcode Disease 21 M 11000 pneumonia 50 37000 flu 51 31000 AIDS Age Sex Zipcode Disease [20, 30) M [10000, 20000] pneumonia [50, 60) [30000, 40000] flu AIDS

Applications (5) – Privacy Preserving Location-Based Services (LBS) Cloak the trajectories of GPS users Protect the places that users visited

Outline Introduction Applications of Probabilistic Data Management Classifications of Uncertain Data Comparisons: Uncertain vs. Certain Data The Existing Systems

Classification of Data Uncertainty Sources of data uncertainty Undesirable uncertainty Noisy sensor data Imprecise GPS data Unreliable extracted/integrated data Desirable uncertainty Medical data with generalized attributes Cloaked trajectory data

Classification of Data Uncertainty (cont'd) Witnessed Person t.p PID1 0.9 PID2 0.2 PID3 0.1 Granularity Tuple Uncertainty Each tuple is associated with an existence probability Attribute Uncertainty Each attribute of a tuple has several possible values (associated with probabilities) Person ID Zip code Disease PID1 (110000, 0.5), (110001, 0.5) (pneumonia,0.3), (flu, 0.7) PID2 (310000, 1) (AIDS, 0.9)

Classification of Data Uncertainty (cont'd) Correlations Independent Uncertainty Uncertain objects are independent of each other Correlated Uncertainty Attributes of uncertain objects are correlated with each other Uncertainty with Local Correlations Uncertain objects from different groups are independent Within each group, uncertain objects are locally correlated

Outline Introduction Applications of Probabilistic Data Management Classifications of Uncertain Data Comparisons: Uncertain vs. Certain Data The Existing Systems

Certain Data Management nearest neighbor query Assume the underlying data are precise and certain Many existing techniques target at certain data Query answering is efficient However, … certain database e q a d b c q a d b c e distance to q

Certain Data Management (cont'd) However, not all application data are clean and precise Sensor data, GPS data, etc. Even if using data cleaning techniques Cannot guarantee 100% data accuracy What is worse, introduce more errors! Cannot guarantee the confidence of query answers So, …

Probabilistic Data Management Advantages of probabilistic data management Directly model uncertain data without corrupting the original data Avoid introducing new errors Query answering with confidence guarantees

Probabilistic Data Management (cont'd) Disadvantages of probabilistic data management Effectiveness issue How to obtain the probabilities of uncertain data How to guarantee confidence of query answers Efficiency issue Each object/attribute has several possible values There are totally an exponential number of possible combinations of object/attribute instances Efficient query answering over uncertain data is problematic!

Example of Nearest Neighbor Search in Uncertain Databases probabilistic database e q a q distance to q d a b d c b e c instances of object a nearest neighbor query

Exercises Assume that: Uncertain object a has 6 possible instances, and Each of the rest uncertain objects, b ~ e has 2 possible instances How many possible combinations of object instances in this database? probabilistic database e q a d 6*(2^4) = 6*16=96 b c nearest neighbor query

Exercises (cont'd) Assume that: For each uncertain object, its instances have equal appearance probabilities What is the NN probability of uncertain object d when a is located at the red point? probabilistic database e q a d When a is at the red point, object d is NN with probability 1/2 b c nearest neighbor query

Outline Introduction Applications of Probabilistic Data Management Classifications of Uncertain Data Comparisons: Uncertain vs. Certain Data The Existing Systems

Existing Systems to Manipulate the Data Uncertainty Existing projects to deal with the data uncertainty MystiQ, University of Washington, 2005 Orion, Purdue, 2003 TRIO, Stanford Info Lab, 2005 MayBMS, Cornell, 2007 MCDB, IBM, 2008 BayesStore, 2008

Summary Data uncertainty occurs in the entire process of data collection, transmission, and processing Uncertain data are ubiquitous in many real applications Sensor network GPS system Data extraction/integration Privacy preserving

Summary (cont'd) Classifications of data uncertainty Data sources Granularity Correlations Uncertain vs. certain data Many techniques are proposed for certain data, but not for uncertain data Query answering for certain data is much more efficient than that for uncertain data

Summary (cont'd) Existing probabilistic data management systems Real-world application data are not always certain data, and are often uncertain data Applying techniques proposed for certain data to uncertain data may lead to erroneous results without confidence guarantees, while uncertain data management can have such guarantees Existing probabilistic data management systems