A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1.

Slides:



Advertisements
Similar presentations
Uncertainty in Data Integration Ai Jing
Advertisements

UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
A Privacy Preserving Index for Range Queries
1 Copyright by Jiawei Han, modified by Charles Ling for cs411a/538a Data Mining and Data Warehousing v Introduction v Data warehousing and OLAP for data.
Modeling and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David.
CS4432: Database Systems II
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB Presented by SHAO Yufeng.
Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.
Data Engineering Research Group 4 faculty members Reynold Cheng David Cheung Ben Kao Nikos Mamoulis 20 research students (10 PhD, 10 MPhil)
Efficient Query Evaluation on Probabilistic Databases
Indexing the imprecise positions of moving objects Xiaofeng Ding and Yansheng Lu Department of Computer Science Huazhong University of Science & Technology.
LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DATABASE SYSTEMS GROUP DEPARTMENT INSTITUTE FOR INFORMATICS Probabilistic Similarity Queries in Uncertain Databases.
Effectively Indexing Uncertain Moving Objects for Predictive Queries School of Computing National University of Singapore Department of Computer Science.
Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work:
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Adaptive Query Processing for Data Aggregation: Mining, Using and Maintaining Source Statistics M.S Thesis Defense by Jianchun Fan Committee Members: Dr.
Mutual Information Mathematical Biology Seminar
Localized Techniques for Power Minimization and Information Gathering in Sensor Networks EE249 Final Presentation David Tong Nguyen Abhijit Davare Mentor:
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Chapter 3 Pre-Mining. Content Introduction Proposed New Framework for a Conceptual Data Warehouse Selecting Missing Value Point Estimation Jackknife estimate.
Model-Driven Data Acquisition in Sensor Networks - Amol Deshpande et al., VLDB ‘04 Jisu Oh March 20, 2006 CS 580S Paper Presentation.
Chapter 4: Organizing and Manipulating the Data in Databases
Chapter 4-1. Chapter 4-2 Database Management Systems Overview  Not a database  Separate software system Functions  Enables users to utilize database.
Da Yan and Wilfred Ng The Hong Kong University of Science and Technology.
Chapter 6 – Database Security  Integrity for databases: record integrity, data correctness, update integrity  Security for databases: access control,
Reynold Cheng†, Eric Lo‡, Xuan S
An Integration Framework for Sensor Networks and Data Stream Management Systems.
Concepts and Terminology Introduction to Database.
Chapter 4: Organizing and Manipulating the Data in Databases
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation.
Department of Computer Science City University of Hong Kong Department of Computer Science City University of Hong Kong 1 Probabilistic Continuous Update.
OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
Organizing Data Revision: pages 8-10, 31 Chapter 3.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.
Data Mining over Hidden Data Sources Tantan Liu Depart. Computer Science & Engineering Ohio State University July 23, 2012.
Stat 112: Notes 2 Today’s class: Section 3.3. –Full description of simple linear regression model. –Checking the assumptions of the simple linear regression.
Query Aggregation for Providing Efficient Data Services in Sensor Networks Wei Yu *, Thang Nam Le +, Dong Xuan + and Wei Zhao * * Computer Science Department.
XML and Database.

IS 320 Notes for April 15, Learning Objectives Understand database concepts. Use normalization to efficiently store data in a database. Use.
 Assumptions are an essential part of statistics and the process of building and testing models.  There are many different assumptions across the range.
Data The fact and figures that can be recorded in system and that have some special meaning assigned to it. Eg- Data of a customer like name, telephone.
An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB.
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Part 4: Data Dependent Query Processing Methods Yin “David” Yang.
Shaoxu Song 1, Aoqian Zhang 1, Lei Chen 2, Jianmin Wang 1 1 Tsinghua University, China 2Hong Kong University of Science & Technology, China 1/19 VLDB 2015.
Rainer Gemulla, Wolfgang Lehner and Peter J. Haas VLDB 2006 A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets 2008/8/27 1.
REED : Robust, Efficient Filtering and Event Detection in Sensor Network Daniel J. Abadi, Samuel Madden, Wolfgang Lindner Proceedings of the 31st VLDB.
Bo Zong, Yinghui Wu, Ambuj K. Singh, Xifeng Yan 1 Inferring the Underlying Structure of Information Cascades
Chapter 8 Relational Database Design. 2 Relational Database Design: Goals n Reduce data redundancy (undesirable replication of data values) n Minimize.
CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course.
1 COMP 1100 Basic SQL David J. Stucki. Outline SQL Overview Retrievals Schema creation Table creation Constraints Inserts Updates 2.
Course Outline 1. Pengantar Data Mining 2. Proses Data Mining
Database Management System
A paper on Join Synopses for Approximate Query Answering
Probabilistic Data Management
Probabilistic Data Management
Introduction to Database Management System
Spatial Online Sampling and Aggregation
Database.
Lecture 16: Probabilistic Databases
Lecture 12: Data Wrangling
Probabilistic Data Management
Database Dr. Roueida Mohammed.
Tel Hope Foundation’s International Institute of Information Technology, (I²IT). Tel
Presentation transcript:

A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung

Outline Introduction Traditional Data Quality and Cleaning Uncertainty Management in Traditional Data Cleaning Cleaning Uncertain Database Conclusion 2

Outline Introduction Traditional Data Quality and Cleaning Uncertainty Management in Traditional Data Cleaning Cleaning Uncertain Database Conclusion 3

Example Report of Bird Sightings ObserverBird-IDBird-NameProb MaryBird-1Finch0.8 MaryBird-1Toucan0.2 SusanBird-1Nightingale0.7 SusanBird-1Toucan0.3 Another Bird-1Hummingbird0.65 Another Bird-1Toucan0.35 ObserverBird-IDBird-Name MaryBird-1Finch SusanBird-1Nightingale Another Bird-1hummingbird Cleaning 4

Philosophy Data Cleaning – To remove dirty data Uncertain Data Management – To preserve more information 5

Outline Introduction Traditional Data Quality and Cleaning Uncertainty Management in Traditional Data Cleaning Cleaning Uncertain Database Conclusion 6

Data Quality Issues Multi-SourceSingle Source Schema Level Instance Level Inconsistency 010Shanghai 021Beijing Constraint Dirty Data 7

Data Quality Issues Multi-SourceSingle Source Schema Level Instance Level Sensor Network o Temperature Census Data o Birth Year Inconsistency Missing Values, Outliers 8

Data Quality Issues Multi-SourceSingle Source Schema Level Instance Level Inconsistency Missing Values, Outliers Integration Duplication 9

Single Source & Schema Level Inconsistent Repairs – Example – Solutions To Optimize some Objective Function – Minimize the number of changes – Cost Function Objective Function Certain Fix Inconsistent Repairs 010Shanghai 021Beijing 10

Single Source & Schema Level Inconsistent Repairs – Example – Solutions Certain Fix (VLDB’10) – Master Data – Certain Region – Some attribute values are asserted to be correct Objective Function Certain Fix Inconsistent Repairs 010Shanghai 021Beijing 11

Single Source & Schema Level Cleaning Operations – Deletion & Insertion – Update attribute values Efficiency Issues – NP-Complete – Heuristic Methods Objective Function Certain Fix Inconsistent Repairs Deletion & Insertion Update Cleaning Operation Efficiency Issues 12

Others Single Source Instance Level – Infer missing values, detect and correct outliers with machine learning / statistical methods Multi-Source Schema Level – Schema Mapping Multi-Source Instance Level – Data Deduplication (Record Linkage) 13

Outline Introduction Traditional Data Quality and Cleaning Uncertainty Management in Traditional Data Cleaning Cleaning Uncertain Database Conclusion 14

Single Source & Schema Level Cardinality-Set-Minimal Repair: A repair I’ of I is cardinality-set- minimal iff there is no repair I’’ of I such that Δ(I, I’’) \in Δ(I, I’) Objective Function Certain Fix Inconsistent Repairs Deletion & Insertion Update Cleaning Operation Efficiency Issues Possible Repair 15 …

Single Source & Instance Level Missing Value & Outliers – Census Database ERACER (sigmod’10) – User input dependency model Death age Parent age – Learn the parameters – Infer the missing value Infer the missing birth year based on death year & death age distribution Further infer the child’s birth year. – Repeat until the distribution converge 16

Multi-Source Schema Level – Uncertain Schema Matching Instance Level – Possible Repairs in Data Deduplication (VLDB’09) 17

Outline Introduction Traditional Data Quality and Cleaning Uncertainty Management in Traditional Data Cleaning Cleaning Uncertain Database Conclusion 18

Cleaning Uncertain Database Applying Integrity Constraints – Exact Method – Sampling Method Quality of Uncertain Query Results – PWS-Quality Efficiency Issues 19

Integrity Constraints Difference with Traditional Database – Locate error in the original database – Locate error in possible worlds Difficulties – Exponential number of possible worlds Statistical Description – Posterior probabilities Prob[j=7|C] Approaches – Exact Method – Approximate Method 20 NameSSNProb John Bill NameSSN John7 Bill7 Constraint set (C): SSN is Unique

Exact Method (Christoph VLDB’08) Model the Constraints as Assignments. Compress the assignment into a tree structure Calculate the Posterior Probabilities 21 j = 1 j = 7, b = 4 …

Approximate Method (Haiquan Chen ICDE’10 Workshop ) Aggregate Constraints Model the Constraints as Scoring Functions Get Posterior Probability by Sampling 22 EmployeeSalary (k)Confidence Alice Bob Charles Constraints: Total Salary in [50k, 70k] …

Quality of Uncertain Query Results (Reynold VLDB’08) Different Query have Different Properties – Range Query: Independent – Min/Max Query: Otherwise Uniform Metric for all Uncertain Queries – Quality on Possible World Answers Cleaning the uncertain tuple so that to improve query quality as much as possible – Oracle Assumption 23

Efficiency Issue A more “realistic” Oracle – Cleaning may fail – Even a successful cleaning can not remove all false values – Cleaning may involve a cost Objective – Remove as much uncertainty as possible – With limited number of cleaning operations Discussion – Instance Level Cleaning (Clean particular instance) – Schema Level Cleaning (Clean the entire DB) 24

Conclusion Improve Data Quality – Data Cleaning -> Remove Errors – Uncertain Data Management -> Maintain Information 2 Directions 25 Constraints Traditional Database Repairs Consistent Possible World(s) Uncertain Database

Discussion Thank You :) 26