Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computer Science Department University of California, Irvine

Similar presentations


Presentation on theme: "Computer Science Department University of California, Irvine"— Presentation transcript:

1 Computer Science Department University of California, Irvine
Work supported by NSF Grants IIS and IIS Copyright(c) by Dmitri V. Kalashnikov, 2005 RelDC Projects Dmitri V. Kalashnikov Computer Science Department University of California, Irvine (RESCUE) RESCUE July 2005

2 Project Team Members RelDC project team SAT project team Stella Chen
Dmitri V. Kalashnikov Sharad Mehrotra Rabia Nuray SAT project team Carter Butts Ram Hariharan Dmitri V. Kalashnikov Yiming Ma Sharad Mehrotra

3 RelDC Overview RelDC project Relationship-based Data Cleaning
Research Area data cleaning information quality Key points domain-independent framework so that can be integrated into a DBMS (e.g. Microsoft SQL Server 2005) based on analysis of relationships views dataset as a graph (ARG) nodes for entities edges for relationships significantly improves the quality of DC

4 Data Cleaning One data cleaning scenario Collecting data
from various sources can have errors can be entered manually to create a unified database massive data Problems with raw data duplicate entries missing entries erroneous (e.g. misspelled) entries inherent ambiguity, etc Goal of data cleaning correct such errors, disambiguation why? because analysis on bad data leads to bad results

5 RelDC Framework Data processing flow Naveen RelDC SAT

6 Problems we have addressed
Fuzzy lookup match references to objects list of all objects is given FBS + Rel + solving NLP Fuzzy grouping group together object repre-senations, that correspond to the same object FBS + Rel + Clustering

7 Learning importance of relationships from data Probabilistic ARG
Ongoing Work Learning importance of relationships from data RelDC relies on a “connection strength” model c(u,v) c(u,v) tells how strongly u and v are connected to each other via relationships we calibrate one such model from data Probabilistic ARG an ARG with probabilistic edges study the feasibility of pARG as a representation for mining low-quality data


Download ppt "Computer Science Department University of California, Irvine"

Similar presentations


Ads by Google