Computer Science Department University of California, Irvine Work supported by NSF Grants IIS-0331707 and IIS-0083489 Copyright(c) by Dmitri V. Kalashnikov, 2005 RelDC Projects Dmitri V. Kalashnikov Computer Science Department University of California, Irvine http://www.ics.uci.edu/~dvk/RelDC http://www.itr-rescue.org (RESCUE) RESCUE July 2005
Project Team Members RelDC project team SAT project team Stella Chen Dmitri V. Kalashnikov Sharad Mehrotra Rabia Nuray SAT project team Carter Butts Ram Hariharan Dmitri V. Kalashnikov Yiming Ma Sharad Mehrotra
RelDC Overview RelDC project Relationship-based Data Cleaning Research Area data cleaning information quality Key points domain-independent framework so that can be integrated into a DBMS (e.g. Microsoft SQL Server 2005) based on analysis of relationships views dataset as a graph (ARG) nodes for entities edges for relationships significantly improves the quality of DC
Data Cleaning One data cleaning scenario Collecting data from various sources can have errors can be entered manually to create a unified database massive data Problems with raw data duplicate entries missing entries erroneous (e.g. misspelled) entries inherent ambiguity, etc Goal of data cleaning correct such errors, disambiguation why? because analysis on bad data leads to bad results
RelDC Framework Data processing flow Naveen RelDC SAT
Problems we have addressed Fuzzy lookup match references to objects list of all objects is given FBS + Rel + solving NLP Fuzzy grouping group together object repre-senations, that correspond to the same object FBS + Rel + Clustering
Learning importance of relationships from data Probabilistic ARG Ongoing Work Learning importance of relationships from data RelDC relies on a “connection strength” model c(u,v) c(u,v) tells how strongly u and v are connected to each other via relationships we calibrate one such model from data Probabilistic ARG an ARG with probabilistic edges study the feasibility of pARG as a representation for mining low-quality data