Computer Science Department University of California, Irvine

Slides:



Advertisements
Similar presentations
A Unified Framework for Context Assisted Face Clustering
Advertisements

Exploiting Relationships for Object Consolidation Zhaoqi Chen Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California,
Exploiting Relationships for Object Consolidation Zhaoqi Chen Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California,
Liyan Zhang, Ronen Vaisenberg, Sharad Mehrotra, Dmitri V. Kalashnikov Department of Computer Science University of California, Irvine This material is.
Jeremy Kashel BI 200 End to End Master Data Management With SQL Server Master Data Services (MDS)
Exploiting Relationships for Domain-Independent Data Cleaning Dmitri V. Kalashnikov Sharad Mehrotra Stella Chen Computer Science Department University.
Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.
Disambiguation Algorithm for People Search on the Web Dmitri V. Kalashnikov, Sharad Mehrotra, Zhaoqi Chen, Rabia Nuray-Turan, Naveen Ashish For questions.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems Zhaoqi Chen, Dmitri V. Kalashnikov, Sharad Mehrotra University of California,
SAMI: Situational Awareness from Multi-modal Input Naveen Ashish.
Wentao He Department of Computer Science University of Toronto Toronto, ON, Canada.
Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.
Introduction to SQL Steve Perry
Computer Science 101 Database Concepts. Database Collection of related data Models real world “universe” Reflects changes Specific purposes and audience.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Editing Building Block (EBB) Validation Tool for FDI and ITS Balance of Payments Working Group 02 April 2012 Unit B4, IT for Statistical Production Georges.
CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.
1 Database Systems Introduction to Microsoft Access Part 2.
3. Relational Model Lingma Acheson Department of Computer and Information Science IUPUI CSCI N207 Data Analysis with Spreadsheets 1.
Class Scheduler Team Members Bernard Battle Jerad Blake James Knoch Chris Louallen Lenora Pride.
2a. What and Why Database? Lingma Acheson Department of Computer and Information Science IUPUI CSCI N207 Data Analysis with Spreadsheets 1.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
CCR = Connectivity Residue Ratio = Pr. [ node pair connected by an edge are together in a common page on computer disk drive.] “U of M Scientists were.
Flat Files Relational Databases
Data Mining What is to be done before we get to Data Mining?
Copyright © 2014 Pearson Canada Inc. 5-1 Copyright © 2014 Pearson Canada Inc. Application Extension 5a Database Design Part 2: Using Information Technology.
Relations and Functions Objective: To use tables and graphs to represent relations and functions.
1 Finding Your Way Through a Database Exploring Microsoft Office Access.
Visualization in Process Mining
Database Systems: Design, Implementation, and Management Tenth Edition
DBM 380 AID Focus Dreams/dbm380aid.com
Application Extension 5a
Basic Database Concepts
CS122B: Projects in Databases and Web Applications Winter 2017
CS422 Principles of Database Systems Course Overview
Information Systems Today: Managing in the Digital World
DBM 380 aid Education Begins/dbm380aid.com
Database Management Systems (CS 564)
Database Management  .
قاعدة البيانات Database
Associative Query Answering via Query Feature Similarity
DBM 380 Competitive Success/snaptutorial.com
DBM 380 Education for Service/snaptutorial.com
DBM 380 Teaching Effectively-- snaptutorial.com
Database Management Systems
Introduction lecture1.
Chapter 9: Database Systems
Basic Concepts in Data Management
Order Database – ER Diagram
قاعدة البيانات Database
CS122B: Projects in Databases and Web Applications Spring 2017
DATABASE SYSTEM UNIT I.
Semantic Interoperability and Data Warehouse Design
Data Quality By Suparna Kansakar.
.NET Database Technologies:
Higher-Level Testing and Integration Testing
06 | Managing Enterprise Data
Record Linkage with Uniqueness Constraints and Erroneous Values
Selected Topics: External Sorting, Join Algorithms, …
SAMI: Situational Awareness from Multi-modal Input
CS122B: Projects in Databases and Web Applications Spring 2018
Disambiguation Algorithm for People Search on the Web
Self-tuning in Graph-Based Reference Disambiguation
CS122B: Projects in Databases and Web Applications Winter 2018
Concept of a Function.
Lingma Acheson Department of Computer and Information Science IUPUI
CS224w: Social and Information Network Analysis
Social Network Analysis with Apache Spark and Neo4J
Week 6 LBSC 690 Information Technology
CS122B: Projects in Databases and Web Applications Winter 2019
Presentation transcript:

Computer Science Department University of California, Irvine Work supported by NSF Grants IIS-0331707 and IIS-0083489 Copyright(c) by Dmitri V. Kalashnikov, 2005 RelDC Projects Dmitri V. Kalashnikov Computer Science Department University of California, Irvine http://www.ics.uci.edu/~dvk/RelDC http://www.itr-rescue.org (RESCUE) RESCUE July 2005

Project Team Members RelDC project team SAT project team Stella Chen Dmitri V. Kalashnikov Sharad Mehrotra Rabia Nuray SAT project team Carter Butts Ram Hariharan Dmitri V. Kalashnikov Yiming Ma Sharad Mehrotra

RelDC Overview RelDC project Relationship-based Data Cleaning Research Area data cleaning information quality Key points domain-independent framework so that can be integrated into a DBMS (e.g. Microsoft SQL Server 2005) based on analysis of relationships views dataset as a graph (ARG) nodes for entities edges for relationships significantly improves the quality of DC

Data Cleaning One data cleaning scenario Collecting data from various sources can have errors can be entered manually to create a unified database massive data Problems with raw data duplicate entries missing entries erroneous (e.g. misspelled) entries inherent ambiguity, etc Goal of data cleaning correct such errors, disambiguation why? because analysis on bad data leads to bad results

RelDC Framework Data processing flow Naveen RelDC SAT

Problems we have addressed Fuzzy lookup match references to objects list of all objects is given FBS + Rel + solving NLP Fuzzy grouping group together object repre-senations, that correspond to the same object FBS + Rel + Clustering

Learning importance of relationships from data Probabilistic ARG Ongoing Work Learning importance of relationships from data RelDC relies on a “connection strength” model c(u,v) c(u,v) tells how strongly u and v are connected to each other via relationships we calibrate one such model from data Probabilistic ARG an ARG with probabilistic edges study the feasibility of pARG as a representation for mining low-quality data