Overcoming the Quality Curse Sharad Mehrotra University of California, Irvine Collaborators/Students (Current) Dmitri Kalashnikov, Yasser Altowim, Hotham.

Slides:



Advertisements
Similar presentations
A Unified Framework for Context Assisted Face Clustering
Advertisements

CrowdER - Crowdsourcing Entity Resolution
Class-constrained Packing Problems with Application to Storage Management in Multimedia Systems Tami Tamir Department of Computer Science The Technion.
Efficient summarization framework for multi-attribute uncertain data Jie Xu, Dmitri V. Kalashnikov, Sharad Mehrotra 1.
BY ANISH D. SARMA, XIN DONG, ALON HALEVY, PROCEEDINGS OF SIGMOD'08, VANCOUVER, BRITISH COLUMBIA, CANADA, JUNE 2008 Bootstrapping Pay-As-You-Go Data Integration.
Automatic Video Shot Detection from MPEG Bit Stream Jianping Fan Department of Computer Science University of North Carolina at Charlotte Charlotte, NC.
Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution.
By: Mr Hashem Alaidaros MIS 211 Lecture 4 Title: Data Base Management System.
Constructing Popular Routes from Uncertain Trajectories Ling-Yin Wei 1, Yu Zheng 2, Wen-Chih Peng 1 1 National Chiao Tung University, Taiwan 2 Microsoft.
Exploiting Relationships for Object Consolidation Zhaoqi Chen Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California,
Exploiting Relationships for Object Consolidation Zhaoqi Chen Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California,
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Liyan Zhang, Ronen Vaisenberg, Sharad Mehrotra, Dmitri V. Kalashnikov Department of Computer Science University of California, Irvine This material is.
1 SAFIRE Project DHS Update – July 15, 2009 Introductions  Update since last teleconference Demo Video - Fire Incident Command Board (FICB) SAFIRE Streams.
Exploiting Relationships for Domain-Independent Data Cleaning Dmitri V. Kalashnikov Sharad Mehrotra Stella Chen Computer Science Department University.
Chapter 3 Database Management
CS538: Advanced Topics in Information Systems. 2 Secure Location transparency Consistent Real-Time Available Black Box: Distributed Storage [GMM] ? Data.
Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.
Disambiguation Algorithm for People Search on the Web Dmitri V. Kalashnikov, Sharad Mehrotra, Zhaoqi Chen, Rabia Nuray-Turan, Naveen Ashish For questions.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems Zhaoqi Chen, Dmitri V. Kalashnikov, Sharad Mehrotra University of California,
Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Database – Part 2 Dr. V.T. Raja Oregon State University.
Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.
SBU Digital Media CSE 690 Internet Vision Organizational Meeting Tamara Berg Assistant Professor SUNY Stony Brook.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Database and Data-Intensive Systems. Data-Intensive Systems From monolithic architectures to diverse systems Dedicated/specialized systems, column stores.
Search Engines and Information Retrieval Chapter 1.
Chapter 1 Introduction to Data Mining
Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution.
Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
Overviews of ITCS 6161/8161: Advanced Topics on Database Systems Dr. Jianping Fan Department of Computer Science UNC-Charlotte
Data Mining Knowledge on rough set theory SUSHIL KUMAR SAHU.
Data Mining By Dave Maung.
1 SATWARE: A Semantic Middleware for Multi Sensor Applications Sharad Mehrotra.
Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution.
Research Projects 6v81 Multimedia Database Yohan Jin, T.A.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Mining Weather Data for Decision Support Roy George Army High Performance Computing Research Center Clark Atlanta University Atlanta, GA
© 2003 Prentice Hall, Inc.3-1 Chapter 3 Database Management Information Systems Today Leonard Jessup and Joseph Valacich.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing Systems Jihui Yang CS525 Advanced Distributed System March 1, 2016.
Presented by: Siddhant Kulkarni Spring Authors: Publication:  ICDE 2015 Type:  Research Paper 2.
VIEWS b.ppt-1 Managing Intelligent Decision Support Networks in Biosurveillance PHIN 2008, Session G1, August 27, 2008 Mohammad Hashemian, MS, Zaruhi.
Department of Computer Science Sir Syed University of Engineering & Technology, Karachi-Pakistan. Presentation Title: DATA MINING Submitted By.
Machine learning & object recognition Cordelia Schmid Jakob Verbeek.
Data Mining - Introduction Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Automatic Video Shot Detection from MPEG Bit Stream
ROBUST FACE NAME GRAPH MATCHING FOR MOVIE CHARACTER IDENTIFICATION
Course Summary (Lecture for CS410 Intro Text Info Systems)
Associative Query Answering via Query Feature Similarity
机器感知与智能教育部重点实验室学术报告 Key Laboratory of Machine Perception (Minister of Education) Peking University Scalable, Robust and Integrative Algorithms for Analyzing.
Computer Science Department University of California, Irvine
Efficient Evaluation of k-NN Queries Using Spatial Mashups
Declarative Creation of Enterprise Applications
Disambiguation Algorithm for People Search on the Web
Self-tuning in Graph-Based Reference Disambiguation
Efficient Record Linkage in Large Data Sets
Paper Reading Dalong Du April.08, 2011.
A Framework for Testing Query Transformation Rules
Exploiting Semantics for Event Detection Systems
Big DATA.
Presentation transcript:

Overcoming the Quality Curse Sharad Mehrotra University of California, Irvine Collaborators/Students (Current) Dmitri Kalashnikov, Yasser Altowim, Hotham Altwaijry, Jeffrey Xu, Liyan Zhang Alumini Stella Zhaoqi Chen, Rabia Nuray-Turan, Virag Kothari

Beyond DASFAA 2003 paper.. 2 Improving Efficiency Improving Quality New Domains Video data Image data Speech data Sensor data Entity Search People Search Location Search DASFAA 2003

Data Cleaning – a vital component of Enterprise Data Processing Workflow 3 Analysis/Min ing Data ETL Decisions Long term strategies Business decisions Historical data analyses Trends, patterns, rules, models,.. Quality(Data)  Quality(Decisions) Quality of Data Quality of Analysis Quality of Decisions Data Sources OLTP Point of sale Organizational customer data Data Cleaning

4 Entity Resolution Problem Real World Digital World

5 Standard Approach to Entity Resolution s (u,v) = f (u,v) ? uv J. Smith John Smith Feature 2 Feature 3 ? ? ? “Similarity function”“Feature-based similarity” Deciding if two reference u and v co-refer Analyzing their features (if s(u,v) > t then u and v are declared to co-refer)

Measuring Quality of Entity Resolution  Entity dispersion for an entity, into how many clusters its repr. are clustered, ideal is 1  Cluster diversity for a cluster, how many distinct entities it contains, ideal is 1  Measures: F-Measure. B-Cubed F-Measure. Variation of Information (VI). Generalized Merge Distance (GMD). …

The Quality Curse -- Why Standard “Feature-based” Approach leads to Poor Results Significant entity dispersion. Significant entity dispersion. Significant cluster diversity. Significant cluster diversity. 7 Photo Collection of Sharad Mehrotra from Beijing, China June 2007 SIGMOD Trip Sharad Mehrotra, research interests: data management, Professor, UC Irvine S Mehrotra has joined the faculty at University of Illinois. He received his PhD from UT, Austin. He got his bachelors from IIT, Kanpur in India S. Mehrotra, PhD from University of Illinois is visiting UT, Austin to give a talk on prefetching on multiprocessor machines. He received his bachelors from India.

Overcoming the Quality Curse (1).. 8 Look more carefully at data for additional evidences

9 Exploiting Relationships among Entities  A1, ‘Dave White’, ‘Intel’   A2, ‘Don White’, ‘CMU’   A3, ‘Susan Grey’, ‘MIT’   A4, ‘John Black’, ‘MIT’   A5, ‘Joe Brown’, unknown   A6, ‘Liz Pink’, unknown   P1, ‘Databases... ’, ‘John Black’, ‘Don White’   P2, ‘Multimedia... ’, ‘Sue Grey’, ‘D. White’   P3, ‘Title3...’, ‘Dave White’   P4, ‘Title5...’, ‘Don White’, ‘Joe Brown’   P5, ‘Title6...’, ‘Joe Brown’, ‘Liz Pink’   P6, ‘Title7... ’, ‘Liz Pink’, ‘D. White’  Author table (clean) Publication table (to be cleaned) ? ER Graph  Context Attraction Principle (CAP): Nodes that are more connected have a higher chance of co-referring to the same entity

Exploiting Relationships for ER Ph.D. Thesis, Stella Chen Formalizing the CAP principle [SDM 05, IQIS 05] Formalizing the CAP principle [SDM 05, IQIS 05] Scaling to large graphs [TODS 06] Scaling to large graphs [TODS 06] Self-Tuning [DASFAA 07, JCDL 07, Journal IQ 11] Self-Tuning [DASFAA 07, JCDL 07, Journal IQ 11] –Not all relationships are equal –E.g., mutual interest in Bruce Lee movies possibly not as important as being colleagues at a university for predicting co- authorship. Merging relationship evidence with other evidences [SIGMOD ‘09] Merging relationship evidence with other evidences [SIGMOD ‘09] Applying to People search on Web [ICDE ‘07, TDKE 08, ICDE 09 (demo)] Applying to People search on Web [ICDE ‘07, TDKE 08, ICDE 09 (demo)] 10

Effectiveness of Exploiting Relationships WEPS WEPS Multimedia Multimedia 11

Smart Video Surveillance Camera Array to track human activities Camera Array to track human activities CS Building in UC Irvine Video collection 12 Surveillance Video Database Semantic Extraction Semantic Extraction Event Database Event Database Query/ Analysis

Event Model 13 Surveillance Video Database Semantic Extraction Semantic Extraction Event Database Event Database Query /Analysis event who what Other property when Activity recognition Face recognition localization Temporal placement extraction Event model : where Query Examples: Who was the last visitor to Mike Carey’s office yesterday? Who spends more time in Labs – database students or embedded computing students? Query Examples: Who was the last visitor to Mike Carey’s office yesterday? Who spends more time in Labs – database students or embedded computing students?

Person Identification Challenge Person Identification 14 event who what Other property when Activity recognition Face recognition localization Temporal placement extraction Event model : where Bob other Alice ? ? ? Who ?

Traditional Approach 15 Traditional Approach Face Detection Face Recognition ? ? ? Detect 70 faces/ 1000 images 2~3 images/ person Poor Performance

Rationale for Poor Performance 16 resolution (original) (1/2 original) (1/3 original) Poor Quality of Data No faces Small faces Low resolution Low temporal Resolution Poor Quality of Data No faces Small faces Low resolution Low temporal Resolution original performance original performance Drop to 70% Drop to 30% Sampling rate Sampling rate 1 frame/sec 1/3 frame/sec 1/2 frame/sec 1 frame/sec original performance original performance Drop to 53% Drop to 35%

Effectiveness of Exploiting Relationships WEPS WEPS Multimedia [IQ2S PERCOM 2011] Multimedia [IQ2S PERCOM 2011] 17

Results on Face Clustering [ACM ICMR 2013 Best Paper Award]

Results High Precision, 662 clusters 31 Real Person, 631 merges High Precision, 203 clusters 31 Real Person, 172 merges 4 Times

20 Overcoming the Quality Curse (2).. Look outside the box

Exploiting Search Engine Statistics Google Search results of “Andrew McCallum” Correlations amongst context entities provide additional source of information to resolve entities Correlations amongst context entities provide additional source of information to resolve entities Sebastian Thrun AND Tom Mitchell Andrew McCallum AND Sebastian Thrun AND Tom Mitchell (Machine Learning OR Text Retrieval) AND (CRF OR UAI 2003) Andrew McCallum AND (Machine Learning OR Text Retrieval ) AND (CRF OR UAI 2003) Andrew McCallum AND Sebastian Thrun AND (CRF OR UAI 2003) Search Engine Queries to learn correlations amongst contexts Sebastian Thrun Machine Learning Text Retrieval Tom Mitchell CRF UAI 2003

Exploiting Web Search Engine Statistics Ph.d. Thesis, Rabia Nuray 9/8/ Web Queries to Learn correlations [SIGIR 08] Application to Web People Search [WePS 09] Cluster refinement to overcome the singleton cluster problem [TODS 11-a] Making Web querying robust to server side fluctuations [tech. report] Scaling up the Web Query Technique [TODS 11- a]

Comparing with the State-of-the-art on WEPS- 2 Dataset 23 9/8/2015

Fluctuations in External Server Behaviour 24 Client System Yahoo! YQL as a mediator Service Bing Google Yahoo! Replies Requests qqq:qqq: qqq:qqq: qqq:qqq: Batch sizeBingYahoo Throughput different batch size at Different times Microsoft Bing Throughput (Queries/Sec) Bing versus Yahoo

Robustness to Fluctuations in External Server Behaviour 25 Yahoo! YQL as a mediator Service Bing Google Reinforcement Learning Optimizer Measurements Configuration Parameters Replies Yahoo! Requests Client System qqq:qqq: qqq:qqq: qqq:qqq:

Scaling Web Querying Number of queries : 4K 2 Number of queries : 4K 2 –Very large to submit to a search engine (40K for 100 search results) –Network, search engine load –~ 6-8 minutes (with optimal batch and concurrency) Solutions: Solutions: –Local Caching of the Web –Minimize the number of queries submitted to search engine –Choose n most effective queries that will maximize the expected quality –NP hard by reduction from knapsack! 26 9/8/2015

Heuristic Approach [TODS-11] Create Initial Clusters using direct features only Repeat until the time limit is reached –Find the most promising queries whose answer is likely to change the clustering –Select a batch of query from the promising queries. –Query the web and update the similarity graph –Generate new clusters 27 9/8/2015

Efficiency Experiments 28 9/8/2015 Promising edges analysis saved 30-40% of the edges if the initial clusters are a few.

Observation/Conclusion… Additional Evidences can be exploited to improve data quality Additional Evidences can be exploited to improve data quality BUT …it is Expensive!! BUT …it is Expensive!! Example: Web Queries Approach Example: Web Queries Approach –Number of queries : 4K 2 ( ~ 40K for 100 results) –Very large to submit to a search engine & expect real- time results –~6-8 minutes (network costs, search engine load) Solutions: Solutions: –Local Caching of the Web –Ask only important queries –Reduces to 1-2 min. without degrading quality much 29

(Near) Future: Addressing the Efficiency Curse … 30 Improving Efficiency Improving Quality New Domains DASFAA 2003 Two complementary approaches –Pay as you go data cleaning – –Progressive algorithm to obtain best quality given budget constraint –Query driven data cleaning – –Perform minimal cleaning to answer query/analyses task. Prevent having to clean unnecessary data.