Presentation is loading. Please wait.

Presentation is loading. Please wait.

Overcoming the Quality Curse Sharad Mehrotra University of California, Irvine Collaborators/Students (Current) Dmitri Kalashnikov, Yasser Altowim, Hotham.

Similar presentations


Presentation on theme: "Overcoming the Quality Curse Sharad Mehrotra University of California, Irvine Collaborators/Students (Current) Dmitri Kalashnikov, Yasser Altowim, Hotham."— Presentation transcript:

1 Overcoming the Quality Curse Sharad Mehrotra University of California, Irvine Collaborators/Students (Current) Dmitri Kalashnikov, Yasser Altowim, Hotham Altwaijry, Jeffrey Xu, Liyan Zhang Alumini Stella Zhaoqi Chen, Rabia Nuray-Turan, Virag Kothari

2 Beyond DASFAA 2003 paper.. 2 Improving Efficiency Improving Quality New Domains Video data Image data Speech data Sensor data Entity Search People Search Location Search DASFAA 2003

3 Data Cleaning – a vital component of Enterprise Data Processing Workflow 3 Analysis/Min ing Data ETL Decisions Long term strategies Business decisions Historical data analyses Trends, patterns, rules, models,.. Quality(Data)  Quality(Decisions) Quality of Data Quality of Analysis Quality of Decisions Data Sources OLTP Point of sale Organizational customer data Data Cleaning

4 4 Entity Resolution Problem Real World Digital World

5 5 Standard Approach to Entity Resolution s (u,v) = f (u,v) ? uv J. Smith John Smith Feature 2 Feature 3 js@google.com sm@yahoo.com ? ? ? “Similarity function”“Feature-based similarity” Deciding if two reference u and v co-refer Analyzing their features (if s(u,v) > t then u and v are declared to co-refer)

6 Measuring Quality of Entity Resolution  Entity dispersion for an entity, into how many clusters its repr. are clustered, ideal is 1  Cluster diversity for a cluster, how many distinct entities it contains, ideal is 1  Measures: F-Measure. B-Cubed F-Measure. Variation of Information (VI). Generalized Merge Distance (GMD). …

7 The Quality Curse -- Why Standard “Feature-based” Approach leads to Poor Results Significant entity dispersion. Significant entity dispersion. Significant cluster diversity. Significant cluster diversity. 7 Photo Collection of Sharad Mehrotra from Beijing, China June 2007 SIGMOD Trip Sharad Mehrotra, research interests: data management, Professor, UC Irvine S Mehrotra has joined the faculty at University of Illinois. He received his PhD from UT, Austin. He got his bachelors from IIT, Kanpur in India S. Mehrotra, PhD from University of Illinois is visiting UT, Austin to give a talk on prefetching on multiprocessor machines. He received his bachelors from India.

8 Overcoming the Quality Curse (1).. 8 Look more carefully at data for additional evidences

9 9 Exploiting Relationships among Entities  A1, ‘Dave White’, ‘Intel’   A2, ‘Don White’, ‘CMU’   A3, ‘Susan Grey’, ‘MIT’   A4, ‘John Black’, ‘MIT’   A5, ‘Joe Brown’, unknown   A6, ‘Liz Pink’, unknown   P1, ‘Databases... ’, ‘John Black’, ‘Don White’   P2, ‘Multimedia... ’, ‘Sue Grey’, ‘D. White’   P3, ‘Title3...’, ‘Dave White’   P4, ‘Title5...’, ‘Don White’, ‘Joe Brown’   P5, ‘Title6...’, ‘Joe Brown’, ‘Liz Pink’   P6, ‘Title7... ’, ‘Liz Pink’, ‘D. White’  Author table (clean) Publication table (to be cleaned) ? ER Graph  Context Attraction Principle (CAP): Nodes that are more connected have a higher chance of co-referring to the same entity

10 Exploiting Relationships for ER Ph.D. Thesis, Stella Chen Formalizing the CAP principle [SDM 05, IQIS 05] Formalizing the CAP principle [SDM 05, IQIS 05] Scaling to large graphs [TODS 06] Scaling to large graphs [TODS 06] Self-Tuning [DASFAA 07, JCDL 07, Journal IQ 11] Self-Tuning [DASFAA 07, JCDL 07, Journal IQ 11] –Not all relationships are equal –E.g., mutual interest in Bruce Lee movies possibly not as important as being colleagues at a university for predicting co- authorship. Merging relationship evidence with other evidences [SIGMOD ‘09] Merging relationship evidence with other evidences [SIGMOD ‘09] Applying to People search on Web [ICDE ‘07, TDKE 08, ICDE 09 (demo)] Applying to People search on Web [ICDE ‘07, TDKE 08, ICDE 09 (demo)] 10

11 Effectiveness of Exploiting Relationships WEPS WEPS Multimedia Multimedia 11

12 Smart Video Surveillance Camera Array to track human activities Camera Array to track human activities CS Building in UC Irvine Video collection 12 Surveillance Video Database Semantic Extraction Semantic Extraction Event Database Event Database Query/ Analysis

13 Event Model 13 Surveillance Video Database Semantic Extraction Semantic Extraction Event Database Event Database Query /Analysis event who what Other property when Activity recognition Face recognition localization Temporal placement extraction Event model : where Query Examples: Who was the last visitor to Mike Carey’s office yesterday? Who spends more time in Labs – database students or embedded computing students? Query Examples: Who was the last visitor to Mike Carey’s office yesterday? Who spends more time in Labs – database students or embedded computing students?

14 Person Identification Challenge Person Identification 14 event who what Other property when Activity recognition Face recognition localization Temporal placement extraction Event model : where Bob other Alice ? ? ? Who ?

15 Traditional Approach 15 Traditional Approach Face Detection Face Recognition ? ? ? Detect 70 faces/ 1000 images 2~3 images/ person Poor Performance

16 Rationale for Poor Performance 16 resolution (original) (1/2 original) (1/3 original) Poor Quality of Data No faces Small faces Low resolution Low temporal Resolution Poor Quality of Data No faces Small faces Low resolution Low temporal Resolution original performance original performance Drop to 70% Drop to 30% Sampling rate Sampling rate 1 frame/sec 1/3 frame/sec 1/2 frame/sec 1 frame/sec original performance original performance Drop to 53% Drop to 35%

17 Effectiveness of Exploiting Relationships WEPS WEPS Multimedia [IQ2S PERCOM 2011] Multimedia [IQ2S PERCOM 2011] 17

18 Results on Face Clustering [ACM ICMR 2013 Best Paper Award]

19 Results High Precision, 662 clusters 31 Real Person, 631 merges High Precision, 203 clusters 31 Real Person, 172 merges 4 Times

20 20 Overcoming the Quality Curse (2).. Look outside the box

21 Exploiting Search Engine Statistics Google Search results of “Andrew McCallum” Correlations amongst context entities provide additional source of information to resolve entities Correlations amongst context entities provide additional source of information to resolve entities Sebastian Thrun AND Tom Mitchell Andrew McCallum AND Sebastian Thrun AND Tom Mitchell (Machine Learning OR Text Retrieval) AND (CRF OR UAI 2003) Andrew McCallum AND (Machine Learning OR Text Retrieval ) AND (CRF OR UAI 2003) Andrew McCallum AND Sebastian Thrun AND (CRF OR UAI 2003) Search Engine Queries to learn correlations amongst contexts Sebastian Thrun Machine Learning Text Retrieval Tom Mitchell CRF UAI 2003

22 Exploiting Web Search Engine Statistics Ph.d. Thesis, Rabia Nuray 9/8/2015 22 Web Queries to Learn correlations [SIGIR 08] Application to Web People Search [WePS 09] Cluster refinement to overcome the singleton cluster problem [TODS 11-a] Making Web querying robust to server side fluctuations [tech. report] Scaling up the Web Query Technique [TODS 11- a]

23 Comparing with the State-of-the-art on WEPS- 2 Dataset 23 9/8/2015

24 Fluctuations in External Server Behaviour 24 Client System Yahoo! YQL as a mediator Service Bing Google Yahoo! Replies Requests qqq:qqq: qqq:qqq: qqq:qqq: Batch sizeBingYahoo 35222 5012359 1009525 Throughput (Queries/Sec) @ different batch size at Different times Microsoft Bing Throughput (Queries/Sec) Bing versus Yahoo

25 Robustness to Fluctuations in External Server Behaviour 25 Yahoo! YQL as a mediator Service Bing Google Reinforcement Learning Optimizer Measurements Configuration Parameters Replies Yahoo! Requests Client System qqq:qqq: qqq:qqq: qqq:qqq:

26 Scaling Web Querying Number of queries : 4K 2 Number of queries : 4K 2 –Very large to submit to a search engine (40K for 100 search results) –Network, search engine load –~ 6-8 minutes (with optimal batch and concurrency) Solutions: Solutions: –Local Caching of the Web –Minimize the number of queries submitted to search engine –Choose n most effective queries that will maximize the expected quality –NP hard by reduction from knapsack! 26 9/8/2015

27 Heuristic Approach [TODS-11] Create Initial Clusters using direct features only Repeat until the time limit is reached –Find the most promising queries whose answer is likely to change the clustering –Select a batch of query from the promising queries. –Query the web and update the similarity graph –Generate new clusters 27 9/8/2015

28 Efficiency Experiments 28 9/8/2015 Promising edges analysis saved 30-40% of the edges if the initial clusters are a few.

29 Observation/Conclusion… Additional Evidences can be exploited to improve data quality Additional Evidences can be exploited to improve data quality BUT …it is Expensive!! BUT …it is Expensive!! Example: Web Queries Approach Example: Web Queries Approach –Number of queries : 4K 2 ( ~ 40K for 100 results) –Very large to submit to a search engine & expect real- time results –~6-8 minutes (network costs, search engine load) Solutions: Solutions: –Local Caching of the Web –Ask only important queries –Reduces to 1-2 min. without degrading quality much 29

30 (Near) Future: Addressing the Efficiency Curse … 30 Improving Efficiency Improving Quality New Domains DASFAA 2003 Two complementary approaches –Pay as you go data cleaning – –Progressive algorithm to obtain best quality given budget constraint –Query driven data cleaning – –Perform minimal cleaning to answer query/analyses task. Prevent having to clean unnecessary data.


Download ppt "Overcoming the Quality Curse Sharad Mehrotra University of California, Irvine Collaborators/Students (Current) Dmitri Kalashnikov, Yasser Altowim, Hotham."

Similar presentations


Ads by Google