Self-tuning in Graph-Based Reference Disambiguation

Slides:



Advertisements
Similar presentations
Google News Personalization: Scalable Online Collaborative Filtering
Advertisements

Autonomic Scaling of Cloud Computing Resources
A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Learning Trajectory Patterns by Clustering: Comparative Evaluation Group D.
Linked data: P redicting missing properties Klemen Simonic, Jan Rupnik, Primoz Skraba {klemen.simonic, jan.rupnik,
1 A Self-Tuning Configurable Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
Exploiting Sparse Markov and Covariance Structure in Multiresolution Models Presenter: Zhe Chen ECE / CMR Tennessee Technological University October 22,
Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray.
CS345 Data Mining Page Rank Variants. Review Page Rank  Web graph encoded by matrix M N £ N matrix (N = number of web pages) M ij = 1/|O(j)| iff there.
Exploiting Relationships for Object Consolidation Zhaoqi Chen Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California,
Exploiting Relationships for Object Consolidation Zhaoqi Chen Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California,
Liyan Zhang, Ronen Vaisenberg, Sharad Mehrotra, Dmitri V. Kalashnikov Department of Computer Science University of California, Irvine This material is.
Exploiting Relationships for Domain-Independent Data Cleaning Dmitri V. Kalashnikov Sharad Mehrotra Stella Chen Computer Science Department University.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.
Disambiguation Algorithm for People Search on the Web Dmitri V. Kalashnikov, Sharad Mehrotra, Zhaoqi Chen, Rabia Nuray-Turan, Naveen Ashish For questions.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems Zhaoqi Chen, Dmitri V. Kalashnikov, Sharad Mehrotra University of California,
1 Ensembles of Nearest Neighbor Forecasts Dragomir Yankov, Eamonn Keogh Dept. of Computer Science & Eng. University of California Riverside Dennis DeCoste.
Algorithmic Problems in Algebraic Structures Undecidability Paul Bell Supervisor: Dr. Igor Potapov Department of Computer Science
Chapter 2 Modeling and Finding Abnormal Nodes. How to define abnormal nodes ? One plausible answer is : –A node is abnormal if there are no or very few.
Mehdi Kargar Aijun An York University, Toronto, Canada Discovering Top-k Teams of Experts with/without a Leader in Social Networks.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution.
General Database Statistics Using Maximum Entropy Raghav Kaushik 1, Christopher Ré 2, and Dan Suciu 3 1 Microsoft Research 2 University of Wisconsin--Madison.
1 Optimal Cycle Vida Movahedi Elder Lab, January 2008.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.
Graph Summaries for Subgraph Frequency Estimation 1 Angela Maduko, 2 Kemafor Anyanwu, 3 Amit Sheth, 4 Paul Schliekelman 1 LSDIS Lab, University of Georgia.
1 Panther: Fast Top-K Similarity Search on Large Networks Jing Zhang 1, Jie Tang 1, Cong Ma 1, Hanghang Tong 2, Yu Jing 1, and Juanzi Li 1 1 Department.
Hanghang Tong, Brian Gallagher, Christos Faloutsos, Tina Eliassi-Rad
Some Aspects of Bayesian Approach to Model Selection Vetrov Dmitry Dorodnicyn Computing Centre of RAS, Moscow.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.
Speaker : Yu-Hui Chen Authors : Dinuka A. Soysa, Denis Guangyin Chen, Oscar C. Au, and Amine Bermak From : 2013 IEEE Symposium on Computational Intelligence.
Counting II: Recurring Problems And Correspondences Great Theoretical Ideas In Computer Science V. AdamchikCS Spring 2006 Lecture 6Feb 2, 2005Carnegie.
Using decision trees to build an a framework for multivariate time- series classification 1 Present By Xiayi Kuang.
Sporadic model building for efficiency enhancement of the hierarchical BOA Genetic Programming and Evolvable Machines (2008) 9: Martin Pelikan, Kumara.
1 Chapter 7 Network Flow Slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved.
Domain Name System: DNS To identify an entity, TCP/IP protocols use the IP address, which uniquely identifies the Connection of a host to the Internet.
BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.
More NP-Complete and NP-hard Problems
Support Feature Machine for DNA microarray data
Efficient Image Classification on Vertically Decomposed Data
A paper on Join Synopses for Approximate Query Answering
RE-Tree: An Efficient Index Structure for Regular Expressions
Probabilistic Data Management
Hyper-parameter tuning for graph kernels via Multiple Kernel Learning
Hybrid computing using a neural network with dynamic external memory
Hanghang Tong, Brian Gallagher, Christos Faloutsos, Tina Eliassi-Rad
Chapter 7 Network Flow Slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved.
Cheng-Ming Huang, Wen-Hung Liao Department of Computer Science
Efficient Image Classification on Vertically Decomposed Data
Lecture 9: Entity Resolution
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
The Importance of Communities for Learning to Influence
Computer Science Department University of California, Irvine
Hidden Markov Models Part 2: Algorithms
Department of Computer Science University of York
Adaptive entity resolution with human computation
Disambiguation Algorithm for People Search on the Web
A Self-Tuning Configurable Cache
Efficient Record Linkage in Large Data Sets
Counting II: Recurring Problems And Correspondences
Great Theoretical Ideas In Computer Science
Graph-based Security and Privacy Analytics via Collective Classification with Joint Weight Learning and Propagation Binghui Wang, Jinyuan Jia, and Neil.
Leverage Consensus Partition for Domain-Specific Entity Coreference
Topological Signatures For Fast Mobility Analysis
Counting II: Recurring Problems And Correspondences
Presentation transcript:

Self-tuning in Graph-Based Reference Disambiguation Rabia Nuray-Turan Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California, Irvine

DASFAA 2007, Bangkok, Thailand Overview Intro to Data Cleaning Entity resolution RelDC Framework Past work Adapting to data The new part Reduction to an Optimization problem Linear programming Experiments 11 January 2019 DASFAA 2007, Bangkok, Thailand

DASFAA 2007, Bangkok, Thailand Data Cleaning Analysis on bad data leads to wrong conclusions 11 January 2019 DASFAA 2007, Bangkok, Thailand

Example of the problem: CiteSeer top-K Suspicious entries Lets go to DBLP website which stores bibliographic entries of many CS authors Lets check two people “A. Gupta” “L. Zhang” they are in top-20 because there are many of them CiteSeer: the top-k most cited authors DBLP DBLP 11 January 2019 DASFAA 2007, Bangkok, Thailand

Two Most Common Entity-Resolution Challenges Fuzzy lookup reference disambiguation match references to objects list of all objects is given Fuzzy grouping group together object repre-sentations, that correspond to the same object 11 January 2019 DASFAA 2007, Bangkok, Thailand

Standard Approach to Entity Resolution 11 January 2019 DASFAA 2007, Bangkok, Thailand

DASFAA 2007, Bangkok, Thailand Overview Intro to Data Cleaning RelDC Framework Past work Adapting to data The new part Reduction to an Optimization problem Linear programming Experiments 11 January 2019 DASFAA 2007, Bangkok, Thailand

DASFAA 2007, Bangkok, Thailand RelDC Framework 11 January 2019 DASFAA 2007, Bangkok, Thailand

DASFAA 2007, Bangkok, Thailand RelDC Framework Past work SDM’05, TODS’06 Domain-independent framework Viewing the dataset as an Entity Relationship Graph Analyzes paths in this graph Solid theoretic foundation Optimization problem Scales to large datasets Robust under uncertainty High disambiguation quality No Self-tuning This paper solves this challenge 11 January 2019 DASFAA 2007, Bangkok, Thailand

Entity-Relationship Graph Choice node For uncertain references To encode options/possibilities yr1, … yrN Among options yr1, … yrN Pick the most strongly connected one CAP principle Analyze paths in G that exist between xr and yrj, for all j Use a model to measure connection strength “Connection strength” model c(u,v), for nodes u and v in G how strongly u and v are connected in G RandomWalk-based Fixed Based on Intuition!!! This paper, instead, learns such a model from data. 11 January 2019 DASFAA 2007, Bangkok, Thailand

DASFAA 2007, Bangkok, Thailand Overview Intro to Data Cleaning RelDC Framework Past work Adapting to data The new part Reduction to an Optimization problem Linear programming Experiments 11 January 2019 DASFAA 2007, Bangkok, Thailand

DASFAA 2007, Bangkok, Thailand Adaptive Solution Classify the found paths in the graph into a finite set of path types ST ={ T1, T2, …, TN} If paths p1 and p2 are of the same type then they are treated as identical. We can show the connection between nodes u and v with a path-type count vector: Tuv = { c1, c2, …, cN} If there is a way to associate path Ti to wi then connection strength will be: 11 January 2019 DASFAA 2007, Bangkok, Thailand

DASFAA 2007, Bangkok, Thailand Problems to Answer How will we classify the paths? How will we associate each path type with a weight? 11 January 2019 DASFAA 2007, Bangkok, Thailand

DASFAA 2007, Bangkok, Thailand Classifying Paths Path Type Model (PTM): Views each path as a sequence of edges <e1,e2,e3,…,en> Each edge ei has a type Ei associated with it Thus, can associate each path p with a string <E1,E2,E3,…,En> Different strings correspond to different path types Associate each string a weight Different models are also possible 11 January 2019 DASFAA 2007, Bangkok, Thailand

Learning Path Weights : Optimization Problem CAP Principle states that: the right option will be better connected Linear programming Learn path types weight w’s. 11 January 2019 DASFAA 2007, Bangkok, Thailand

DASFAA 2007, Bangkok, Thailand Final Solution The value of c(xr,yrj)- c(xr,yrl) should be maximized for all r, l≠j Then final solution: 11 January 2019 DASFAA 2007, Bangkok, Thailand

DASFAA 2007, Bangkok, Thailand Example -Graph P1= e1-e3-e1 P2= e1-e1-e3 P3= e1-e2-e2-e3 P4= e1-e2-e3-e2-e3 11 January 2019 DASFAA 2007, Bangkok, Thailand

DASFAA 2007, Bangkok, Thailand Example- Solution w1 =1 w3 = w4 = 0 w2 can be anything between 0 and 1. 11 January 2019 DASFAA 2007, Bangkok, Thailand

DASFAA 2007, Bangkok, Thailand Overview Intro to Data Cleaning RelDC Framework Past work Adapting to data The new part Reduction to an Optimization problem Linear programming Experiments 11 January 2019 DASFAA 2007, Bangkok, Thailand

DASFAA 2007, Bangkok, Thailand Experimental Setup Parameters When looking for L-short simple paths, L = 5 L is the path-length limit RealMov: movies (12K) people (22K) actors directors producers studious (1K) producing distributing ground truth is known SynPub datasets: many ds of five different types emulation of RealPub publications (5K) authors (1K) organizations (25K) departments (125K) ground truth is known 11 January 2019 DASFAA 2007, Bangkok, Thailand

Experimental Results on Movies Parameters : Fraction : fraction of uncertain references in the dataset Each reference has 2 choices 11 January 2019 DASFAA 2007, Bangkok, Thailand

Experimental Results on Movies- II Number of options based on PMF Distribution 11 January 2019 DASFAA 2007, Bangkok, Thailand

Experimental Results on SynPub RandomWalk, PTM and the Hybrid Model have the same accuracy Is RandomWalk the optimum model for Publications domain? Hybrid Model : 11 January 2019 DASFAA 2007, Bangkok, Thailand

Effect of Random Relationships in the Publications Domain 11 January 2019 DASFAA 2007, Bangkok, Thailand

DASFAA 2007, Bangkok, Thailand Summary Main Contribution An adaptive solution for connection strength Model learns the weights of different path types Ongoing work Using different models to learn the importance of paths in the connection strength Use of standard machine learning techniques for learning: such as decision trees, etc… Different ways to classify paths 11 January 2019 DASFAA 2007, Bangkok, Thailand

DASFAA 2007, Bangkok, Thailand Contact Information RelDC project www.ics.uci.edu/~dvk/RelDC www.itr-rescue.org (RESCUE) Rabia Nuray-Turan (contact author) www.ics.uci.edu/~rnuray Dmitri V. Kalashnikov www.ics.uci.edu/~dvk Sharad Mehrotra www.ics.uci.edu/~sharad 11 January 2019 DASFAA 2007, Bangkok, Thailand

Thank you !