Exploiting Context Analysis for Combining Multiple Entity Resolution Systems Zhaoqi Chen, Dmitri V. Kalashnikov, Sharad Mehrotra University of California,

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems Zhaoqi Chen, Dmitri V. Kalashnikov, Sharad Mehrotra University of California, Irvine © 2009 Dmitri V. Kalashnikov ACM SIGMOD 2009 Conference, Providence, RI, USA, June 30 – July 2, 2009

2 Information Quality Quality of data is critical $1 Billion market –Estimated by Forrester Group Quality of Data Quality of Analysis Quality of Decisions (Raw) Data Analysis Decisions Data Processing Flow 2

3 Entity Resolution Lookup –List of all objects is given –Match references to objects Grouping –No list of objects is given –Group references that corefer Entity Resolution (ER) –One of the Information Quality challenges –Disambiguating uncertain references to objects (in raw data) 3

4 Example of Analysis on Bad Data: CiteSeer CiteSeer: Top-k most cited authors DBLPDBLP Unexpected Entries –Lets check two people in DBLP –“A. Gupta” –“L. Zhang” Analysis on bad data can lead to incorrect results Fix errors before analysis Raw Data Analysis Decisions Data Quality Engine

Motivating ER Ensembles Many ER solutions exist No single ER solution is consistently the best –In terms of quality Different ER solutions perform better in different contexts Example: –Let K be the true number of clusters –K is part of context –Assume that we use Agglomerative Clustering (Merging) if (K is large) then use Solution1: high threshold if (K is small) then use Solution2: low threshold –Observe that K is unknown beforehand in this case! 5

Virtual Connected Subgraph –Use simple techniques to create similarity edges (or connect all refs.) –Similarity edges form VCSs 6 Graphical View of ER Problem VCS properties 1. Virtual –Contains only similarity edges 2. Connected –A path exists between any 2 nodes 3.Subgraph –Subgraphs of the ER graph 4. Complete –Adding more nodes/edges would violate (1) or (2) Logically, the goal of ER is to partition each VCS correctly [CKM: JCDL 2007]

S2S2 S2S2 S1S1 S1S1 SNSN SNSN … Base-level ER Systems Raw Dataset Output of S 1 Output of S N … Output of S 2 Final Result Problem Definition 7 Black boxes Apply each to dataset Outputs as graphs: node - per each ref. edges - connect each pair of references For each edge e j, system S i makes decision d ji  {-1,+1} Goal: combine d j1, d j2, …,d jn to make the final decision a j * for e j, such that the final clustering is as close to the ground truth as possible Ensemble Techniques

Toy Example: Notation 8 EF G AB CD VCS 1 VCS 2 AB CD EF G EF G AB CD ER system S1ER system S2 Graph

Naïve Solutions: Voting and Weighted Voting Voting For each edge e j count decisions d ji made by each S i : if (sum ≥ 0) then e j - positive (+1) else e j - negative (-1) Notice: if (n -1) systems perform poorly and only one performs well - the majority will win… 9 Weighted Voting –Assign weight w i to each system S i –For e j count weighted decisions d ji made by S i ’s –Proceed like in voting

Limitations of Weighted Voting –No matter how we choose the weights, in our running example Accuracy ≤ 56% –Problem: WV is static non-adaptive to the context 10

11 Choosing Context Features Effectiveness – should capture well which ER systems work well in the given context Generality – should be generic, not be present just in few datasets 11 Number of Clusters (K) –K + can help (merging ex.) –But, K + is unknown! –Use regression to predict –K 1, K 2, …, K n → K * –K i is # of clusters by S i –Features for edge e j : Node Fanout –N v + is # of pos. edges of v –Also unknown –Use regression to predict –N v1, N v2,…, N vn → N v * –N vi is # according to S i –Features for edge e j : Error Features –Measure how far the prediction of a parameter by S i is different from the estimated true value of that parameter –The more the error is, the likely is that S i ’s solution is off Combining Features

Training & Testing 12 (training only)

Approach 1: Context-Extended Classification 13 f2f2 d1d1 d2d2 d2d2 ≤0.9>0.9 1 C=-1C=1 1 1 C=-1 C=1 C=-1 Context features: Three Methods Method1: learn Method2: Method3: 2n features → n –Confidence in “merge” –Learn

Approach 2: Context-Weighted Classification 14 Idea –For each S i learn model M i of how well S i performs in context –Learn f j → c j Algorithm –Apply S i, get d j and f j for e j –Apply M i on f j, get c* ji and p ji –p ji is confidence in c* ji –v ji = d ji ·c* ji · p ji ; v j = (v j1, v j2,…, v jn ) –May reverse some decisions –Learn/Use v j → a* j mapping

Clustering 15 Example –Simple merging will merge –CC will not –2 negative vs. 1 positive Correlation Clustering –Once a* j  {-1,+1} are known, we need to cluster –CC is designed to handle conflicts in labeling –Finds clustering that agrees the most with the labeling –CC can behave as Agglom. Clustering –Set params. accordingly –More generic scheme

Experimental Setup Dataset –Web domain: [WWW’05] –Publication domain: RealPub [TODS’05] Baseline Algorithms –BestBase - S i that produces the best overall result –Majority Voting –Weighted Voting –Three clustering-aggregation algos from [GMT05] –Standard ER ensemble [ZR05] Base-level Systems S i –TF-IDF+merging, with different merging threshold –Feature+relationship+Correlation Clustering –Etc. 16

Sample of Base-level systems 17

Experiment 1: “Sanity Check” Introduce one “perfect” base-level system that always gets perfect results –Does not exist in practice –Utilizes the ground truth (unknown, of course) As expected, the algorithms were able to learn to use that “perfect” system, and to ignore the results of other base-level systems 18

Comparing Various Aggregation Algorithms –Measures: F P, F B,F 1 –Num. systems: 5, 10, 20 –MajorVot < BestBase –Many base-algo’s do not perform well 19 –WeightedERE is #1 –ExtendedERE is #2 –Both are statistically better –According to t-test  = 0.05 –Consistent improvement –5 → 10 → 20

Detailed results for 20 systems and Fp –None of the baselines is consistently better –See “BestIndiv” –That is why ER Ensemble outperforms the rest 20

Results on RealPub –Results are similar to those on WePS data 21

Comparing Different Combinations of Base- line Systems on Real Pub –Combination 1 –1 Context, 3 RelER (t=0.05;0.01;0.005), and 1 RelAA (t=0.1) –Combination 2 –3 RelER (t=0.0005;0.0001;0.00005) and 2 RelAA (t=0.01;0.001) –W_ERE #1, E_ERE #2, Comb2 > Comb1 22

23 Efficiency Issues Running time consist of –Running (in parallel) base-level systems –To get decision features –Running (in parallel) two regression classifiers –To get context features –Applying meta-classifier –Depends on the type of classifier –Usually not a bottleneck (1-5 sec on 5K to 50K data) –Applying correlation clustering –Not a bottleneck (under second) Blocking –1-2 order magnitude of improvement

Future Work 24 Efficiency –How to determine which base-level systems to run –And on which parts of data –Trade efficiency for quality Features –Look into more feature types –Improve the quality of predictions –Apply framework iteratively 24

Questions? Dmitri V. Kalashnikov www.ics.uci.edu/~dvk 25 Stella ChenSharad Mehrotra www.ics.uci.edu/~sharad GDF Project www.ics.uci.edu/~dvk/GDF

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems Zhaoqi Chen, Dmitri V. Kalashnikov, Sharad Mehrotra University of California,

Similar presentations

Presentation on theme: "Exploiting Context Analysis for Combining Multiple Entity Resolution Systems Zhaoqi Chen, Dmitri V. Kalashnikov, Sharad Mehrotra University of California,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems Zhaoqi Chen, Dmitri V. Kalashnikov, Sharad Mehrotra University of California,

Similar presentations

Presentation on theme: "Exploiting Context Analysis for Combining Multiple Entity Resolution Systems Zhaoqi Chen, Dmitri V. Kalashnikov, Sharad Mehrotra University of California,"— Presentation transcript:

Similar presentations

About project

Feedback