Disambiguation Algorithm for People Search on the Web

Disambiguation Algorithm for People Search on the Web
Dmitri V. Kalashnikov, Sharad Mehrotra, Zhaoqi Chen, Rabia Nuray-Turan, Naveen Ashish For questions visit: Computer Science Department University of California, Irvine If somebody has questions during the presentation which you cannot answer, please ask them to contact Dmitri V. Kalashnikov, the URL is provided. That webpage contains of Dmitri V. Kalashnikov, which is not provided here to avoid spam.

Entity (People) Search
Person1 Person2 Top-K Webpages Person3 Problem Definition: the goal is to group the webpages that co-refer (talk about the same person) Why? – Better (next-generation) search capabilities! Unknown beforehand

Standard Approach to Entity Resolution
So, how do you solve the disambiguation problem? This slidea shows the standard approach: using features. Feature in this case can be TF/IDF of webpages.

Key Observation: More Info is Available
= This is a key observation for understanding RelDC framework: relational data can be represented in the form of the Entity Relationship Graph. ER graph – nodes are entities, edges are relationships.

RelDC Framework RelDC framework combines methods for using: features, context (features derived from context) and relationships. Intuition: a presence of a path between X and Y might indicate that they co-refer. Thus analyze paths!

Where is the Graph here? Use Extraction!
Unlike in regular (structured) database, the graph is not readily available here. Solution: from each webpage extract: - Named Entities (using GATE) Hypelinks/ s connect them as explained in the paper - parese links/ s

Overall Algorithm Overview
User Input. A user submits a query to the middleware via a web-based interface. Web page Retrieval. The middleware queries a search engine’s API, gets top-K Web pages. Preprocessing. The retrieved Web pages are preprocessed: TF/IDF. Preprocessing steps for computing TF/IDF are carried out. Ontology. Ontologies are used to enrich the Webpage content. Extraction. Named entities, and web related information is extracted from the Webpages. Graph Creation. The Entity-Relationship Graph is generated Enhanced TF/IDF. Ontology-enhanced TF/IDF values are computed Clustering. Correlation clustering is applied Cluster Processing. Each resulting cluster is then processed as follows: Sketches. A set of keywords that represent the web pages within a cluster is computed for each cluster. The goal is that the user should be able to find the person of interest by looking at the sketch. Cluster Ranking. All cluster are ranked by a choosing criteria to be presented in a certain order to the user Web page Ranking. Once the user hones in on a particular cluster, the Web pages in this cluster are presented in a certain order, computed on this step. Visualization of Results. The results are presented to the user in the form of clusters (and their sketches) corresponding to namesakes and which can be explored further.

Correlation Clustering
In CC, each pair of nodes (u,v) is labeled with “+” or “-” edge labeling is done according to a similarity function s(u,v) Similarity function s(u,v) if s(u,v) believes u and v are similar, then label “+” else label “-” s(u,v) is typically trained from past data Clustering looks at edges tries to minimize disagreement disagreement for element x placed in cluster C, is a number of “-” edges that connect x and other elements in C

Connection strength between u and v:
Similarity Function Connection strength between u and v: where ck – the number of u-v paths of type k and wk – the weigh of u-v paths of type k Similarity s(u,v) is a combination

Training s(u,v) on pre-labeled data
It is a “linear” optimization problem. (LP have efficient solutions) The system says that for edges labeled “+” similarity should exceed a threshold, for “-” – should be less than threshold. Delta is used for “clearly” less or more than t. Since such contsructed inequalities might not have a solution that would satify them all – to handle it slack is added to each equation. The goal is to miminize the overall slack.

Experiments: Quality of Disambiguation
By Artiles, et al. in SIGIR’05 These two experiments show quality on two datasets: from SIGIR’05 and from WWW’05 (By famous group of Andrew McCallum). The “+” value in brackets are improvements from what is reported in those papers. In WWW’05, the author do a different experiment (from what we do in this paper) – we implemented that experiment also and have 9.5% improvement over them (the measure is computed the same way they do – so we made sure that everything is comparable). By Bekkerman & McCallum in WWW’05

Experiments: Effect on Search
Effects on Precision, Recall, F-measure for “Andrew McCallum” for different representative clusters For Umass professor: his cluster is first and dominant (large) For customer-support person: his cluster is small (3 webpages) towards the end of Google’s search In all experiments, the goal is to find all the pages of a given person. The experiments show that the new interface allows to do so quicker.

Disambiguation Algorithm for People Search on the Web

Similar presentations

Presentation on theme: "Disambiguation Algorithm for People Search on the Web"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Disambiguation Algorithm for People Search on the Web

Similar presentations

Presentation on theme: "Disambiguation Algorithm for People Search on the Web"— Presentation transcript:

Similar presentations

About project

Feedback