Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.
Association Analysis (Data Engineering). Type of attributes in assoc. analysis Association rule mining assumes the input data consists of binary attributes.
Reporter: Jing Chiu Advisor: Yuh-Jye Lee /7/181Data Mining & Machine Learning Lab.
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Site Level Noise Removal for Search Engines André Luiz da Costa Carvalho Federal University of Amazonas, Brazil Paul-Alexandru Chirita L3S and University.
Person Name Disambiguation by Bootstrapping Presenter: Lijie Zhang Advisor: Weining Zhang.
Grouping Search-Engine Returned Citations for Person Name Queries Reema Al-Kamha Research Supported by NSF.
Grouping Search-Engine Returned Citations for Person Name Queries Reema Al-Kamha Research Supported by NSF.
Web People Search using Extracted Attributes Joseph S. Park Computer Science Brigham Young University.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Modern Information Retrieval Chapter 2 Modeling. Probabilistic model the appearance or absent of an index term in a document is interpreted either as.
Partitioning Search-Engine Returned Citations for Proper-Noun Queries Reema Al-Kamha Supported by NSF.
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
Partitioning Search-Engine Returned Citations for Proper-Noun Queries Reema Al-Kamha.
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.
The Problem Finding information about people in huge text collections or on-line repositories on the Web is a common activity Person names, however, are.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
Slide No. 1 Searching the Web H Search engines and directories H Locating these resources H Using these resources H Interpreting results H Locating specific.
X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox Associate Dean for.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa
Discrete Mathematical Structures (Counting Principles)
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
25/03/2003CSCI 6405 Zheyuan Yu1 Finding Unexpected Information Taken from the paper : “Discovering Unexpected Information from your Competitor’s Web Sites”
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
Algorithmic Detection of Semantic Similarity WWW 2005.
Web- and Multimedia-based Information Systems Lecture 2.
Web of Science: Citation Indexes on the Web Gary Wiggins 9/29/2004.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Search Worms, ACM Workshop on Recurring Malcode (WORM) 2006 N Provos, J McClain, K Wang Dhruv Sharma
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Presented By Amarjit Datta
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Your caption here POLYPHONET: An Advanced Social Network Extraction System from the Web Yutaka Matsuo Junichiro Mori Masahiro Hamasaki National Institute.
1 CS 430: Information Discovery Lecture 5 Ranking.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
IUB Libraries Faculty & Graduate Student Updates Web of Science: Citation Indexes on the Web Presented by Gary Wiggins
MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Sravanthi Lakkimsety Mar 14,2016.
Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Using Google Scholar Ronald Wirtz, Ph.D.Calvin T. Ryan LibraryDec Finding Scholarly Information With A Popular Search Engine Tool.
User Modeling for Personal Assistant
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
A research literature search engine with abbreviation recognition
ITE 130 Web Searching.
Information Retrieval
Information Organization: Clustering
Data Mining Chapter 6 Search Engines
Anatomy of a Search Search The Index:
Presentation transcript:

Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop on Web information and data management 2004)

Abstract They present a technique to group search-engine returned citations for person-name queries. The objective is to put the returned citations in groups such that each group relates to one person. They use a multi-faceted approach that considers evidence from three facets (attributes, links, page similarity). They construct a relatedness confidence matrix for pairs of citations. They merge pairs whose matching confidence value is above an threshold.

Related work The problem is related to cross-document coreferencing and object identity. G. Mann and D. Yarowsky (2003) –They use document vectors over biographical information such as birth year, birth place, spouse name.. S. Tcjada (2001) –About object identification, one technique is vector space modeling, and the other is probabilistic modeling.

A multi-faceted approach They use a multi-faceted method to group relevant citations. Each facet represents an aspect of the problem about if two citations reference the same person or different persons. In this paper, they consider attributes about a person, links within and among sites, and page similarity as facets.

Facet 1: Attributes Attributes they found by manual inspection are phone number, address, state, city and zip code. In order to extract values from a web page, they write regular expressions for each attribute.

Facet 2: Links (1) If two URLs share a common host, they may refer to the same person. If the URL of one citation has the same host as one of the URLs that belongs to the web page referred by the other citation, they may refer to the same person.

Facet 2: Links (2) Because many names often appear on popular hosts, when two citations share a popular host, we have less confidence that they refer to the same person. They need to find a way to determine whether the host is popular or not. The query link:siteURL in Google shows all pages that point to that URL. A host h is popular for person-name queries if more than 400 pages point to h.

Facet 3: Page Similarity (1) If two different web pages are similar, they may refer to the same person. They use pairs of words that start with a capital letter and that are either adjacent or separated by a connector (and, or, but) or by a preposition which may be followed by an article (a, an, the) or by a single capital letter followed by dot. –David Embley, who is a professor of the Data Extraction Research Group in the Computer Science Department at Brigham Young Univeristy.

Facet 3: Page Similarity (2) They construct a stop word list which is a list of frequently appearing adjacent cap-word pairs –Home Page, Privacy Policy They collected approximately 10,000 web documents taken at random from the Open Directory Project. They constructed all adjacent cap-word pairs and sorted by their frequencies and considered all pairs only with a frequency greater than two to be stop words.

Facet 3: Page Similarity (3) They consider the number of adjacent cap-word pairs as an indicator of the similarity between two web pages. The greater the number of adjacent cap-word pairs, the greater the similarity between the pages.

Confidence Matrix Construction (1) They construct a confidence matrix, one for each facet. First, they construct a training set to compute the conditional probabilities. There are some restrictions for training set. –They should contain male, female, and gender-neutral names. –They should contain names that the returned citations are grouped in different size groups. –They should contain names that the returned citations are grouped in different number of groups. They entered each name (9) as a query for Google, and collected the first 50 returned citations for each name.

Confidence Matrix Construction (2) They use training set to estimate the conditional probabilities. P( Same Person= “ Yes ” | = “ Yes ” ) P( Same Person= “ Yes ” | City= “ Yes ” and State= “ Yes ” )

Final Confidence Matrix They generate the final confidence matrix by combining the confidence matrices for the three facets using Stanford certainty theory. Stanford certainty theory gives the following rule to combine the evidence from these two independent observations. Suppose CF(E1) is the certainty factor associated with evidence E1 for some observation B, and CF(E2) is another certainty factor. The compounded CF of B is calculated by CF(E1)+CF(E2)-(CF(E1)*CF(E2)).

Grouping Algorithm If there is high confident between two citations Ci, Cj, they are grouped into a set S1. If there is high confident between two citations Cj, Ck, they are grouped into a set S2. Because S1 and S2 share one or more citations, they are grouped together in one group S3. Keep merging any two sets of citations that share one or more citations until no citation is shared between any two sets. The threshold is 0.8.

Example (1) They apply their technique to the first 10 returned citations for the person-name query “ Kelly Flanagan ”. Pages referenced by the two citations C4 and C7 have the same city and state. They have P( Same Person = “ Yes ” | City = “ Yes ” and State = “ Yes ” )=0.96.

Example (2) The final confidence value between citation C1 and C8 using Stanford certainty theory as – 0.96*0 – 0.96*0.78 – 0.78* *0*0.78 =

Experimental results (1) They chose 10 names by opening an arbitrary page from a phone book and choosing an arbitrary name from the page. The system returned the grouping result for the first 50 returned citations for each name. The size of test set are 500 citations.

Experimental results (2) To evaluate the performance of their system, they use split and merge measures. First, they count how many splits they should do over all the groups to make the citations in each group relate to one person. Then, they counted how many merges they should do between the groups to ensure that no two groups relate to one person. They normalize the split and merge scores to range between 0 and 1. For example.example

Experimental results (3)

Experimental results (4) Using a multi-faceted approach gives much better performance than using each facet separately. For groups that should have been merged, no evidence or only weak evidence was found to group them. Human expert may look at pictures, a deeper understanding of the meaning of distinguishing phrases.

Concluding remarks They designed and implemented a system that can automatically group the returned citations from a search engine person-name query. They used a multi-faceted approach that considers three facets. They gave experimental evidence to show that their approach can be successful.

Evaluation example Correct grouping result for 8 citations: –G1: {C1, C2, C4, C6, C7} –G2: {C3, C8} –G3: {C5} The grouping result of their system: –G1: {C1, C2, C4} –G2: {C3, C6, C7} –G3: {C5, C8} The number of splits over all the citations is 0+1+1=2, and total number of merge scores is 2. back