Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.

Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop on Web information and data management 2004)

Abstract They present a technique to group search-engine returned citations for person-name queries. The objective is to put the returned citations in groups such that each group relates to one person. They use a multi-faceted approach that considers evidence from three facets (attributes, links, page similarity). They construct a relatedness confidence matrix for pairs of citations. They merge pairs whose matching confidence value is above an threshold.

Related work The problem is related to cross-document coreferencing and object identity. G. Mann and D. Yarowsky (2003) –They use document vectors over biographical information such as birth year, birth place, spouse name.. S. Tcjada (2001) –About object identification, one technique is vector space modeling, and the other is probabilistic modeling.

A multi-faceted approach They use a multi-faceted method to group relevant citations. Each facet represents an aspect of the problem about if two citations reference the same person or different persons. In this paper, they consider attributes about a person, links within and among sites, and page similarity as facets.

Facet 1: Attributes Attributes they found by manual inspection are phone number, email address, state, city and zip code. In order to extract values from a web page, they write regular expressions for each attribute.

Facet 2: Links (1) If two URLs share a common host, they may refer to the same person. If the URL of one citation has the same host as one of the URLs that belongs to the web page referred by the other citation, they may refer to the same person.

Facet 2: Links (2) Because many names often appear on popular hosts, when two citations share a popular host, we have less confidence that they refer to the same person. They need to find a way to determine whether the host is popular or not. The query link:siteURL in Google shows all pages that point to that URL. A host h is popular for person-name queries if more than 400 pages point to h.

Facet 3: Page Similarity (1) If two different web pages are similar, they may refer to the same person. They use pairs of words that start with a capital letter and that are either adjacent or separated by a connector (and, or, but) or by a preposition which may be followed by an article (a, an, the) or by a single capital letter followed by dot. –David Embley, who is a professor of the Data Extraction Research Group in the Computer Science Department at Brigham Young Univeristy.

Facet 3: Page Similarity (2) They construct a stop word list which is a list of frequently appearing adjacent cap-word pairs –Home Page, Privacy Policy They collected approximately 10,000 web documents taken at random from the Open Directory Project. They constructed all adjacent cap-word pairs and sorted by their frequencies and considered all pairs only with a frequency greater than two to be stop words.

Facet 3: Page Similarity (3) They consider the number of adjacent cap-word pairs as an indicator of the similarity between two web pages. The greater the number of adjacent cap-word pairs, the greater the similarity between the pages.

Confidence Matrix Construction (1) They construct a confidence matrix, one for each facet. First, they construct a training set to compute the conditional probabilities. There are some restrictions for training set. –They should contain male, female, and gender-neutral names. –They should contain names that the returned citations are grouped in different size groups. –They should contain names that the returned citations are grouped in different number of groups. They entered each name (9) as a query for Google, and collected the first 50 returned citations for each name.

Confidence Matrix Construction (2) They use training set to estimate the conditional probabilities. P( Same Person= “ Yes ” | Email= “ Yes ” ) P( Same Person= “ Yes ” | City= “ Yes ” and State= “ Yes ” )

Final Confidence Matrix They generate the final confidence matrix by combining the confidence matrices for the three facets using Stanford certainty theory. Stanford certainty theory gives the following rule to combine the evidence from these two independent observations. Suppose CF(E1) is the certainty factor associated with evidence E1 for some observation B, and CF(E2) is another certainty factor. The compounded CF of B is calculated by CF(E1)+CF(E2)-(CF(E1)*CF(E2)).

Grouping Algorithm If there is high confident between two citations Ci, Cj, they are grouped into a set S1. If there is high confident between two citations Cj, Ck, they are grouped into a set S2. Because S1 and S2 share one or more citations, they are grouped together in one group S3. Keep merging any two sets of citations that share one or more citations until no citation is shared between any two sets. The threshold is 0.8.

Example (1) They apply their technique to the first 10 returned citations for the person-name query “ Kelly Flanagan ”. Pages referenced by the two citations C4 and C7 have the same city and state. They have P( Same Person = “ Yes ” | City = “ Yes ” and State = “ Yes ” )=0.96.

Example (2) The final confidence value between citation C1 and C8 using Stanford certainty theory as 0.96 + 0 + 0.78 – 0.96*0 – 0.96*0.78 – 0.78*0 +0.96*0*0.78 = 0.9912.

Experimental results (1) They chose 10 names by opening an arbitrary page from a phone book and choosing an arbitrary name from the page. The system returned the grouping result for the first 50 returned citations for each name. The size of test set are 500 citations.

Experimental results (2) To evaluate the performance of their system, they use split and merge measures. First, they count how many splits they should do over all the groups to make the citations in each group relate to one person. Then, they counted how many merges they should do between the groups to ensure that no two groups relate to one person. They normalize the split and merge scores to range between 0 and 1. For example.example

Experimental results (3)

Experimental results (4) Using a multi-faceted approach gives much better performance than using each facet separately. For groups that should have been merged, no evidence or only weak evidence was found to group them. Human expert may look at pictures, a deeper understanding of the meaning of distinguishing phrases.

Concluding remarks They designed and implemented a system that can automatically group the returned citations from a search engine person-name query. They used a multi-faceted approach that considers three facets. They gave experimental evidence to show that their approach can be successful.

Evaluation example Correct grouping result for 8 citations: –G1: {C1, C2, C4, C6, C7} –G2: {C3, C8} –G3: {C5} The grouping result of their system: –G1: {C1, C2, C4} –G2: {C3, C6, C7} –G3: {C5, C8} The number of splits over all the citations is 0+1+1=2, and total number of merge scores is 2. back

Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.

Similar presentations

Presentation on theme: "Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.

Similar presentations

Presentation on theme: "Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop."— Presentation transcript:

Similar presentations

About project

Feedback