Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal, 16, 28040 Madrid, Spain celina.santamaria@gmail.comjulio@lsi.uned.es javart@bec.uned.es ACL 2010

Introduction Word sense Disambiguation(WSD) – Promoting diversity in the search result – Present the results as a set of clusters – Complement search results with search suggestions Two lexical resource – Wikipedia – Wordnet 3.0

Introduction

Problem – Coverage – Estimate search results diversity using our senses – Sense frequencies – Classification

Test Set It are susceptible to form a one-word query Denote one or more named entities 40 nouns – 15 nouns from the Senseval-3 lexical sample dataset – 25 nouns which satisfy two conditions 1.Ambiguous 2.They are all names for music bands in one of their senses

Test Set Average of 22 senses per noun in Wikipedia Average of 4.5 senses per noun in Wordnet Wikipedia has an larger coverage Retrieve 150 documents for each noun(Google) Annotate each document in each of the dictionaries

Coverage of Web Search Results If we focus on the top ten results, in the band subset Wikipedia covers 68% of the top ten documents In the top ten results that are not covered by Wikipedia – a majority of the missing senses consists of names of companies(45%) and products or services(26%) – the other frequent type (12%) of non annotated document is disambiguation pages

Coverage of Web Search Results Wikipedia seems to extend the coverage of Wordnet rather than providing complementary sense information If we want to extend the coverage of Wikipedia, the best strategy seems to be to consider lists of companies, products and services

Diversity in Google Search Results Use Wikipedia senses to test how well search results respect diversity in terms of this subset of senses 63% of the pages in search results belong to the most frequent sense of the query word Diversity may not play a major role in the current Google ranking algorithm

Sense Frequency Estimators for Wikipedia Frequency information is crucial in a lexicon But Wikipedia don’t provide the relative importance of senses for a given word Attempt to use two estimators of expected sense distribution – Incoming links for the sense page – The number of visits for the sense page(May, June and July 2009 http://stats.grok.se/)http://stats.grok.se/

Association of Wikipedia Senses to Web Pages Test whether the information can be used to classify search results accurately No consider approaches that involve a manual training data A web page p and the set of senses w1,…wn listed in Wikipedia Approach 1.Vector Space Model(VSM) 2.Word Sense Disambiguation(WSD) System 3.Random 4.Assign the most frequent sense to all documents

VSM Represent page in a vector space model(tf*idf weights) VSM : compute idf in the collection of retrieval documents VSM-GT : use the statistics provided by the google Terabyte collection VSM-mix : combine statistics from the collection and from the Google Terabyre Collection VSM-GT+freq

WSD system Extract learning examples from the Wikipedia automatically Disambiguate all occurrences of word w in the page p TiMBL-core : use only the examples found in the Wikipedia page TiMBL-inlinks : use the examples found in Wikipedia pages pointing to the page TiMBL-all : use both sources of examples TiMBL-core+freq

Classification Results VSM is a simpler and more efficient approach May indicate that using frequency estimations is only helpful up to certain precision ceiling

Precision/Coverage Trade-off All systems assign a sense for every document in the test collection It is possible to enhance search results diversity without annotating every document Set threshold[0.00-0.90]

Using Classification to Promote Diversity Use our best classifier(VSM-GT+freq) Make a list of the top-ten documents – Maximize the number of senses – Maximize the similarity scores of the documents to their assigned senses Algorithm 1.Fill each position in the rank with the highest similarity sense which are not yet represented in the rank 2.Once all senses are represented, we start choosing a second representative for each sense

Using Classification to Promote Diversity Other approaches – Clustering(centroids) – Clustering(top ranked) – Random – Upper bound

Using Classification to Promote Diversity coverage=the number of senses in the top ten result / the number of senses in all search results Using wikipedia to enhance diversity seems to work much better than clustering Note, Our evaluation has a bias towards using Wikipedia, because only Wikipedia senses are considered to estimate diversity

Conclusion Wikipedia has a much better coverage The distribution of senses can be esitmated Improve search results diversity for one word queries with simple and efficient algorithm Our results do not imply that the Wikipedia modified rank is better than the original Google rank

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Similar presentations

Presentation on theme: "Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Similar presentations

Presentation on theme: "Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,"— Presentation transcript:

Similar presentations

About project

Feedback