Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Chapter 5: Introduction to Information Retrieval

Query Chains: Learning to Rank from Implicit Feedback Paper Authors: Filip Radlinski Thorsten Joachims Presented By: Steven Carr.

Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,

Scott Wen-tau Yih (Microsoft Research) Joint work with Vahed Qazvinian (University of Michigan)

Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.

Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.

Sentiment Lexicon Creation from Lexical Resources BIS 2011 Bas Heerschop Erasmus School of Economics Erasmus University Rotterdam

Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.

Ensemble Learning: An Introduction

Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.

1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.

Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.

1 UCB Digital Library Project An Experiment in Using Lexical Disambiguation to Enhance Information Access Robert Wilensky, Isaac Cheng, Timotius Tjahjadi,

Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.

Overview of Search Engines

Disambiguation of References to Individuals Levon Lloyd (State University of New York) Varun Bhagwan, Daniel Gruhl (IBM Research Center) Varun Bhagwan,

Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)

Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.

Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.

COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.

Personalisation Seminar on Unlocking the Secrets of the Past: Text Mining for Historical Documents Sven Steudter.

Tag Clouds Revisited Date : 2011/12/12 Source : CIKM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh. Jia-ling 1.

A Fully Unsupervised Word Sense Disambiguation Method Using Dependency Knowledge Ping Chen University of Houston-Downtown Wei Ding University of Massachusetts-Boston.

1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

W IKIPEDIA AS S ENCE I NVENTORY TO I MPROVE D IVERSITY IN W EB S EARCH R ESULTS Celina Santamar´ıa, Julio Gonzalo and Javier Artiles UNED, c/Juan del Rosal,

A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA

The identification of interesting web sites Presented by Xiaoshu Cai.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

W IKIPEDIA AS S ENCE I NVENTORY TO I MPROVE D IVERSITY IN W EB S EARCH R ESULTS Celina Santamar´ıa, Julio Gonzalo and Javier Artiles nlp.uned.es UNED,

Word Sense Disambiguation in Queries Shaung Liu, Clement Yu, Weiyi Meng.

윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.

SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.

Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.

Intent Subtopic Mining for Web Search Diversification Aymeric Damien, Min Zhang, Yiqun Liu, Shaoping Ma State Key Laboratory of Intelligent Technology.

1 Computing Relevance, Similarity: The Vector Space Model.

Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.

Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.

Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)

21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,

1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,

2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.

Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.

Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department.

W IKIPEDIA AS S ENCE I NVENTORY TO I MPROVE D IVERSITY IN W EB S EARCH R ESULTS Celina Santamar´ıa, Julio Gonzalo and Javier Artiles UNED, c/Juan del Rosal,

Motivation  Methods of local analysis extract information from local set of documents retrieved to expand the query  An alternative is to expand the.

Post-Ranking query suggestion by diversifying search Chao Wang.

Intelligent Database Systems Lab Presenter: CHANG, SHIH-JIE Authors: Longzhuang Li, Yi Shang, Wei Zhang 2002.ACM. Improvement of HITS-based Algorithms.

Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.

Learning to Estimate Query Difficulty Including Applications to Missing Content Detection and Distributed Information Retrieval Elad Yom-Tov, Shai Fine,

26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.

Concept-based Short Text Classification and Ranking

Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.

Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.

Word Sense and Subjectivity (Coling/ACL 2006) Janyce Wiebe Rada Mihalcea University of Pittsburgh University of North Texas Acknowledgements: This slide.

COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics Semantic distance between two words.

University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G

Collection Synthesis Donna Bergmark Cornell Digital Library Research Group March 12, 2002.

1 Query Directed Web Page Clustering Daniel Crabtree Peter Andreae, Xiaoying Gao Victoria University of Wellington.

IR 6 Scoring, term weighting and the vector space model.

An Effective Statistical Approach to Blog Post Opinion Retrieval Ben He, Craig Macdonald, Jiyin He, Iadh Ounis (CIKM 2008)

CS 620 Class Presentation Using WordNet to Improve User Modelling in a Web Document Recommender System Using WordNet to Improve User Modelling in a Web.

Retrieval of Authentic Documents for Reader-Specific Lexical Practice

Panagiotis G. Ipeirotis Luis Gravano

Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.

Presentation transcript:

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal, 16, Madrid, Spain ACL 2010

Introduction Word sense Disambiguation(WSD) – Promoting diversity in the search result – Present the results as a set of clusters – Complement search results with search suggestions Two lexical resource – Wikipedia – Wordnet 3.0

Introduction

Problem – Coverage – Estimate search results diversity using our senses – Sense frequencies – Classification

Test Set It are susceptible to form a one-word query Denote one or more named entities 40 nouns – 15 nouns from the Senseval-3 lexical sample dataset – 25 nouns which satisfy two conditions 1.Ambiguous 2.They are all names for music bands in one of their senses

Test Set Average of 22 senses per noun in Wikipedia Average of 4.5 senses per noun in Wordnet Wikipedia has an larger coverage Retrieve 150 documents for each noun(Google) Annotate each document in each of the dictionaries

Coverage of Web Search Results If we focus on the top ten results, in the band subset Wikipedia covers 68% of the top ten documents In the top ten results that are not covered by Wikipedia – a majority of the missing senses consists of names of companies(45%) and products or services(26%) – the other frequent type (12%) of non annotated document is disambiguation pages

Coverage of Web Search Results Wikipedia seems to extend the coverage of Wordnet rather than providing complementary sense information If we want to extend the coverage of Wikipedia, the best strategy seems to be to consider lists of companies, products and services

Diversity in Google Search Results Use Wikipedia senses to test how well search results respect diversity in terms of this subset of senses 63% of the pages in search results belong to the most frequent sense of the query word Diversity may not play a major role in the current Google ranking algorithm

Sense Frequency Estimators for Wikipedia Frequency information is crucial in a lexicon But Wikipedia don’t provide the relative importance of senses for a given word Attempt to use two estimators of expected sense distribution – Incoming links for the sense page – The number of visits for the sense page(May, June and July

Association of Wikipedia Senses to Web Pages Test whether the information can be used to classify search results accurately No consider approaches that involve a manual training data A web page p and the set of senses w1,…wn listed in Wikipedia Approach 1.Vector Space Model(VSM) 2.Word Sense Disambiguation(WSD) System 3.Random 4.Assign the most frequent sense to all documents

VSM Represent page in a vector space model(tf*idf weights) VSM : compute idf in the collection of retrieval documents VSM-GT : use the statistics provided by the google Terabyte collection VSM-mix : combine statistics from the collection and from the Google Terabyre Collection VSM-GT+freq

WSD system Extract learning examples from the Wikipedia automatically Disambiguate all occurrences of word w in the page p TiMBL-core : use only the examples found in the Wikipedia page TiMBL-inlinks : use the examples found in Wikipedia pages pointing to the page TiMBL-all : use both sources of examples TiMBL-core+freq

Classification Results VSM is a simpler and more efficient approach May indicate that using frequency estimations is only helpful up to certain precision ceiling

Precision/Coverage Trade-off All systems assign a sense for every document in the test collection It is possible to enhance search results diversity without annotating every document Set threshold[ ]

Using Classification to Promote Diversity Use our best classifier(VSM-GT+freq) Make a list of the top-ten documents – Maximize the number of senses – Maximize the similarity scores of the documents to their assigned senses Algorithm 1.Fill each position in the rank with the highest similarity sense which are not yet represented in the rank 2.Once all senses are represented, we start choosing a second representative for each sense

Using Classification to Promote Diversity Other approaches – Clustering(centroids) – Clustering(top ranked) – Random – Upper bound

Using Classification to Promote Diversity coverage=the number of senses in the top ten result / the number of senses in all search results Using wikipedia to enhance diversity seems to work much better than clustering Note, Our evaluation has a bias towards using Wikipedia, because only Wikipedia senses are considered to estimate diversity

Conclusion Wikipedia has a much better coverage The distribution of senses can be esitmated Improve search results diversity for one word queries with simple and efficient algorithm Our results do not imply that the Wikipedia modified rank is better than the original Google rank