Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa

Slides:



Advertisements
Similar presentations
Relevance Feedback Limitations –Must yield result within at most 3-4 iterations –Users will likely terminate the process sooner –User may get irritated.
Advertisements

INFORMATION SOLUTIONS Citation Analysis Reports. Copyright 2005 Thomson Scientific 2 INFORMATION SOLUTIONS Provide highly customized datasets based on.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Site Level Noise Removal for Search Engines André Luiz da Costa Carvalho Federal University of Amazonas, Brazil Paul-Alexandru Chirita L3S and University.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
A review on “Answering Relationship Queries on the Web” Bhushan Pendharkar ASU ID
Information Retrieval in Practice
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Re-ranking Documents Segments To Improve Access To Relevant Content in Information Retrieval Gary Madden Applied Computational Linguistics Dublin City.
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Bibliometrics: the black art of citation rankings Roger Mills OULS Head of Science Liaison and Specialist Services February 2010 These slides are available.
Information Retrieval
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Search Engines and Information Retrieval Chapter 1.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway
Fall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 2-1 How Search Engines Work Today we show how a search engine works  What happens when.
THOMSON SCIENTIFIC Patricia Brennan Thomson Scientific January 10, 2008.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Search Tools and Search Engines Searching for Information and common found internet file types.
Search Engine-Crawler Symbiosis: Adapting to Community Interests
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
Bibliometrics: the black art of citation rankings Roger Mills Head of Science Liaison and Specialist Services, Bodleian Libraries June 2010 These slides.
Citation-Based Retrieval for Scholarly Publications 指導教授:郭建明 學生:蘇文正 M
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
1 CS 430: Information Discovery Lecture 5 Ranking.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Facilitating Document Annotation Using Content and Querying Value.
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
Automated Information Retrieval
Information Retrieval in Practice
Text Based Information Retrieval
Martin Rajman, Martin Vesely
Search Techniques and Advanced tools for Researchers
Information Retrieval
Panagiotis G. Ipeirotis Luis Gravano
Information Retrieval and Web Design
Presentation transcript:

Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa

Citation Indexes Valuable tools for research Examples: SCI, CiteSeer, arXiv, CiteBase Permit traversal of citation networks Identify significant contributions Subject search is often the entry point

Subject search Query similarity Citation frequency

PageRank Example: 2 papers similar in terms of relevance published at roughly the same time Paper A cited only by its author Paper B cited 10 times by other authors Paper B likely to have greater priority for reading

Problem Boolean retrieval metrics Many top documents are not relevant Effective for Web-searches Any one of several popular pages will do Not so for users of citation indexes

Reference Directed Indexing (RDI) Objective: To combine strong measures of both relevance and significance in a single metric Intuition: The opinions of authors who cite a document effectively distinguish both what a document is about and how important a contribution it makes Similar to the use of anchor text to index Web documents

Example Paper by Ron Azuma and Gary Bishop On tracking the heads of users in augmented reality systems Head tracking is necessary in order to generate the correct perspective view

A single reference to Azuma Azuma et al. [2] developed a 6DOF tracking system using linear accelerometers and rate gyroscopes to improve the dynamic registration of an optical beacon ceiling tracker.

Summarizes Azuma paper as… A six degrees of freedom tracking system With additional details: Improves dynamic registration Optical beacon ceiling tracker Linear accelerometers Rate gyroscopes

Leveraging multiple citations For any document cited more than once… We can compare the words of all authors Terms used by many referrers make good index terms for a document

Repeated use of “tracking” and “augmented reality” Whereas several augmented reality environments are known (cf. State et al. 1] Azuma and Bishop [3]) … e.g. landmark tracking for determining head pose in augmented reality [2, 3, 4, 5] Azuma and Holloway analyze sources of registration and tracking errors in AR systems [2, 11, 12]. Azuma et al. [2] developed a 6DOF tracking system using linear accelerometers

A voting technique RDI treats each citing document as a voter The presence of a query term in referential text is a vote of “yes” The absence of that term, a “no” The documents with the most votes for the query terms rank highest

Related Work McBryan – World Wide Web Worm Brin & Page – Google Chakrabarti et. al - CLEVER Mendelzon et. al - TOPIC Bharat et. al – Hilltop Craswell et. al – Effective Site Finding

Contributions Application to scientific literature “Anchor text” for unrestricted subject search “Anchor text” for combining measures of relevance and significance

Rosetta Experimental system in which we implemented RDI Term weighting metric: Ranking metric:

Experiments 10,000 research papers Gathered from CiteSeer Each document cited at least once Evaluated Retrieval precision Impact of search results

Comparison system We compared Rosetta to a traditional content-based retrieval system Comparison system uses TFIDF for term weighting: And the Cosine ranking metric:

Indexing Indexed collection in both Rosetta and the TFIDF/Cosine system Rosetta indexed documents based on references to them The TFIDF/Cosine system indexed documents based on words used within them Required that each document was cited at least once to ensure that both systems indexed the same set of documents

As referential text, Rosetta used CiteSeer’s “contexts of citation”

Queries 32 queries in our test set Queries were key terms extracted from “Keywords” sections of documents Queries extracted from sample of 24 documents Document from which key term was extracted established the topic of interest

Queries

Relevance assessments The topic of interest for a query was the idea identified by the corresponding key term Relevant documents directly addressed this same topic Example: Query: “force feedback” Relevant: Work on providing a sense of touch in VR applications or other computer simulations

Retrieval interface Meta-interface Queried both systems Used top 10 search results from each system Integrated all 20 search results Presented them in random order No way to determine the source of a retrieved document

Experimental summary 32 queries drawn from document key terms Document identified the topic of interest Relevant documents addressed the same topic Used a meta-search interface Evaluated top 10 from both systems Origin of search results hidden

Precision at top 10 On average RDI provided a 16.6% improvement over TFIDF/Cosine 1 or 2 more relevant documents in the top 10 Result is significant t-test of the mean paired difference Test statistic = Significant at a confidence level of 99.5%

Precision at top 10 (cont’d)

Many retrieval errors avoided Example: software architecture diagrams Most papers about software architecture frequently use the term “diagrams” Few are about tools for diagramming TFIDF/Cosine system -- 0/10 relevant Rosetta -- 4/10 relevant (3 in top 5) Rosetta made the correct distinction more often

Rosetta Shortcomings Retrieval metric sorts search results by number of query terms matched Some authors reuse portions of text in which other documents are cited

Impact of search results A look at the number of citations to documents retrieved for each query Compared RDI to a baseline provided by the TFIDF/Cosine system TFIDF/Cosine includes no measure of impact Seeking only a measure of the relative impact of documents retrieved by RDI on a given topic

Experiment For each query… Calculated the average citations/year for each document Average publication year for Rosetta – 1994 TFIDF/Cosine – 1995 Found the median number of citations/year for each set of search results Found the difference between the median for Rosetta and the median for TFIDF/Cosine

Difference in impact On average the median citations/year… 8.9 for Rosetta 1.5 for the baseline

Difference in impact (cont’d)

Summary of Experiments Small study – results are tentative Surpassed retrieval precision of a widely used relevance-based approach Consistently retrieved documents that have had a significant impact

Future Work Retrieval metric that eliminates Boolean component Large scale implementation with CiteSeer data Studies with more sophisticated relevance- based retrieval systems Comparison with popularity-based retrieval techniques

Contact Shannon Bradshaw The University of Iowa