Searching CiteSeer Metadata Using Nutch Larry Reeve INFO624 – Information Retrieval Dr. Lin – Winter 2005.

Slides:



Advertisements
Similar presentations
Theo van Veen, Koninklijke Bibliotheek The European Library: opportunities for new services.
Advertisements

Chapter 5: Introduction to Information Retrieval
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
A web application for browsing research papers By: Rhea Dookeran 09’
The CERIF-2000 Implementation. Andrei S. Lopatenko CERIF Implementation Guidelines Andrei Lopatenko Vienna University of Technology
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Search Engines and Information Retrieval
H YPERLINKING DIGITAL LIBRARIES ON THE WEB Juan Camilo Zapata ITEC – 810 Supervisor Robert Dale 1.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
Most relevant, few hits Not enough hits? Search by ‘Keyword(ke)’ Still not enough hits? Search by ‘Entire Document (tx)’ (least relevant, most hits) Advanced.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
Extensible Information Retrieval with Apache Nutch Aaron Elkiss 16-Feb-2006.
Learning Bit by Bit Search. Information Retrieval Census Memex Sea of Documents Find those related to “new media” Brute force.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Automating the Extraction of Domain Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Proposal January 2004.
Overview of Search Engines
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
Databases & Data Warehouses Chapter 3 Database Processing.
Nutch Search Engine Tool. Nutch overview A full-fledged web search engine Functionalities of Nutch  Internet and Intranet crawling  Parsing different.
MOVIE QUOTES SEARCH ENGINE Students: Meytal Bialik Zvi Cahana Supervisors: Hayim Makabee Oren Somekh Technion – Israel Institute Of Technology Computer.
Search Engines and Information Retrieval Chapter 1.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Nutch in a Nutshell (part I) Presented by Liew Guo Min Zhao Jin.
Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.
Building Search Portals With SP2013 Search. 2 SharePoint 2013 Search  Introduction  Changes in the Architecture  Result Sources  Query Rules/Result.
Building a Domain-Specific Document Collection for Evaluating Metadata Effects on Information Retrieval Walid Magdy, Jinming Min, Johannes Leveling, Gareth.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Building a Domain-Specific Document Collection for Evaluating Metadata Effects on Information Retrieval Walid Magdy, Jinming Min, Johannes Leveling, Gareth.
1 Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing Seung-Taek Park and David M. Pennock (ACM SIGKDD 2007)
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Revolutionizing enterprise web development Searching with Solr.
Introduction to Nutch CSCI 572: Information Retrieval and Search Engines Summer 2010.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
IL Step 3: Using Bibliographic Databases Information Literacy 1.
Chapter 6: Information Retrieval and Web Search
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
By: Channa Boucher. What is ? Gigablast is a search engine that was created in 2000 that retrieves information from partner sites. It was created to index.
Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.
Uwe SchindlerGES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1, Benny Bräuer.
The World Wide Web: Information Resource. Hock, Randolph. The Extreme Searcher’s Internet Handbook. 2 nd ed. CyberAge Books: Medford. (2007). Internet.
Search Tools and Search Engines Searching for Information and common found internet file types.
Design a full-text search engine for a website based on Lucene
Reference Collections: Collection Characteristics.
P ERSONALIZED J OB M ATCHING Md. Mustafizur Rahman Ellie Clougherty John Clougherty Sam Hewitt.
©2003 Paula Matuszek GOOGLE API l Search requests: submit a query string and a set of parameters to the Google Web APIs service and receive in return a.
 Java Server Pages (JSP) By Offir Golan. What is JSP?  A technology that allows for the creation of dynamically generated web pages based on HTML, XML,
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
WIRED Week 6 Syllabus Review Readings Overview Search Engine Optimization Assignment Overview & Scheduling Projects and/or Papers Discussion.
Cimel Summer 2003 Presented by Sanghwa Lee Senior at University of Illinois at Urbana Champaign -Automatic Just the Facts -Find Button -Track Editor.
CHAPTER 7 LESSON C Creating Database Reports. Lesson C Objectives  Display image data in a report  Manually create queries and data links  Create summary.
SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.
CS276B Text Information Retrieval, Mining, and Exploitation Practical 1 Jan 14, 2003.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
16BIT IITR Data Collection Module A web crawler (also known as a web spider or web robot) is a program or automated script which browses the World Wide.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
GNU EPrints 2 Overview Christopher Gutteridge 19 th October 2002 CERN. Geneva, Switzerland.
INFO 320: Information Needs, Searching, and Presentation (aka… Search) Instructor: William Jones TA: Brennen Smith.
Building Search Systems for Digital Library Collections
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Submitted By: Usha MIT-876-2K11 M.Tech(3rd Sem) Information Technology
موضوع پروژه : بازیابی اطلاعات Information Retrieval
Introduction to Information Retrieval
Getting Started With Solr
Peer-to-Peer Information Systems Assignment #3
Peer-to-Peer Information Systems Week 6: Assignment #3
Presentation transcript:

Searching CiteSeer Metadata Using Nutch Larry Reeve INFO624 – Information Retrieval Dr. Lin – Winter 2005

CiteSeer

CiteSeer Search Issues Search Issues Keyword-based full-text search Keyword-based full-text search Boolean search syntax Boolean search syntax How to… How to… search by author name? search by author name? search author affiliation? search author affiliation? search by publication date? search by publication date?

CiteSeer Example: Example: Suggested author search approach: Suggested author search approach: For authors, list all variants that appear in citations, separated by “OR“ For authors, list all variants that appear in citations, separated by “OR“ Examples: Examples: m jordan or michael jordan or m i jordan or m jordan or michael jordan or m i jordan or michael i jordan howard w/2 white or h w/2 white howard w/2 white or h w/2 white

CiteSeer – phrase search

CiteSeer – term search

Goal Search selected metadata fields Search selected metadata fields Author name Author name Author affiliation Author affiliation Publication Date (month, day, year) Publication Date (month, day, year) Title Title Others… Others… Increase precision Increase precision

Methodology - Nutch An open-source web search engine An open-source web search engine Includes crawling, indexing, searching Includes crawling, indexing, searching Technologies: Java, JSP, Tomcat Technologies: Java, JSP, Tomcat Extensible Extensible new fields new fields new parsing/indexing facilities new parsing/indexing facilities adapt UI for searching adapt UI for searching

Methodology - Metadata

Methodology 1) Split XML file into HTML documents Each HTML doc contains metadata Each HTML doc contains metadata Allows existing crawler to be used/extended Allows existing crawler to be used/extended 2) Crawl and index HTML documents on local filesystem 3) Search generated index using JSP page

Methodology 100 HTML Documents XML File (100 records) Split Program Nutch Crawler Parse Filter Index Filter Nutch Search (JSP) Query Filter Implemented as part of project

XML to HTML Split

Methodology - Split

Methodology – Crawl/Index Requires 2 filters to process metadata Requires 2 filters to process metadata CSParseFilter CSParseFilter Parses HTML for metadata values Parses HTML for metadata values Implements Nutch HtmlParseFilter interface Implements Nutch HtmlParseFilter interface CSIndexingFilter CSIndexingFilter Uses metadata generated by ParseFilter Uses metadata generated by ParseFilter Adds metadata to index Adds metadata to index Implements Nutch IndexingFilter interface Implements Nutch IndexingFilter interface

Parse Filter – extract metadata

Index Filter

Methodology – Query Modification of Nutch search page Modification of Nutch search page Change URL from filesystem metadata HTML to CiteSeer Change URL from filesystem metadata HTML to CiteSeer Change to 20 hits, to match CiteSeer Change to 20 hits, to match CiteSeer Query filter Query filter Handles custom fields from index filter Handles custom fields from index filter Prefixed with cs_ Prefixed with cs_ Implements Nutch QueryFilter interface Implements Nutch QueryFilter interface

Query Filter

Evaluation Testing for precision/recall Testing for precision/recall 100 documents 100 documents Stress test Stress test 10,000 documents 10,000 documents Approx 10 mins to crawl/index Approx 10 mins to crawl/index 575,000 documents in CiteSeer metadata download 575,000 documents in CiteSeer metadata download (716,797 documents in CiteSeer) (716,797 documents in CiteSeer) 3.5 hours to split XML into HTML 3.5 hours to split XML into HTML 12 hours to crawl/index 12 hours to crawl/index ~551,000 indexed during crawling ~551,000 indexed during crawling

Evaluation Precision & recall Precision & recall Use first 100 docs (easy to measure recall) Use first 100 docs (easy to measure recall) Issue queries Issue queries Author last name Author last name Author first & last name Author first & last name Author affiliation Author affiliation Precision Precision Use max docs in each system Use max docs in each system Issue author search queries to both systems Issue author search queries to both systems Measure precision on each page of 20 hits Measure precision on each page of 20 hits

Evaluation – P & R Look for all papers where Peter Lee is an author (1 document) Look for all papers where Peter Lee is an author (1 document) cs_authorlast:lee cs_authorlast:lee Returns 3 documents, all with last name of Lee Returns 3 documents, all with last name of Lee P=.33, R=1 P=.33, R=1 cs_authorlast:lee cs_authorfirst:peter cs_authorlast:lee cs_authorfirst:peter Returns single document Returns single document P=1, R=1 P=1, R=1

Evaluation - Precision Author search: Author search: Q1: Peter Lee Q1: Peter Lee Project: cs_authorfirst:peter cs_authorlast:lee Project: cs_authorfirst:peter cs_authorlast:lee CiteSeer: peter w/2 lee CiteSeer: peter w/2 lee Q2: Jeffrey Ullman Q2: Jeffrey Ullman Project: cs_authorfirst:jeffrey cs_authorlast:ullman Project: cs_authorfirst:jeffrey cs_authorlast:ullman CiteSeer: jeffrey w/2 ullman CiteSeer: jeffrey w/2 ullman Q3: John Smith Q3: John Smith Project: cs_authorfirst:john cs_authorlast:smith Project: cs_authorfirst:john cs_authorlast:smith CiteSeer: john w/2 smith CiteSeer: john w/2 smith

Evaluation - Precision

Search Demo Available fields: Available fields: cs_authorfirst cs_authorfirst cs_authorlast cs_authorlast cs_authoraffiliation cs_authoraffiliation cs_pubyear cs_pubyear cs_pubmonth cs_pubmonth