Presentation is loading. Please wait.

Presentation is loading. Please wait.

Searching CiteSeer Metadata Using Nutch Larry Reeve INFO624 – Information Retrieval Dr. Lin – Winter 2005.

Similar presentations


Presentation on theme: "Searching CiteSeer Metadata Using Nutch Larry Reeve INFO624 – Information Retrieval Dr. Lin – Winter 2005."— Presentation transcript:

1 Searching CiteSeer Metadata Using Nutch Larry Reeve INFO624 – Information Retrieval Dr. Lin – Winter 2005

2 CiteSeer

3 CiteSeer Search Issues Search Issues Keyword-based full-text search Keyword-based full-text search Boolean search syntax Boolean search syntax How to… How to… search by author name? search by author name? search author affiliation? search author affiliation? search by publication date? search by publication date?

4 CiteSeer Example: Example: Suggested author search approach: Suggested author search approach: For authors, list all variants that appear in citations, separated by “OR“ For authors, list all variants that appear in citations, separated by “OR“ Examples: Examples: m jordan or michael jordan or m i jordan or m jordan or michael jordan or m i jordan or michael i jordan howard w/2 white or h w/2 white howard w/2 white or h w/2 white

5 CiteSeer – phrase search

6 CiteSeer – term search

7 Goal Search selected metadata fields Search selected metadata fields Author name Author name Author affiliation Author affiliation Publication Date (month, day, year) Publication Date (month, day, year) Title Title Others… Others… Increase precision Increase precision

8 Methodology - Nutch An open-source web search engine An open-source web search engine Includes crawling, indexing, searching Includes crawling, indexing, searching Technologies: Java, JSP, Tomcat Technologies: Java, JSP, Tomcat Extensible Extensible new fields new fields new parsing/indexing facilities new parsing/indexing facilities adapt UI for searching adapt UI for searching

9 Methodology - Metadata

10 Methodology 1) Split XML file into HTML documents Each HTML doc contains metadata Each HTML doc contains metadata Allows existing crawler to be used/extended Allows existing crawler to be used/extended 2) Crawl and index HTML documents on local filesystem 3) Search generated index using JSP page

11 Methodology 100 HTML Documents XML File (100 records) Split Program Nutch Crawler Parse Filter Index Filter Nutch Search (JSP) Query Filter Implemented as part of project

12 XML to HTML Split

13 Methodology - Split

14 Methodology – Crawl/Index Requires 2 filters to process metadata Requires 2 filters to process metadata CSParseFilter CSParseFilter Parses HTML for metadata values Parses HTML for metadata values Implements Nutch HtmlParseFilter interface Implements Nutch HtmlParseFilter interface CSIndexingFilter CSIndexingFilter Uses metadata generated by ParseFilter Uses metadata generated by ParseFilter Adds metadata to index Adds metadata to index Implements Nutch IndexingFilter interface Implements Nutch IndexingFilter interface

15 Parse Filter – extract metadata

16 Index Filter

17

18 Methodology – Query Modification of Nutch search page Modification of Nutch search page Change URL from filesystem metadata HTML to CiteSeer Change URL from filesystem metadata HTML to CiteSeer Change to 20 hits, to match CiteSeer Change to 20 hits, to match CiteSeer Query filter Query filter Handles custom fields from index filter Handles custom fields from index filter Prefixed with cs_ Prefixed with cs_ Implements Nutch QueryFilter interface Implements Nutch QueryFilter interface

19 Query Filter

20 Evaluation Testing for precision/recall Testing for precision/recall 100 documents 100 documents Stress test Stress test 10,000 documents 10,000 documents Approx 10 mins to crawl/index Approx 10 mins to crawl/index 575,000 documents in CiteSeer metadata download 575,000 documents in CiteSeer metadata download (716,797 documents in CiteSeer) (716,797 documents in CiteSeer) 3.5 hours to split XML into HTML 3.5 hours to split XML into HTML 12 hours to crawl/index 12 hours to crawl/index ~551,000 indexed during crawling ~551,000 indexed during crawling

21 Evaluation Precision & recall Precision & recall Use first 100 docs (easy to measure recall) Use first 100 docs (easy to measure recall) Issue queries Issue queries Author last name Author last name Author first & last name Author first & last name Author affiliation Author affiliation Precision Precision Use max docs in each system Use max docs in each system Issue author search queries to both systems Issue author search queries to both systems Measure precision on each page of 20 hits Measure precision on each page of 20 hits

22 Evaluation – P & R Look for all papers where Peter Lee is an author (1 document) Look for all papers where Peter Lee is an author (1 document) cs_authorlast:lee cs_authorlast:lee Returns 3 documents, all with last name of Lee Returns 3 documents, all with last name of Lee P=.33, R=1 P=.33, R=1 cs_authorlast:lee cs_authorfirst:peter cs_authorlast:lee cs_authorfirst:peter Returns single document Returns single document P=1, R=1 P=1, R=1

23 Evaluation - Precision Author search: Author search: Q1: Peter Lee Q1: Peter Lee Project: cs_authorfirst:peter cs_authorlast:lee Project: cs_authorfirst:peter cs_authorlast:lee CiteSeer: peter w/2 lee CiteSeer: peter w/2 lee Q2: Jeffrey Ullman Q2: Jeffrey Ullman Project: cs_authorfirst:jeffrey cs_authorlast:ullman Project: cs_authorfirst:jeffrey cs_authorlast:ullman CiteSeer: jeffrey w/2 ullman CiteSeer: jeffrey w/2 ullman Q3: John Smith Q3: John Smith Project: cs_authorfirst:john cs_authorlast:smith Project: cs_authorfirst:john cs_authorlast:smith CiteSeer: john w/2 smith CiteSeer: john w/2 smith

24 Evaluation - Precision

25 Search Demo Available fields: Available fields: cs_authorfirst cs_authorfirst cs_authorlast cs_authorlast cs_authoraffiliation cs_authoraffiliation cs_pubyear cs_pubyear cs_pubmonth cs_pubmonth


Download ppt "Searching CiteSeer Metadata Using Nutch Larry Reeve INFO624 – Information Retrieval Dr. Lin – Winter 2005."

Similar presentations


Ads by Google