IRTools Software Overview Gregory B. Newby UNC Chapel Hill

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Chapter 5: Introduction to Information Retrieval
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Introduction to Information Retrieval
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Information Retrieval in Practice
Modern Information Retrieval
LEARNING OBJECTIVES Index files.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
Spring 2003 ECE569 Lecture ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
Database Implementation Issues CPSC 315 – Programming Studio Spring 2008 Project 1, Lecture 5 Slides adapted from those used by Jennifer Welch.
CS/Info 430: Information Retrieval
METU Department of Computer Eng Ceng 302 Introduction to DBMS Disk Storage, Basic File Structures, and Hashing by Pinar Senkul resources: mostly froom.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
File Organizations and Indexing Lecture 4 R&G Chapter 8 "If you don't find it in the index, look very carefully through the entire catalogue." -- Sears,
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Organizing files for performance Chapter Data compression Advantages of reduced file size Redundancy reduction: state code example Repeating sequences:
E.G.M. PetrakisHashing1 Hashing on the Disk  Keys are stored in “disk pages” (“buckets”)  several records fit within one page  Retrieval:  find address.
Hinrich Schütze and Christina Lioma Lecture 4: Index Construction
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Overview of Search Engines
IR Software for Large-Scale Research Gregory B. Newby School of Information and Library Science, University of North Carolina at Chapel Hill CB 3360 Manning.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 17 Disk Storage, Basic File Structures, and Hashing.
1 Physical Data Organization and Indexing Lecture 14.
Terrier: TERabyte RetRIevER An Introduction By: Kavita Ganesan (Last Updated April 21 st 2009)
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
File Processing - Indexing MVNC1 Indexing Jim Skon.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
1 Chapter 7 Indexing File Structures by Folk, Zoellick, and Ricarrdi.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Incremental Indexing Dr. Susan Gauch. Indexing  Current indexing algorithms are essentially batch processing  They start from scratch every time  What.
CE Operating Systems Lecture 17 File systems – interface and implementation.
1 Information Retrieval LECTURE 1 : Introduction.
CPSC 252 Hashing Page 1 Hashing We have already seen that we can search for a key item in an array using either linear or binary search. It would be better.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
Chapter 5 Record Storage and Primary File Organizations
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Information Retrieval in Practice
University of Maryland Baltimore County
Storage and File Organization
Why indexing? For efficient searching of a document
Indexing & querying text
Modified from Stanford CS276 slides Lecture 4: Index Construction
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Implementation Issues & IR Systems
Database Implementation Issues
Lecture 12 Lecture 12: Indexing.
DATABASE IMPLEMENTATION ISSUES
Indexing 4/11/2019.
Database Implementation Issues
Presentation transcript:

IRTools Software Overview Gregory B. Newby UNC Chapel Hill

Download & Participate IRTools is a work in progress. Check back in the spring for more software and test cases. Currently, only some parts work Want to help? We use CVS for distributed development Our project page:

Design Principles For IR Researchers A programming toolkit, not an IR system Implements major approaches to IR (Boolean, VSM, Probabilistic & LSI) Scalable to billions of documents High performance algorithms and structures Expandable Documented:

Major Components SpiderIndexerRetrieval Engine Gathers documents on the live Web Builds internal representations of documents Processes queries and generates results

Implementation Mostly in C++, using the GNU compiler Uses the Standard Template Library Tested on Solaris & Linux (Alpha & 386) Designed for modularity, so IR researchers can add their own components

Why Might You use IRTools? If you have your own IR software, there’s probably no need If you are looking for experimental IR software, this might be a good alternative (goal: to be suitable for general use in mid- 2002) IRTools should be useful for classroom use and demonstration For production use, consider ht://dig

Design Snippet: Word List The Berkeley DB is used to store the term  termID lookup table A single file, accessed by hash in a B+ tree struct term_termID { char * term irt_int termID }

Design Snippet: 1 st Inverted Index File Binary file with fixed-length records Accessed by termid*sizeof(struct) offset Gives basic info needed for weighting Points to more files for inverted entries (the actual documents for this term) Some duplication (e.g., meantf ) to prevent additional I/O

Design Snippet: 1 st Inverted Index File s truct inv_file1 { irt_int termID irt_int term_doccount // Frequency irt_int meantf // For weighting irt_int nt // # terms in this doc irt_int file2_location // File for // entries irt_int starting_offset // File 2 loc irt_int entry_count // # occurrences // of this term // in file 2 }

Design Snippet: 2 nd Inverted Index File Info about documents with this term Using Page Rank, best docs can be listed earliest (avoiding subsequent disk I/O) Multiple 2 nd files for larger collections struct inv_file2 { irt_int termID // Sanity check irt_int file_location // Next file irt_int starting_offset, num_entries // As for file1 }

Design Snippet: 2 nd Inverted Index File For each document with this term: struct inv_docentry { irt_int term_in_doc_count // For weighting: irt_int doc_unique_terms irt_Int doc_total_terms // 3 rd file offset irt_int file3_location }

Design Snippet: 3 nd Inverted Index File This lists a term’s locations in documents irt_int termID // Sanity check Followed by terms_in_doc_count irt_ints indicating the positions of this term in this document Usable for a NEAR operator

Planned & Current Components Current Various stemmers and stoplists Various weighting schemes Sparse matrix formats for LSI etc. Boolean AND & OR TREC output Visual interfaces Designed & Planned Page Rank Integrated spider Boolean NEAR Update & delete entries Concurrent retrieval engine clients Concurrent indexers

Global Collection Variables maxn :highest # of terms in any doc maxUn : highest # unique terms Nterms : total known terms Ndocs : total known documents

Design Snippet: Boolean Candidate Merging Works for OR or AND Min. disk I/O (needed for inverted index only) Doesn’t require inverted index to be sorted in docID order The STL map can be problematic for more than about 20K candidates; using documents that are Page Rank’ed can help shrink the candidate set (and speed up everything) Start with terms with the lowest frequency; we only continue until we have enough hits

Design Snippet: Boolean Candidate Merging irt_int NFULL=0 // stop with enough hits vector full // docIDs w. all q terms map // Candidates struct candidate_info { // For each doc irt_int docID // this doc’s ID nt // # terms in this doc for weighting meantf // mean tf in this doc for weighting float [NQUERYTERMS] tf // For weighting irt_short qtcount // # query terms in doc } The map eliminates sorting! We must allocate memory for every candidate

Design Snippet: LSI & Information Space We use a modified Harwell-Boeing sparse matrix format on disk (modified = binary files) Berry’s svdpackc has been integrated We’re doing scaling experiments now. Scaling is a major challenge for LSI One solution: do smaller eigensystem problems on candidate subset on the fly, rather than pre-computing the entire collection’s semantic space. But this eliminates possibly interesting documents!

Hyperlink Map The hyperlink map is a sparse asymmetric matrix, size is D x D We use a modified Harwell-Boeing format to store the matrix A similar index file structure to the inverted index gives us rapid access to any document’s link list We must store both sides of the matrix

Web Document Metadata Items stored during spidering. These are kept in a Berkeley DB B+ hash file, with the document URL (or name) as key Docname // key docID HTTP last update as reported Our last visit/update HTTP-reported size Checksum (simple) # links out

Design Snippet: tokenizer The tokenizer reads files (via spider or local disk) Goal: Few passes through the file Goal: Any character set Process: Keep a static array of word boundaries Keep a static array of tag delimiters (<) Fold everything to lower case termID lookup can happen now or later Simple transformations (like ditching extra white space) can happen now