Lucene Part3‏. Lucene High Level Infrastructure When you look at building your search solution, you often find that the process is split into two main.

Slides:



Advertisements
Similar presentations
JQuery MessageBoard. Lets use jQuery and AJAX in combination with a database to update and retrieve information without refreshing the page. Here we will.
Advertisements

Information Retrieval in Practice
Organisation Of Data (1) Database Theory
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch
Recursion CS 367 – Introduction to Data Structures.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Information Retrieval in Practice
Search Engines and Information Retrieval
Search and Recursion pt. 2 CS221 – 2/25/09. How to Implement Binary Search Take a sorted data-set to search and a key to search for Start at the mid-point.
11 3 / 12 CHAPTER Databases MIS105 Lec14 Irfan Ahmed Ilyas.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Chapter 1 Program Design
1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.
1 Working with Classes Chapter 6. 2 Class definition A class is a collection of data and routines that share a well-defined responsibility or provide.
Overview of Search Engines
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009.
Apache Lucene in LexGrid. Lucene Overview High-performance, full-featured text search engine library. Written entirely in Java. An open source project.
Search Engines and Information Retrieval Chapter 1.
The Structured Specification. Why a Structured Specification? System analyst communicates the user requirements to the designer with a document called.
 A databases is a collection of data organized to make it easy to search and easy to retrieve in a useful, usable form.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
DATABASE. A database is collection of information that is organized so that it can easily be accessed, managed and updated. It is also the collection.
Recursion, Complexity, and Searching and Sorting By Andrew Zeng.
Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
A Survey of Patent Search Engine Software Jennifer Lewis April 24, 2007 CSE 8337.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.
Data Structure & File Systems Hun Myoung Park, Ph.D., Public Management and Policy Analysis Program Graduate School of International Relations International.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
 Three-Schema Architecture Three-Schema Architecture  Internal Level Internal Level  Conceptual Level Conceptual Level  External Level External Level.
Copyright © 2010 Certification Partners, LLC -- All Rights Reserved Perl Specialist.
Objective At the conclusion of this chapter you will be able to:
Now, please open your book to page 60, and let’s talk about chapter 9: How Data is Stored.
DATABASE What exactly is a database How do databases work? What's the difference between a spreadsheet database and a "real" database?
CSci 111 – computer Science I Fall 2014 Cynthia Zickos WRITING A SIMPLE PROGRAM IN JAVA.
1 5. Abstract Data Structures & Algorithms 5.2 Static Data Structures.
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
Storage Structures. Memory Hierarchies Primary Storage –Registers –Cache memory –RAM Secondary Storage –Magnetic disks –Magnetic tape –CDROM (read-only.
1 CS161 Introduction to Computer Science Topic #9.
1 Web Servers (Chapter 21 – Pages( ) Outline 21.1 Introduction 21.2 HTTP Request Types 21.3 System Architecture.
Copyright © 2003 ProsoftTraining. All rights reserved. Perl Fundamentals.
Files Tutor: You will need ….
Lesson 13 Databases Unit 2—Using the Computer. Computer Concepts BASICS - 22 Objectives Define the purpose and function of database software. Identify.
Domain Model A representation of real-world conceptual classes in a problem domain. The core of object-oriented analysis They are NOT software objects.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
Course Code #IDCGRF001-A 5.1: Searching and sorting concepts Programming Techniques.
Lucene Jianguo Lu.
System calls for Process management Process creation, termination, waiting.
CS162 - Topic #10 Lecture: Recursion –The Nature of Recursion –Tracing a Recursive Function –Work through Examples of Recursion Programming Project –Discuss.
Program Design. Simple Program Design, Fourth Edition Chapter 1 2 Objectives In this chapter you will be able to: Describe the steps in the program development.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
Maitrayee Mukerji. Factorial For any positive integer n, its factorial is n! is: n! = 1 * 2 * 3 * 4* ….* (n-1) * n 0! = 1 1 ! = 1 2! = 1 * 2 = 2 5! =
Click to edit Master text styles Stacks Data Structure.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
FILES AND EXCEPTIONS Topics Introduction to File Input and Output Using Loops to Process Files Processing Records Exceptions.
Information Retrieval in Practice
Lesson Objectives Aims Key Words Paging, Segmentation, Virtual Memory
Introduction To DBMS.
Large Scale Search: Inverted Index, etc.
Searching and Indexing
Ch. 8 File Structures Sequential files. Text files. Indexed files.
Thanks to Bill Arms, Marti Hearst
Fundamentals of Programming
Presentation transcript:

Lucene Part3‏

Lucene High Level Infrastructure When you look at building your search solution, you often find that the process is split into two main tasks: building an index, and searching that index. This is definitely the case with Lucene (and the only time when this isn’t the case is if your search goes directly to the database). We wanted to keep the search interface fairly simple, so the code that interacts from the system sees two main interfaces, IndexBuilder, and IndexSearch.

Lucene Index Builder Any process which needs to build an index goes through the IndexBuilder. This is a simple interface which gives you two entrance points to the indexing process: By passing individual configuration settings to the class E.g. path to the index, if you want to do an incremental build, and how often to optimize the Lucene index as you add records. By passing in an index plan name IndexBuilder will then look up the settings it needs from the configuration system. This allows you to tweak your variables in an external file, rather than code.

Lucene Index Sources The Index Builder abstracts the details of Lucene, and the Index Sources that are used to create the index itself. The search interface is also kept very simple. A search is done via: IndexSearch.search(String inputQuery, int resultsStart, int resultsCount); e.g. Look for the terms EJB and WebLogic, returning up to the first 10 results: IndexSearch.search("EJB and WebLogic", 0, 10);

Lucene How to tweak the ranking of records There is one piece of logic that goes above and beyond munging the data to a Lucene friendly manner. It is in this class that we calculate any boosts that we want to place on fields, or the document itself. It turns out that we end up with the following boosters: The date boost has been really important for us. We have data that goes back for a long time, and seemed to be returning “old reports” too often. The date-based booster trick has gotten around this, allowing for the newest content to bubble up.

Lucene Lucene Index Anatomy You can successfully use Lucene without understanding this directory structure. Feel free to skip this section and treat the directory as a black box without regard to what is inside. When you are ready to dig deeper you'll find that the files you created in the last section contain statistics and other data to facilitate rapid searching and ranking. An index contains a sequence of documents. In our indexing example, each document represents information about a text file.

Lucene Documents Documents are the primary retrievable units from a Lucene query. Documents consist of a sequence of fields. Fields have a name ("contents" and "filename" in our example). Field values are a sequence of terms.

Lucene Terms A term is the smallest piece of a particular field. Fields have three attributes of interest: Stored -- Original text is available in the documents returned from a search. Indexed -- Makes this field searchable. Tokenized -- The text added is run through an analyzer and broken into relevant pieces (only makes sense for indexed fields).

Lucene Analysis Tokenized fields are where the real fun happens. In our example, we are indexing the contents of text files. The goal is to have the words in the text file be searchable, but for practical purposes it doesn't make sense to index every word. Some words like "a", "and", and "the" are generally considered irrelevant for searching and can be optimized out -- these are called stop words.

Lucene Some Inverted Index Strategies batch-based: use file-sorting algorithms (textbook) + fastest to build + fastest to search - slow to update b-tree based: update in place ( + fast to search - update/build does not scale

Lucene What you have to do Lucene handles the indexing, searching and retrieving, but it doesn't handle: managing the process (instantiating the objects and hooking them together, both for indexing and for searching) selecting the data files parsing the data files getting the search string from the user displaying the search results to the use - complex implementation segment based: lots of small indexes (Verity) + fast to build + fast to update - slower to search

Lucene Indexing In Depth two basic algorithms: make an index for a single document merge a set of indices incremental algorithm: maintain a stack of segment indices create index for each incoming document push new indexes onto the stack let b=10 be the merge factor; M=∞ for (size = 1; size < M; size *= b) { if (there are b indexes with size docs on top of the stack) { pop them off the stack; merge them into a single index; push the merged index onto the stack; } else { break; } } optimization: single-doc indexes kept in RAM, saves system calls notes:

Lucene Lucene's Disjunctive Search Algorithm described in since all postings must be processed goal is to minimize per-posting computation merges postings through a fixed-size array of accumulator buckets performs boolean logic with bit masks scales well with large queries

Lucene Summary Lucene For key value pair fast retrieval. For read oriented data.