Lucene in action Information Retrieval A.A. 2010-11 – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –

Slides:



Advertisements
Similar presentations
Lucene in action Information Retrieval A.A – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –
Advertisements

Assignment 2: Full text search with Lucene Mathias Mosolf, Alexander Frenzel.
Lucene/Solr Architecture
Lucene Tutorial Based on Lucene in Action Michael McCandless, Erik Hatcher, Otis Gospodnetic.
Introduction to Information Retrieval Introduction to Information Retrieval Lucene Tutorial Chris Manning and Pandu Nayak.
Chapter 5: Introduction to Information Retrieval
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch
Advanced Indexing Techniques with
Apache Solr Yonik Seeley 29 June 2006 Dublin, Ireland.
The Lucene Search Engine Kira Radinsky Modified by Amit Gross to Lucene 4 Based on the material from: Thomas Paul and Steven J. Owens.
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
The Lucene Search Engine Kira Radinsky Based on the material from: Thomas Paul and Steven J. Owens.
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Introduction to Lucene Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Lucene in action Information Retrieval A.A – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Introduction to Information Retrieval Introduction to Information Retrieval Lucene Tutorial Chris Manning, Pandu Nayak, and Prabhakar Raghavan.
Full-Text Search with Lucene Yonik Seeley 02 May 2007 Amsterdam, Netherlands.
Full-Text Search with Lucene Yonik Seeley 02 May 2007 Amsterdam, Netherlands slides:
1 Introduction to Lucene Rong Jin. What is Lucene ?  Lucene is a high performance, scalable Information Retrieval (IR) library Free, open-source project.
Softvérové knižnice a systémy Vyhľadávanie informácií Michal Laclavík.
1 Lucene Jianguo Lu School of Computer Science University of Windsor.
Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing.
Advanced Lucene Grant Ingersoll Center for Natural Language Processing ApacheCon 2005 December 12, 2005.
Lucene Boot Camp I Grant Ingersoll Lucid Imagination Nov. 3, 2008 New Orleans, LA.
Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.
Vyhľadávanie informácií Softvérové knižnice a systémy Vyhľadávanie informácií Michal Laclavík.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Lucene Part1 ‏. Lucene Use Case Store data in a 2 dimensional way How do we do this. Spreadsheet Relational Database X/Y.
Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Indexing UMLS concepts with Apache Lucene Julien Thibault University of Utah Department of Biomedical Informatics.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Lucene-Demo Brian Nisonger. Intro No details about Implementation/Theory No details about Implementation/Theory See Treehouse Wiki- Lucene for additional.
“ Lucene.Net is a source code, class-per-class, API-per-API and algorithmatic port of the Java Lucene search engine to the C# and.NET ”
Iccha Sethi Serdar Aslan Team 1 Virginia Tech Information Storage and Retrieval CS 5604 Instructor: Dr. Edward Fox 10/11/2010.
Searching CiteSeer Metadata Using Nutch Larry Reeve INFO624 – Information Retrieval Dr. Lin – Winter 2005.
Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer 
Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA.
Design a full-text search engine for a website based on Lucene
Lucene Jianguo Lu.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Query processing: optimizations Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 2.3.
Analyzing Text with SQL Server 2014, R, AND Azure ML Dejan Sarka.
Document Parsing Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Introduction to Information Retrieval Introduction to Information Retrieval ΜΕ003-ΠΛΕ70: Ανάκτηση Πληροφορίας Διδάσκουσα: Ευαγγελία Πιτουρά Εισαγωγή στο.
Lucene : Text Search IG5 – TILE Esther Pacitti. Basic Architecture.
INFORMATION RETRIEVAL Pabitra Mitra Computer Science and Engineering IIT Kharagpur
Apache Lucene Searching the Web and Everything Else Daniel Naber Mindquarry GmbH ID 380.
ΠΛΕ70: Ανάκτηση Πληροφορίας
Lucene Tutorial Chris Manning and Pandu Nayak
Why indexing? For efficient searching of a document
Jianguo Lu School of Computer Science University of Windsor
Searching AND INDEXING Big data
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Query processing: phrase queries and positional indexes
Indexing & querying text
CS276 Lucene Section.
Searching and Indexing
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Building Search Systems for Digital Library Collections
Vores tankesæt: 80% teknologi | 20% forretning
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Lucene in action Information Retrieval A.A
Lucene/Solr Architecture
Query processing: phrase queries and positional indexes
Table of Contents 1) Understanding Lucene 2) Lucene Indexing
Presentation transcript:

Lucene in action Information Retrieval A.A – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –

What is Lucene Full-text search library Indexing + Searching components 100% Java, no dependencies, no config files No crawler, document parsing nor search UI – see Apache Nutch, Apache Solr, Apache Tika Probably, the most widely used SE Applications: a lot of (famous) websites, many commercial products

Basic Application 1.Parse docs 2.Write Index 3.Make query 4.Display results IndexWriterIndexSearcher Index

Indexing //create the index IndexWriter writer = new IndexWriter(directory, analyzer); //create the document structure Document doc = new Document(); Field id = new NumericField(“id”,Store.YES); Field title = new Field(“title”,null,Store.YES, Index.ANALYZED); Field body = new Field(“body”,null,Store.NO,Index.ANALYZED); doc.add(id); doc.add(title); doc.add(body); //scroll all documents, fill fields and index them! for(all document){ Article a = parse(document); id.setIntValue(a.id); title.setValue(a.title); body.setValue(a.body); writer.addDocument(doc); //doc is just a container } //IMPORTANT! close the Writer to commit all operations! writer.close();

How to represent text? TEXTQUERY … Official Michael Jackson website … michael jackson Lower and Upper case … Michael Jackson’s new video … michael jackson Tokenizer issues … Fender Music, the guitar company … Fender guitars Stemming … Microsoft WindowsXP … windows xp Word delimiter … the cat is on the table … cat table Dictionary size: stopwords

Analyzer Text processing pipeline Tokenizer TokenFilter1 String TokenStream TokenFilter2 TokenStream TokenFilter3 TokenStream Indexing tokens Index Analyzer Docs

Analyzer Index Analyzer Searching tokens Query String Results

Analyzer Built-in Tokenizer s – WhitespaceTokenizer, LetterTokenizer,… – StandardTokenizer good for most European-language docs Built-in TokenFilter s – LowerCase, Stemming, Stopwords, AccentFilter, many others in contrib packages (language-specific) Built-in Analyzer s – Keyword, Simple, Standard… – PerField wrapper

the LexCorp BFG-900 is a printer TEXTQUERY Lex corp bfg900 printers theLexCorpBFG-900isaprinter theCorpBFGisaprinterLex LexCorp 900 thecorpbfgisaprinterlex lexcorp 900 corpbfgprinterlex lexcorp 900 corpbfgprint-lex lexcorp 900 WhitespaceTokenizer WordDelimiterFilter LowerCaseFilter StopwordFilter StemmerFilter Lexcorpbfg900printers Lexcorpbfgprinters900 lexcorpbfgprinters900 lexcorpbfgprinters900 lexcorpbfgprint-900 MATCH!

Field Options Field.Stored – YES, NO Field.Index – ANALYZED, NOT_ANALYZED, NO Field.TermVector – NO, YES (POSITION and/or OFFSETS)

Analysis tips Use PerFieldAnalyzerWrapper – Don’t analyze keyword fields – Store only needed data Use NumberUtils for numbers Add same field more than once, analyze it differently – Boost exact case/stem matches

Searching //Best practice: reusable singleton! IndexSearcher s = new IndexSearcher(directory); //Build the query from the input string QueryParser qParser = new QueryParser(“body”, analyzer); Query q = qParser.parse(“title:Jaguar”); //Do search TopDocs hits = s.search(q, maxResults); System.out.println(“Results: ”+hits.totalHits); //Scroll all retrieved docs for(ScoreDoc hit : hits.scoreDocs){ Document doc = s.doc(hit.doc); System.out.println( doc.get(“id”) + “ – ” + doc.get(“title”) + “ Relevance=” + hit.score); } s.close();

Building the Query Built-in QueryParser – does text analysis and builds the Query object – good for human input, debugging – not all query types supported – specific syntax: see JavaDoc for QueryParser Programmatic query building e.g.: Query q = new TermQuery( new Term(“title”, “jaguar”)); – many types: Boolean, Term, Phrase, SpanNear, … – no text analysis!

Scoring Lucene = Boolean Model + Vector Space Model Similarity = Cosine Similarity – Term Frequency – Inverse Document Frequency – Other stuff Length Normalization Coord. factor (matching terms in OR queries) Boosts (per query or per doc) To build your own: implement Similarity and call Searcher.setSimilarity For debugging: Searcher.explain(Query q, int doc)

Performance Indexing – Batch indexing – Raise mergeFactor – Raise maxBufferedDocs Searching – Reuse IndexSearcher – Optimize: IndexWriter.optimize() – Use cached filters: QueryFilter Segment_3 Index structure

Esercizi