Assignment 2: Full text search with Lucene Mathias Mosolf, Alexander Frenzel.

Slides:



Advertisements
Similar presentations
Lucene in action Information Retrieval A.A – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –
Advertisements

Lucene/SOLR 2: Lucene search API
Lucene/Solr Architecture
Lucene Tutorial Based on Lucene in Action Michael McCandless, Erik Hatcher, Otis Gospodnetic.
Introduction to Information Retrieval Introduction to Information Retrieval Lucene Tutorial Chris Manning and Pandu Nayak.
Chapter 1 Object-Oriented Concepts. A class consists of variables called fields together with functions called methods that act on those fields.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch
Advanced Indexing Techniques with
Apache Solr Yonik Seeley 29 June 2006 Dublin, Ireland.
PubMed and its search options Jan Emmerich, Sonja Jacobi, Kerstin Müller (5th Semester Library Management)
The Lucene Search Engine Kira Radinsky Modified by Amit Gross to Lucene 4 Based on the material from: Thomas Paul and Steven J. Owens.
Lucene in action Information Retrieval A.A – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –
Lucene Part3‏. Lucene High Level Infrastructure When you look at building your search solution, you often find that the process is split into two main.
For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei.
The Lucene Search Engine Kira Radinsky Based on the material from: Thomas Paul and Steven J. Owens.
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Introduction to Lucene Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Lucene in action Information Retrieval A.A – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –
Introduction to Information Retrieval Introduction to Information Retrieval Lucene Tutorial Chris Manning, Pandu Nayak, and Prabhakar Raghavan.
Implementing search with free software An introduction to Solr By Mick England.
International Atomic Energy Agency INIS Training Seminar Principles of Information Retrieval and Query Formulation 07 – 11 October 2013 Vienna, Austria.
Full-Text Search with Lucene Yonik Seeley 02 May 2007 Amsterdam, Netherlands.
1 Introduction to Lucene Rong Jin. What is Lucene ?  Lucene is a high performance, scalable Information Retrieval (IR) library Free, open-source project.
1 Documenting with Javadoc CS 3331 Fall 2009 How to Write Doc Comments for the Javadoc TM Tool available from java.sun.com.
Softvérové knižnice a systémy Vyhľadávanie informácií Michal Laclavík.
1 Lucene Jianguo Lu School of Computer Science University of Windsor.
Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing.
Advanced Lucene Grant Ingersoll Center for Natural Language Processing ApacheCon 2005 December 12, 2005.
Lucene Boot Camp I Grant Ingersoll Lucid Imagination Nov. 3, 2008 New Orleans, LA.
Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.
Vyhľadávanie informácií Softvérové knižnice a systémy Vyhľadávanie informácií Michal Laclavík.
Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine.
Chapter 8 Cookies And Security JavaScript, Third Edition.
Lucene Part1 ‏. Lucene Use Case Store data in a 2 dimensional way How do we do this. Spreadsheet Relational Database X/Y.
Indexing UMLS concepts with Apache Lucene Julien Thibault University of Utah Department of Biomedical Informatics.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Lucene-Demo Brian Nisonger. Intro No details about Implementation/Theory No details about Implementation/Theory See Treehouse Wiki- Lucene for additional.
“ Lucene.Net is a source code, class-per-class, API-per-API and algorithmatic port of the Java Lucene search engine to the C# and.NET ”
Dana Movshovitz-Attias, William Cohen Aug 5, 2013 ACL 2013 Natural Language Models for Predicting Programming Comments.
Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA.
Basics of Information Retrieval and Query Formulation Bekele Negeri Duresa Nuclear Information Specialist.
Design a full-text search engine for a website based on Lucene
Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.
Using OARE Search Engines. Environmental Index (EBSCO) Advanced Search.
1 Documenting with Javadoc CS 3331 Section and Appendix B of [Jia03] How to Write Doc Comments for the Javadoc TM Tool available from
INGENTA GATEWAY PORTAL
Lucene Jianguo Lu.
 Java Server Pages (JSP) By Offir Golan. What is JSP?  A technology that allows for the creation of dynamically generated web pages based on HTML, XML,
Sudeshna Sarkar, IIT Kharagpur 1 Programming and Data Structure Sudeshna Sarkar Lecture 3.
1 Documenting with Javadoc How to Write Doc Comments for the Javadoc TM Tool available from java.sun.com.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
Introduction to Information Retrieval Introduction to Information Retrieval ΜΕ003-ΠΛΕ70: Ανάκτηση Πληροφορίας Διδάσκουσα: Ευαγγελία Πιτουρά Εισαγωγή στο.
Lucene : Text Search IG5 – TILE Esther Pacitti. Basic Architecture.
Philosopher’s Index Manual
ΠΛΕ70: Ανάκτηση Πληροφορίας
Lucene Tutorial Chris Manning and Pandu Nayak
Jianguo Lu School of Computer Science University of Windsor
Searching AND INDEXING Big data
CS276 Lucene Section.
Searching and Indexing
Using Jsoup to Parse HTML
Building Search Systems for Digital Library Collections
Search Techniques and Advanced tools for Researchers
Elasticsearch Query DSL
Lucene in action Information Retrieval A.A
Lucene/Solr Architecture
How to search Medline.
Table of Contents 1) Understanding Lucene 2) Lucene Indexing
Presentation transcript:

Assignment 2: Full text search with Lucene Mathias Mosolf, Alexander Frenzel

Allgemein Probleme? Welche Dokumentation? o Getting Started o vorwiegend Beispiele o Javadoc

Indexer XML-Parsing mit Digester o Lucene o org.apache.lucene.index.IndexWriter; o org.apache.lucene.store.Directory; o org.apache.lucene.analysis.Analyzer; o org.apache.lucene.analysis.WhitespaceAnalyzer; o org.apache.lucene.document.Document; o org.apache.lucene.document.Field;

Analyzer analyzer = new WhitespaceAnalyzer(); boolean createFlag = true; writer = new IndexWriter(indexDir, analyzer, createFlag, IndexWriter.MaxFieldLength.UNLIMITED); [..] digester.parse(xml); [..] public void addMedlineDocument(MedlineDocument doc) throws IOException { this.counter++; String title = doc.getTitle().replaceAll("\\ ", " ").toLowerCase(); String text = ((doc.getAbstract() != null)?doc.getAbstract():"").replaceAll("\\ ", " ").toLowerCase(); Document medlineDocument = new Document(); medlineDocument.add(new Field("pmid", doc.getPmid(), Field.Store.YES, Field.Index.NOT_ANALYZED)); medlineDocument.add(new Field("title", title, Field.Store.YES, Field.Index.ANALYZED)); medlineDocument.add(new Field("abstract", text, Field.Store.YES, Field.Index.ANALYZED)); medlineDocument.add(new Field("combined", title+text, Field.Store.YES, Field.Index.ANALYZED)); writer.addDocument(medlineDocument); } [..] // optimize and close the index writer.optimize(); writer.close();

BoolSearch Sammeln von ID mittel Set Lucene o org.apache.lucene.search.IndexSearcher; o org.apache.lucene.index.Term; o org.apache.lucene.search.BooleanClause; o org.apache.lucene.search.BooleanQuery; o org.apache.lucene.search.ScoreDoc; o org.apache.lucene.search.TermQuery; o org.apache.lucene.search.TopDocs;

IndexSearcher indexSearcher = new IndexSearcher(indexDir); pmids = new HashSet (); public void search(String field, String[] keywords) throws IOException { BooleanQuery query = new BooleanQuery(); // create BOOL Query for (String word: keywords) { TermQuery tq = new TermQuery(new Term(field, word.toLowerCase())); query.add(tq, BooleanClause.Occur.MUST); } // extract PMIDs TopDocs docs = this.searcher.search(query, searcher.maxDoc()); for (ScoreDoc scoreDoc : docs.scoreDocs) { Document doc = searcher.doc(scoreDoc.doc); pmids.add(doc.get("pmid"));// add to set }

PhraseSearch PhraseQuery vs SpanQuery Lucene o org.apache.lucene.search.IndexSearcher; o org.apache.lucene.index.Term; o org.apache.lucene.search.spans.SpanNearQuery; o org.apache.lucene.search.spans.SpanQuery; o org.apache.lucene.search.spans.SpanTermQuery; o org.apache.lucene.search.spans.Spans;

IndexSearcher indexSearcher = new IndexSearcher(indexDir); pmids = new HashSet (); public void search(String field, String[] phrase) throws IOException { // generate query int l=phrase.length; SpanQuery[] sq = new SpanQuery[l]; for(int i = 0; i < l; i++) { sq[i] = new SpanTermQuery(new Term(field, phrase[i])); } SpanNearQuery query = new SpanNearQuery(sq, 0, true); // search Spans sp = query.getSpans(this.searcher.getIndexReader()); int id=-1; Document doc; // runs trough all Occurrences of the phrase while (sp.next() == true) { this.occ++; //number of occurrences if (id != sp.doc()) { // next doc id = sp.doc(); // save current id // add pmid doc = searcher.doc(id); this.pmids.add(doc.get("pmid")); }

Tests/Beispiele Beispiele o "reduce the appeal of" => ~0.399s o "duchenne's muscular" => ~0.398s o dicyclocarbodimide => ~0.399s o experiment => ~0.458s o protein complex => ~0.446s o duchenne's disease => ~0.447s "Komplex" o and => ~1.446s o and the => ~1.298s o and to the => ~1.212s o and to the you => ~0.449s o "operation, the patient presents no signs of" => ~0.400s