Word Up! Using Lucene for full-text search of your data set.

Slides:



Advertisements
Similar presentations
Efficient full-text search in databases Andrew Aksyonoff, Peter Zaitsev Percona Ltd. shodan (at) shodan.ru.
Advertisements

Advanced XSLT. Branching in XSLT XSLT is functional programming –The program evaluates a function –The function transforms one structure into another.
XML DOCUMENTS AND DATABASES
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Idaho National Engineering and Environmental Laboratory Drill Down! The INEEL Docu-Search provides extensive searching capabilities Von Crofts Interlab.
“ Leveraging SharePoint 2010 Search Technologies ” With: Ivan Neganov.
For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei.
1 panFMP - Ein XML-basiertes Framework für Metadaten- Portale Vortrag und „hands-on“ Seminar am GFZ Potsdam Uwe Schindler MARUM – Universität Bremen PANGAEA.
Technical Tips and Tricks for User Support Mike Gardner
LCT2506 Internet 2 Data-driven web sites Week 5. LCT2506 Internet 2 Current Practice  Combining web pages and data stored in a relational database is.
U of R eXtensible Catalog Team MetaCat. Problem Domain.
Tutorial 11: Connecting to External Data
CORE 2: Information systems and Databases STORAGE & RETRIEVAL 2 : SEARCHING, SELECTING & SORTING.
GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)
Implementing search with free software An introduction to Solr By Mick England.
Use Watch folders to automatically add PDFs to Mendeley Desktop. When you place a document in a watched folder, it will be automatically added to Mendeley.
Data Persistence and Object-Relational Mapping Slides by James Brucker, used with his permission 1.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Databases & Data Warehouses Chapter 3 Database Processing.
Empowering EPrints Search with Xapian
Hibernatification! Roadmap for Migrating from Plain Old SQL on JDBC to JPA on Hibernate Duke Banerjee Senior Developer, DrillingInfo.com.
Training - Day 3 OJB. What is OR Mapping? OR Mapping is the mapping of relational database tables to objects (Java Objects in our case) Many OR Mapping.
Metadata and identifiers for e- journals Copenhagen Juha Hakala Helsinki University Library
What's the story with open source? Searching and monitoring news media with open source technology Charlie Hull, Flax BCS IRSG Search Solutions 2010 Photo.
AL-MAAREFA COLLEGE FOR SCIENCE AND TECHNOLOGY INFO 232: DATABASE SYSTEMS CHAPTER 7 INTRODUCTION TO STRUCTURED QUERY LANGUAGE (SQL) Instructor Ms. Arwa.
Database Solutions for Storing and Retrieving XML Documents.
Dali JPA Tools. About Dali Dali JPA Tools is an Eclipse Web Tools Platform sub-Project Dali 1.0 is a part of WTP 2.0 Europa coordinated release Goal -
A Survey of Patent Search Engine Software Jennifer Lewis April 24, 2007 CSE 8337.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Natural Resource Program Center Dissolving Data Boundaries Search Mar /17/2011 Dan Kocol Functional Analyst I&M.
Linking electronic documents and standardisation of URL’s What can libraries do to enhance dynamic linking and bring related information within a distance.
Data Management BIRN supports data intensive activities including: – Imaging, Microscopy, Genomics, Time Series, Analytics and more… BIRN utilities scale:
Searching Business Data with MOSS 2007 Enterprise Search Presenter: Corey Roth Enterprise Consultant Stonebridge Blog:
EXist Indexing Using the right index for you data Date: 9/29/2008 Dan McCreary President Dan McCreary & Associates (952) M.
Nate Trail Network Development & MARC Standards Office 8/1/2006 With help from Sydney Olive How to Build, Display and Find METS Objects.
FlexElink Winter presentation 26 February 2002 Flexible linking (and formatting) management software Hector Sanchez Universitat Jaume I Ing. Informatica.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
PatentScope - Electronic Publication World Intellectual Property Organization.
ITGS Databases.
Uwe SchindlerGES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1, Benny Bräuer.
Ganymede Simultaneous Release | © 2008 by Springsite B.V., The Netherlands made available under the EPL v1.0 Teneo Ganymede Simultaneous Release.
Searching CiteSeer Metadata Using Nutch Larry Reeve INFO624 – Information Retrieval Dr. Lin – Winter 2005.
Core Integration Web Services Dean Krafft, Cornell University
SRS Introductory Course 5/12/ Temporary and permanent sessions - Simple querying - Browsing indices - Standard and extended query forms - User defined.
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
AxKit A member of the Apache XML project Ryan Maslyn Kyle Bechtel.
EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine.
Don’t Duck Metadata March 2005 Introducing Setting Up a Clearinghouse Node Topic: Introduction to Setting Up a Clearinghouse Node Objective: By.
Archiving.Net® Document Management System rchiving.Net® is a bi-lingual (Arabic/English) document management system that lets you capture, index, organize,
Steven Perry Dave Vieglais. W a s a b i Web Applications for the Semantic Architecture of Biodiversity Informatics Overview WASABI is a framework for.
8 th Semester, Batch 2009 Department Of Computer Science SSUET.
Executive Overview. Software modeling is essential, because it is the map that guides your developers. Additionally: Modeling Software  Visual information.
©2003 Paula Matuszek GOOGLE API l Search requests: submit a query string and a set of parameters to the Google Web APIs service and receive in return a.
Chapter 14 Using JavaBeans Components in JSP Documents.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
Connecting to External Data. Financial data can be obtained from a number of different data sources.
CS 440 Database Management Systems Stored procedures & OR mapping 1.
CS520 Web Programming Full Text Search Chengyu Sun California State University, Los Angeles.
The Web Web Design. 3.2 The Web Focus on Reading Main Ideas A URL is an address that identifies a specific Web page. Web browsers have varying capabilities.
High performance, full-featured text search engine written in Java. Technology suitable for nearly any application requiring full-text search, especially.
CS520 Web Programming Full Text Search
Searching and Indexing
A very brief introduction
Building Search Systems for Digital Library Collections
NOSQL databases and Big Data Storage Systems
Introduction of Week 11 Return assignment 9-1 Collect assignment 10-1
Teneo Ganymede Simultaneous Release Graduation Review
Teneo Ganymede Mini Deck
Bryan Soltis – Kentico Technical Evangelist
Presentation transcript:

Word Up! Using Lucene for full-text search of your data set

Full-text search Review of full-text search options Focus on Lucene Integrating Lucene with JPA/Hibernate

Full-text search options ‘LIKE’ queries SQL extensions Kludge with web search engine Kludge with web search appliance Embeddable search library

‘LIKE’ queries

Simple, straightforward Fast, easy to implement Large result set Limited fuzziness (wildcard or regex)

Full-text search extensions No standard syntax (Sybase, MSSQL, DB2, etc. all different) Administrative overhead for text search indices Other limitations

Kludge with search engine External indexing/search software ht://Dig mnoGoSearch Sphinx Xapian Not necessarily pure Java Can be database-intensive Lag in updating search index

Kludge with search appliance “Black-box” solutions Thunderstone Google Search Appliance Your data set mixes with public content Doesn’t always work as advertised Can’t fine-tune search

Embeddable search library

Search library Example: Apache Lucene Deploys as part of your application 100% Java Fuzzy full-text search (Levenshtein algorithm) Searches against text, numeric, boolean fields with multiple options Can be integrated with JPA/Hibernate via Hibernate Search, Compass

About Lucene Search index stored on file system (also JDBC and BDB options) Can store/retrieve data to/from search index (Lucene Projections) Can index HTML, XML, Office docs, PDFs, Exchange mail with external tools Supports extended and multi-byte character sets by default

More about Lucene Indexes records as Lucene Document object Lucene Document doesn’t have to be a literal document – can be any arbitrary object Document can have any number of name- value pairs Synchronizing your data with search index is someone else’s problem …

Integrating with JPA / Hibernate Most common method: Hibernate Search Supports only Hibernate provider Automatically updates search index when object persisted to database Entity classes mapped to separate indexes Entity fields mapped to Lucene index fields using Java annotations

Integrating with JPA/Hibernate … Alternate method: Compass Project Supports Hibernate, OpenJPA, others No release since 2009 – effectively unsupported

Annotated class schema="MAPLINK") public class Marker extends MarkerA implements @Field(store=Store.YES) private long nullable private Double nullable private Double – tells Hibernate that this entity class should be indexed

Annotated class schema="MAPLINK") public class Marker extends MarkerA implements @Field(store=Store.YES) private long nullable private Double nullable private Double – tells Hibernate to create a matching name-value pair in the search index for this entity class Store.YES – stores the value for retrieval directly from the index, without touching the database

Annotated class schema="MAPLINK") public class Marker extends MarkerA implements @Field(store=Store.YES) private long nullable private Double nullable private Double – index as a numeric value, enables greater than / less than / range searches

Let’s take a Luke at the index …

Practical search exercise

Questions!