Searching and Indexing

Slides:



Advertisements
Similar presentations
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Advertisements

Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch
Advanced Indexing Techniques with
Lucene Part3‏. Lucene High Level Infrastructure When you look at building your search solution, you often find that the process is split into two main.
For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei.
Information Retrieval in Practice
XP Chapter 3 Succeeding in Business with Microsoft Office Access 2003: A Problem-Solving Approach 1 Analyzing Data For Effective Decision Making.
Tutorial 11: Connecting to External Data
Overview of Search Engines
Exploring Microsoft® Office Grauer and Barber 1 Committed to Shaping the Next Generation of IT Experts. Robert Grauer and Maryann Barber Using.
Simple Web SQLite Manager/Form/Report
Implementing search with free software An introduction to Solr By Mick England.
Full-Text Search with Lucene Yonik Seeley 02 May 2007 Amsterdam, Netherlands.
Word Up! Using Lucene for full-text search of your data set.
Apache Lucene in LexGrid. Lucene Overview High-performance, full-featured text search engine library. Written entirely in Java. An open source project.
Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20 Rafał Kuć – sematext.com.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Elasticsearch in Dashboard Data Management Applications David Tuckett IT/SDC 30 August 2013 (Appendix 11 November 2013)
1 Lucene Jianguo Lu School of Computer Science University of Windsor.
Building Search Portals With SP2013 Search. 2 SharePoint 2013 Search  Introduction  Changes in the Architecture  Result Sources  Query Rules/Result.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Revolutionizing enterprise web development Searching with Solr.
Searching Business Data with MOSS 2007 Enterprise Search Presenter: Corey Roth Enterprise Consultant Stonebridge Blog:
Indexing UMLS concepts with Apache Lucene Julien Thibault University of Utah Department of Biomedical Informatics.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to CS520/CS596_026 Lecture Two Gordon Tian Fall 2015.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Lucene-Demo Brian Nisonger. Intro No details about Implementation/Theory No details about Implementation/Theory See Treehouse Wiki- Lucene for additional.
“ Lucene.Net is a source code, class-per-class, API-per-API and algorithmatic port of the Java Lucene search engine to the C# and.NET ”
Design a full-text search engine for a website based on Lucene
DAY 21: MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Aliya Farheen October 29,2015.
807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT.
Clusterpoint Margarita Sudņika ms RDBMS & NoSQL Databases & tables → Document stores Columns, rows → Schemaless documents Scales UP → Scales UP.
Lucene Jianguo Lu.
1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
CS520 Web Programming Full Text Search Chengyu Sun California State University, Los Angeles.
Leveraging SharePoint Search In SharePoint 2013 Jameson Bozeman.
Introduction to Enterprise Search Corey Roth Blog: Twitter: twitter.com/coreyrothtwitter.com/coreyroth.
1 Using the Lucene Search Engine. 2 Team Phil Corcoran Project Leader 10 Years Software Telecoms, Finance, Manufacturing Reqs, Design, Test Derek O’ Keeffe.
Data mining in web applications
A presentation on ElasticSearch
Information Retrieval in Practice
PHP using MySQL Database for Web Development (part II)
CS520 Web Programming Full Text Search
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Search Engine Architecture
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
CS276 Lucene Section.
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Bridging the Data Science and SQL Divide for Practitioners
Experience in CMS with Analytic Tools for Data Flow and HLT Monitoring
Safe by default, optimized for efficiency
CHAPTER 3 Architectures for Distributed Systems
NOSQL databases and Big Data Storage Systems
Database Performance Tuning and Query Optimization
Searching Business Data with MOSS 2007 Enterprise Search
CS6604 Digital Libraries IDEAL Webpages Presented by
Chapter 11: Indexing and Hashing
Ashutosh Rana Rahul Nori 7/17/2018
Introduction to Apache
Lucene/Solr Architecture
Spreadsheets, Modelling & Databases
Getting Started With Solr
Academic & More Group 4 谢知晖 王逸雄 郭嘉宋 程若愚.
NoSQL & Document Stores
Indexing with ElasticSearch
Microsoft Azure Data Catalog
Plug-In Architecture Pattern
Presentation transcript:

Searching and Indexing Indexing is the initial part of all search applications. Its goal is to process the original data into a highly efficient cross-reference lookup in order to facilitate rapid searching. The job is simple when the content is already textual in nature and its location is known.

Indexing Steps: acquiring the content. build documents This process gathers and scopes the content that needs to be indexed. build documents The raw content that needs to be indexed has to be translated into the units (usually called documents) used by the search application. document analysis The textual fields in a document cannot be indexed directly. Rather, the text has to be broken into a series of individual atomic elements called tokens. This happens during the document analysis step. Each token corresponds roughly to a word in the language, and the analyzer determines how the textual fields in the document are divided into a series of tokens. index the document The final step is to index the document. During the indexing step, the document is added to the index.

Lucene Lucene is a free, open source project implemented in Java. licensed under Apache Software Foundation. Lucene itself is a single JAR (Java Archive) file, less than 1 MB in size, and with no dependencies, and integrates into the simplest Java stand-alone console program as well as the most sophisticated enterprise application. Rich and powerful full-text search library. Lucene to provide full-text indexing across both database objects and documents in various formats (Microsoft Office documents, PDF, HTML, text, and so on). supporting full-text search using Lucene requires two steps: creating a lucence index creating a lucence index on the documents and/or database objects. Parsing looking up parsing the user query and looking up the prebuilt index to answer the query.

Architecture

c:\docs\einstein.txt: The important thing is not to stop questioning. Query: not c:\docs\einstein.txt: The important thing is not to stop questioning. String comparison slow! Solution: Inverted index c:\docs\shakespeare.txt: To be or not to be. Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques with Apache Lucene - Payloads Inverted index Query: not be important is not or questioning stop to the thing 1 0 1 c:\docs\einstein.txt: The important thing is not to stop questioning. c:\docs\shakespeare.txt: To be or not to be. 1 Document IDs Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques with Apache Lucene - Payloads Inverted index Query: ”not to” be important is not or questioning stop to the thing 1 0 1 c:\docs\einstein.txt: The important thing is not to stop questioning. 0 1 2 3 4 5 0 1 2 3 4 5 6 7 c:\docs\shakespeare.txt: To be or not to be. 1 Document IDs Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques with Apache Lucene - Payloads Inverted index Query: ”not to” be important is not or questioning stop to the thing 1 1 3 4 2 7 6 5 5 c:\docs\einstein.txt: The important thing is not to stop questioning. 0 1 2 3 4 5 1 6 7 c:\docs\shakespeare.txt: To be or not to be. 1 1 0 4 0 1 2 3 4 5 Document IDs Positions Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques with Apache Lucene - Payloads Inverted index with Payloads be important is not or questioning stop to the thing 1 1 3 4 2 7 6 5 1 5 c:\docs\einstein.txt: The important thing is not to stop questioning. 0 1 2 3 4 5 6 7 c:\docs\shakespeare.txt: To be or not to be. 1 4 0 1 2 3 4 5 Document IDs Positions Advanced Indexing Techniques with Apache Lucene - Payloads

Creating an Index (IndexWriter Class) The first step in implementing full-text searching with Lucene is to build an index. To create an index, the first thing that need to do is to create an IndexWriter object. The IndexWriter object is used to create the index and to add new index entries (i.e., Documents) to this index. You can create an IndexWriter as follows IndexWriter indexWriter = new IndexWriter("index-directory", new StandardAnalyzer(), true);

Parsing the Documents (Analyzer Class) The job of Analyzer is to "parse" each field of your data into indexable "tokens" or keywords. Several types of analyzers are provided out of the box. Table 1 shows some of the more interesting ones. StandardAnalyzer A sophisticated general-purpose analyzer. WhitespaceAnalyzer A very simple analyzer that just separates tokens using white space. StopAnalyzer Removes common English words that are not usually useful for indexing. SnowballAnalyzer An interesting experimental analyzer that works on word roots (a search on rain should also return entries with raining, rained, and so on).

Adding a Document/object to Index (Document Class) .To index an object, we use the Lucene Document class, to which we add the fields that you want indexed. Document doc = new Document(); doc.add(new Field("description", hotel.getDescription(), Field.Store.YES, Field.Index.TOKENIZED));

Elasticsearch (NO-SQL) What is ElasticSearch ? ● Open Source (No-sql DB) ● Distributed (cloud friendly) ● Highly-available ● Designed to speak JSON (JSON in, JSON out)

Highly available ● For each index you can specify: ● Number of shards – Each index has fixed number of shards ● Number of replicas – Each shard can have 0-many replicas, can be changed dynamically

Admin API ● Indices – Status – CRUD operation – Mapping, Open/Close, Update settings – Flush, Refresh, Snapshot, Optimize ● Cluster – Health – State – Node Info and stats – Shutdown

Rich query API ● There is rich Query DSL for search, includes: ● Queries – Boolean, Term, Filtered, MatchAll, ... ● Filters – And/Or/Not, Boolean, Missing, Exists, ... ● Sort -order:asc, order:desc ● Facets Facets allows to provide aggregated data for the search request –Terms, Term, Terms_stat, Statistical,….

Scripting support ● There is a support for using scripting languages ● mvel (default) ● JS ● Groovy ● Python

ES_Query syntax { “fields”:[“name”,”Id”] } ------------------------------------------- “from”:0, “size”:100,

ES_Query syntax { “fields”:[“name”,”Id”] “query”:{ } ------------------------------------------- {“from”:0, “size”:100, “fields”:[“name”,”Id”], “filtered”:{ “match_all”:{} }, “filter”:{ “terms”:{ “name”:[“ABC”,”XYZ”]

Filter_syntax {“filter”:”terms”:{ “name”:[“ABC”,”XYZ”] } ----------------------- { “filter”: “range”: “from”:0, “to”:10

Facets_Syntax { “facets”:{ “facet_name”:{ “terms”:{ “name”:[”ABC”, ”XYZ”] }

Facets_Syntax { “facets”:{ “facet_name”:{ “term_stats”:{ “key_field”:”name” “value_field”:”salary” }

Indexing and Searching with LUCENE Assignment Indexing and Searching with LUCENE