Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.

Slides:



Advertisements
Similar presentations
Chapter 6 Server-side Programming: Java Servlets
Advertisements

© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Netscape Application Server Application Server for Business-Critical Applications Presented By : Khalid Ahmed DS Fall 98.
Lucene Part3‏. Lucene High Level Infrastructure When you look at building your search solution, you often find that the process is split into two main.
June 22-23, 2005 Technology Infusion Team Committee1 High Performance Parallel Lucene search (for an OAI federation) K. Maly, and M. Zubair Department.
Solr has a lot of extensive features Solr Integration and Enhancements Todd Hatcher.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Object-Oriented Enterprise Application Development Tomcat 3.2 Configuration Last Updated: 03/30/2001.
Web Server Hardware and Software
Fast Track to ColdFusion 9. Getting Started with ColdFusion Understanding Dynamic Web Pages ColdFusion Benchmark Introducing the ColdFusion Language Introducing.
Web Servers How do our requests for resources on the Internet get handled? Can they be located anywhere? Global?
Implementation Considerations for FAST Search For SharePoint (FS4SP) Presenter : Shyam Narayan MOSSIG – February 2011 Meeting b:
Enterprise Search With SharePoint Portal Server V2 Steve Tullis, Program Manager, Business Portal Group 3/5/2003.
Microsoft ® Official Course Interacting with the Search Service Microsoft SharePoint 2013 SharePoint Practice.
Tomcat Configuration A Very, Very, Very Brief Overview.
SQL Reporting Services Overview SSRS includes all the development and management pieces necessary to publish end user reports in  HTML  PDF 
Overview of Search Engines
Web Applications Basics. Introduction to Web Web features Clent/Server HTTP HyperText Markup Language URL addresses Web server - a computer program that.
Tomcat Celsina Bignoli History of Tomcat Tomcat is the result of the integration of two groups of developers. – JServ, an open source.
GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)
Implementing search with free software An introduction to Solr By Mick England.
ECPRD seminar on the net IX”, Brussels, 2011 Faceted Search Some examples of applied faceted search on websites developed by the EP Jerry.
TIBCO Designer TIBCO BusinessWorks is a scalable, extensible, and easy to use integration platform that allows you to develop, deploy, and run integration.
Introduction to Apache Lucene/Solr CSCI 572: Information Retrieval and Search Engines Summer 2010.
Search Search Drupal with Apache Solr with CERN Web Communications Group – Copyright 2013.
Submitted by: Madeeha Khalid Sana Nisar Ambreen Tabassum.
M. Taimoor Khan * Java Server Pages (JSP) is a server-side programming technology that enables the creation of dynamic,
Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20 Rafał Kuć – sematext.com.
Configuration Management and Server Administration Mohan Bang Endeca Server.
Eric Westfall – Indiana University James Bennett – Indiana University ADMINISTERING A PRODUCTION KUALI RICE INFRASTRUCTURE.
© 2006 IBM Corporation IBM WebSphere Portlet Factory Architecture.
1 Apache. 2 Module - Apache ♦ Overview This module focuses on configuring and customizing Apache web server. Apache is a commonly used Hypertext Transfer.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Revolutionizing enterprise web development Searching with Solr.
Introduction to Nutch CSCI 572: Information Retrieval and Search Engines Summer 2010.
Chapter 6 Server-side Programming: Java Servlets
Andrew S. Budarevsky Adaptive Application Data Management Overview.
Module 10 Administering and Configuring SharePoint Search.
® IBM Software Group © 2007 IBM Corporation Best Practices for Session Management
Uwe SchindlerGES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1, Benny Bräuer.
Iccha Sethi Serdar Aslan Team 1 Virginia Tech Information Storage and Retrieval CS 5604 Instructor: Dr. Edward Fox 10/11/2010.
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
1. 2 Google Session 1.About MIT’s Google Search Appliance (GSA) 2.Adding Google search to your web site 3.Customizing search results 4.Tips on improving.
JAVA BEANS JSP - Standard Tag Library (JSTL) JAVA Enterprise Edition.
© FPT SOFTWARE – TRAINING MATERIAL – Internal use 04e-BM/NS/HDCV/FSOFT v2/3 JSP Application Models.
807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT.
Clusterpoint Margarita Sudņika ms RDBMS & NoSQL Databases & tables → Document stores Columns, rows → Schemaless documents Scales UP → Scales UP.
Lucene Jianguo Lu.
Activiti Dima Ionut Daniel. Contents What is Activiti? Activiti Basics Activiti Explorer Activiti Modeler Activiti Designer BPMN 2.0 Activiti Process.
Google Code Libraries Dima Ionut Daniel. Contents What is Google Code? LDAPBeans Object-ldap-mapping Ldap-ODM Bug4j jOOR Rapa jongo Conclusion Bibliography.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Apache Cocoon – XML Publishing Framework 데이터베이스 연구실 박사 1 학기 이 세영.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
A presentation on ElasticSearch
Netscape Application Server
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Self Healing and Dynamic Construction Framework:
Searching and Indexing
Open Source distributed document DB for an enterprise
Building Search Systems for Digital Library Collections
CS6604 Digital Libraries IDEAL Webpages Presented by
Cloud Web Filtering Platform
Lucene/Solr Architecture
Getting Started With Solr
Rafał Kuć – Sematext sematext.com
Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20
Indexing with ElasticSearch
Intro to Azure Search Julie Smith 2019.
Intro to Azure Search Julie Smith 2019.
Presentation transcript:

Apache Solr Dima Ionut Daniel

Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography

What is Apache Solr? Solr is an open source enterprise search written in Java, from the Apache Lucene project. Its major feature include full-text search, dynamic clustering, rich document handling, etc. Solr is written in Java and runs as a standalone full-text search within a servlet container as Tomcat or Jetty. Solr uses Lucene library as it's core full-text indexing and search and has REST, HTTP/XML and JSON APIs.

Architecture

Features Scalable: Solr scales by distributing work (indexing and query processing) to multiple servers in a cluster. Ready to deploy: Solr is open source, is easy to install and configure, and provides a preconfigured example to help you get started. Optimized for search: Solr is fast and can execute complex queries in subsecond speed, often only tens of milliseconds. Large volumes of documents: Solr is designed to deal with indexes containing many millions of documents. Text-centric: Solr is optimized for searching natural-language text, like s, web pages, resumes, PDF documents, and social messages such as tweets or blogs. Results sorted by relevance: Solr returns documents in ranked order based on how relevant each document is to the user’s query.

Core The main configuration file in Solr, solrconfig.xml, contains numerous settings, some of which (such as cache management settings) are useful when just starting out, while others are intended for advanced users. Solr adds a simple, declarative way to define the structure of your index and how you want fields to be represented and analyzed: an XML-configuration document named schema.xml. Solr uses schema.xml to represent all of the possible fields and data types necessary to map documents into a Lucene index.

Core (cont.) Solr provides copy and dynamic fields. Copy fields provide a way to take the raw text contents of one or more fields and have them applied to a different field. Dynamic fields allow you to apply the same field type to many different fields without explicitly declaring them in schema.xml. Lucene provides a powerful library for indexing documents, executing queries, and ranking results. With schema.xml we define the index structure using an XML-configuration document instead of having to program to the Lucene API. Solr runs as a Java web application and integrates with other technologies, using proven standards such as XML, JSON, and HTTP.

Core (cont.)

We can index data by running a tool that will call SOLR URL:

Core (cont.) From Solr GUI we can query the data that we just added.

Core (cont.)

In Solr, a core is composed of a set of configuration files, Lucene index files, and Solr’s transaction log. One Solr server running in Jetty can host multiple cores. Solr also uses the term collection, which only has meaning in the context of a Solr cluster in which a single index is distributed across multiple servers. Solr home is a directory structure that encapsulates one or more cores, which historically were configured by a configuration file named solr.xml.

Core (cont.) Requests to Solr happen over HTTP. Query is done using HTTP GET requests.

Core (cont.) The main purpose of the request dispatcher is to locate the correct core to handle the request, such as collection1, and then route the request to the appropriate request handler registered in the core, in this case /select.

Core (cont.) All select request handlers are implemented by solr.SearchHandler. In Solr there are 2 main types of request handlers: – search handlers for query – update handler for indexing

Core (cont.) Solr provides a number of built-in caches to improve query performance. There are four main concerns when working with Solr caches: – Cache sizing and eviction policy – Hit ratio and evictions – Cached-object invalidation – Autowarming new caches

Core (cont.) Solr requires you to set an upper limit on the number of objects in each cache. Solr will evict objects when the cache reaches the upper limit, using either a Least Recently Used or Least Frequently Used eviction policy. Solr creates a new searcher after a commit, but it doesn’t close the old searcher until the new searcher is fully warmed. Some of the keys in the soon-to-be-closed searcher’s cache can be used to populate the new searcher’s cache, a process known as autowarming in Solr. Every Solr cache supports an autowarmCount attribute that indicates either the maximum number of objects or a percentage of the old cache size to autowarm.

Core (cont.)

Solr Concepts Solr is a document storage and retrieval engine. Each document contains one or more fields, modeled as a particular field type: string, tokenized text, Boolean, date/time, lat/long, etc. Each field is defined in Solr’s schema as a particular field type, which allows Solr to know how to handle the content as it’s received.

Solr Concepts (cont.) A document is a collection of fields that map to particular field types defined in a schema. Solr uses Lucene’s inverted index to power its fast searching capabilities.

Configuration Solr focuses around three main XML files: – solr.xml: defines properties related to administration, logging, sharding, and SolrCloud. – solrconfig.xml: defines the main settings for a specific Solr core. – schema.xml: defines the structure of your index, including fields and field types. The Solr web application uses a global Java system property (solr.solr.home) to identify the root directory from which to look for configuration files. Solr scans the home directory for subdirectories containing a core.properties file, which defines basic properties for autodiscovered cores. The core.properties file contains a single line defining the name of the core (ex: name=collection1).

Configuration (cont.) Bellow is a list of allowed parameters from core.properties.

Configuration (cont.)

Solr can autodiscover cores during startup using core.properties. Once a core is discovered, Solr locates the solrconfig.xml file under $SOLR_HOME/$instanceDir/conf/solrconfig.xml, where $instanceDir/ is the directory containing the core.properties file. Solr uses the solrconfig.xml file to initialize the core. The solrconfig.xml file contains index-management settings and contains a large number of complex sections.

Conclusions High Performance. High Scalability to very large clusters. Solr has a web GUI based admin.

Bibliography Manning – Solr in Action