Solr 4 The NoSQL Search Server

Slides:



Advertisements
Similar presentations
1. XP 2 * The Web is a collection of files that reside on computers, called Web servers. * Web servers are connected to each other through the Internet.
Advertisements

Consensus on Transaction Commit
1 Senn, Information Technology, 3 rd Edition © 2004 Pearson Prentice Hall James A. Senns Information Technology, 3 rd Edition Chapter 7 Enterprise Databases.
Virtual Trunk Protocol
1 XQuery XML Databases Roger L. Costello 16 June 2010.
Universität Innsbruck Leopold Franzens Copyright 2006 DERI Innsbruck LarCK Workshop, ISWC/ASWC Busan, Korea 16-Feb-14 Towards Scalable.
1 Term 2, 2004, Lecture 9, Distributed DatabasesMarian Ursu, Department of Computing, Goldsmiths College Distributed databases 3.
Database Systems: Design, Implementation, and Management
The Hydra Framework as a Series of Diagrams Naomi Dushay Stanford University Libraries April,
Configuration management
Information Systems Today: Managing in the Digital World
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
13 Copyright © 2005, Oracle. All rights reserved. Monitoring and Improving Performance.
The Platform as a Service Model for Networking Eric Keller, Jennifer Rexford Princeton University INM/WREN 2010.
Yavapai College Self Service Banner Training. Agenda Definition of Key Concepts Log Into Finance Self Service Budget Query Overview Budget Query Procedures.
Megastore: Providing Scalable, Highly Available Storage for Interactive Services. Presented by: Hanan Hamdan Supervised by: Dr. Amer Badarneh 1.
Microsoft Access.
Chapter Information Systems Database Management.
Review Pseudo Code Basic elements of Pseudo code
Solr 4 The NoSQL Database Yonik Seeley Apachecon Europe 2012.
CAR Training Module PRODUCT REGISTRATION and MANAGEMENT Module 2 - Register a New Document - Without Alternate Formats (Run as a PowerPoint show)
4 Oracle Data Integrator First Project – Simple Transformations: One source, one target 3-1.
Lecture plan Transaction processing Concurrency control
Database System Concepts and Architecture
Executional Architecture
XP New Perspectives on Browser and Basics Tutorial 1 1 Browser and Basics Tutorial 1.
Chapter 11 Creating Framed Layouts Principles of Web Design, 4 th Edition.
South Dakota Library Network MetaLib User Interface South Dakota Library Network 1200 University, Unit 9672 Spearfish, SD © South Dakota.
CASSANDRA-A Decentralized Structured Storage System Presented By Sadhana Kuthuru.
CS 440 Database Management Systems
5/27/2014 Stephen Frein. About Me Director of QA for Comcast.com Adjunct for CCI
© 2013 A. Haeberlen, Z. Ives Cloud Storage & Case Studies NETS 212: Scalable & Cloud Computing Fall 2014 Z. Ives University of Pennsylvania 1.
Transaction.
NoSQL Databases: MongoDB vs Cassandra
In 10 minutes Mohannad El Dafrawy Sara Rodriguez Lino Valdivia Jr.
Presentation by Krishna
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos – A. Pavlo How to Scale a Database System.
A Study in NoSQL & Distributed Database Systems John Hawkins.
Peer-to-Peer in the Datacenter: Amazon Dynamo Aaron Blankstein COS 461: Computer Networks Lectures: MW 10-10:50am in Architecture N101
Databases with Scalable capabilities Presented by Mike Trischetta.
Software Engineer, #MongoDBDays.
Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20 Rafał Kuć – sematext.com.
Goodbye rows and tables, hello documents and collections.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation MongoDB Architecture.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Topic Distributed DBMS Database Management Systems Fall 2012 Presented by: Osama Ben Omran.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS limitations (mostly performance) Not much of a definition...
Introduction to MongoDB. Database compared.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
Distributed databases A brief introduction with emphasis on NoSQL databases Distributed databases1.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
CSE-291 (Distributed Systems) Winter 2017 Gregory Kesden
and Big Data Storage Systems
Cloud Computing and Architecuture
MongoDB Distributed Write and Read
Maximum Availability Architecture Enterprise Technology Centre.
NOSQL.
NOSQL databases and Big Data Storage Systems
CSE-291 (Cloud Computing) Fall 2016 Gregory Kesden
Massively Parallel Cloud Data Storage Systems
CS6604 Digital Libraries IDEAL Webpages Presented by
Transaction Properties: ACID vs. BASE
Rafał Kuć – Sematext sematext.com
Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20
CMPE 280 Web UI Design and Development March 14 Class Meeting
NoSQL & Document Stores
Presentation transcript:

Solr 4 The NoSQL Search Server Yonik Seeley May 30, 2013

Non-traditional data stores Doesn’t use / isn’t designed around SQL NoSQL Databases Wikipedia says: A NoSQL database provides a mechanism for storage and retrieval of data that use looser consistency models than traditional relational databases in order to achieve horizontal scaling and higher availability. Some authors refer to them as "Not only SQL" to emphasize that some NoSQL systems do allow SQL-like query language to be used. Non-traditional data stores Doesn’t use / isn’t designed around SQL May not give full ACID guarantees Offers other advantages such as greater scalability as a tradeoff Distributed, fault-tolerant architecture

Solr Cloud Design Goals Automatic Distributed Indexing HA for Writes Durable Writes Near Real-time Search Real-time get Optimistic Concurrency

Solr Cloud Distributed Indexing designed from the ground up to accommodate desired features CAP Theorem Consistency, Availability, Partition Tolerance (saying goes “choose 2”) Reality: Must handle P – the real choice is tradeoffs between C and A Ended up with a CP system (roughly) Value Consistency over Availability Eventual consistency is incompatible with optimistic concurrency Closest to MongoDB in architecture We still do well with Availability All N replicas of a shard must go down before we lose writability for that shard For a network partition, the “big” partition remains active (i.e. Availability isn’t “on” or “off”)

Solr 4

Solr 4 at a glance Document Oriented NoSQL Search Server Distributed Data-format agnostic (JSON, XML, CSV, binary) Schema-less options (more coming soon) Distributed Multi-tenanted Fault Tolerant HA + No single points of failure Atomic Updates Optimistic Concurrency Near Real-time Search Full-Text search + Hit Highlighting Tons of specialized queries: Faceted search, grouping, pseudo-join, spatial search, functions The desire for these features drove some of the “SolrCloud” architecture

Quick Start Unzip the binary distribution (.ZIP file) Start Solr Go! Note: no “installation” required Start Solr Go! Browse to http://localhost:8983/solr for the new admin interface $ cd example $ java –jar start.jar

New admin UI

Add and Retrieve document $ curl http://localhost:8983/solr/update -H 'Content-type:application/json' -d ' [ { "id" : "book1", "title" : "American Gods", "author" : "Neil Gaiman" } ]' Note: no type of “commit” is necessary to retrieve documents via /get (real-time get) $ curl http://localhost:8983/solr/get?id=book1 { "doc": { "id" : "book1", "author": "Neil Gaiman", "title" : "American Gods", "_version_": 1410390803582287872 } This is an example of realtime-get. No “commit” is needed before documents are visible via /get Writes are durable – you can do a “kill -9” of the JVM after the add, and restart – the doc will be there. Notice that a _version_ field was automatically added to the document

Simplified JSON Delete Syntax Singe delete-by-id {"delete":”book1"} Multiple delete-by-id {"delete":[”book1”,”book2”,”book3”]} Delete with optimistic concurrency {"delete":{"id":”book1", "_version_":123456789}} Delete by Query {"delete":{”query":”tag:category1”}}

Atomic Updates [ {"id" : "book1", "pubyear_i" : { "add" : 2001 }, $ curl http://localhost:8983/solr/update -H 'Content-type:application/json' -d ' [ {"id" : "book1", "pubyear_i" : { "add" : 2001 }, "ISBN_s" : { "add" : "0-380-97365-1"} } ]' $ curl http://localhost:8983/solr/update -H 'Content-type:application/json' -d ' [ {"id" : "book1", "copies_i" : { "inc" : 1}, "cat" : { "add" : "fantasy"}, "ISBN_s" : { "set" : "0-380-97365-0"} "remove_s" : { "set" : null } } ]'

Optimistic Concurrency client Conditional update based on document version 1. /get document 2. Modify document, retaining _version_ Solr 4. Go back to step #1 if fail code=409 3. /update resulting document

Version semantics Specifying _version_ on any update invokes optimistic concurrency _version_ Update Semantics > 1 Document version must exactly match supplied _version_ 1 Document must exist < 0 Document must not exist Don’t care (normal overwrite if exists)

Optimistic Concurrency Example $ curl http://localhost:8983/solr/get?id=book2 { "doc” : { "id":"book2", "title":["Neuromancer"], "author":"William Gibson", "copiesIn_i":7, "copiesOut_i":3, "_version_":123456789 }} Get the document $ curl http://localhost:8983/solr/update -H 'Content-type:application/json' -d ' [ { "id":"book2", "title":["Neuromancer"], "author":"William Gibson", "copiesIn_i":6, "copiesOut_i":4, "_version_":123456789 } ]' Modify and resubmit, using the same _version_ After you successfully add the modified document, it will have a new version (i.e. not equal to the _version_ you sent in) Note: optimistic concurrency works with atomic updates also! Alternately, specify the _version_ as a request parameter curl http://localhost:8983/solr/update?_version_=123456789 -H 'Content-type:application/json' -d […]

Optimistic Concurrency Errors HTTP Code 409 (Conflict) returned on version mismatch $ curl -i http://localhost:8983/solr/update -H 'Content-type:application/json' -d ' [{"id":"book1", "author":"Mr Bean", "_version_":54321}]' HTTP/1.1 409 Conflict Content-Type: text/plain;charset=UTF-8 Transfer-Encoding: chunked { "responseHeader":{ "status":409, "QTime":1}, "error":{ "msg":"version conflict for book1 expected=12345 actual=1408814192853516288", "code":409}}

Schema

Restlet is now integrated with Solr Get a specific field Schema REST API Restlet is now integrated with Solr Get a specific field curl http://localhost:8983/solr/schema/fields/price {"field":{ "name":"price", "type":"float", "indexed":true, "stored":true }} Get all fields curl http://localhost:8983/solr/schema/fields Get Entire Schema! curl http://localhost:8983/solr/schema

Dynamic Schema {"type":”float", "indexed":"true”} ‘ Add a new field (Solr 4.4) curl -XPUT http://localhost:8983/solr/schema/fields/strength -d ‘ {"type":”float", "indexed":"true”} ‘ Works in distributed (cloud) mode too! Schema must be managed & mutable (not currently the default) <schemaFactory class="ManagedIndexSchemaFactory"> <bool name="mutable">true</bool> <str name="managedSchemaResourceName">managed-schema</str> </schemaFactory>

Schemaless “Schemaless” really normally means that the client(s) have an implicit schema “No Schema” impossible for anything based on Lucene A field must be indexed the same way across documents Dynamic fields: convention over configuration Only pre-define types of fields, not fields themselves No guessing. Any field name ending in _i is an integer “Guessed Schema” or “Type Guessing” For previously unknown fields, guess using JSON type as a hint Coming soon (4.4?) based on the Dynamic Schema work Many disadvantages to guessing Lose ability to catch field naming errors Can’t optimize based on types Guessing incorrectly means having to start over

Solr Cloud

Solr Cloud http://.../solr/collection1/query?q=awesome shard1 shard2 Load-balanced sub-request replica1 replica1 replica2 replica2 replica3 replica3 ZooKeeper quorum ZK node /configs /myconf solrconfig.xml schema.xml /clusterstate.json /aliases.json /livenodes server1:8983/solr server2:8983/solr /collections /collection1 configName=myconf /shards /shard1 /shard2 server3:8983/solr server4:8983/solr ZooKeeper holds cluster state Nodes in the cluster Collections in the cluster Schema & config for each collection Shards in each collection Replicas in each shard Collection aliases

Distributed Indexing http://.../solr/collection1/update shard1 shard2 Update sent to any node Solr determines what shard the document is on, and forwards to shard leader Shard Leader versions document and forwards to all other shard replicas HA for updates (if one leader fails, another takes it’s place)

Collections API Create a new document collection http://localhost:8983/solr/admin/collections? action=CREATE &name=mycollection &numShards=4 &replicationFactor=3 Delete a collection http://localhost:8983/solr/admin/collections? action=DELETE Create an alias to a collection (or a group of collections) http://localhost:8983/solr/admin/collections? action=CREATEALIAS &name=tri_state &collections=NY,NJ,CT

http://localhost:8983/solr/#/~cloud

Distributed Query Requests Distributed query across all shards in the collection http://localhost:8983/solr/collection1/query?q=foo Explicitly specify node addresses to load-balance across shards=localhost:8983/solr|localhost:8900/solr, localhost:7574/solr|localhost:7500/solr A list of equivalent nodes are separated by “|” Different phases of the same distributed request use the same node Specify logical shards to search across shards=NY,NJ,CT Specify multiple collections to search across collection=collection1,collection2 public CloudSolrServer(String zkHost) ZK aware SolrJ Java client that load-balances across all nodes in cluster Calculate where document belongs and directly send to shard leader (new)

Lucene flushes writes to disk on a “commit” Durable Writes Lucene flushes writes to disk on a “commit” Uncommitted docs are lost on a crash (at lucene level) Solr 4 maintains it’s own transaction log Contains uncommitted documents Services real-time get requests Recovery (log replay on restart) Supports distributed “peer sync” Writes forwarded to multiple shard replicas A replica can go away forever w/o collection data loss A replica can do a fast “peer sync” if it’s only slightly out of date A replica can do a full index replication (copy) from a peer

Near Real Time (NRT) softCommit softCommit opens a new view of the index without flushing + fsyncing files to disk Decouples update visibility from update durability commitWithin now implies a soft commit Current autoCommit defaults from solrconfig.xml: <autoCommit> <maxTime>15000</maxTime> <openSearcher>false</openSearcher> </autoCommit> <!-- <autoSoftCommit> <maxTime>5000</maxTime> </autoSoftCommit> -->

Document Routing numShards=4 router=compositeId hash ring shard4 id = BigCo!doc5 (MurmurHash3) hash ring 9f27 3c71 shard4 shard1 40000000-7fffffff 80000000-bfffffff q=my_query shard.keys=BigCo! - You can see the range of any shard in clusterstate.json Hashing based on the “id” only has some advantages vs hashing based on a different field. Clients can be more generic and not know/care what addressing scheme is being used when dealing with individual documents. The “id” always fully defines where a document lives. Enabled highly scalable multi-tenanted applications 00000000-3fffffff c0000000-ffffffff (hash) shard3 shard2 9f27 0000 to 9f27 ffff shard1

Seamless Online Shard Splitting update Shard1 replica leader Shard2 replica leader Shard3 replica leader Shard2_0 Shard2_1 http://localhost:8983/solr/admin/collections?action=SPLITSHARD&col lection=mycollection&shard=Shard2 New sub-shards created in “construction” state Leader starts forwarding applicable updates, which are buffered by the sub-shards Leader index is split and installed on the sub-shards Sub-shards apply buffered updates then become “active” leaders and old shard becomes “inactive”

Questions?