Presentation is loading. Please wait.

Presentation is loading. Please wait.

Solr 4 The NoSQL Search Server Yonik Seeley May 30, 2013.

Similar presentations


Presentation on theme: "Solr 4 The NoSQL Search Server Yonik Seeley May 30, 2013."— Presentation transcript:

1 Solr 4 The NoSQL Search Server Yonik Seeley May 30, 2013

2 2 2 NoSQL Databases Wikipedia says: A NoSQL database provides a mechanism for storage and retrieval of data that use looser consistency models than traditional relational databases in order to achieve horizontal scaling and higher availability. Some authors refer to them as "Not only SQL" to emphasize that some NoSQL systems do allow SQL-like query language to be used. Non-traditional data stores Doesn’t use / isn’t designed around SQL May not give full ACID guarantees Offers other advantages such as greater scalability as a tradeoff Distributed, fault-tolerant architecture

3 3 3 Solr Cloud Design Goals Automatic Distributed Indexing HA for Writes Durable Writes Near Real-time Search Real-time get Optimistic Concurrency

4 4 4 Solr Cloud Distributed Indexing designed from the ground up to accommodate desired features CAP Theorem Consistency, Availability, Partition Tolerance (saying goes “choose 2”) Reality: Must handle P – the real choice is tradeoffs between C and A Ended up with a CP system (roughly) Value Consistency over Availability Eventual consistency is incompatible with optimistic concurrency Closest to MongoDB in architecture We still do well with Availability All N replicas of a shard must go down before we lose writability for that shard For a network partition, the “big” partition remains active (i.e. Availability isn’t “on” or “off”)

5 5 5 Solr 4

6 6 6 Solr 4 at a glance Document Oriented NoSQL Search Server Data-format agnostic (JSON, XML, CSV, binary) Schema-less options (more coming soon) Distributed Multi-tenanted Fault Tolerant HA + No single points of failure Atomic Updates Optimistic Concurrency Near Real-time Search Full-Text search + Hit Highlighting Tons of specialized queries: Faceted search, grouping, pseudo-join, spatial search, functions The desire for these features drove some of the “SolrCloud” architecture

7 7 7 Quick Start 1. Unzip the binary distribution (.ZIP file) Note: no “installation” required 2. Start Solr 2. Go! Browse to for the new admin interfacehttp://localhost:8983/solr $ cd example $ java –jar start.jar

8 8 8 New admin UI

9 9 9 Add and Retrieve document $ curl -H 'Content-type:application/json' -d ' [ { "id" : "book1", "title" : "American Gods", "author" : "Neil Gaiman" } ]' $ curl { "doc": { "id" : "book1", "author": "Neil Gaiman", "title" : "American Gods", "_version_": } Note: no type of “commit” is necessary to retrieve documents via /get (real-time get)

10 10 Simplified JSON Delete Syntax Singe delete-by-id {"delete":”book1"} Multiple delete-by-id {"delete":[”book1”,”book2”,”book3”]} Delete with optimistic concurrency {"delete":{"id":”book1", "_version_": }} Delete by Query {"delete":{”query":”tag:category1”}}

11 11 Atomic Updates $ curl -H 'Content-type:application/json' -d ' [ {"id" : "book1", "pubyear_i" : { "add" : 2001 }, "ISBN_s" : { "add" : " "} } ]' $ curl -H 'Content-type:application/json' -d ' [ {"id" : "book1", "copies_i" : { "inc" : 1}, "cat" : { "add" : "fantasy"}, "ISBN_s" : { "set" : " "} "remove_s" : { "set" : null } } ]'

12 12 Optimistic Concurrency Conditional update based on document version Solr 1. /get document 2. Modify document, retaining _version_ 3. /update resulting document 4. Go back to step #1 if fail code=409 client

13 13 Version semantics Specifying _version_ on any update invokes optimistic concurrency

14 14 Optimistic Concurrency Example $ curl -H 'Content-type:application/json' -d ' [ { "id":"book2", "title":["Neuromancer"], "author":"William Gibson", "copiesIn_i":6, "copiesOut_i":4, "_version_": } ]' $ curl { "doc” : { "id":"book2", "title":["Neuromancer"], "author":"William Gibson", "copiesIn_i":7, "copiesOut_i":3, "_version_": }} curl -H 'Content- type:application/json' -d […] Get the document Modify and resubmit, using the same _version_ Alternately, specify the _version_ as a request parameter

15 15 Optimistic Concurrency Errors HTTP Code 409 (Conflict) returned on version mismatch $ curl -i -H 'Content-type:application/json' -d ' [{"id":"book1", "author":"Mr Bean", "_version_":54321}]' HTTP/ Conflict Content-Type: text/plain;charset=UTF-8 Transfer-Encoding: chunked { "responseHeader":{ "status":409, "QTime":1}, "error":{ "msg":"version conflict for book1 expected=12345 actual= ", "code":409}}

16 16 Schema

17 17 Schema REST API Restlet is now integrated with Solr Get a specific field curl {"field":{ "name":"price", "type":"float", "indexed":true, "stored":true }} Get all fields curl Get Entire Schema! curl

18 18 Dynamic Schema Add a new field (Solr 4.4) curl -XPUT -d ‘http://localhost:8983/solr/schema/fields/strength {"type":”float", "indexed":"true”} ‘ Works in distributed (cloud) mode too! Schema must be managed & mutable (not currently the default) true managed-schema

19 19 Schemaless “Schemaless” really normally means that the client(s) have an implicit schema “No Schema” impossible for anything based on Lucene A field must be indexed the same way across documents Dynamic fields: convention over configuration Only pre-define types of fields, not fields themselves No guessing. Any field name ending in _i is an integer “Guessed Schema” or “Type Guessing” For previously unknown fields, guess using JSON type as a hint Coming soon (4.4?) based on the Dynamic Schema work Many disadvantages to guessing Lose ability to catch field naming errors Can’t optimize based on types Guessing incorrectly means having to start over

20 20 Solr Cloud

21 21 Solr Cloud shard1 replica2 replica3 replica2 replica3 ZooKeeper quorum ZK nod e /configs /myconf solrconfig.xml schema.xml /clusterstate.json /aliases.json /livenodes server1:8983/solr server2:8983/solr /collections /collection1 configName=myconf /shards /shard1 server1:8983/solr server2:8983/solr /shard2 server3:8983/solr server4:8983/solr Load-balanced sub-request replica1 shard2 replica1 ZooKeeper holds cluster state Nodes in the cluster Collections in the cluster Schema & config for each collection Shards in each collection Replicas in each shard Collection aliases

22 22 Distributed Indexing shard1 shard2 Update sent to any node Solr determines what shard the document is on, and forwards to shard leader Shard Leader versions document and forwards to all other shard replicas HA for updates (if one leader fails, another takes it’s place)

23 23 Collections API Create a new document collection action=CREATE &name=mycollection &numShards=4 &replicationFactor=3 Delete a collection action=DELETE &name=mycollection Create an alias to a collection (or a group of collections) action=CREATEALIAS &name=tri_state &collections=NY,NJ,CT

24 24

25 25 Distributed Query Requests Distributed query across all shards in the collection Explicitly specify node addresses to load-balance across shards=localhost:8983/solr|localhost:8900/solr, localhost:7574/solr|localhost:7500/solr A list of equivalent nodes are separated by “|” Different phases of the same distributed request use the same node Specify logical shards to search across shards=NY,NJ,CT Specify multiple collections to search across collection=collection1,collection2 public CloudSolrServer(String zkHost) ZK aware SolrJ Java client that load-balances across all nodes in cluster Calculate where document belongs and directly send to shard leader (new)

26 26 Durable Writes Lucene flushes writes to disk on a “commit” Uncommitted docs are lost on a crash (at lucene level) Solr 4 maintains it’s own transaction log Contains uncommitted documents Services real-time get requests Recovery (log replay on restart) Supports distributed “peer sync” Writes forwarded to multiple shard replicas A replica can go away forever w/o collection data loss A replica can do a fast “peer sync” if it’s only slightly out of date A replica can do a full index replication (copy) from a peer

27 27 Near Real Time (NRT) softCommit softCommit opens a new view of the index without flushing + fsyncing files to disk Decouples update visibility from update durability commitWithin now implies a soft commit Current autoCommit defaults from solrconfig.xml: false >

28 28 Document Routing bfffffff fffffff fffffff c ffffffff shard1shard4 shard3shard2 id = BigCo!doc5 9f2 7 3c71 (MurmurHash3) q=my_query shard.keys=BigCo! 9f f27 ffff to (hash) shard1 numShards=4 router=compositeId hash ring

29 29 Seamless Online Shard Splitting Shard2_0 Shard1 replica leader Shard2 replica leader Shard3 replica leader Shard2_1 1. lection=mycollection&shard=Shard2 2. New sub-shards created in “construction” state 3. Leader starts forwarding applicable updates, which are buffered by the sub-shards 4. Leader index is split and installed on the sub-shards 5. Sub-shards apply buffered updates then become “active” leaders and old shard becomes “inactive” update

30 30 Questions?


Download ppt "Solr 4 The NoSQL Search Server Yonik Seeley May 30, 2013."

Similar presentations


Ads by Google