© 2008-20091 NYC Apache Lucene/Solr Meetup. Lucid Imagination, Inc. Agenda Welcome "Faster. Better. Solr! What to look for in Solr 1.4“ Yonik Seeley,

Slides:



Advertisements
Similar presentations
Lucene Near Realtime Search Jason Rutherglen & Jake Mannix LinkedIn 6/3/2009 SOLR/Lucene Users Group San Francisco.
Advertisements

Lucene/Solr Architecture
Raptor Technical Details. Outline Workshop structured by Raptor workflow – Raptor Event model. – ICA log file parsing – ICA/MUA event storage – ICA event.
Exadata Distinctives Brown Bag New features for tuning Oracle database applications.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Advanced Indexing Techniques with
Apache Solr Yonik Seeley 29 June 2006 Dublin, Ireland.
Web Applications Development Using Coldbox Platform Eddie Johnston.
Turners SharePoint Web Site How we did it. 2 Page Anatomy Custom Search Web Part Custom Search Web Part Data Form Web Parts Content Query Web Part HTML.
Solr has a lot of extensive features Solr Integration and Enhancements Todd Hatcher.
Information Retrieval in Practice
Introduction to Open Source Search with Apache Lucene and Solr Grant Ingersoll.
Distributed Computations
Xyleme A Dynamic Warehouse for XML Data of the Web.
Fast Track to ColdFusion 9. Getting Started with ColdFusion Understanding Dynamic Web Pages ColdFusion Benchmark Introducing the ColdFusion Language Introducing.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Microsoft ® Official Course Interacting with the Search Service Microsoft SharePoint 2013 SharePoint Practice.
Distributed Computations MapReduce
Overview of Search Engines
Enterprise Search. Search Architecture Configuring Crawl Processes Advanced Crawl Administration Configuring Query Processes Implementing People Search.
Implementing search with free software An introduction to Solr By Mick England.
Word Up! Using Lucene for full-text search of your data set.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Welcome to the Minnesota SharePoint User Group. Introductions / Overview Project Tracking / Management / Collaboration via SharePoint Multiple Audiences.
Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.
Submitted by: Madeeha Khalid Sana Nisar Ambreen Tabassum.
Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20 Rafał Kuć – sematext.com.
Dynamic Data Exchanges with the Java Flow Processor Presenter: Scott Bowers Date: April 25, 2007.
M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.
Patient Empowerment for Chronic Diseases System Sifat Islam Graduate Student, Center for Systems Integration, FAU, Copyright © 2011 Center.
Copyright © Orbeon, Inc. All rights reserved. Erik Bruchez Applications of XML Pipelines XML Prague, June 16 th, 2007.
Revolutionizing enterprise web development Searching with Solr.
Introduction to ColdFusion Yu Fu 2003 MEC Candidate.
Solr 3.1 and Beyond Yonik Seeley Lucid Imagination October 8,
Introduction to Nutch CSCI 572: Information Retrieval and Search Engines Summer 2010.
Overview of IU Digital Collections Search Hui Zhang Jon Dunn Indiana University Digital Library Program IU Digital Library Brown Bag October 19, 2011.
SharePoint 2010 Search Architecture The Connector Framework Enhancing the Search User Interface Creating Custom Ranking Models.
BW Know-How Call : Performance Tuning dial-in phone numbers! U.S. Toll-free: (877) International: (612) Passcode: “BW”
MapReduce M/R slides adapted from those of Jeff Dean’s.
Open Search Office Web Services Database Doc Mgt Sys Pipeline Index Geospatial Analysis Text Search Faceting Caching Query parsing Clustering Synonyms.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
Module 10 Administering and Configuring SharePoint Search.
ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation MongoDB Architecture.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Uwe SchindlerGES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1, Benny Bräuer.
Iccha Sethi Serdar Aslan Team 1 Virginia Tech Information Storage and Retrieval CS 5604 Instructor: Dr. Edward Fox 10/11/2010.
AxKit A member of the Apache XML project Ryan Maslyn Kyle Bechtel.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
Clusterpoint Margarita Sudņika ms RDBMS & NoSQL Databases & tables → Document stores Columns, rows → Schemaless documents Scales UP → Scales UP.
Feb 24-27, 2004ICDL 2004, New Dehli Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer.
Steven Perry Dave Vieglais. W a s a b i Web Applications for the Semantic Architecture of Biodiversity Informatics Overview WASABI is a framework for.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
AA202: Performance Enhancers for Laserfiche Connie Anderson, Technical Writer.
Configuring SQL Server for a successful SharePoint Server Deployment Haaron Gonzalez Solution Architect & Consultant Microsoft MVP SharePoint Server
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
Information Retrieval in Practice
Understanding and Improving Server Performance
Global Search: An Introduction and Administrator Perspective
Apache Ignite Data Grid Research Corey Pentasuglia.
Searching and Indexing
Open Source distributed document DB for an enterprise
Post-relational databases What's wrong with web development?
Nate Nelson I*LEVEL, Inc.
Lucene/Solr Architecture
Rafał Kuć – Sematext sematext.com
Academic & More Group 4 谢知晖 王逸雄 郭嘉宋 程若愚.
Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20
CloudAnt: Database as a Service (DBaaS)
Presentation transcript:

© NYC Apache Lucene/Solr Meetup

Lucid Imagination, Inc. Agenda Welcome "Faster. Better. Solr! What to look for in Solr 1.4“ Yonik Seeley, Lucid Imagination How fast is it? Assessing Performance in Lucene and Solr Mark Miller, Lucid Imagination Finding more than music: how MTV Networks drives Viacom entertainment brands with Solr search Michael Rosencrantz, MTV Networks Lightning Talks 2 ©

What’s New In Solr 1.4 Yonik Seeley

Lucid Imagination, Inc. 4 ©

Lucid Imagination, Inc. Performance! Scalability/Concurrency! FastLRUCache – ConcurrentHashMap based Reads are lockless, writes are partitioned Can be slower if hit rate is low with few cores filterCache, queryCache, documentCache NIOFSDirectory! sync{ seek(pos), read(nBytes) } => pread(pos, nBytes) Windows still defaults to synchronized (JVM bug) 5 ©

Lucid Imagination, Inc. Performance! IndexReader.reopen() 6 © S2 S3 S1 new segment Lucene index segments on disk SR1 popularity SR2 popularity SR3 popularity Field Cache Un-inverted RAM resident SR1SR2SR3SR4 MultiReader1MultiReader2 reopen()

Lucid Imagination, Inc. Performance! Faceting! New UnInvertedField (FieldCache-like method) Good for many unique terms, but relatively few values per doc Builds a doc-id => values mapping, for multi-valued fields Lots of tricks to reduce memory footprint Hybrid approach: filters used for “big” terms (>5% of index) Default for multi-valued fields facet.method=enum switches back to old behavior How big is it? Check out admin/stats.jsp, go to fieldValueCache Result: up to 50x faster, 5x smaller (100K unique values, 1-5/doc) 7 ©

Lucid Imagination, Inc. Performance! TrieRangeQuery Trie* fields index multiple precisions Works for numerics & dates… renamed NumericField in Lucene 175 is indexed as hundreds:1 tens:17 ones:175 TrieRange:[154 TO 183] is executed as tens:[16 TO 18] OR ones:[154 TO 159] OR ones:[181 TO 183] Result: up to 40x faster than standard range queries Configurable precision step Only for single valued fields! Not completely integrated into Solr yet (no faceting) 8 ©

Lucid Imagination, Inc. Performance! Binary format for updates (no XML parsing) Use SolrJ, it’s the default transfer syntax SolrJ’s StreamingUpdateSolrServer Streams multiple documents over multiple connections Simple test went from 231 docs/sec to docs/sec! omitTermFreqAndPositions Omits number of terms in that specific field & list of positions Saves time and index space for non-text fields 9 ©

Lucid Imagination, Inc. Performance! avoid scoring when generating docsets/filters Enabled by new Collector classes in Lucene Filters now apply before main query 300% faster in some cases new small set filter implementation Used when cardinality < maxDoc/64 40% smaller, good news for the filterCache 60% faster at calculating intersections (facet.method=enum) 10 ©

Lucid Imagination, Inc. 11 © Solr Indexing Architecture 11 XML Update Handler CSV Update Handler /update/update/csv XML Update with custom processor chain /update/xml Solr CELL: Extracting RequestHandler (PDF, Word, …) via Apache Tika /update/extract Lucene Index Data Import Handler Database pull RSS pull Simple transforms SQL DB RSS feed Signature processor Logging processor Index processor Custom Transform processor PDF HTTP POST pull Update Processor Chain (per handler) Lucene Text Index Analyzers

Lucid Imagination, Inc. New update components Solr Cell (Content Extraction Library) Allow apps to send in Office, PDF, etc. and index it Integrates Apache Tika (v0.4) into Solr SignatureUpdateProcessor Detect duplicates during indexing and handle them Adds a signature field to the document (could be uniqueKey) Exact (hash on certain fields) or Fuzzy duplicate detection 12 ©

Lucid Imagination, Inc. Replication Old: UNIX only Difficult/Annoying to setup New See Java-based, self contained Replication of configuration files! Simple configuration 13 ©

Lucid Imagination, Inc. Master: commit schema.xml,stopwords.txt Worker: 00:00:60 14 ©

Lucid Imagination, Inc. Multi-select support 15 © Very generic support Ability to tag filters Ability to exclude certain filters when faceting, by tag q=index replication&facet=true &fq={!tag=proj}project:(lucene OR solr) &facet.field={!ex=proj}project &facet.field={!ex=src}source

Lucid Imagination, Inc. New Request Handler Components ClusteringComponent Uses Carrot2 to dynamically cluster the top N search results Like dynamically discovers facets Terms Component Return indexed terms+docfreq in a field, use for auto-suggest, etc TermVector Component Returns term info per document (tf, positions) Stats Component min, max, sum, sumOfSquares, count, missing, mean, stddev 16 ©

Lucid Imagination, Inc. Solr Request Plugins /select RequestHandler Query Component Facet Component Highlight Component Debug Component Distributed Search MoreLikeThisStatisticsTerms SpellcheckTermVectorQueryElevation My Custom Binary response writer JSON response writer Request Handler (non- component based) /admin/luke Request Handler (custom) /mypath XML response writer XSLT response writer Query Response {“response”={ “docs”={ Additional plug-n-play search components Clustering Velocity response writer

Lucid Imagination, Inc. Tons more new features! Ranges over arbitrary functions: {!frange l=1 u=2}sqrt(sum(a,b)) Nested queries, for function queries too solrjs – javascript client library commitWithin – doc must be committed within x seconds Binary field type Merge one index into another SolrJ client for load balancing and failover Field globbing for some params: hl.fl=*_text Doublemetaphone, Arabic stemmer, etc VelocityResponseWriter – template responses using Velocity 18 ©

Now it is much easier to find my plans to get Bugs Bunny with Solr. I am a super genius to use Solr! Contributed by Kayla Seeley, 9