Presentation is loading. Please wait.

Presentation is loading. Please wait.

Full-Text Search with Lucene Yonik Seeley 02 May 2007 Amsterdam, Netherlands.

Similar presentations


Presentation on theme: "Full-Text Search with Lucene Yonik Seeley 02 May 2007 Amsterdam, Netherlands."— Presentation transcript:

1 Full-Text Search with Lucene Yonik Seeley yonik@apache.org 02 May 2007 Amsterdam, Netherlands

2 What is Lucene High performance, scalable, full-text search library Written by Doug Cutting, 100% Java Focus: Indexing + Searching Documents Easily embeddable, no config files No crawlers or document parsing

3 Inverted Index aardvark hood red little riding robin women zoo Little Red Riding Hood Robin Hood Little Women 01 02 0 0 2 1 0 1 2

4 Basic Application IndexWriterIndexSearcher Lucene Index Document field1: value1 field2: value2 field3: value3 Hits (Matching Docs) Query addDocument()search()

5 Indexing Documents IndexWriter writer = new IndexWriter(directory, analyzer, true); Document doc = new Document(); doc.add(new Field("title", "Lucene in Action", Field.Store.YES, Field.Index.TOKENIZED)); doc.add(new Field("author", "Erik Hatcher", Field.Store.YES, Field.Index.TOKENIZED)); doc.add(new Field("author", "Otis Gospodnetic", Field.Store.YES, Field.Index.TOKENIZED)); writer.addDocument(doc); writer.close();

6 Field Options Indexed –Necessary for searching or sorting Tokenized –Text analysis done before indexing Stored Compressed Binary –Currently for stored-only fields

7 Searching an Index IndexSearcher searcher = new IndexSearcher(directory); QueryParser parser = new QueryParser("defaultField", analyzer); Query query = parser.parse("title:Lucene"); Hits hits = searcher.search(query); System.out.println(“matches:" + hits.length()); Document doc = hits.doc(0); System.out.println(“first:" + doc.get("title")); searcher.close();

8 Scoring VSM – Vector Space Model tf – numer of terms in field lengthNorm – number of tokens in field idf – number of documents containing term coord – coordination factor, number of matching terms document boost query clause boost http://lucene.apache.org/java/docs/scoring.html

9 Query Construction Lucene QueryParser Example: queryParser.parse("title:spiderman"); good for IPC, human entered queries, debug does text analysis and constructs appropriate queries not all query types supported Programmatic query construction Example: new TermQuery(new Term(“title”,”spiderman”)) explicit, no escaping necessary

10 Query Examples 1.mission impossible EQUIV: mission OR impossible QueryParser default is “optional” 2.+mission +impossible –actor:cruise EQUIV: mission AND impossible NOT cruise 3.“mission impossible” –actor:cruise 4.title:spiderman^10 description:spiderman 5.description:“spiderman movie”~10

11 Query Examples2 1.releaseDate:[2000 TO 2007] Range search: lexicographic ordering, so beware of numbers 2.Wildcard searches: te?t, te*t, test* 3.spider~ Fuzzy search: Levenshtein distanceLevenshtein distance Optional minimum similarity: spider~0.7 4.*:* 5.(a AND b) OR (c AND d)

12 Deleting Documents IndexReader.deleteDocument(int id) –exclusive with IndexWriter –powerful Deleting with IndexWriter –deleteDocuments(Term t) –updateDocument(Term t, Document d) Deleting does not immediately reclaim space

13 Performance Decrease index segments Lower merge factor Optimize Use cached filters ‘+title:spiderman +released:true’ ‘title:spiderman’ filtered by ‘released:true’

14 Index Structure _0.fnm _0.fdt _0.fdx _0.frq _0.tis _0.tii _0.prx _0.nrm _0_1.del segments_3 IndexWriter params MaxBufferedDocs MergeFactor MaxMergeDocs MaxFieldLength

15 Search Relevancy PowerShot SD 500 PowerShotSD500 SD500PowerShot PowerShot sd500powershot powershot WhitespaceTokenizer WordDelimiterFilter catenateWords=1 LowercaseFilter power-shot sd500 power-shotsd500 sd500powershot sd500powershot WhitespaceTokenizer WordDelimiterFilter catenateWords=0 LowercaseFilter Query Analysis A Match! Document Analysis

16 Tokenizers Tokenizers break field text into tokens StandardTokenizer –source string: “full-text lucene.apache.org” –“full” “text” “lucene.apache.org” WhitespaceTokenizer –“full-text” “lucene.apache.org” LetterTokenizer –“full” “text” “lucene” “apache” “org”

17 TokenFilters LowerCaseFilter StopFilter LengthFilter ISOLatin1AccentFilter SnowballPorterFilter –stemming: reducing words to root form –rides, ride, riding => ride –country, countries => countri contrib/analyzers for other languages

18 Analyzers class MyAnalyzer extends Analyzer { private Set myStopSet = StopFilter.makeStopSet(StopAnalyzer.ENGLISH_STOP_ WORDS); public TokenStream tokenStream(String fieldname, Reader reader) { TokenStream ts = new StandardTokenizer(reader); ts = new StandardFilter(ts); ts = new LowerCaseFilter(ts); ts = new StopFilter(ts, myStopSet); return ts; }

19 Analysis Tips Use PerFieldAnalyzerWrapper Add same field more than once, analyze differently –Boost exact case matches –Boost exact tense matches –Query with or without synonyms –Soundex for sounds-like queries

20 Nutch Open source web search application Crawlers Link-graph database Document parsers (HTML, word, pdf, etc) Language + charset detection Utilizes Hadoop (DFS + MapReduce) for massive scalability

21 Solr XML/HTTP, JSON APIs Faceted search / navigation Flexible Data Schema Hit Highlighting Configurable Caching Replication Web admin interface Solr Flare: Ruby on Rails user interface

22 Questions?


Download ppt "Full-Text Search with Lucene Yonik Seeley 02 May 2007 Amsterdam, Netherlands."

Similar presentations


Ads by Google