Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Lucene Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.

Similar presentations


Presentation on theme: "Introduction to Lucene Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata."— Presentation transcript:

1 Introduction to Lucene Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata

2 Open source search engines  Academic – Terrier (Java, University of Glasgow) – Indri, Lemur (C++, and Java too, UMass & CMU) – Zettair (University of Melbourne)  Apache project (non-academic) – Lucene – Apache license, legally easier for commercial use  Lucene – Java search engine library, with many features – Ports/integration to other languages available (C/C++, C#, Python, Ruby, … ) – Other projects on top of Lucene: Solr and others – Used by: LinkedIn, Twitter, CiteSeer, … 2

3 Lucene: overview 3 Lucene document Documents Tokens Index and dictionary Lucene query Text query Search results Results display User: document building Lucene: Analyzing Lucene: Indexing Lucene: Analyzing Lucene: Searching User: reading search results User: query building UI User User = Programmer

4 Lucene document building  A document is a collection of Fields  Document: CV of a student, several fields – Last_Name: Banerjee – First_Name: Arkadeep – Age: 24 – Gender: M – Institute: ISI Kolkata – Location: Kolkata – Description: Arkadeep is a highly motivated student who simply loves challenge. He is …  A field can be a text, a number, a range 4

5 Building document 5 import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; … Document doc = new Document(); doc.add(new StringField(“Last_Name”,”Banerjee”, …)); doc.add(new StringField(“First_Name”,”Arkadeep”, …)); doc.add(new IntField(“Age”,24, …); doc.add(new TextField(“Description”,description, …)); import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; … … Document doc = new Document(); doc.add(new StringField(“Last_Name”,”Banerjee”, …)); doc.add(new StringField(“First_Name”,”Arkadeep”, …)); doc.add(new IntField(“Age”,24, …); doc.add(new TextField(“Description”,description, …)); Now, let’s understand the fields

6 Building document  Field.Store – NO : Don’t store the field value in the index – YES : Store the field value in the index  Field.Index – ANALYZED : Tokenize with an Analyzer – NOT_ANALYZED : Do not tokenize – NO : Do not index this field – Other options 6 new StringField(“Last_Name”,”Banerjee”,Field.Store.Yes)); new StringField(“First_Name”,”Arkadeep”,Field.Store.Yes)); new NumericField(“Age”,24,Field.Store.Yes); new TextField(“Description”,desc, Field.Store.No)); new StringField(“Last_Name”,”Banerjee”,Field.Store.Yes)); new StringField(“First_Name”,”Arkadeep”,Field.Store.Yes)); new NumericField(“Age”,24,Field.Store.Yes); new TextField(“Description”,desc, Field.Store.No));

7 Indexing  IndexWriter: primary class for indexing – Stores the index in a directory – RAMDirectory also possible (in memory) 7 StandardAnalyzer analyzer = new StandardAnalyzer(); FSDirectory dir = FSDirectory.open(new File(“directory_path”)); IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer); IndexWriter w = new IndexWriter(index, config); w.addDocument(document); w.close(); StandardAnalyzer analyzer = new StandardAnalyzer(); FSDirectory dir = FSDirectory.open(new File(“directory_path”)); IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer); IndexWriter w = new IndexWriter(index, config); w.addDocument(document); w.close();

8 Lucene document building  A document is a collection of Fields  Document: CV of a student, several fields – Last_Name: Banerjee – First_Name: Arkadeep – Age: 24 – Gender: M – Institute: ISI Kolkata – Location: Kolkata – Description: Arkadeep is a highly motivated student who simply loves challenge. He is …  A field can be a text, a number, a range 8 What would we like to search with?

9 Lucene Analyzer  Tokenizes the text in the Fields  Common Analyzers – WhitespaceAnalyzer Splits tokens on whitespace – SimpleAnalyzer Splits tokens on non-letters, and then lowercases – StopAnalyzer Same as SimpleAnalyzer, but also removes stop words – StandardAnalyzer Most sophisticated analyzer that knows about certain token types, lowercases, removes stop words,...

10 Analysis example  “The quick brown fox jumped over the lazy dog”  WhitespaceAnalyzer – [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]  SimpleAnalyzer – [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]  StopAnalyzer – [quick] [brown] [fox] [jumped] [over] [lazy] [dog]  StandardAnalyzer – [quick] [brown] [fox] [jumped] [over] [lazy] [dog]

11 Analysis example 2  “XY&Z Corporation – xyz@example.com”  WhitespaceAnalyzer – [XY&Z] [Corporation] [-] [xyz@example.com]  SimpleAnalyzer – [xy] [z] [corporation] [xyz] [example] [com]  StopAnalyzer – [xy] [z] [corporation] [xyz] [example] [com]  StandardAnalyzer – [xy&z] [corporation] [xyz@example.com]

12 Searching  IndexReader and IndexSearcher 12 int k = 10; StandardAnalyzer analyzer = new StandardAnalyzer(); IndexReader reader = DirectoryReader.open(FSDirectory.open(new File( ".//index"))); IndexSearcher searcher = new IndexSearcher(reader); Query q = new QueryParser("Description", analyzer).parse(query); TopDocs docs = searcher.search(q, k); ScoreDoc[] hits = docs.scoreDocs; int k = 10; StandardAnalyzer analyzer = new StandardAnalyzer(); IndexReader reader = DirectoryReader.open(FSDirectory.open(new File( ".//index"))); IndexSearcher searcher = new IndexSearcher(reader); Query q = new QueryParser("Description", analyzer).parse(query); TopDocs docs = searcher.search(q, k); ScoreDoc[] hits = docs.scoreDocs;

13 Query and QueryParser QueryParser parser = new QueryParser(Version.LUCENE_40, ”LastName”, new StandardAnalyzer()); Query query = parser.parse(q); QueryParser parser = new QueryParser(Version.LUCENE_40, ”LastName”, new StandardAnalyzer()); Query query = parser.parse(q);  QueryParser – Need to parse the query in the same way the documents were indexed – Tell the query which field should it use (field based search) – Use the same analyzer

14 TopDocs and ScoreDoc TopDocs docs = searcher.search(q, k); ScoreDoc[] hits = docs.scoreDocs; TopDocs docs = searcher.search(q, k); ScoreDoc[] hits = docs.scoreDocs;  Search returns TopDocs – Reference to the top ranked documents returned by search  TopDoc has ScoreDoc(s) – Each ScoreDoc is a single document

15 Getting the results System.out.println("Found " + hits.length + " hits."); for(int i=0;i<hits.length;++i) { int docId = hits[i].doc; Document d = searcher.doc(docId); System.out.println((i + 1) + ". " + d.get("First_Name") + " " + d.get("Last_Name")); } System.out.println("Found " + hits.length + " hits."); for(int i=0;i<hits.length;++i) { int docId = hits[i].doc; Document d = searcher.doc(docId); System.out.println((i + 1) + ". " + d.get("First_Name") + " " + d.get("Last_Name")); }  Get the required fields from the documents

16 Adding/deleting Document s void addDocument(Document d); void addDocument(Document d, Analyzer a); Important: Need to ensure that Analyzer s used at indexing time are consistent with Analyzer s used at searching time // deletes docs containing term or matching // query. The term version is useful for // deleting one document. void deleteDocuments(Term term); void deleteDocuments(Query query);

17 Index format  Each Lucene index consists of one or more segments – A segment is a standalone index for a subset of documents – All segments are searched – A segment is created whenever IndexWriter flushes adds/deletes  Periodically, IndexWriter will merge a set of segments into a single segment – Policy specified by a MergePolicy – Segments are grouped into levels – Segments within a group are roughly equal size (in log space) – Once a level has enough segments, they are merged into a segment at the next level up  Explicitly invoke optimize() to merge segments

18 Searching a changing index Directory dir = FSDirectory.open(...); IndexReader reader = IndexReader.open(dir); IndexSearcher searcher = new IndexSearcher(reader); Above reader does not reflect changes to the index unless you reopen it. Reopen ing is more resource efficient than open ing a new IndexReader. IndexReader newReader = reader.reopen(); If (reader != newReader) { reader.close(); reader = newReader; searcher = new IndexSearcher(reader); }

19 Near-real-time search IndexWriter writer =...; IndexReader reader = writer.getReader(); IndexSearcher searcher = new IndexSearcher(reader); // Now let us say there’s a change to the index using writer writer.addDocument(newDoc); // reopen() and getReader() force writer to flush IndexReader newReader = reader.reopen(); if (reader != newReader) { reader.close(); reader = newReader; searcher = new IndexSearcher(reader); }

20 Query Syntax Query expressionDocument matches if… javaContains the term java in the default field java junit java OR junit Contains the term java or junit or both in the default field (the default operator can be changed to AND) +java +junit java AND junit Contains both java and junit in the default field title:antContains the term ant in the title field title:extreme –subject:sportsContains extreme in the title and not sports in subject (agile OR extreme) AND javaBoolean expression matches title:”junit in action”Phrase matches in title title:”junit action”~5Proximity matches (within 5) in title java*Wildcard matches java~Fuzzy matches lastmodified:[1/1/09 TO 12/31/09] Range matches

21 Programmatically constructed queries  TermQuery, TermRangeQuery  NumericRangeQuery  PrefixQuery  BooleanQuery  PhraseQuery, WildcardQuery

22 Source and acknowledgements  Slides by Manning and Nayak: http://web.stanford.edu/class/cs276/handouts/lecture- lucene.pptx http://web.stanford.edu/class/cs276/handouts/lecture- lucene.pptx  The Lucene tutorial website: http://www.lucenetutorial.com http://www.lucenetutorial.com  Apache lucene: http://lucene.apache.orghttp://lucene.apache.org 22


Download ppt "Introduction to Lucene Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata."

Similar presentations


Ads by Google