Presentation is loading. Please wait.

Presentation is loading. Please wait.

Assignment 2: Full text search with Lucene Mathias Mosolf, Alexander Frenzel.

Similar presentations


Presentation on theme: "Assignment 2: Full text search with Lucene Mathias Mosolf, Alexander Frenzel."— Presentation transcript:

1 Assignment 2: Full text search with Lucene Mathias Mosolf, Alexander Frenzel

2 Allgemein Probleme? Welche Dokumentation? o Getting Started o vorwiegend Beispiele o Javadoc

3 Indexer XML-Parsing mit Digester o http://commons.apache.org/digester/ Lucene o org.apache.lucene.index.IndexWriter; o org.apache.lucene.store.Directory; o org.apache.lucene.analysis.Analyzer; o org.apache.lucene.analysis.WhitespaceAnalyzer; o org.apache.lucene.document.Document; o org.apache.lucene.document.Field;

4 Analyzer analyzer = new WhitespaceAnalyzer(); boolean createFlag = true; writer = new IndexWriter(indexDir, analyzer, createFlag, IndexWriter.MaxFieldLength.UNLIMITED); [..] digester.parse(xml); [..] public void addMedlineDocument(MedlineDocument doc) throws IOException { this.counter++; String title = doc.getTitle().replaceAll("\\ ", " ").toLowerCase(); String text = ((doc.getAbstract() != null)?doc.getAbstract():"").replaceAll("\\ ", " ").toLowerCase(); Document medlineDocument = new Document(); medlineDocument.add(new Field("pmid", doc.getPmid(), Field.Store.YES, Field.Index.NOT_ANALYZED)); medlineDocument.add(new Field("title", title, Field.Store.YES, Field.Index.ANALYZED)); medlineDocument.add(new Field("abstract", text, Field.Store.YES, Field.Index.ANALYZED)); medlineDocument.add(new Field("combined", title+text, Field.Store.YES, Field.Index.ANALYZED)); writer.addDocument(medlineDocument); } [..] // optimize and close the index writer.optimize(); writer.close();

5 BoolSearch Sammeln von ID mittel Set Lucene o org.apache.lucene.search.IndexSearcher; o org.apache.lucene.index.Term; o org.apache.lucene.search.BooleanClause; o org.apache.lucene.search.BooleanQuery; o org.apache.lucene.search.ScoreDoc; o org.apache.lucene.search.TermQuery; o org.apache.lucene.search.TopDocs;

6 IndexSearcher indexSearcher = new IndexSearcher(indexDir); pmids = new HashSet (); public void search(String field, String[] keywords) throws IOException { BooleanQuery query = new BooleanQuery(); // create BOOL Query for (String word: keywords) { TermQuery tq = new TermQuery(new Term(field, word.toLowerCase())); query.add(tq, BooleanClause.Occur.MUST); } // extract PMIDs TopDocs docs = this.searcher.search(query, searcher.maxDoc()); for (ScoreDoc scoreDoc : docs.scoreDocs) { Document doc = searcher.doc(scoreDoc.doc); pmids.add(doc.get("pmid"));// add to set }

7 PhraseSearch PhraseQuery vs SpanQuery Lucene o org.apache.lucene.search.IndexSearcher; o org.apache.lucene.index.Term; o org.apache.lucene.search.spans.SpanNearQuery; o org.apache.lucene.search.spans.SpanQuery; o org.apache.lucene.search.spans.SpanTermQuery; o org.apache.lucene.search.spans.Spans;

8 IndexSearcher indexSearcher = new IndexSearcher(indexDir); pmids = new HashSet (); public void search(String field, String[] phrase) throws IOException { // generate query int l=phrase.length; SpanQuery[] sq = new SpanQuery[l]; for(int i = 0; i < l; i++) { sq[i] = new SpanTermQuery(new Term(field, phrase[i])); } SpanNearQuery query = new SpanNearQuery(sq, 0, true); // search Spans sp = query.getSpans(this.searcher.getIndexReader()); int id=-1; Document doc; // runs trough all Occurrences of the phrase while (sp.next() == true) { this.occ++; //number of occurrences if (id != sp.doc()) { // next doc id = sp.doc(); // save current id // add pmid doc = searcher.doc(id); this.pmids.add(doc.get("pmid")); }

9 Tests/Beispiele Beispiele o "reduce the appeal of" => ~0.399s o "duchenne's muscular" => ~0.398s o dicyclocarbodimide => ~0.399s o experiment => ~0.458s o protein complex => ~0.446s o duchenne's disease => ~0.447s "Komplex" o and => ~1.446s o and the => ~1.298s o and to the => ~1.212s o and to the you => ~0.449s o "operation, the patient presents no signs of" => ~0.400s


Download ppt "Assignment 2: Full text search with Lucene Mathias Mosolf, Alexander Frenzel."

Similar presentations


Ads by Google