Presentation is loading. Please wait.

Presentation is loading. Please wait.

For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei.

Similar presentations


Presentation on theme: "For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei."— Presentation transcript:

1 For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei

2 What is Lucene “Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. ” high performance, scalable Information Retrieval (IR) library. a project in the Apache Software Foundation mature, free, open-source implemented in Java.

3 full-text indexing and searching “In text retrieval, full text search refers to a technique for searching a computer-stored document or database. In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user. ” “Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. ”

4 Lucene is popular a number of ports or integrations to other programming languages C/C++, C#, Ruby, Perl, Python, PHP, etc. 1500+ installations: HP, FedEx, Iron Mountain, Akamai, DSpace, IBM/Yahoo, Healthline, Webmail, CNET, Lookout (acquired by Microsoft), webshots.com (100M docs, 4M queries/day), Siderean, Monster….

5 Lucene is just a hammer! NOT a ready-to-use search application, like Google a software library, a toolkit a single compact JAR file (less than 1 MB!) A number of full-featured search applications have been built on top of Lucene.

6 What Lucene can do for you add search capabilities to your application index and make searchable any data that you can extract text from Lucene doesn’t care about the source of the data, its format, or even its language, as long as you can derive text from it. You can even index data stored in your databases, indirectly!

7 Search Application Figure 1. Typical components of search application; the shaded components show which parts Lucene handles. Components for indexing Acquire Content Build Document Analyze Document Index Document Components for searching Search User Interface Build Query Search Query Render Results Others Administration Interface Analytics Interface Scaleout

8 Ranking formula score(Q,D) = coord(Q,D) · queryNorm(Q) · ∑ t in Q ( tf(t in D) · idf(t) 2 · t.getBoost() · norm(D) ) tf–idf weight (term frequency–inverse document frequency)

9 Key index files in Lucene Segments file Fields information file Text information file Frequency file Position file

10 Inverted Index Example Doc 1: Penn State Football … football Doc 2: Football players … State Posting id worddocoffset 1footballDoc 13 67 Doc 21 2pennDoc 11 3playersDoc 22 4stateDoc 12 Doc 213 Posting Table

11 Demo How to install Lucene and run the demo Boolean retrieval example apache – lucene apache + lucene apache lucene Luke: http://www.getopt.org/luke/http://www.getopt.org/luke/ A online demo (PHP + Lucene) : http://tiny.cc/JCA9Khttp://tiny.cc/JCA9K

12 Reference: Lucene: http://lucene.apache.org/http://lucene.apache.org/ Apache: http://www.apache.org/http://www.apache.org/ “Lucene in Action” Chapter 1 and code: LinkLink Lucene index: http://www.ibm.com/developerworks/library/wa- lucene/http://www.ibm.com/developerworks/library/wa- lucene/ http://lucene.apache.org/java/2_4_0/scoring.html http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/sea rch/Similarity.html http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/sea rch/Similarity.html http://en.wikipedia.org/wiki/Full_text_search http://en.wikipedia.org/wiki/Index_%28search_engine%29 http://en.wikipedia.org/wiki/Tf-idf


Download ppt "For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei."

Similar presentations


Ads by Google