Lemur Application toolkit Kanishka P Pathak Bioinformatics CIS 595.

Lemur Application toolkit Kanishka P Pathak Bioinformatics CIS 595

Introduction A language model (LM) is a probabilistic mechanism for generating text A language model (LM) is a probabilistic mechanism for generating text In the past several years, there has been significant interest in the use of language modeling for text and natural language processing tasks In the past several years, there has been significant interest in the use of language modeling for text and natural language processing tasks We now have text information retrieval (IR) based on statistical language modeling We now have text information retrieval (IR) based on statistical language modeling

Previous work The first statistical modeler was Claude Shannon. The first statistical modeler was Claude Shannon. He thought of the human language as a statistical source and … He thought of the human language as a statistical source and … He measured how well simple n-gram models did at predicting and compressing natural text. He measured how well simple n-gram models did at predicting and compressing natural text.

For many years, language models were used in speech recognition. For many years, language models were used in speech recognition. However, basic language modeling ideas have been used in information retrieval for quite some time. However, basic language modeling ideas have been used in information retrieval for quite some time. Some of the previous models are: Some of the previous models are: naïve Bayes model Robertson and Sparck Jones model

Their limitations…. Naïve Bayes Naïve Bayes Suffers from the “Independence Assumptions” it makes RSJ RSJ Distribution of query trems in “relevant” and “non-relevant” documents

Turning the problem around Ponte and Croft proposed the smoothed version of document unigram model to assign a score to a query Ponte and Croft proposed the smoothed version of document unigram model to assign a score to a query Berger and J.Lafferty built on this model. Berger and J.Lafferty built on this model. Their approach : “predict the input (i.e. the query)” This opened up new ways to think about information retrieval….

Lemur ‘Lemur’ is a nocturnal, monkey-like African animal largely confined to the island of Madagascar ‘Lemur’ is a nocturnal, monkey-like African animal largely confined to the island of Madagascar The name was chosen partly because of resemblance to LM/IR The name was chosen partly because of resemblance to LM/IR Secondly because LM community was an island to the IR community Secondly because LM community was an island to the IR community

What is the Lemur project? It is a research project being carried out by the computer Science dept. at Univ. of Massachusetts and Carnegie Mellon University It is a research project being carried out by the computer Science dept. at Univ. of Massachusetts and Carnegie Mellon University It is sponsored by the Advanced Research and Development Activity in Information Technology (ARDA) It is sponsored by the Advanced Research and Development Activity in Information Technology (ARDA) It is designed to facilitate research in language modeling and Information retrieval It is designed to facilitate research in language modeling and Information retrieval It is written in C/C++ and runs under Unix as well as Windows It is written in C/C++ and runs under Unix as well as Windows

Components and their interaction

The toolkit The lemur toolkit is available on the site www-2.cs.cmu.edu/~lemur The lemur toolkit is available on the site www-2.cs.cmu.edu/~lemur To use the toolkit : To use the toolkit : download  compile  execute

Example of applications Pre-processing : Pre-processing :ParseQueryParseToFile Building/Adding Index : Building/Adding Index :PushIndexerBuildBasicIndex Retrieval/Evaluation : Retrieval/Evaluation :RetEvalStructQueryEval Summarization : Summarization :BasicSummAppMMRSummApp

What do we need to run an application? Text documents in the format which is acceptable by LEMUR (TREC format) Text documents in the format which is acceptable by LEMUR (TREC format) Parameter file Parameter file

Document format in Lemur There are 5 documents formats supported by Lemur : TRECWEBCHINESECHINESECHARARABIC

Example of a Document format Say, we take the document “web” <DOC> any_number_here any_number_here Text here </DOC><DOC> any_number_here any_number_here Text here </DOC>

Example of Document format <DOC> 251 251 Ballistic Cam Design This paper presents a digital computer program for the rapid calculation of manufacturing data essential to the design of preproduction cams which are utilized in ballistic computers of tank fire control systems. The cam profile generated introduces the superelevation angle required by tank main armament for a particular type ammunition. CACM November, 1961 Archambault, M. CA611117 JB March 15, 1978 10:37 PM </DOC>

Example of what a parameter file looks like Say we are creating a parameter file for the application ‘BuildBasicIndex’ The parameter file needs to have the following contents: 1.inputFile : the path to the source file 2.outputPrefix : a prefix name for your index 3.maxDocuments : maximum number of documents to index (default 1000000) 4.maxMemory : maximum amount of memory to be used for indexing (default 128MB)

Eg:inputFile=/usr/mydata/source; outputPrefix= /usr/mydata/index; maxDocuments=200000; C:\lemur>BuildBasicIndex c:\lemur\buildpa The indexed file generated is : /usr/mydata/index.bsc /usr/mydata/index.bsc

Contd…. Run the application with the parameter as the only argument OR OR the first argument, if the application can take other parameters from the command line

example Example: C:\lemur\lemur-2.0.3>BuildBasicIndex c:\lemur\parambasic.txt OR C:\lemur\lemur-2.0.3>BuildBasicIndex c:\lemur\parambasic.txt c:\lemur\source.txt Where, BuildBasicIndex is the application parambasic.txt is a parameter file for BuildBasicIndex source.txt is the file containing the source document

Lemur API The Lemur API is intended to allow a programmer to use the toolkit for special-purpose applications that are not implemented in the toolkit itself The API interfaces are grouped at three different levels: 1. Utility level 2. Indexer level 3. Retrieval level

API levels Utility level : Includes common utilities such as memory management, default exception handler, program argument handler. Utility level : Includes common utilities such as memory management, default exception handler, program argument handler. Indexer level : Converts the raw text into efficient data structures so that the information (i.e. word counts) may be accessed conveniently and efficiently later. Indexer level : Converts the raw text into efficient data structures so that the information (i.e. word counts) may be accessed conveniently and efficiently later. Retrieval level: It is most useful for users who want to build a prototype system or evaluation system Retrieval level: It is most useful for users who want to build a prototype system or evaluation system

Future Developments Summarizing Summarizing Filtering Filtering Question Answering Question Answering Language generation Language generation

References www-2.cs.cmu.edu/~lemur www-2.cs.cmu.edu/~lemur A language modeling approach to Information retrieval A language modeling approach to Information retrieval by Jay M Ponte and W. Bruce Croft (CS – UMass Amherst)

THANK YOU Any questions?

Lemur Application toolkit Kanishka P Pathak Bioinformatics CIS 595.

Similar presentations

Presentation on theme: "Lemur Application toolkit Kanishka P Pathak Bioinformatics CIS 595."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lemur Application toolkit Kanishka P Pathak Bioinformatics CIS 595.

Similar presentations

Presentation on theme: "Lemur Application toolkit Kanishka P Pathak Bioinformatics CIS 595."— Presentation transcript:

Similar presentations

About project

Feedback