 CIKM  Implementation of Smoothing techniques on the GPU  Re running experiments using the wt2g collection  The Future.

 CIKM  Implementation of Smoothing techniques on the GPU  Re running experiments using the wt2g collection  The Future

 Began my FYP by revamping my UROP report and submitting it for publication to CIKM  Learnt the importance of succinct writing  Importance of re-drawing images  Importance of re-writing equations  Comments by reviewers  Definition of language modeling approach not clear  We should use a standard dataset  Also noticed that we need to improve our smoothing model. Direction for my FYP

 The Good Turing smoothing algorithm  The Kneser Ney Smoothing algorithm

 Intuition : We estimate the probability of things that occur c times using the probability of things that occur c+1 times.  In the above definition, N c is the number of N grams that occur c times.

Doc 1Doc 2 a shoe11 a cat02 foo bar20 a dog12 2 phases: 1)Calculate the N c values 2)Smooth counts

1021 Doc1: 0112Sort: 0123 Positions: 012 013 Stream compaction Doc1: N 0 = 1, N 1 = 2, N 2 = 1

1021 Doc1 Thread 2Thread 1 Thread 0 Thread 3 Let one thread compute the smoothed count for each Ngram

 CIKM  Implementation of Smoothing techniques on the GPU  Re running experiments using the wt2g collection  The Future

 Provided by University of Glasgow  Cost : 350 pounds. Size = 2 GB Webpage HTML parser Text Inverted Index LM indexer Re run experiments Results!

 Both written in Python  Used the lxml API for HTML parsing : from lxml import html  Do not use inbuilt HTML parser provided by Python. Cannot handle broken HTML very well while extracting text  Beautiful soap is also a good option  Used the nltk library for stemming (nltk.stem.porter) and indexing (nltk.word_tokenize)

 CIKM  Implementation of Smoothing techniques on the GPU  Re running experiments using the wt2g collection  The Future  Implementing Ponte and Croft’s Model  Re running experiments using the TREC GOV 2 collection

 Modify code to implement Ponte and Croft’s model  Re run experiments using the TREC GOV 2 collection. SourceDataSet Using Graphics Processors for high performance IR query processing TREC GOV2 dataset of 25.2 million pages Optimized topK processing with global page scoresTREC GOV2. 426 GB On Efficient Posting List Intersection with Multicore Processors Altavista query log. (250GB) Improving Search Engines Performance on Multithreading Processors TodoCL database. A database consisting of pages from Chile (1.5GB) Faster Top-k Document Retrieval Using Block-Max IndexesTREC GOV2. Batch Query processing for web search enginesSubset of 10 million pages crawled by PolyBot web crawler A Language Modeling approach to Information RetrievalTREC. Topics 202-250 and topics 51-100. Improved Index Compression Techniques for Versioned Document Collections Wikipedia dataset (8 million documents) + Internet Archive (1.06 million documents) Ranking Web Pages Using Collective KnowledgeWikipedia + TREC Combining the Language Model and Inference Network Approaches to Retrieval TREC 4,6,7, and 8 queries

 CIKM  Implementation of Smoothing techniques on the GPU  Re running experiments using the wt2g collection  The Future.

Similar presentations

Presentation on theme: " CIKM  Implementation of Smoothing techniques on the GPU  Re running experiments using the wt2g collection  The Future."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

 CIKM  Implementation of Smoothing techniques on the GPU  Re running experiments using the wt2g collection  The Future.

Similar presentations

Presentation on theme: " CIKM  Implementation of Smoothing techniques on the GPU  Re running experiments using the wt2g collection  The Future."— Presentation transcript:

Similar presentations

About project

Feedback