 CIKM  Implementation of Smoothing techniques on the GPU  Re running experiments using the wt2g collection  The Future.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
IMPLEMENTATION OF INFORMATION RETRIEVAL SYSTEMS VIA RDBMS.
Chapter 5: Introduction to Information Retrieval
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
Information Retrieval in Practice
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
Scalable Text Mining with Sparse Generative Models
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
 Search engines are programs that search documents for specified keywords and returns a list of the documents where the keywords were found.  A search.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Terrier: TERabyte RetRIevER An Introduction By: Kavita Ganesan (Last Updated April 21 st 2009)
TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
CourseCrawler Matt Berntsen Don Frehulfer Evan Kaiser.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Search Engine Architecture
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
IT-522: Web Databases And Information Retrieval By Dr. Syed Noman Hasany.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
IR Homework #1 By J. H. Wang Mar. 16, Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection.
Search engine note. Search Signals “Heuristics” which allow for the sorting of search results – Word based: frequency, position, … – HTML based: emphasis,
A search engine is a web site that collects and organizes content from all over the internet Search engines look through their own databases of.
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Evaluation Anisio Lacerda.
Search Engine Architecture
Map Reduce.
Search Engine Architecture
Implementation Issues & IR Systems
Prepared by Rao Umar Anwar For Detail information Visit my blog:
MR Application with optimizations for performance and scalability
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Thanks to Bill Arms, Marti Hearst
Web Scrapers/Crawlers
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Data Mining Chapter 6 Search Engines
MR Application with optimizations for performance and scalability
CS246: Information Retrieval
Search Engine Architecture
Inverted Indexing for Text Retrieval
The Search Engine Architecture
Presentation transcript:

 CIKM  Implementation of Smoothing techniques on the GPU  Re running experiments using the wt2g collection  The Future

 Began my FYP by revamping my UROP report and submitting it for publication to CIKM  Learnt the importance of succinct writing  Importance of re-drawing images  Importance of re-writing equations  Comments by reviewers  Definition of language modeling approach not clear  We should use a standard dataset  Also noticed that we need to improve our smoothing model. Direction for my FYP

 The Good Turing smoothing algorithm  The Kneser Ney Smoothing algorithm

 Intuition : We estimate the probability of things that occur c times using the probability of things that occur c+1 times.  In the above definition, N c is the number of N grams that occur c times.

Doc 1Doc 2 a shoe11 a cat02 foo bar20 a dog12 2 phases: 1)Calculate the N c values 2)Smooth counts

1021 Doc1: 0112Sort: 0123 Positions: Stream compaction Doc1: N 0 = 1, N 1 = 2, N 2 = 1

1021 Doc1 Thread 2Thread 1 Thread 0 Thread 3 Let one thread compute the smoothed count for each Ngram

 CIKM  Implementation of Smoothing techniques on the GPU  Re running experiments using the wt2g collection  The Future

 Provided by University of Glasgow  Cost : 350 pounds. Size = 2 GB Webpage HTML parser Text Inverted Index LM indexer Re run experiments Results!

 Both written in Python  Used the lxml API for HTML parsing : from lxml import html  Do not use inbuilt HTML parser provided by Python. Cannot handle broken HTML very well while extracting text  Beautiful soap is also a good option  Used the nltk library for stemming (nltk.stem.porter) and indexing (nltk.word_tokenize)

 CIKM  Implementation of Smoothing techniques on the GPU  Re running experiments using the wt2g collection  The Future  Implementing Ponte and Croft’s Model  Re running experiments using the TREC GOV 2 collection

 Modify code to implement Ponte and Croft’s model  Re run experiments using the TREC GOV 2 collection. SourceDataSet Using Graphics Processors for high performance IR query processing TREC GOV2 dataset of 25.2 million pages Optimized topK processing with global page scoresTREC GOV GB On Efficient Posting List Intersection with Multicore Processors Altavista query log. (250GB) Improving Search Engines Performance on Multithreading Processors TodoCL database. A database consisting of pages from Chile (1.5GB) Faster Top-k Document Retrieval Using Block-Max IndexesTREC GOV2. Batch Query processing for web search enginesSubset of 10 million pages crawled by PolyBot web crawler A Language Modeling approach to Information RetrievalTREC. Topics and topics Improved Index Compression Techniques for Versioned Document Collections Wikipedia dataset (8 million documents) + Internet Archive (1.06 million documents) Ranking Web Pages Using Collective KnowledgeWikipedia + TREC Combining the Language Model and Inference Network Approaches to Retrieval TREC 4,6,7, and 8 queries