Efficient Search in Large Textual Collections with Redundancy Jiangong Zhang and Torsten Suel Review by Newton Alex 993940942.

Slides:



Advertisements
Similar presentations
For more information please send to or EFFICIENT QUERY SUBSCRIPTION PROCESSING.
Advertisements

Chapter 5: Introduction to Information Retrieval
Operating Systems Lecture 10 Issues in Paging and Virtual Memory Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing.
A review on “Answering Relationship Queries on the Web” Bhushan Pendharkar ASU ID
1 Presented By Avinash Gutte Under The Guidance of Mrs. Hemangi Kulkarni Department of Computer Engineering Pimpri-Chinchwad College of Engineering, Pune.
Tries Standard Tries Compressed Tries Suffix Tries.
Clarke, R. J (2001) t909-02: 1 Office Automation & Intranets BUSS 909 Tutorial 2 Researching on the WWW.
Search Engines and Information Retrieval
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
Web Exploration and Search Technology Lab Department of Computer and Information Science Polytechnic University Brooklyn, NY Faculty: Torsten Suel.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Web Search – Summer Term 2006 V. Web Search - Page Repository (c) Wolfgang Hürst, Albert-Ludwigs-University.
Google and Scalable Query Services
 Search engines are programs that search documents for specified keywords and returns a list of the documents where the keywords were found.  A search.
Design and Implementation of a Geographic Search Engine Alexander Markowetz Yen-Yu Chen Torsten Suel Xiaohui Long Bernhard Seeger.
SIEVE—Search Images Effectively through Visual Elimination Ying Liu, Dengsheng Zhang and Guojun Lu Gippsland School of Info Tech,
Databases & Data Warehouses Chapter 3 Database Processing.
How Search Engines Work. Any ideas? Building an index Dan taylor Flickr Creative Commons.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
§6 B+ Trees 【 Definition 】 A B+ tree of order M is a tree with the following structural properties: (1) The root is either a leaf or has between 2 and.
Search Engines and Information Retrieval Chapter 1.
Click to edit Present’s Name Xiaoyang Zhang 1, Jianbin Qin 1, Wei Wang 1, Yifang Sun 1, Jiaheng Lu 2 HmSearch: An Efficient Hamming Distance Query Processing.
Case Study ProsperaSoft’s global sourcing model gives the maximum benefit to customers in terms of cost savings, improved quality, access to highly talented.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
KLUWER JOURNALS
1 Enhancements in Query Evaluation and Page Summarization of The Thinking Algorithm M. Shoaib Jameel Amar Akshat Chingtham Tejbanta Singh Department of.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.
Qingqing Gan Torsten Suel CSE Department Polytechnic Institute of NYU Improved Techniques for Result Caching in Web Search Engines.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Fundamentals of Music Processing Chapter 7: Content-Based Audio Retrieval Meinard Müller International Audio Laboratories Erlangen
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Facilitating Document Annotation using Content and Querying Value.
Announcements and Reminders Recitation sessions are being held this week, with focus on introducing students to Unix systems You’re responsible for policies.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
ACIS Introduction to Data Analytics & Business Intelligence Database s Benefits & Components.
Supporting Privacy Protection in Personalized Web Search.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University.
Search engine note. Search Signals “Heuristics” which allow for the sorting of search results – Word based: frequency, position, … – HTML based: emphasis,
Step 1of 11 Admin Demonstrations Click Here to Start.
CSM06: Information Retrieval Notes about writing coursework reports, revision and examination.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
Lecture 4: Data Integration and Cleaning CMPT 733, SPRING 2016 JIANNAN WANG.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
Traffic Source Tell a Friend Send SMS Social Network Group chat Banners Advertisement.
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,
Keys and adding, deleting and modifying records in an array ● Record Keys ● Reading and Adding Records ● Partition or Sentinels Marking Space in Use ●
Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,
Spatial Data Management
How do Web Applications Work?
Software Applications for end-users
Editing Your Website on SharePoint 2013
Prepared by Rao Umar Anwar For Detail information Visit my blog:
Dynamic Indexing in SpatialHadoop
Thanks to Bill Arms, Marti Hearst
Google Scholar: 8,554 publications Systematic search:
Information Retrieval and Web Design
INF 141: Information Retrieval
Information Retrieval and Web Design
Presentation transcript:

Efficient Search in Large Textual Collections with Redundancy Jiangong Zhang and Torsten Suel Review by Newton Alex

Problem Searching over collections of data that include many different crawls and versions of each page – E.g. Searching the Internet archive, archives etc. Not feasible to provide full text search due to high cost of processing a query – E.g. Current indexing and query processing techniques when applied to say 10 successive crawls of the same URL will result in index sizes and query processing costs roughly 10 times that of single crawl

Proposed Solution A new and general framework that results in significant savings in the size of the inverted index and the performance of query processing for webpage collections with redundancies. Features – Content-dependent partitioning techniques, in particular Winnowing. – Non redundant indexing. Two policies with respect to indexing local sharing global sharing – Modification of Document-at-a-time query processing algorithm to take advantage of the fragment based indexes

Critique The paper does not described the data structures used or the hardware setup in detail. The framework supports deleting old unused fragments. Why is a delete required when we are interested in versioned systems? Since no duplicate fragments are maintained, deleting a fragment might result in removing fragments corresponding to other pages in the archive.

Relation to Course This paper is similar to the Google News paper. However, this paper doesn’t describe the data structures or the environment setup in detail Related to the concepts that were used in the Search engine project like inverted indexes, query matching etc. Proposes methods for creating efficient indexes for redundant data.