The Simigle Image Search Engine Wei Dong 2010-09-23.

Slides:



Advertisements
Similar presentations
For more information please send to or EFFICIENT QUERY SUBSCRIPTION PROCESSING.
Advertisements

Image Retrieval: Current Techniques, Promising Directions, and Open Issues Yong Rui, Thomas Huang and Shih-Fu Chang Published in the Journal of Visual.
Aggregating local image descriptors into compact codes
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Presented by Xinyu Chang
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Qinqing Gan Torsten Suel Improved Techniques for Result Caching in Web Search Engines Presenter: Arghyadip ● Konark.
Clustering and Load Balancing Optimization for Redundant Content Removal Shanzhong Zhu (Ask.com) Alexandra Potapova, Maha Alabduljalil (Univ. of California.
Presented by Russell Myers Paper by Ming-Chuan Wu and Alejandro P. Buchmann.
Movie Recommendation System
Cascaded Filtering For Biometric Identification Using Random Projection Atif Iqbal.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
WISE: Large Scale Content-Based Web Image Search Michael Isard Joint with: Qifa Ke, Jian Sun, Zhong Wu Microsoft Research Silicon Valley 1.
Fast and Compact Retrieval Methods in Computer Vision Part II A. Torralba, R. Fergus and Y. Weiss. Small Codes and Large Image Databases for Recognition.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Multiple Tiers in Action
By Intellext Presented By: Neha Bhatt. What is Watson? Watson is an information access assistant that automatically retrieves useful information in the.
Techniques and Data Structures for Efficient Multimedia Similarity Search.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Multiple Object Class Detection with a Generative Model K. Mikolajczyk, B. Leibe and B. Schiele Carolina Galleguillos.
1 Intelligent Crawling Junghoo Cho Hector Garcia-Molina Stanford InfoLab.
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
Internet Research Search Engines & Subject Directories.
IIIT Hyderabad Atif Iqbal and Anoop Namboodiri Cascaded.
Chris Cummings.  Traffic cameras recording targets and retrieving them  Cameras track targets and the data needs to be recorded, but how are you supposed.
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
CS523 INFORMATION RETRIEVAL COURSE INTRODUCTION YÜCEL SAYGIN SABANCI UNIVERSITY.
Web Search Module 6 INST 734 Doug Oard. Agenda The Web  Crawling Web search.
Data Structures & Algorithms and The Internet: A different way of thinking.
Full-Text Search in P2P Networks Christof Leng Databases and Distributed Systems Group TU Darmstadt.
Compact Data Structures and Applications Gil Einziger and Roy Friedman Technion, Haifa.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Qingqing Gan Torsten Suel CSE Department Polytechnic Institute of NYU Improved Techniques for Result Caching in Web Search Engines.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.
Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.
May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at.
IEEE Int'l Symposium on Signal Processing and its Applications 1 An Unsupervised Learning Approach to Content-Based Image Retrieval Yixin Chen & James.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Scott Wen-tau Yih (Microsoft Research) Joint work with Hannaneh Hajishirzi (University of Illinois) Aleksander Kolcz (Microsoft Bing)
P ROBING THE L OCAL -F EATURE S PACE OF I NTEREST P OINTS Wei-Ting Lee, Hwann-Tzong Chen Department of Computer Science National Tsing Hua University,
A BRIEF INTRODUCTION TO CACHE LOCALITY YIN WEI DONG 14 SS.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Document Clustering and Collection Selection Diego Puppin Web Mining,
Overview on Web Caching COSC 513 Class Presentation Instructor: Prof. M. Anvari Student name: Wei Wei ID:
A Metric Cache for Similarity Search fabrizio falchi claudio lucchese salvatore orlando fausto rabitti raffaele perego.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Indexing The World Wide Web: The Journey So Far Abhishek Das, Ankit Jain 2011 Paper Presentation : Abhishek Rangnekar 1.
Scalability of Local Image Descriptors Björn Þór Jónsson Department of Computer Science Reykjavík University Joint work with: Laurent Amsaleg (IRISA-CNRS)
GENERAL SCALABILITY CONSIDERATIONS
1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.
DS.H.1 Hashing Chapter 5 Overview The General Idea Hash Functions Separate Chaining Open Addressing Rehashing Extendible Hashing Application Example: Geometric.
Naifan Zhuang, Jun Ye, Kien A. Hua
Information Retrieval on the World Wide Web
Search Engines & Subject Directories
Authors: Guanghan Ning, Zhi Zhang, Xiaobo Ren, Haohong Wang,
Junqi Zhang+ Xiangdong Zhou+ Wei Wang+ Baile Shi+ Jian Pei*
Search Engines & Subject Directories
Search Engines & Subject Directories
Client-Server Model: Requesting a Web Page
Presentation transcript:

The Simigle Image Search Engine Wei Dong

Challenges Large dataset –~100 million images w/ single server High confidence –False positive rate < High recall –Recall ~ 80% Online search High throughput –Still a long way to go

System Overview Loosely coupled Search servers Easy to replicate Read Only Database Images A cluster for crawling and indexing images Clients w/ Various Browsers Json Jpeg html Software techniques: C++, boost, poco Javascript, jqueryC++, java, hadoop

Search Server Architecture query Session Cache (by UUID) Retrieval Cache (by SHA1) Feature Extraction Feature Search Query Expansion Search Process miss Thumbnail Database Feature Index Feature Index Feature Index Feature Index

Main Techniques Entropy-filtered local image features –High confidence Graph-based query expansion –High recall Compact sketch representation –Smaller database, faster search Flexible bit-vector indexing –Online search Content-aware disk layout –High throughput thumbnail retrieval

Entropy-Filtered Local Feature Feature detection w/ Difference-of- Gaussian Entropy-based filtering for high confidence DoG detects more regions than needed. Some plain regions can cause false positives (like A, D). We only keep regions with high entropy (rich content, like B, C) 10x reduction of error rate Less features have to be indexed [ Unpublished ]

Graph-Base Query Expansion We can find more results if we use the initial results to search again Keep searching until we find no more Problem: hit a lot of false positives We use graph-partitioning method [1] to smartly cut-off expansion. Recall from 43% to ~80% w/ same false positive rate [2]. [1] Andersen, et al. Local graph partitioning using PageRank vectors. FOCS’ 06. [2] Unpublished.

Compact Sketch Representation Raw features are large, 5~10KB/image –About 80 features / image –128 bytes / feature (SIFT) or 64 bytes / feature (SURF) with lower quality –Encodes all information about a region We only need to tell if two features are extremely similar 128-bit sketch with random space partitioning techniques Dong, et al. Asymmetric Distance Estimation with Sketches for Similarity Search in High- Dimensional Spaces. SIGIR ’08.

Flexible Bit-Vector Indexing Search for sketches w/ <=3 bits different. Divide 128-bit into 4 blocks, so at least one block is identical. State-of-art [1] is equal partitioning. We find optimal partitioning with dynamic programming [2] –Faster –More flexible [1] Manku, et al. Detecting near-duplicates for web crawling. WWW'07. [2] Unpublished

Content-Aware Disk Layout Query results range from a few to 1000s 20~100 thumbnails / page If thumbnails are randomly stored on disk, throughput will be limited by disk seeks We store similar images together on disk and load a bunch with one disk seek Results on a single query can be covered with a few disk seeks. [ Unpublished ]

Conclusion We present a system for similar web image retrieval –High capacity (~100 million images / server) –High confidence (10 -6 error rate) –High recall (~80% recall) –Online search (searches return in seconds) Future work: further improve responsiveness and throughput.