Index Structures for Querying the Deep Web Jian Qiu, Feng Shao, Jayavel Shanmugasundaram Cornell Universersity Misha Zatsman Google.

Slides:



Advertisements
Similar presentations
KEVIN DAVIS GURUSFORHIRE.COM Becoming An SEO Samurai.
Advertisements

Searching for Information Search engines vs. subscription services.
Searching the World Wide Web
Insurance and Risk Management Internet Searches Overall Objective: –On completion of the three course modules, you should be able to obtain and evaluate.
The Internet Adult Literacy Center Created by Andrea L. Lawrence MS.
Sample MQP Projects Murali Mani
Florian Schroff, Antonio Criminisi & Andrew Zisserman ICCV 2007 Harvesting Image Databases from the Web.
Building Workshop I. The Basics II. Website Types III. Website Platforms IV. Live Case Study V. Open Discussion.
The Visible Web (aka The Surface Web or Indexable Web)
Exploring the Deep Web Brunvand, Amy, Kate Holvoet, Peter Kraus, and David Morrison. "Exploring the Deep Web." PPT--Download University of Utah.
Search Engine Marketing Free Traffic for Your Web Site Paul Allen, CEO
Page 1 June 2, 2015 Optimizing for Search Making it easier for users to find your content.
Searching the Semantic Web. Introduction  Research Focuses: IE Ontologies (creating, languages, merging, storing, querying)  Next Sep: Using the Semantic.
Information Retrieval and Databases: Synergies and Syntheses IDM Workshop Panel 15 Sep 2003 Jayavel Shanmugasundaram Cornell University.
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
2/11/2004 Internet Services Overview February 11, 2004.
14 1 Chapter 14 Database Connectivity and Web Development Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
Multiple Tiers in Action
Search engines. The number of Internet hosts exceeded in in in in in
 Search engines are programs that search documents for specified keywords and returns a list of the documents where the keywords were found.  A search.
Google App Engine and Java Application: Clustering Internet search results for a person Aleksandar Kartelj Faculty of Mathematics,
How Search Engines Work. Any ideas? Building an index Dan taylor Flickr Creative Commons.
CSCI-235 Micro-Computer in Science Internet Search.
Searching Information. General Steps Identifying Key Words, Synonyms, and Key Phrases Constructing an effective search statement Advance search/boolean.
Web Services Brenton Lovett Wizard Information Services.
CONDUCTING RESEARCH How to find information on the Internet.
Marketing Mix - Promotion. MySpace Adds Different models of adds.
McLean HIGHER COMPUTER NETWORKING Lesson 7 Search engines Description of search engine methods.
Search engines are used to for looking for documents. They compile their databases by employing "spiders" or "robots" to crawl through web space from.
World Wide Web. Browser Use browser to access the web –Internet Explorer (Microsoft) –Firefox (Mozilla) On all PCs Requires internet connection Provides.
Searching Tutorial By: Lola L. Introduction:  When you are using a topic, you might want to use “keyword topics.” Using this might help you find better.
به نام خدا مهندسي اينترنت جوانمرد اسلايد پنجم.
What You Will Learn? - What is Keyword and Why it’s Important? - How To Brainstorm Keyword Ideas. - How To check the Search Volume.
 A search agent scours the entire web.  Constantly Evolving and Expanding.
Research Paper NE 201 Honora Eskridge NCSU Libraries September 27, 2006.
Mr C Johnston ICT Teacher G042 – Lecture 02 Using Logical Operators To Aid Searching.
What Does the User Really Want ? Relevance, Precision and Recall.
SEO: top-rankings in Google Harald J. Koch. Why are top-rankings in search engines that important?
How Web Database Architectures Work CPS181s April 8, 2003.
CS : NLP, Speech and Web-Topics-in-AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 34: Precision, Recall, F- score, Map.
Source Page US:official&tbm=isch&tbnid=Mli6kxZ3HfiCRM:&imgrefurl=
Setting up a search engine KS 2 Search: appreciate how results are selected.
Integrated Departmental Information Service IDIS provides integration in three aspects Integrate relational querying and text retrieval Integrate search.
INTERNET VOCAB. WEB BROWSER An app for finding info on the web.
One Platform for Mining Structured and Unstructured Data: Dream or Reality? VLDB Panel 13 Sep 2006 Jayavel Shanmugasundaram Yahoo! Research.
1 NETE4631 Using Google Web Services Lecture Notes #6.
Search Engines 19 Search Engines 19. Search Engines 19 We all use search engines every day But could you explain what happens behind the scenes? That’s.
+ GOOGLEGOOGLE ANAS AL-JEFRY SULTAN AL-SAAD. + Why Google? In 2010, Google made $
Database Research for the Current Millennium ICDE Panel 1 Apr 2004 Jayavel Shanmugasundaram Cornell University.
Путешествуй со мной и узнаешь, где я сегодня побывал.
1 Chapter 5 (3 rd ed) Your library is an excellent resource tool. Your library is an excellent resource tool.
CONDUCTING RESEARCH How to find information on the Internet.
Find Customer – Bind Customer
Google’s Deep Web Crawler
Search Engines.
CIW Lesson 6 Web Search Engines.
Prepared by Rao Umar Anwar For Detail information Visit my blog:
Page 1. Page 2 Page 3 Page 4 Page 5 Page 6 Page 7.
التسويق الإلكتروني E- Marketing
ما الذي يريد صاحب العمل أن يعرفه؟
What is a Search Engine EIT, Author Gay Robertson, 2017.
BOOSTING IMAGE RETRIEVAL
Database Connectivity and Web Development
Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.
Sunday Morning 12th August, 2018
Online Search Engines IBT
Client-Server Model: Requesting a Web Page
Who is Using your webSite?
Өмнөговь аймгийн Нийгэм,
Presentation transcript:

Index Structures for Querying the Deep Web Jian Qiu, Feng Shao, Jayavel Shanmugasundaram Cornell Universersity Misha Zatsman Google

Deep Web Keyword queries Static web pages Surface web

Deep Web Keyword queries Static web pages Surface web Ebay databases CNN databases Cars.com databases … Amazon databases times the size of surface web! Deep web …

Deep Glue Structured queries Query results Ebay database CNN databases Cars.com database … Amazon database times the size of surface web! Deep web

Deep Glue System Query Engine Find textbooks with price<$50 Database half.com… Query Superset of relevant data sources Internet … Half.com databases Index structures Indexer Our focus

Index structure for deep web: Challenges l Deal with structured data l Underlying databases are structured l Surface web typically unstructured l Deal with large volumes l Orders of magnitude larger than the size of surface web

Our approach l Understand the structure/typing of the data l Support equality and range queries l Heavily compress the index l Achieve a factor of 10 compression l Tradeoff between compression factor and the number of false positives l Compression factor 10 with only ~10 false positives for 1000 data sources.

Outline l Query model l Index Structures l Experimental Evaluation l Related work and conclusion

Assumptions l Data sources are classified into domains l Online car dealers, online auctions, online travel agents, … l Data sources in the same domain use same logical relational schema l Indexing attributes l Price, date, make, model, isbn,… l Indexed by Deep Glue system l Indexing data can be obtained via l Crawling the deep web [Raghavan 01 ] l Previously agreed-upon protocols [Froogle]

Query Model l Support equality and range queries l currently on a single indexing attribute l Schema: Car(Id,Make,Model,Year,Price) l Queries: l Find all year 2003 cars, year = 2003 l Find all cars that cost less than $1000, price < 1000

Outline l Query model l Index Structures l Experimental Evaluation l Related work and conclusion

Overview l Uncompressed Index l Compressed Index, still support equality and range queries l Value Clustered Index (VCI) l DataSource Clustered Index (DCI) l Value DataSource Clustered Index (VDCI) l Histogram Based Index (HBI)

Uncompressed Index (UI) l For each distinct value v for an indexing attribute, stores the list of data sources value data source d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d7d7 d8d8 1XXX 2XXXX 3X 4XXX 5XXX 6XXX d1: ebay.com, d2: amazon.com … UI: B+tree

Problems l A huge number of values and data sources in deep web !! l Indexing every indexing attribute requires space l Need to compress UI ! l Use gzip? l Have to uncompress the index  index lookup too expensive! l Need new compression techniques

Overview l Uncompressed Index l Compressed Index, still support equality and range queries l Value Clustered Index (VCI) l DataSource Clustered Index (DCI) l Value DataSource Clustered Index (VDCI) l Histogram Based Index (HBI)

Value Clustered Index (VCI) l Intuition: l “closely related” values are stored in “closely related” data sources l ISBN numbers of antique books in the online book retailers specializing in antique books. l Cluster “closely related” values l Stores the list of data sources only for each cluster

VCI Example value data source d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d7d7 d8d8 1XXX 2XXXX 3X 4XXX 5XXX 6XXXX Cluster 1: { 1, 6} Cluster 2: { 2, 5} Cluster 3: { 3, 4} l False positives l value 1  data source d1 l Tradeoff between space and accuracy l Mapping all values in one cluster l Mapping each distinct value into a separate cluster VCI structures: Union B+tree

VCI Implementation l Use existing scalable algorithm l Scales to large data sets: Birch Framework [Zhang96] l Minimize the number of false positives l Specify the parameters for Birch l Centroid, the mid-point of a cluster l Radius, a measure of quality for a cluster l Distance between clusters Centroid Radius Distance cluster1 cluster2

VCI formulae l For a cluster having the set of values V ds(v): the set of data sources for value v l centroid(V) = l radius(V) = distance(V1, V2) Additional number of false positives when merging two clusters l Data sources associated with the cluster Sum of number of false positives

Overview l Uncompressed Index l Compressed Index: l Value Clustered Index (VCI) l DataSource Clustered Index (DCI) l Value DataSource Clustered Index (VDCI) l Histogram Based Index (HBI)

DataSource Clustered Index (DCI) l Intuition: “closely related” data sources may have “closely related” sets of values l Amazon and b&n has similar sets of ISBN numbers l In the data graph, VCI clusters rows and DCI clusters columns value data source d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d7d7 d8d8 1XXX 2XXXX 3XXXX 4XXX 5XXX 6XXX Cluster 1: {d2,d3,d6} Cluster 2: { d4, d5} Cluster 3: { d1, d7, d8} l Table structures are similar to VCI. l See paper for other details

Overview l Uncompressed Index l Compressed Index: l Value Clustered Index (VCI) l DataSource Clustered Index (DCI) l Value DataSource Clustered Index (VDCI) l Histogram Based Index (HBI)

Value-DataSource Clustered Index (VDCI) l VCI, DCI: clusters in 1 dimension l VDCI: clusters in 2 dimensions, generalizes VCI/DCI l Cluster: a set of values and a set of data sources value data source d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d7d7 d8d8 1XXX 2XXXXXX 3XXXX 4XXXX 5XXXXX 6XXX l Cluster 1:{ {2,3}, {d2,d3,d4}} l Cluster 2:{ {4,5}, {d4,d5,d6} } l Cluster 3:{ {1,2}, {d6,d7,d8} } l Data source d4 is in two clusters l Value 2 is in two clusters l Table structures are similar to VCI. l See paper for other details

Overview l Uncompressed Index l Compressed Index: l Value Clustered Index (VCI) l DataSource Clustered Index (DCI) l Value DataSource Clustered Index (VDCI) l Histogram Based Index (HBI)

Histogram Based Index (HBI) l VCI/VDCI don’t consider the ordering among values l Range queries implies this need l HBI groups adjacent values in the same cluster l Also need to ensure the accuracy l Use threshold to determine the boundary of a cluster l Threshold: average number of false positives in a cluster

HBI Example value data source d1d2d3d4d5d6d7d8 1XXX 2XXXX 3X 4XXX 5XXX 6XXX l Threshold: 2 l Cluster adjacent values Cluster 1: {1} Cluster 2: {2,3,4} Cluster 3: {5,6}

Outline l Query model l Index Structures l Experimental Evaluation l Related work and conclusion

Experimental setup l Synthetic data l 1000 data sources, 100,000 values, 4,000,000 (value,data source) pairs l Other parameters are in the paper l Metrics l Index creation time l Compression factor l False positives l Setup l 2.8GHz Pentium IV, 1GB memory, 80GB disk l C++

Index creation time Index structureTime(min) UI0.25 VCI15 DCI3 VDCI180 HBI2.5

Equality queries (1000 data sources)

Range Queries (1000 data sources)

Outline l Query model l Index Structures l Experimental Evaluation l Related work and conclusion

Related work l Distributed database & information integration l Niagara system [Naughton01] l GlOSS [Gravano99] l … l Database/Inverted list compression l Query Optimization in Compressed Databases [Chen 01] l Compressing the Relations and Index [Goldstein 98] l Improved Query Performance with Variant Indices [O’Neill 97] l Implementation and Performance of Compressed Databases [Westmann 00] l Size Reduction of Inverted Files [Weiss 90] l …

Conclusion l Space-efficient index structures for querying the deep web l Support equality and range queries l A factor of 10 compression with a little loss in precision l Future work l Combine cluster-based and histogram-based l Multiple attributes queries l Joins l Incremental index maintenance

Questions?

Experimental setup Other parameters: l Number of groups l The data sources in the same group use same distribution to generate the values l Default 20 l Group mode l How many groups a data source belongs to l Default 1 l Value correlation l How the orders in the value space maps to the value ordering over which Gaussian distribution is used. l Default 0.2