Presentation is loading. Please wait.

Presentation is loading. Please wait.

Index Structures for Querying the Deep Web Jian Qiu, Feng Shao, Jayavel Shanmugasundaram Cornell Universersity Misha Zatsman Google.

Similar presentations


Presentation on theme: "Index Structures for Querying the Deep Web Jian Qiu, Feng Shao, Jayavel Shanmugasundaram Cornell Universersity Misha Zatsman Google."— Presentation transcript:

1 Index Structures for Querying the Deep Web Jian Qiu, Feng Shao, Jayavel Shanmugasundaram Cornell Universersity Misha Zatsman Google

2 Deep Web Keyword queries Static web pages Surface web

3 Deep Web Keyword queries Static web pages Surface web Ebay databases CNN databases Cars.com databases … Amazon databases www.ebay.com 400-500 times the size of surface web! Deep web …

4 Deep Glue Structured queries Query results Ebay database CNN databases Cars.com database … Amazon database 400-500 times the size of surface web! Deep web

5 Deep Glue System Query Engine Find textbooks with price<$50 Database Concepts @ half.com… Query Superset of relevant data sources Internet … Half.com databases Index structures Indexer Our focus

6 Index structure for deep web: Challenges l Deal with structured data l Underlying databases are structured l Surface web typically unstructured l Deal with large volumes l Orders of magnitude larger than the size of surface web

7 Our approach l Understand the structure/typing of the data l Support equality and range queries l Heavily compress the index l Achieve a factor of 10 compression l Tradeoff between compression factor and the number of false positives l Compression factor 10 with only ~10 false positives for 1000 data sources.

8 Outline l Query model l Index Structures l Experimental Evaluation l Related work and conclusion

9 Assumptions l Data sources are classified into domains l Online car dealers, online auctions, online travel agents, … l Data sources in the same domain use same logical relational schema l Indexing attributes l Price, date, make, model, isbn,… l Indexed by Deep Glue system l Indexing data can be obtained via l Crawling the deep web [Raghavan 01 ] l Previously agreed-upon protocols [Froogle]

10 Query Model l Support equality and range queries l currently on a single indexing attribute l Schema: Car(Id,Make,Model,Year,Price) l Queries: l Find all year 2003 cars, year = 2003 l Find all cars that cost less than $1000, price < 1000

11 Outline l Query model l Index Structures l Experimental Evaluation l Related work and conclusion

12 Overview l Uncompressed Index l Compressed Index, still support equality and range queries l Value Clustered Index (VCI) l DataSource Clustered Index (DCI) l Value DataSource Clustered Index (VDCI) l Histogram Based Index (HBI)

13 Uncompressed Index (UI) l For each distinct value v for an indexing attribute, stores the list of data sources value data source d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d7d7 d8d8 1XXX 2XXXX 3X 4XXX 5XXX 6XXX d1: ebay.com, d2: amazon.com … UI: B+tree

14 Problems l A huge number of values and data sources in deep web !! l Indexing every indexing attribute requires space l Need to compress UI ! l Use gzip? l Have to uncompress the index  index lookup too expensive! l Need new compression techniques

15 Overview l Uncompressed Index l Compressed Index, still support equality and range queries l Value Clustered Index (VCI) l DataSource Clustered Index (DCI) l Value DataSource Clustered Index (VDCI) l Histogram Based Index (HBI)

16 Value Clustered Index (VCI) l Intuition: l “closely related” values are stored in “closely related” data sources l ISBN numbers of antique books in the online book retailers specializing in antique books. l Cluster “closely related” values l Stores the list of data sources only for each cluster

17 VCI Example value data source d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d7d7 d8d8 1XXX 2XXXX 3X 4XXX 5XXX 6XXXX Cluster 1: { 1, 6} Cluster 2: { 2, 5} Cluster 3: { 3, 4} l False positives l value 1  data source d1 l Tradeoff between space and accuracy l Mapping all values in one cluster l Mapping each distinct value into a separate cluster VCI structures: Union B+tree

18 VCI Implementation l Use existing scalable algorithm l Scales to large data sets: Birch Framework [Zhang96] l Minimize the number of false positives l Specify the parameters for Birch l Centroid, the mid-point of a cluster l Radius, a measure of quality for a cluster l Distance between clusters Centroid Radius Distance cluster1 cluster2

19 VCI formulae l For a cluster having the set of values V ds(v): the set of data sources for value v l centroid(V) = l radius(V) = distance(V1, V2) Additional number of false positives when merging two clusters l Data sources associated with the cluster Sum of number of false positives

20 Overview l Uncompressed Index l Compressed Index: l Value Clustered Index (VCI) l DataSource Clustered Index (DCI) l Value DataSource Clustered Index (VDCI) l Histogram Based Index (HBI)

21 DataSource Clustered Index (DCI) l Intuition: “closely related” data sources may have “closely related” sets of values l Amazon and b&n has similar sets of ISBN numbers l In the data graph, VCI clusters rows and DCI clusters columns value data source d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d7d7 d8d8 1XXX 2XXXX 3XXXX 4XXX 5XXX 6XXX Cluster 1: {d2,d3,d6} Cluster 2: { d4, d5} Cluster 3: { d1, d7, d8} l Table structures are similar to VCI. l See paper for other details

22 Overview l Uncompressed Index l Compressed Index: l Value Clustered Index (VCI) l DataSource Clustered Index (DCI) l Value DataSource Clustered Index (VDCI) l Histogram Based Index (HBI)

23 Value-DataSource Clustered Index (VDCI) l VCI, DCI: clusters in 1 dimension l VDCI: clusters in 2 dimensions, generalizes VCI/DCI l Cluster: a set of values and a set of data sources value data source d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d7d7 d8d8 1XXX 2XXXXXX 3XXXX 4XXXX 5XXXXX 6XXX l Cluster 1:{ {2,3}, {d2,d3,d4}} l Cluster 2:{ {4,5}, {d4,d5,d6} } l Cluster 3:{ {1,2}, {d6,d7,d8} } l Data source d4 is in two clusters l Value 2 is in two clusters l Table structures are similar to VCI. l See paper for other details

24 Overview l Uncompressed Index l Compressed Index: l Value Clustered Index (VCI) l DataSource Clustered Index (DCI) l Value DataSource Clustered Index (VDCI) l Histogram Based Index (HBI)

25 Histogram Based Index (HBI) l VCI/VDCI don’t consider the ordering among values l Range queries implies this need l HBI groups adjacent values in the same cluster l Also need to ensure the accuracy l Use threshold to determine the boundary of a cluster l Threshold: average number of false positives in a cluster

26 HBI Example value data source d1d2d3d4d5d6d7d8 1XXX 2XXXX 3X 4XXX 5XXX 6XXX l Threshold: 2 l Cluster adjacent values Cluster 1: {1} Cluster 2: {2,3,4} Cluster 3: {5,6}

27 Outline l Query model l Index Structures l Experimental Evaluation l Related work and conclusion

28 Experimental setup l Synthetic data l 1000 data sources, 100,000 values, 4,000,000 (value,data source) pairs l Other parameters are in the paper l Metrics l Index creation time l Compression factor l False positives l Setup l 2.8GHz Pentium IV, 1GB memory, 80GB disk l C++

29 Index creation time Index structureTime(min) UI0.25 VCI15 DCI3 VDCI180 HBI2.5

30 Equality queries (1000 data sources)

31 Range Queries (1000 data sources)

32 Outline l Query model l Index Structures l Experimental Evaluation l Related work and conclusion

33 Related work l Distributed database & information integration l Niagara system [Naughton01] l GlOSS [Gravano99] l … l Database/Inverted list compression l Query Optimization in Compressed Databases [Chen 01] l Compressing the Relations and Index [Goldstein 98] l Improved Query Performance with Variant Indices [O’Neill 97] l Implementation and Performance of Compressed Databases [Westmann 00] l Size Reduction of Inverted Files [Weiss 90] l …

34 Conclusion l Space-efficient index structures for querying the deep web l Support equality and range queries l A factor of 10 compression with a little loss in precision l Future work l Combine cluster-based and histogram-based l Multiple attributes queries l Joins l Incremental index maintenance

35 Questions?

36 Experimental setup Other parameters: l Number of groups l The data sources in the same group use same distribution to generate the values l Default 20 l Group mode l How many groups a data source belongs to l Default 1 l Value correlation l How the orders in the value space maps to the value ordering over which Gaussian distribution is used. l Default 0.2


Download ppt "Index Structures for Querying the Deep Web Jian Qiu, Feng Shao, Jayavel Shanmugasundaram Cornell Universersity Misha Zatsman Google."

Similar presentations


Ads by Google