Presentation is loading. Please wait.

Presentation is loading. Please wait.

Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.

Similar presentations


Presentation on theme: "Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University."— Presentation transcript:

1 Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University of Illinois at Urbana-Champaign Kevin Chen-Chuan Chang computer science at the University of Illinois at Urbana-Champaign COMMUNICATIONS OF THE ACM May 2007/Vol. 50, No. 5 1

2 Introduction Web has been rapidly “deepened” by massive databases online ▫current search engines do not reach most of the data on the internet Surface Web ▫linked of static HTML pages a far more significant amount of information is believed to be “hidden” in the deep Web ▫behind the query forms of searchable databases 2

3 Conceptual View of the Deep Web 3

4 Introduction (Con.) This article reports the survey of the deep Web ▫scale ▫subject distribution ▫search-engine coverage ▫other access characteristics of online databases 4

5 Related Work BrightPlanet.com., 2000 ▫established interest in this area ▫focused on only the scale aspect ▫overlap analysis  43,000–96,000 “deep Web sites”  informal estimate of 7,500 terabytes of data exist  500 times larger than the surface Web ▫underestimate  assume two search engines randomly and independently obtain data  Actually, highly correlated in coverage of deep web data 5

6 Global Scale Estimation IP sampling approach ▫randomly sampled 1,000,000 IPs ▫From the entire space of 2,230,124,544 valid IP address For each IP ▫download HTML pages ▫identified & analyzed web databases in this sample 6

7 Site, Databases, and Interface distinguish three related notions for accessing the deep web ▫site, database, and interface a deep web ▫a Web server that provides information maintained in one or more backend Web databases each of database is searchable through one or more HTML forms as its query interfaces 7

8 Site, Databases, and Interface (Con.) find the number of query interfaces for each Web site, then the number of Web databases, and finally the number of deep Web sites 8

9 Query Interface exclude non-query HTML forms (which do not access back-end databases) from query interfaces exclude login, subscription, registration, polling, and message posting exclude “site search” ▫many web sites now provide for searching HTML pages on their sites removed duplicates 9

10 Web Database based on the discovered query interfaces compute the number of Web databases by finding the set of query interfaces (within a site) that refer to the same database if the objects from one interface can always be found in the other one ▫the two interfaces are searching the same database ▫randomly choose five objects 10

11 Deep Web Site the recognition of deep web site is rather simple a Web site is a deep Web site if it has at least one query interface 11

12 (Q1) Where to find “entrances” to databases? To access a Web database, we must first find its entrances: the query interfaces depth of query interface ▫the minimum number of hops from the root page of the site to the interface page Due to deep crawling, analyzed 1/10 of total IP samples ▫100,000 IPs 12

13 Results of Q1 found 281 Web servers Exhaustively crawling these servers to depth 10, we found 24 of them are deep Web sites ▫Contained a total of 129 query interfaces representing 34 Web databases query interfaces tend to locate shallowly in their sites ▫none of the 129 query interfaces had depth deeper than 5 ▫72% (93 out of 129) interfaces were found within depth 3 13

14 Depth of Web Database since a Web database may be accessed through multiple interfaces ▫measured its depth as the minimum depths of all its interfaces ▫94% (32 out of 34) Web databases appeared within depth 3 ▫91.6% (22 out of 24) deep Web sites had their databases within depth 3 14

15 (Q2) What is the scale of the deep Web? tested and analyzed all of the 1,000,000 IP samples to estimate the scale of the deep Web high depth-three coverage ▫almost all Web databases can be identified within depth 3 crawled to depth 3 for these one million IPs 15

16 Results of Q2 2,256 Web servers ▫126 deep Web sites ▫406 query interfaces ▫190 Web databases s = 1,000,000 unique IP samples the entire IP space of t = 2,230,124,544 IPs Number of deep Web sites number of Web databases number of query interfaces 16

17 (Q3) How “structured” is the deep Web? classified Web databases into two types ▫unstructured databases  provide data objects as unstructured media (text, images, audio, and video) ▫structured databases  provide data objects as structured “relational” records with attribute-value pairs 17

18 Results of Q3 manual querying and inspection of the 190 Web databases sampled ▫found 43 unstructured and 147 structured ▫similarly estimate their total numbers to be 102,000 and 348,000 Deep Web features mostly structured data sources ▫3.4:1 18

19 (Q4)What is the subject distribution of Web databases? top-level categories of the yahoo.com directory as taxonomy manually categorized the sampled 190 Web databases the distribution indicates great subject diversity among Web databases non-commerce categories 51% (97 out of 190 databases) 19

20 (Q5) How do search engines cover the deep Web? randomly selected 20 Web databases from 190 For each database, first manually sampled five objects (result pages) as test data 20

21 Coverage of Search Engines indexing almost the same objects entirely a subset of Yahoo contrasts with the surface Web overlap -> low, combination -> greatly improve coverage 21

22 (Q6) What is the coverage of deep Web directories? providing deep Web directories classify Web databases in some taxonomies recorded the number of databases it claimed to have indexed low coverage manual classification of Web databases (directory-based indexing services) hardly scale for the deep Web 22

23 Conclusion 23

24 Conclusion (Con.) poor coverage of both its data and databases ▫access to the deep Web is not adequately supported V.s. surface web ▫Same  large, fast-growing, and diverse ▫Different  more diversely distributed, is mostly structured, and suffers an inherent limitation of crawling crawl-and-index techniques ▫“limit of coverage” and “structural heterogeneity” 24

25 Future Work database-centered, discover-and-forward access model automatically discover databases on the Web by crawling and indexing their query interfaces User querying -> forward users to the actual search of data ▫use their data-specific interfaces ▫fully leverage their structures Recent project ▫MetaQuerier and WISE-Integrator 25


Download ppt "Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University."

Similar presentations


Ads by Google