Presentation is loading. Please wait.

Presentation is loading. Please wait.

Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer.

Similar presentations


Presentation on theme: "Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer."— Presentation transcript:

1 Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer Science Department {ntoulas, pzerfos, JCDL, June 8 th 2005 Downloading Textual Hidden-Web Content Through Keyword Queries

2 1 November 2013 Motivation I would like to buy a used 98 Ford Taurus Technical specs ? Reviews ? Classifieds ? Vehicle history ? Google?Google?Google?Google?

3 1 November 2013 Why cant we use a search engine ? Search engines today employ crawlers that find pages by following links around Many useful pages are available only after issuing queries (e.g. Classifieds, USPTO, PubMed, LoC, …) Search engines cannot reach such pages: there are no links to them (Hidden-Web) In this talk: how can we download Hidden- Web content?

4 1 November 2013 Outline Interacting with Hidden-Web sites Algorithms for selecting queries for the Hidden-Web sites Experimental evaluation of our algorithms

5 1 November 2013 Interacting with Hidden-Web pages (1) 1.The user issues a query through a query interface liver

6 1 November 2013 Interacting with Hidden-Web pages (2) 1.The user issues a query through a query interface 2.A result list is presented to the user Result List Page

7 1 November The user issues a query through a query interface 2.A result list is presented to the user 3.The user selects and views the interesting results Interacting with Hidden-Web pages (3)

8 1 November 2013 Querying a Hidden-Web site Procedure while ( there are available resources ) do (1) select a query to send to the site (2) send query and acquire result list (3) download the pages done

9 1 November 2013 How should we select the queries ? (1) S: set of pages in Web site (pages as points) q i : set of pages returned if we issue query q i (queries as circles)

10 1 November 2013 How should we select the queries ? (2) Find the queries (circles) that cover the maximum number of pages (points) Equivalent to the set-covering problem in graph-theory

11 1 November 2013 Challenges during query selection In practice we dont know which pages will be returned by which queries (q i are unknown) Even if we did know q i, the set-covering problem is NP-Hard We will present approximation algorithms to the query selection problem We will assume single-keyword queries

12 1 November 2013 Outline Interacting with Hidden-Web sites Algorithms for selecting queries for the Hidden-Web sites Experimental evaluation of our algorithms

13 1 November 2013 Some background (1) Assumption: When we issue query q i to a Web site, all pages containing q i are returned P(q i ): fraction of pages from site we get back after issuing q i Example: q = liver No. of docs in DB: 10,000 No. of docs containing liver: 3,000 P(liver) = 0.3

14 1 November 2013 Some background (2) P(q 1 /\q 2 ): fraction of pages containing both q 1 and q 2 (intersection of q 1 and q 2 ) P(q 1 \/q 2 ): fraction of pages containing either q 1 or q 2 (union of q 1 and q 2 ) Cost and benefit: How much benefit do we get out of a query ? How costly is it to issue a query?

15 1 November 2013 Cost function The cost to issue a query and download the Hidden-Web pages: c q : query cost c r : cost for retrieving a result item c d : cost for downloading a document Cost(q i ) = (1) Cost for issuing a query (2) Cost for retrieving a result item times no. of results (3) Cost for retrieving a doc times no. of docs cqcq + c r P(q i )+ c d P(q i )

16 1 November 2013 Problem formalization Find the set of queries q 1,…,q n which maximizes P(q 1 \/…\/q n ) Under the constraint:

17 1 November 2013 Query selection algorithms Random: Select a query randomly from a precompiled list (e.g. a dictionary) Frequency-based: Select a query from a precompiled list based on frequency (e.g. a corpus previously downloaded from the Web) Adaptive: Analyze previously downloaded pages to determine promising future queries

18 1 November 2013 Adaptive query selection Assume we have issued q 1,…,q i-1. To find a promising query q i we need to estimate P(q 1 \/…\/q i-1 \/q i ) P( (q 1 \/…\/q i-1 ) \/ q i ) = P(q 1 \/…\/q i-1 ) + P(q i )- P(q 1 \/…\/q i-1 ) P(q i |q 1 \/…\/q i-1 ) Known (by counting) since we have issued q 1,…,q i-1 Can measure by counting P(q i ) within P(q 1,…,q i-1 ) What about P(q i ) ?

19 1 November 2013 Estimating P(q i ) Independence estimator Zipf estimator [IG02] Rank queries based on frequency of occurrence and fit a power law distribution Use fitted distribution to estimate P(q i ) P(q i ) ~ P(q i |q 1 \/…\/q i-1 )

20 1 November 2013 Query selection algorithm foreach q i in [potential queries] do P new (q i ) = P(q 1 \/…\/q i-1 \/q i ) – P(q 1 \/…\/q i-1 ) Estimate done return q i with maximum Efficiency(q i )

21 1 November 2013 Other practical issues Efficient calculation of P(q i |q 1 \/…\/q i-1 ) Selection of the initial query Crawling sites that limit the number of results (e.g. DMOZ returns up to 10,000 results) Please refer to our paper for the details

22 1 November 2013 Outline Interacting with Hidden-Web sites Algorithms for selecting queries for the Hidden-Web sites Experimental evaluation of our algorithms

23 1 November 2013 Experimental evaluation Applied our algorithms to 4 different sites Hidden-Web siteNo. of documents Limit in the no. of results PubMed medical library ~13 millionno limit Books section of Amazon ~4.2 million32,000 DMOZ: Open directory project ~3.8 million10,000 Arts section of DMOZ ~429,00010,000

24 1 November 2013 Policies Random-16K Pick query randomly from 16,000 most popular terms Random-1M Pick query randomly from 1,000,000 most popular terms Frequency-based Pick query based on frequency of occurrence Adaptive

25 1 November 2013 Coverage of policies What fraction of the Web sites can we download by issuing queries ? Study P(q 1 \/…\/q i ) as i increases

26 1 November 2013 Coverage of policies for PubMed Adaptive gets ~80% with ~83 queries Frequency needs 103 for the same coverage

27 1 November 2013 Coverage of policies for DMOZ (whole) Adaptive outperforms others

28 1 November 2013 Coverage of policies for DMOZ (arts) Adaptive performs best in topic-specific texts

29 1 November 2013 Other experiments Impact of the initial query Impact of the various parameters of the cost function Crawling sites that limit the number of results (e.g. DMOZ returns up to 10,000 results) Please refer to our paper for the details

30 1 November 2013 Related work Issuing queries to databases Acquire language model [CCD99] Estimate fraction of the Web indexed [LG98] Estimate relative size and overlap of indexes [BB98] Build multi-keyword queries that can return a large number of documents [BF04] Harvesting approaches/cooperative databases (OAI [LS01], DP9 [LMZN02])

31 1 November 2013 Conclusion An adaptive algorithm for issuing queries to Hidden-Web sites Our algorithm is highly efficient (downloaded >90% of a site with ~100 queries) Allows users to tap into unexplored information on the Web Allows the research community to download, mine, study, understand the Hidden-Web

32 1 November 2013 References [IG02] P. Ipeirotis, L. Gravano. Distributed search over the hidden web: Hierarchical database sampling and selection. VLDB [CCD99] J. Callan, M.E. Connel, A. Du. Automatic discovery of language models for text databases. SIGMOD [LG98] S. Lawrence, C.L. Giles. Searching the World Wide Web. Science 280(5360):98-100, [BB98] K. Bharat, A. Broder. A technique for measuring the relative size and overlap of public web search engines. WWW [BF04] L. Barbosa, J. Freire. Siphoning hidden-web data through keyword-based interfaces. [LS01] C. Lagoze, H.V. Sompel. The Open Archives Initiative: Building a low-barrier interoperability framework. JCDL [LMZN02] X. Liu, K. Maly, M. Zubair, M.L. Nelson. DP9-An OAI Gatway Service for Web Crawlers. JCDL 2002.

33 Thank you ! Questions ?

34 1 November 2013 Impact of the initial query Does it matter what the first query is ? Crawled PubMed with queries: data(1,344,999 results) information(308,474 results) return(29,707 results) pubmed (695 results)

35 1 November 2013 Impact of the initial query Algorithm converges regardless of initial query

36 1 November 2013 Incorporating the document download cost Cost(q i ) = c q + c r P(q i ) + c d P new (q i ) Crawled PubMed with c q = 100 c r = 100 c d = 10,000

37 1 November 2013 Incorporating document download cost Adaptive uses resources more efficiently Document cost significant portion of the cost

38 1 November 2013 Can we get all the results back ? …

39 1 November 2013 Downloading from sites limiting the number of results (1) Site returns q i instead of q i For q i+1 we need to estimate P(q i+1 |q 1 \/…\/q i )

40 1 November 2013 Downloading from sites limiting the number of results (2) Assuming q i is a random sample of q i

41 1 November 2013 Impact of the limit of results How does the limit of results affect our algorithms ? Crawled DMOZ but restricted the algorithms to 1,000 results instead of 10,000

42 1 November 2013 Dmoz with a result cap at 1,000 Adaptive still outperforms frequency-based


Download ppt "Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros ZerfosJunghoo Cho University of California Los Angeles Computer."

Similar presentations


Ads by Google