Download presentation
Presentation is loading. Please wait.
1
Searching the Web II
2
The Web Why is it important: –“Free” ubiquitous information resource –Broad coverage of topics and perspectives –Becoming dominant information collection –Growth and jobs Web access methods Search (e.g. Google) Directories (e.g. Yahoo!) Other …
3
Web Characteristics Distributed data High volatility Large volume Unstructured data Quality of data Heterogeneous data
4
Web Tasks Precision is the key –Goal: first 10-100 results should satisfy user –Requires ranking that matches user’s need –Recall is not important Completeness of index is not important Comprehensive crawling is not important
5
Browsing Web directories –Human-organized taxonomies of Web sites –Small portion (< than 1%) of Web pages Remember that recall (completeness) is not important Directories point to logical web sites rather than pages –Directory search returns both categories and sites –People generally browse rather than search once they identify categories of interest
6
Metasearch Search a number of search engines Advantages –Do not build their own crawler and index –Cover more of the Web than any of their component search engines Difficulties –Need to translate query to each engine query language –Need to merge results into a meaningful ranking
7
Metasearch II Merging Results –Voting scheme based on component search engines No model of component ranking schemes needed –Model-based merging Need understanding of relative ranking, potentially by query type Why they are not used for the Web –Bias towards coverage (e.g. recall), which is not important for most Web queries –Merging results is largely ad-hoc, so search engines tend to do better Big application: the Dark Web
8
Using Structure in Search Languages to search content and structure –Query languages over labeled graphs PHIQL: Used in Microplis and PHIDIAS hypertext systems Web-oriented: W3QL, WebSQL, WebLog, WQL
9
Using Structure in Search Other use of structure in search –Relevant pages have neighbors that also tend to be relevant –Search approaches that collect (and filter) neighbors to returned pages
10
Web Query Characteristics Few terms and operators –Average 2.35 terms per query 25% of queries have a single term –Average 0.41 operators per query Queries get repeated –Average 3.97 instances of each query –This is very uneven (e.g. “Britney Spears” vs. “Frank Shipman”) Query sessions are short –Average 2.02 queries per session –Average of 1.39 pages of results examined Data from 1998 study –How different today?
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.