Presentation on theme: "The “Deep Web” ISC 110 Final Project Kaila Ryan - 12/12/2013."— Presentation transcript:
The “Deep Web” ISC 110 Final Project Kaila Ryan - 12/12/2013
What is the “Deep Web”? Web content which is hidden behind an HTML form, and is generally not able to be indexed by search engines (Madhavan et. All, 2009). Largely made up of web-connected databases (Wright, 2009). Shopping catalogs Scientific research data Public transport information, etc. Requires “valid input values” to access (Madhavan et. All, 2009). In other words, a query or another similar form of typed input. Web-crawlers not yet sophisticated enough to automate formulation of relevant queries, so this data cannot be reached by them.
A bit about search engines... Most modern search engines use automated “web crawler” programs to index websites Crawlers follow a “trail” of links from webpage to webpage, indexing each new page it finds so that it becomes searchable- part of the “surface web” (Wright, 2009). Because of the very nature of how they function, traditional crawling methods fail to index some documents, such as: Databases, which require specific queries to access the information contained in them Impossible (or at least inefficient and impractical) to use every possible query on every database found. Task of figuring out how to narrow down possible queries to relevant terminology has been challenging.
Finding the Deep Web: No single, exhaustive method of locating this data is available- yet. Many competing theories and projects working toward the creation of functioning Deep Web crawlers and search engines. Primary methods of locating Deep Web content at present: Directories, like “The Hidden Wiki” (requires Tor browser) Referral by current users of a particular site/service/database Many in the field of Information Science focused on development of technology capable of “surfacing” Deep Web content, through the use of new methods of locating and querying databases, and indexing the results of these queries. Google has a team dedicated specifically to this task
The Deep Web's value: You may be asking yourself, “Why should we bother surfacing the 'Deep Net'? What is it worth to us?” Ability to automate database querying and indexing opens up potential for automated cross-referencing of otherwise unconnected databases. Invaluable to the field of medical and scientific research. Important step in the movement toward a semantic web. Could potentially be used to search for answers to complex questions, for which all of the information is available, but is either not unified, or not easily accessible (“What is the cheapest way to get from X to Y at 9am on a Sunday?”) In general, ability to discover a wealth of knowledge that is already freely available, but hidden: up to 96% of the Web may be considered the Deep Web.
Sources Bergman, M. K. (2001, Sept 24). The deep web: Surfacing hidden value. Deep Content, Retrieved from http://grids.ucs.indiana.edu/courses/xinformatics/searchindik/ deepwebwhitepaper.pdf Bin He, Mitesh Patel, Zhen Zhang, and Kevin Chen-Chuan Chang. 2007. Accessing the deep web. Commun. ACM 50, 5 (May 2007), 94-101. DOI=10.1145/1230819.1241670 http://doi.acm.org/10.1145/1230819.1241670 Jayant Madhavan, David Ko, Łucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy. 2008. Google's Deep Web crawl. Proc. VLDB Endow. 1, 2 (August 2008), 1241-1252. Wright, A. (2009, Feb 23). Exploring a 'deep web' that google can't grasp. The New York Times. Retrieved from http://cob.jmu.edu/williamson/mktg470/reading/search/2009/Exploring a ‘Deep Web’ That Google Can’t Grasp.pdf