Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru.

Similar presentations


Presentation on theme: "1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru."— Presentation transcript:

1 1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru

2 2 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Outline  Web and Search Engines  Types of Web  What is Deep Web? How big it is? Is it important?  What makes it Deep and what is in it?  Deep Web content classification and categories  Crawling and Indexing Deep Web  Deep Web Statistics

3 3 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Outline  Deep Web Quality  How to find and use Deep Web?  Deep Web Gateways  Deep Web Issues  Summary  References

4 4 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Web and Search Engines  In 1991, Web was created by Tim Berners-Lee, a researcher at the CERN high- energy physics laboratory in Switzerland.  Berners-Lee designed the Web to be platform-independent.  To enable this cross-platform capability, Berners-Lee created HTML, or Hypertext Markup Language - simplified version of SGML (Standard Generalized Markup Language).  The simplicity of Markup languages format makes it easier to introduce the concept of search engines which the user can use to search and retrieve HTML documents of their interest on the web.  This Shallow Web, also known as the Surface Web or Static Web, is a collection of Web sites indexed by automated search engines.  Search engines Web crawler follows URL links on the Web, and indexes every word on every HTML page on the web and store them in huge databases that can be searched on demand.

5 5 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Types of Web  Static Web  Dynamic Web  Opaque Web  Private Web  Proprietary Web  Pay per click Web

6 6 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 What is Deep Web?  Web pages accessing vast information repository that search engines cannot or will not index.  Mainly refers to the rich content information that search engines don't have direct access to, like databases.  Deep Web pages are dynamically created as the result of a specific search.  Deep Web also called Invisible Web.  Term invisible in "Invisible Web" is actually a misnomer.  Deep Web information is available via the Web but isn't accessible by the search engines.

7 7 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 How big is Invisible Web ?  Cannot be determined accurately.  In a word, it's humungous.  Deep web is approximately 500 times bigger than the searchable or surface Web. May be bigger than that.  Considering that Google alone covers around 8 billion pages, that's just mind boggling.  Major search engines together index only 20% of the Web, then they miss 80% of the content.  Deep Web includes images, sounds, presentations and many other types of media not visible to search engines.

8 8 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Is Deep Web Important ?  Web as a vast library. Requires more digging to find what’s needed.  Search engines only search a very small portion of the web make the Invisible Web a very tempting resource. There's a lot more information out there than one could ever imagine.  Significant content of Deep Web is quality content that exists in documents within searchable databases on the web which conventional search engines (well known and mostly used) can't access it.  Currently businesses, researchers, consumers etc, may not get quality and needed information.  Search Engines themselves have problems in providing relevant content – at least for bit complicated or obscure queries.

9 9 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Why the name “Invisible” ?  Spiders crawling through the Web, when run into a page from the Invisible Web, they don't know quite what to do with it.  Spiders can record the address of the page it couldn’t access, but can't tell the information the page contains.  Main factors are due to technical barriers ex: databases, passwords protected pages, script-based pages.

10 10 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 What makes it Deep ?  Proprietary sites  Sites requiring a registration  Sites with scripts  Dynamic sites  Ephemeral sites  Sites blocked by local webmasters  Sites blocked by search engine policy  Sites with special formats  Searchable databases

11 11 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Other factors…  Exclude Pages by policy.  Spiders/crawlers do not report what it can't index.  Task of actually finding all the pages on the Web.

12 12 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep Web resource classification  Dynamic content - Dynamic pages in response to a submitted query  Unlinked content - Pages which are not linked to by other pages  Limited access content - Sites that require registration or limit access to their pages  Scripted content - Pages that are only accessible through links produced by JavaScript and Flash which require special handling.  Non-text content - Multimedia (image) files, Usenet archives and documents in non-HTML file formats such as PDF and DOC documents

13 13 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep web content categories

14 14 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Crawling & Indexing the Deep Web Major search engines such as Google, AltaVista, Inktomi does index dynamic context through the use of following programs  Paid partnership programs  Trusted feed services  Premium inclusion programs Quigo's QUIBOT remotely crawls through pages from the deep Web, enabling it to index a large portion of the deep Web and making this content available to users searching on Quigo and partner portals. Quigo's DeepWebGateway enables search engines to index deep Web content that they do not access directly. This technology also solves other problems related to deep Web crawling and indexing, such as spider traps and personalization.

15 15 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep Web Statistics  Public information on the deep Web is currently 400 to 550 times larger than the commonly defined World Wide Web.  The deep Web contains 7,500 terabytes of information compared to 29 terabytes of information in the surface Web.  The deep Web contains nearly 550 billion individual documents compared to the 2.5 billion of the surface Web.  Ninety-five percent of the deep web contains publicly accessible information that is not subject to fees or subscriptions.  More than 200,000 deep Web sites presently exist.  60 of the largest deep-Web sites collectively contain about 750 terabytes of information -- sufficient by themselves to exceed the size of the surface Web 40 times.  On average, deep Web sites receive fifty per cent greater monthly traffic than surface sites and are more highly linked to than surface sites.

16 16 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep Web Statistics (contd)  Deep Web sites tend to be narrower, with deeper content, than conventional surface sites.  Total quality content of the deep Web is 1,000 to 2,000 times greater than that of the surface Web.  More than half of the deep Web content resides in topic-specific databases.  Eighty-five percent of Web users use search engines to find needed information, but nearly as high a percentage cite the inability to find desired information as one of their biggest frustrations.  More than 95% of deep Web information is publicly available without restriction.  International Data Corporation predicts that the number of surface Web documents will grow from the current two billion or so to 13 billion within three years, a factor increase of 6.5 times. Deep Web growth should exceed this rate, perhaps increasing about nine-fold over the same period.

17 17 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep Web Quality  About a three-fold improved likelihood for obtaining quality results from the deep Web as compared to the surface Web.  Overall precision and recall would be higher due to presence of highly relevant information for each subject area.  Degree of content overlap between deep Web sites to be much less than for surface Web sites.  Observations from working with the deep Web sources and data suggest there are important information categories where duplication does exist. Prominent among these are yellow/white pages, genealogical records, and public records with commercial potential such as SEC filings. On the other hand, there are entire categories of deep Web sites whose content appears uniquely valuable. These mostly fall within the categories of topical databases, publications, and internal site indices which accounts in total for about 80% of deep Web sites.  Duplication will be lower within the deep Web.

18 18 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Finding Deep Web  General web directories  www.completeplanet.com, www.thebighub.com www.completeplanet.comwww.thebighub.com  Deep Web search engines that sends single query to dozens of databases simultaneously.  www.alltheweb.com, www.brightplanet.com www.alltheweb.comwww.brightplanet.com  Specialized Databases  www.nsdl.org, http://catalog.loc.gov www.nsdl.orghttp://catalog.loc.gov  Use Google and other search engines to locate searchable databases.  Example for Google & Yahoo: languages database or toxic chemicals database

19 19 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep Web search strategies to follow  Be aware that the Deep Web exists.  Use a general search engine for broad topic searching.  Use a searchable database for focused searches.  Register on special sites and use their archives.  Call the reference desk at a local college if in need of a proprietary Web site. Many college libraries subscribe to these services and provide free on-site searching.  Many libraries offer free remote online access to commercial and research databases for anyone with a library card.

20 20 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep Web Gateways – Web Directories  Infomine [http://infomine.ucr.edu/] is a virtual library of Internet resources relevant to faculty, students, and research staff at the university level.  It contains useful Internet resources such as databases, electronic journals, electronic books, bulletin boards, mailing lists, online library card catalogs, articles, directories of researchers, and many other types of information.  Infomine is librarian built. Librarians from the University of California, Wake Forest University, California State University, the University of Detroit - Mercy, and other universities and colleges have contributed to building Infomine.

21 21 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Infomine Web Directory

22 22 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep Web Gateways – Web Directories  Digital Librarian [http://www.digital-librarian.com/] is librarian’s choice of the best of the web.

23 23 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep Web Gateways – Search Engines  Turbo10 is a meta search engine that provides a universal interface to Deep Web search engines.  Turbo10 is designed to help search Deeper and browse faster.  Turbo10 has developed search technology since 2001. It connects Internet searchers to Deep Web search engines.

24 24 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Turbo10 Deep Web Search Engine

25 25 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep Web Gateways – Search Engines  AlltheWeb [http://www.alltheweb.com/] combines one of the largest and freshest indices with the most powerful search features that allow anyone to find anything faster than with any other search engine.  AlltheWeb's index (provided by Yahoo!) includes billions of web pages, as well as tens of millions of PDF and MS Word® files. Yahoo! frequently scans the entire web to ensure that our content is fresh and to eliminate broken links.  AlltheWeb offers a variety of specialized search tools and advanced search features, and supports searching in 36 different languages.  Our image, audio, and video searches include hundreds of millions of multimedia files.  AllTheWeb provides with the controls necessary to find the most relevant content through some of the most sophisticated advanced search features available.

26 26 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 AlltheWeb – Deep Web Search Engine

27 27 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep Web Gateways – Specialized Databases  NSDL (National Science Digital Library - http://nsdl.org/) was established as an online library which directs users to exemplary resources for science, technology, engineering, and mathematics (STEM) education and research.http://nsdl.org/  NSDL provides an organized point of access to STEM content that is aggregated from a variety of other digital libraries, NSF- funded projects, and NSDL-reviewed web sites.  NSDL also provides access to services and tools that enhance the use of this content in a variety of contexts.

28 28 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 NSDL – Specialized Database

29 29 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Other notable Deep Web resources  Deep Query Manager (DQM), BrightPlanet's powerful search tool designed to retrieve information from thousands of Deep Web databases and search engines at one time.  AlphaSearch is an extremely useful directory of "gateway" sites that collect and organize Web sites that focus on a particular subject.  Many databases that make up GPO Access..http://www.access.gpo.gov/  Telephone directory databases such as Anywho.http://www.anywho.com/

30 30 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep Web Issues  Complete indexing for Deep Web is impossible.  Deep web content is dynamic and can change faster than the contents in static/surface web.  There is no bright line that separates content sources on the Web. Users need to choose the database (Deep Web resource) of their interest on their own.  Deep Web phenomenon is not well known to the Internet- searching public.  Value of deep web content is incalculable.

31 31 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Summary World Wide Web “Invisible/Deep Web”“Visible/Surface Web” Search DirectoriesSearch Engines Examples: Librarians Index to the Internet, Yahoo Fee-based Specialized, searchable Databases Examples: Google, Yahoo, Altavista Free Examples: Library Catalogs, digital library archives, Dictionaries, Encyclopedias, Article databases Examples: Library Catalogs, digital library archives, Dictionaries, Encyclopedias, Article databases

32 32 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Summary  Deep Web content is highly relevant to every information need, market, and domain.  The deep Web is the largest growing category of new information on the Internet.  Serious information seekers can no longer avoid the importance or quality of deep Web information.  Deep Web information is only a component of total information available. Searching must evolve to encompass the complete Web.  Directed query technology is the only means to integrate deep and surface Web information.

33 33 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Summary  Specific vertical market services are already evolving to partially address the deep web challenges. These will likely need to be supplemented with a persistent query system customizable by the user that would set the queries, search sites, filters, and schedules for repeated queries.  Search directories that offer hand-picked information chosen from the surface Web to meet popular search needs  Use search engines for more robust surface-level searches and content-aggregation vertical "infohubs" for deep Web information to provide answers where comprehensiveness and quality are imperative.

34 34 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 References 1. Wikipedia, the free encyclopedia, “Deep Web” 24 April 2007 http://en.wikipedia.org/wiki/Hidden_web 2. Wendy Boswell, “The Invisible Web” 21 April 2007 http://websearch.about.com/od/invisibleweb/a/invisible_web.htmhttp://websearch.about.com/od/invisibleweb/a/invisible_web.htm. 3. Chris Sherman, "The Invisible Web“ 20 April 2007 http://www.freepint.co.uk/issues/080600.htm#featurehttp://www.freepint.co.uk/issues/080600.htm#feature. 4. Joe Barker, “Invisible or Deep Web” 9 March 2007 http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/InvisibleWeb.html 5. Michael K. Bergman, “The Deep Web: Surfacing Hidden Value” September 24, 2001 http://www.press.umich.edu/jep/07-01/bergman.htmlhttp://www.press.umich.edu/jep/07-01/bergman.html 6. Laura Cohen, “The Deep Web” 22 November 2006 http://www.internettutorials.net/deepweb.html 7. Marcus P. Zillman, “Deep Web Research” April 23, 2007 http://deepwebresearch.blogspot.com/ http://deepwebresearch.blogspot.com/ 8. Paul Bruemmer, “Indexing Deep Web Content” March 27, 2002 http://www.searchengineguide.com/wi/2002/0327_wi2.html

35 35 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 References 9. Danny Sullivan, “Invisible Web Gets Deeper” August 2, 2000 http://searchenginewatch.com/showPage.html?page=2162871 10. Chris Sherman, “Search for the invisible web” September 6, 2001 http://technology.guardian.co.uk/online/story/0,3605,547140,00.html http://technology.guardian.co.uk/online/story/0,3605,547140,00.html 11. “Greg Linden”, Deep Web Strategy March 2007 http://www.semantic-web.at/10.57.1089.press.greg-linden-on-google-s-deep-web- strategy.htm 12. Alex Wright, “In search of the deep Web” 9 March 2004 http://archive.salon.com/tech/feature/2004/03/09/deep_web/index_np.html http://archive.salon.com/tech/feature/2004/03/09/deep_web/index_np.html 13. Danny Sullivan, “Invisible Web" Revealed” June 11, 1999 http://searchenginewatch.com/showPage.html?page=2167321 14. Michael Cross, “The hidden potential of the web” April 21, 2004 http://society.guardian.co.uk/e-public/story/0,13927,1195901,00.html

36 36 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Thank You !!! Manoj Ravuru (mravuru@mail.smu.edu)


Download ppt "1 CSE 8337 – Spring 2007Project 2 – Deep Web Manoj Ravuru - Student ID 22508269 Deep (Invisible) Web - Manoj Ravuru."

Similar presentations


Ads by Google