Presentation is loading. Please wait.

Presentation is loading. Please wait.

World Wide Web Hypertext documents Hypertext documents Text Text Links Links Web Web billions of documents billions of documents authored by millions of.

Similar presentations


Presentation on theme: "World Wide Web Hypertext documents Hypertext documents Text Text Links Links Web Web billions of documents billions of documents authored by millions of."— Presentation transcript:

1 World Wide Web Hypertext documents Hypertext documents Text Text Links Links Web Web billions of documents billions of documents authored by millions of diverse people authored by millions of diverse people edited by no one in particular edited by no one in particular distributed over millions of computers, connected by variety of media distributed over millions of computers, connected by variety of media

2 Mining the Web 2Chakrabarti and Ramakrishnan History of Hypertext Citation, Citation, Hyperlinking Hyperlinking Ramayana, Mahabharata, Talmud Ramayana, Mahabharata, Talmud branching, non-linear discourse, nested commentary, branching, non-linear discourse, nested commentary, Dictionary, encyclopedia Dictionary, encyclopedia self-contained networks of textual nodes self-contained networks of textual nodes joined by referential links joined by referential links

3 Mining the Web 3Chakrabarti and Ramakrishnan Hypertext systems Memex [Vannevar Bush] Memex [Vannevar Bush] stands for “memory extension” stands for “memory extension” photoelectrical-mechanical storage and computing device photoelectrical-mechanical storage and computing device Aim: to create and help follow hyperlinks across documents Aim: to create and help follow hyperlinks across documents Hypertext Hypertext Coined by Ted Nelson Coined by Ted Nelson Xanadu hypertext: system with Xanadu hypertext: system with robust two-way hyperlinks, version management, controversy management, annotation and copyright management. robust two-way hyperlinks, version management, controversy management, annotation and copyright management.

4 Mining the Web 4Chakrabarti and Ramakrishnan World-wide Web Initiated at CERN (the European Organization for Nuclear Research) Initiated at CERN (the European Organization for Nuclear Research) By Tim Berners-Lee By Tim Berners-Lee GUIs GUIs Berners-Lee (1990) Berners-Lee (1990) Erwise and Viola(1992), Midas (1993) Erwise and Viola(1992), Midas (1993) Mosaic (1993) Mosaic (1993) a hypertext GUI for the X-window system a hypertext GUI for the X-window system HTML: markup language for rendering hypertext HTML: markup language for rendering hypertext HTTP: hypertext transport protocol for sending HTML and other data over the Internet HTTP: hypertext transport protocol for sending HTML and other data over the Internet CERN HTTPD: server of hypertext documents CERN HTTPD: server of hypertext documents

5 Mining the Web 5Chakrabarti and Ramakrishnan The early days of the Web : CERN HTTP traffic grows by 1000 between 1991-1994 (image courtesy W3C)

6 Mining the Web 6Chakrabarti and Ramakrishnan The early days of the Web: The number of servers grows from a few hundred to a million between 1991 and 1997 (image courtesy Nielsen)

7 Mining the Web 7Chakrabarti and Ramakrishnan 1994: the landmark year Foundation of the “Mosaic Communications Corporation" Foundation of the “Mosaic Communications Corporation" first World-wide Web conference first World-wide Web conference MIT and CERN agreed to set up the World-wide Web Consortium (W3C). MIT and CERN agreed to set up the World-wide Web Consortium (W3C).

8 Mining the Web 8Chakrabarti and Ramakrishnan Web: A populist, participatory medium number of writers =(approx) number of readers. number of writers =(approx) number of readers. the evolution of MEMES the evolution of MEMES ideas, theories etc that spread from person to person by imitation. ideas, theories etc that spread from person to person by imitation. Now they have constructed the Internet !! Now they have constructed the Internet !! E.g.: “Free speech online", chain letters, and email viruses E.g.: “Free speech online", chain letters, and email viruses

9 Mining the Web 9Chakrabarti and Ramakrishnan Abundance and authority crisis liberal and informal culture of content generation and dissemination. liberal and informal culture of content generation and dissemination. Very little uniform civil code. Very little uniform civil code. redundancy and non-standard form and content. redundancy and non-standard form and content. millions of qualifying pages for most broad queries millions of qualifying pages for most broad queries Example: java or kayaking Example: java or kayaking no authoritative information about the reliability of a site no authoritative information about the reliability of a site

10 Mining the Web 10Chakrabarti and Ramakrishnan Problems due to Uniform accessibility little support for adapting to the background of specific users. little support for adapting to the background of specific users. commercial interests routinely influence the operation of Web search commercial interests routinely influence the operation of Web search “Search Engine Optimization“ !! “Search Engine Optimization“ !!

11 Mining the Web 11Chakrabarti and Ramakrishnan Hypertext data Semi-structured or unstructured Semi-structured or unstructured No schema No schema Large number of attributes Large number of attributes

12 Mining the Web 12Chakrabarti and Ramakrishnan Crawling and indexing Purpose of crawling and indexing Purpose of crawling and indexing quick fetching of large number of Web pages into a local repository quick fetching of large number of Web pages into a local repository indexing based on keywords indexing based on keywords Ordering responses to maximize user’s chances of the first few responses satisfying his information need. Ordering responses to maximize user’s chances of the first few responses satisfying his information need. Earliest search engine: Lycos (Jan 1994) Earliest search engine: Lycos (Jan 1994) Followed by…. Followed by…. Alta Vista (1995), HotBot and Inktomi, Excite Alta Vista (1995), HotBot and Inktomi, Excite

13 Mining the Web 13Chakrabarti and Ramakrishnan Topic directories Yahoo! directory Yahoo! directory to locate useful Web sites to locate useful Web sites Efforts for organizing knowledge into ontologies Efforts for organizing knowledge into ontologies Centralized: (Yahoo!) Centralized: (Yahoo!) Decentralized: About.COM and the Open Directory Decentralized: About.COM and the Open Directory

14 Mining the Web 14Chakrabarti and Ramakrishnan Clustering and classification Clustering Clustering discover groups in the set of documents such that documents within a group are more similar than documents across groups. discover groups in the set of documents such that documents within a group are more similar than documents across groups. Subjective disagreements due to Subjective disagreements due to different similarity measures different similarity measures Large feature sets Large feature sets Classification Classification For assisting human efforts in maintaining taxonomies For assisting human efforts in maintaining taxonomies E.g.: IBM's Lotus Notes text processing system & Universal Database text extenders E.g.: IBM's Lotus Notes text processing system & Universal Database text extenders

15 Mining the Web 15Chakrabarti and Ramakrishnan Hyperlink analysis Take advantage of the structure of the Web graph. Take advantage of the structure of the Web graph. Indicators of prestige of a page (E.g. citations) Indicators of prestige of a page (E.g. citations) HITS & PageRank HITS & PageRank Bibliometry Bibliometry bibliographic citation graph of academic papers bibliographic citation graph of academic papers Topic distillation Topic distillation Adapting to idioms of Web authorship and linking styles Adapting to idioms of Web authorship and linking styles

16 Mining the Web 16Chakrabarti and Ramakrishnan Resource discovery and vertical portals Federations of crawling and search services Federations of crawling and search services each specializing in specific topical areas. each specializing in specific topical areas. Goal-driven Web resource discovery Goal-driven Web resource discovery language analysis does not scale to billions of documents language analysis does not scale to billions of documents counter by throwing more hardware counter by throwing more hardware

17 Mining the Web 17Chakrabarti and Ramakrishnan Structured vs. Web data mining traditional data mining traditional data mining data is structured and relational data is structured and relational well-defined tables, columns, rows, keys, and constraints. well-defined tables, columns, rows, keys, and constraints. Web data Web data readily available data rich in features and patterns readily available data rich in features and patterns spontaneous formation and evolution of spontaneous formation and evolution of topic-induced graph clusters topic-induced graph clusters hyperlink-induced communities hyperlink-induced communities Goal of book: discovering patterns which are spontaneously driven by semantics, Goal of book: discovering patterns which are spontaneously driven by semantics,


Download ppt "World Wide Web Hypertext documents Hypertext documents Text Text Links Links Web Web billions of documents billions of documents authored by millions of."

Similar presentations


Ads by Google