1 Web Search Introduction. 2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet.

Slides:



Advertisements
Similar presentations
The Internet and the Web
Advertisements

Communicating Information: Web Design. It’s a big net HTTP FTP TCP/IP SMTP protocols The Internet The Internet is a network of networks… It connects millions.
A Short History of the Internet and Web Frank McCown COMP 250 – Internet Development Harding University.
ISE554 The WWW for eLearning 3.1 WWW Concepts. “The WWW principle of universal readership is that once information is available, it should be accessible.
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
1 Web Search Introduction. 2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet.
Searching and Researching the World Wide: Emphasis on Christian Websites Developed from the book: Searching and Researching on the Internet and World Wide.
1 Web Information Retrieval Web Science Course. 2.
1 Searching the Web Baeza-Yates Modern Information Retrieval, 1999 Chapter 13.
The History of Search Engines Chris Kwierant Chris Kwierant Eastern Michigan University Eastern Michigan University April, 2005 April, 2005.
Internet Research Search Engines & Subject Directories.
What Is A Web Page? An Introduction to the Internet.
The Internet A brief overview Internet - An interconnected system of networks that connects computers around the world via the TCP/IP protocol. What.
1 Internet History Internet made up of thousands of networks worldwide No one in charge of Internet - No governing body Internet backbone owned by private.
1 Web Search Introduction. 2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet.
1 Accessing the Global Database The World Wide Web.
Discover the Information Superhighway Explore How It Serves You Test-Drive the Internet.
History of the Internet  Began in 1969 as a network of computers at UCLA, Santa Barbara, Stanford & Univ. of Utah.  It was funded by the U.S Dept.
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
Introducing the Internet Source: Learning to Use the Internet.
HTML History CS 101. HTML Stands for Hypertext Markup Language A “Markup Language” dates from the early days of publishing when editing was done manually.
1 Web Search Introduction. 2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet.
Chapter 6 The World Wide Web. Web Pages Each page is an interactive multimedia publication It can include: text, graphics, music and videos Pages are.
The Internet and Multimedia Chapter 2. How the Internet Developed The Internet grew out of the Cold Ware between the United States and the Soviet Union.
Chapter 8 The Internet: A Resource for All of Us.
Operating Systems Concepts 1/e Ruth Watson Chapter 12 Chapter 12 Introduction to the Internet Ruth Watson.
History of the Internet and Internet Basics AGED 4143 Electronic Communications in Agriculture.
Lecture 10: 9/26/2002CS149D Fall CS149D Elements of Computer Science Ayman Abdel-Hamid Department of Computer Science Old Dominion University Lecture.
The Internet : Exploration, Evaluation, and Elaboration presented by Kathy Schrock.
Introduction To Internet
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
Overview What is a Web search engine History Popular Web search engines How Web search engines work Problems.
1/28: The Internet & Website Design What is the Internet? –Parts of the Internet –Internet & WWW basics –Searching the WWW Website design considerations.
HTML for ISD Brown Bag Presentation Session 1 Why?
Computer Application in Mass Comm.. What is Internet?  Interconnection of computers and computer networks using TCP/IP communication protocol  Transport.
Information Retrieval and Web Search Web search. Spidering Instructor: Rada Mihalcea Class web page: (some of these.
The Internet. Definition: Network of networks. Began in 1969, DOD project called ARPANET. Early 1980’s NSF creates NSFnet NSF takes over both by mid ’80’s.
Retrieving Information on the Web Presented by Md. Zaheed Iftekhar Course : Information Retrieval (IFT6255) Professor : Jian E. Nie DIRO, University of.
The World Wide Web: Information Resource. Hock, Randolph. The Extreme Searcher’s Internet Handbook. 2 nd ed. CyberAge Books: Medford. (2007). Internet.
World Wide Web “WWW”, "Web" or "W3". World Wide Web “WWW”, "Web" or "W3"
COP 3813 Intro to Internet Computing Prof. Roy Levow Lecture 1.
Internet and WWW. Internet Network linking computers to other computers Access to numerous resources – Communications systems Instant messaging.
 A website, also written Web site, web site, or simply site, is a group of Web pages and related text, databases, graphics, audio, and video files that.
World Wide Web Guide * for Students to the Internet.
The World Wide Web: Information Resource. How a Search Engine works… How Search Works - YouTube
Search Engines Information Technology and Social Life March 2, 2005.
8/31: Ch. 1 The Internet & WWW What is the Internet? What is the WWW? –Browser basics What is a search engine? What search engines are used today? Images.
CSCI-235 Micro-Computers in Science The Internet and World Wide Web.
Introduction to the World Wide Web & Internet CIS 101.
1 Web Search Introduction. 2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet.
ULI101 – XHTML Basics (Part I) Internet / Web Concepts Brief History TCP/IP Web Servers / Web Browsers URL HTTP / HTML.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
1 Web Search Introduction. 2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet.
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
SEARCH ENGINE by: by: B.Anudeep B.Anudeep Y5CS016 Y5CS016.
The World Wide Web.
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
Internet.
Search Engines & Subject Directories
Computer Networks and Internet
Web Search Introduction.
Search Engines & Subject Directories
Search Engines & Subject Directories
Web Search Introduction.
Web Search by Ray Mooney
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Web Searching Everything, now..
Web Search Introduction.
Internet and the world wide web (www)
Information Retrieval and Web Search
Presentation transcript:

1 Web Search Introduction

2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet. Combined idea of documents available by FTP with the idea of hypertext to link documents. Developed initial HTTP network protocol, URLs, HTML, and first “web server.”

3 Pre-Web History Ted Nelson developed idea of hypertext in Doug Engelbart invented the mouse and built the first implementation of hypertext in the late 1960’s at SRI. ARPANET was developed in the early 1970’s. The basic technology was in place in the 1970’s; but it took the PC revolution and widespread networking to inspire the web and make it practical.

4 Web Browser History Early browsers were developed in 1992 (Erwise, ViolaWWW). In 1993, Marc Andreessen and Eric Bina at UIUC NCSA developed the Mosaic browser and distributed it widely. Andreessen joined with James Clark (Stanford Prof. and Silicon Graphics founder) to form Mosaic Communications Inc. in 1994 (which became Netscape to avoid conflict with UIUC). Microsoft licensed the original Mosaic from UIUC and used it to build Internet Explorer in 1995.

5 Search Engine Early History By late 1980’s many files were available by anonymous FTP. In 1990, Alan Emtage of McGill Univ. developed Archie (short for “archives”) – Assembled lists of files available on many FTP servers. –Allowed regex search of these file names. In 1993, Veronica and Jughead were developed to search names of text files available through Gopher servers.

6 Web Search History In 1993, early web robots (spiders) were built to collect URL’s: –Wanderer –ALIWEB (Archie-Like Index of the WEB) –WWW Worm (indexed URL’s and titles for regex search) In 1994, Stanford grad students David Filo and Jerry Yang started manually collecting popular web sites into a topical hierarchy called Yahoo.

7 Web Search History (cont) In early 1994, Brian Pinkerton developed WebCrawler as a class project at U Wash. (eventually became part of Excite and AOL). A few months later, Fuzzy Maudlin, a grad student at CMU developed Lycos. First to use a standard IR system as developed for the DARPA Tipster project. First to index a large set of pages. In late 1995, DEC developed Altavista. Used a large farm of Alpha machines to quickly process large numbers of queries. Supported boolean operators, phrases, and “reverse pointer” queries.

8 Web Search Recent History In 1998, Larry Page and Sergey Brin, Ph.D. students at Stanford, started Google. Main advance is use of link analysis to rank results partially based on authority.

9 Web Challenges for IR Distributed Data: Documents spread over millions of different web servers. Volatile Data: Many documents change or disappear rapidly (e.g. dead links). Large Volume: Billions of separate documents. Unstructured and Redundant Data: No uniform structure, HTML errors, up to 30% (near) duplicate documents. Quality of Data: No editorial control, false information, poor quality writing, typos, etc. Heterogeneous Data: Multiple media types (images, video, VRML), languages, character sets, etc.

10 Number of Web Servers

11 Number of Web Pages

12 Number of Web Pages Indexed Assuming about 20KB per page, 1 billion pages is about 20 terabytes of data. SearchEngineWatch, Aug. 15, 2001

13 Growth of Web Pages Indexed GoogleGoogle lists current number of pages searched. SearchEngineWatch, Jan. 28, 2005

14 Some Recent Web Statistics As of January 2006, there are an estimated 440 million hosts on the Internet As of August 2006, there are an estimated 96 million Web servers on the Internet As of September 2005, yahoo.com 20 billion items ( google.com 8.1 billion web pages, search.msn.com 5 billion web pages, alltheweb.com over 3 billion web pages (August 2003)yahoo.comhttp:// google.comsearch.msn.comalltheweb.com

15 Graph Structure in the Web

16 Zipf’s Law on the Web Number of in-links/out-links to/from a page has a Zipfian distribution. Length of web pages has a Zipfian distribution. Number of hits to a web page has a Zipfian distribution.

17 Zipf’s Law An empirical rule that describes the relation between the frequencies of appearances. Example -- text words: the i-th most frequent word appears as many times as the most frequent one divided by i , for some   1. The same can be applied to in-link/out-link of a web page, length of a web page, and number of hits to a web page, among others.

18 Manual Hierarchical Web Taxonomies Yahoo approach of using human editors to assemble a large hierarchically structured directory of web pages. – Open Directory Project is a similar approach based on the distributed labor of volunteer editors (“net-citizens provide the collective brain”). Used by most other search engines. Started by Netscape. –

19 Automatic Document Classification Manual classification into a given hierarchy is labor intensive, subjective, and error-prone. Text categorization methods provide a way to automatically classify documents. Best methods based on training a machine learning (pattern recognition) system on a labeled set of examples (supervised learning). Text categorization is a topic we will discuss later in the course.

20 Automatic Document Hierarchies Manual hierarchy development is labor intensive, subjective, and error-prone. It would be nice to automatically construct a meaningful hierarchical taxonomy from a corpus of documents. This is possible with hierarchical text clustering (unsupervised learning). –Hierarchical Agglomerative Clustering (HAC) Text clustering is a another topic we will discuss later in the course.