Seek and Ye shall Find COS 116: 2/21/2008 Sanjeev Arora The continuum of computer “intelligence”

Slides:



Advertisements
Similar presentations
The Internet.
Advertisements

Internet Basics and Information Literacy
Hyper-Searching the Web. Search Engines Basic Search (index) Cluster Search (themes) Meta-search (outsource) “Smarter” meta-search (themes + outsource)
What is the Internet? Internet: The Internet, in simplest terms, is the large group of millions of computers around the world that are all connected to.
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
ADMINISTRATION Sources of Information REVISION – BLOCK 6.
Internet Basics The World Wide Web. Page 1 Web Basics The World Wide Web The Web is a collection of files organized as a giant hypertext Many of these.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Chapter 5 Searching for Truth: Locating Information on the WWW.
What’s The Difference??  Subject Directory  Search Engine  Deep Web Search.
The Internet & Web Browsers Business Webpage Design Kelly Seale.
With Internet Explorer 8© 2011 Pearson Education, Inc. Publishing as Prentice Hall1 Go! with Internet Explorer 8 Getting Started.
Seek and Ye shall Find COS 116, Spring 2010 Adam Finkelstein The continuum of computer “intelligence”
WWW. What is the Web? Not the internet Not the internet Websites, pages on different computers linked via hyperlinks. An enormous graph. Websites, pages.
Computer Concepts 2014 Chapter 7 The Web and .
 Internet vs WWW  Pages vs Sites  How the Internet Works  Getting a Web Presence.
Chapter 5 Searching for Truth: Locating Information on the WWW.
Lecturer: Ghadah Aldehim
1 ITGS - introduction A computer may have: a direct connection to a net (cable); or remote access (modem). Connect network to other network through: cables.
What is the Internet? Internet: The Internet, in simplest terms, is the large group of millions of computers around the world that are all connected to.
The Internet and the World Wide Web Renee Roland, Dan Waters, Amelia Wright.
Introduction to Internet
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
What is the Internet? Internet: The Internet, in simplest terms, is the large group of millions of computers around the world that are all connected to.
How did the internet develop?. What is Internet? The internet is a network of computers linking many different types of computers all over the world.
COMPREHENSIVE Windows Tutorial 4 Working with the Internet and .
© C.R. Business Education Creations WebQuest The History of the Internet.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
 The World Wide Web is a collection of electronic documents linked together like a spider web.  These documents are stored on computers called servers.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
>> Introduction To The Internet Mr. Garel St. BACHS.
Web Engineering we define Web Engineering as follows: 1) Web Engineering is the application of systematic and proven approaches (concepts, methods, techniques,
1 Search Engines Emphasis on Google.com. 2 Discovery  Discovery is done by browsing & searching data on the Web.  There are 2 main types of search facilities.
The Internet Do you really know what is out there?
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Search Engines.
Hypersearching the Web, Chakrabarti, Soumen Presented By Ray Yamada.
Internet Architecture and Governance
1 UNIT 13 The World Wide Web Lecturer: Kholood Baselm.
CONTENTS  Definition And History  Basic services of INTERNET  The World Wide Web (W.W.W.)  WWW browsers  INTERNET search engines  Uses of INTERNET.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
The Internet. Important Terms Network Network Internet Internet WWW (World Wide Web) WWW (World Wide Web) Web page Web page Web site Web site Browser.
and Internet Explorer.  The transmission of messages and files via a computer network  Messages can consist of simple text or can contain attachments,
By: Kem Forbs Advanced Google Search. Tips and Tricks Keywords: adding additional terms or keywords can redefine your search and make the most relevant.
The Internet What is the Internet? The Internet is a lot of computers over the whole world connected together so that they can share information. It.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
General Architecture of Retrieval Systems 1Adrienn Skrop.
The Internet & Web Browsers Business Webpage Design Created by Kelly Seale Adapted by Jill Einerson.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Searching the Web for academic information Ruth Stubbings.
Data mining in web applications
Chapter 10: Web Basics.
How do Web Applications Work?
DATA MINING Introductory and Advanced Topics Part III – Web Mining
Using Apps to Get and Share Information
Warm Handshake with Websites, Servers and Web Servers:
Methods and Apparatus for Ranking Web Page Search Results
using the internet for research
Data Mining Chapter 6 Search Engines
Introduction to Computer Concept
Searching for Truth: Locating Information on the WWW
Introduction to computers
Unit# 5: Internet and Worldwide Web
Searching for Truth: Locating Information on the WWW
Searching for Truth: Locating Information on the WWW
Internet Basics and Information Literacy
Website A website is a collection of web pages (documents that are accessed through the Internet) When someone gives you their web address, it generally.
Presentation transcript:

Seek and Ye shall Find COS 116: 2/21/2008 Sanjeev Arora The continuum of computer “intelligence”

Recap: Binary Representation Powers of = 1024 ≈ 10 3 Fact: Every integer can be uniquely represented as a sum of powers of 2. Ex: 25 = = 1 x x x x x 2 0 [25] 2 = 11001

Misconceptions about Computers Just a calculator on steroids Just maintains large amount of data Just does what the programmer tells it Yes, but … Weather Forecast Airline Reservation System

Various meanings of Look up “Shirley Tilghman” in online phonebook. In consumer database, find “credit-worthy” consumers. Find web pages relevant to “computer music.” Among all cell phone conversations originating in Country X, identify suspicious ones. Search all religion and philosophy books of the world for meaning of life. “Data Mining”“Web Search”

These are major scientific problems with many components Engineering Algorithms Statistical Modeling Ethics, Policy, Society Linguistics

Discussion Time How do you solve this task: Sorted array of n numbers, find if it contains Binary search! First thing to check: “Is A[n/2] <58780”? (Whatever the answer, you halve the range.) Question: What if the array of numbers is not sorted??

Looking up “Shirley Tilghman” in Electronic Phonebook ASCII: Agreed-upon convention for representing letters with numbers Example: Sorted Phonebook = sorted array of numbers Use binary search (prev. slide) Tilghman, Ideas??

Rest of the lecture: Web Search

Future lecture: Internet (physical infrastructure underlying Web) Routers, gateways, DNS,... (any computer can send a msg to any other)

What is World Wide Web? Files residing on “servers” that are connected to internet. A file “index.html” in “public_html” directory on some server belonging to PU. URL (uniform resource locator); basically an “address” “hyperlinks”: URL of other files;could be on another server.

Logical Structure of the Web Important: This logical structure is created by independent actions of 100s of millions of users “Directed graph” “edges” = link from one node to another

1st step for search engines: create snapshot of the web Webcrawler: “browser on autopilot” - Maintains array of web pages it has seen - 2 types of pages: “visited”, “fully explored” - Do forever { Pick any webpage marked “visited” from array. Mark it “fully explored.” Open all its linked pages in browser. Save them in array and mark them “visited.” } Better: just the pages not “fully explored” yet.

First Web Crawler From: (Brian Pinkerton) Newsgroups: comp.infosystems.announce Subject: The WebCrawler Index: A content-based Web index Date: 11 June :33:42 GMT Organization: University of Washington The WebCrawler Index is now available for searching! The index is broad: it contains information from as many different servers as possible. It's a great tool for locating several different starting points for exploring by hand. The current index is based on the contents of documents located on nearly 4000 servers, world-wide. Check it out at: Other information is available from there, including a description of the WebCrawler (the robot itself), and a list of the 25 most frequently referenced sites on the Web. Brian Pinkerton Dept of Computer Science and Engineering University of Washington [

Still Feasible Today? About 15 billion web pages today (could be off by 2x). Say 10 kb (10,000 bytes) of data per page 15 X bytes to store the web ≈ 150, 000 Gb ≈ 500 hard disks ≈ $50,000 in ‘07

Searching for “computer music” Ideas? Identify all pages that contain “computer music”. Sort according to number of occurrences of “computer music” in the page. Human staff computes answers to all possible questions.

Some pitfalls “Spamming” by unscrupulous websites Synonymy (car, auto, vehicle …) Polysemy (jaguar: car or cat?)

Solution IBM’s CLEVER – 1996 Google’s PAGERANK – 1997 Take advantage of the link structure of the web Web link confers “approval”

CLEVER Typically Authorities point to hubs and hubs point to authorities Hubs: Clearinghouses of information - “My favorite computer music links” Authorities: Sites that are viewed “with respect” by many - New York Times - International Computer Music Association Circular Definition? Circular Definition – see Definition, Circular

Breaking Circularity Iterative algorithm Start with At every step each page has:  “Hub Score”  “Authority Score” Pages containing “Computer music” All pages they point to } Initially all 1

Score Calculation - Do forever { Next Hub Score for page Next Authority Score for page } Sum of current Authority Scores of pages that link to it. Sum of current Hub Scores of pages that link to it. Fact The scores converge. (Proof uses Linear Algebra, Eigenvalues)

Computer models and jurisprudence Aug 25th 2005 [Fowler and Jeon, ’05]

- By product of CLEVER algorithm– it reveals clusters Example: Pro-Choice Pro-Life “Abortion” - Data Mining – Process of finding answers that are not in the data and must be inferred. Example: “How is a person who shops at Whole Foods & REI likely to vote?”

Concerns From users: - Privacy From Computer scientists: - Formalize privacy - How to safeguard privacy while allowing legitimate computations

“Netflix Prize seeks to substantially improve the accuracy of predictions about how much someone is going to love a movie based on their movie preferences” (top prize: $1M)

Trends in web search Algorithms to “guess” what user generating the query had in mind (using AI, Psychology, User History, News tracking). Seamless integration with e-commerce, and click-based revenue harvesting (interesting meeting point of economics and computer science) “Semantic web”: Allow users to attach “meaning” to web-based documents; allowing search engines to make sense of them.

Shape of things to come: [ ]

Next Time… Digital Audio / Music