Search and Discovery: Searching the Web. Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products.

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

■ Google’s Ad Distribution Network ■ Primary Benefits of AdWords ■ Online Advertising Stats and Trends ■ Appendix: Basic AdWords Features ■ Introduction.
Natural Language Processing WEB SEARCH ENGINES August, 2002.
Computer Information Technology – Section 3-2. The Internet Objectives: The Student will: 1. Understand Search Engines and how they work 2. Understand.
Information Retrieval in Practice
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
CS 345 Data Mining Lecture 1 Introduction to Web Mining.
Searching and Researching the World Wide: Emphasis on Christian Websites Developed from the book: Searching and Researching on the Internet and World Wide.
Information Retrieval
Overview of Web Data Mining and Applications Part I
COMPUTER APPLICATIONS TO BUSINESS ||
Overview of Search Engines
1 Internet History Internet made up of thousands of networks worldwide No one in charge of Internet - No governing body Internet backbone owned by private.
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
History of the Internet  Began in 1969 as a network of computers at UCLA, Santa Barbara, Stanford & Univ. of Utah.  It was funded by the U.S Dept.
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
Search Engines. Internet protocol (IP) Two major functions: Addresses that identify hosts, locations and identify destination Connectionless protocol.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Search. Search and Economics Search is ubiquitous –Money as a search efficiency Eliminates double coincidence of wants in search for barter exchange –Job.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)
Internet Architecture and Governance
Meet the web: First impressions How big is the web and how do you measure it? How many people use the web? How many use search engines? What is the shape.
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
Searching the World Wide Web: Meta Crawlers vs. Single Search Engines By: Voris Tejada.
 A website, also written Web site, web site, or simply site, is a group of Web pages and related text, databases, graphics, audio, and video files that.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Chapter 1 Getting Listed. Objectives Understand how search engines work Use various strategies of getting listed in search engines Register with search.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Information Retrieval (9) Prof. Dragomir R. Radev
Week 1 Introduction to Search Engine Optimization.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
Chapter 8: Web Analytics, Web Mining, and Social Analytics
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
The Internet. The Internet and Systems that Use It Internet –A group of computer networks that encircle the entire globe –Began in 1969 Protocol –Language.
General Architecture of Retrieval Systems 1Adrienn Skrop.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
SEARCH ENGINE by: by: B.Anudeep B.Anudeep Y5CS016 Y5CS016.
Information Retrieval in Practice
Search Engine Optimization
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
A Brief Introduction to the Internet
Search Engines & Subject Directories
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Information Retrieval
Search Engines & Subject Directories
Search Engines & Subject Directories
All About the Internet.
Presentation transcript:

Search and Discovery: Searching the Web

Stages of a transaction Discovery –Find what you’re interested in Locate sellers Locate buyers Compare products Negotiation Exchange

Discovery Encompasses: –Search engines –Recommender systems –Price comparison/shopping agents –Description languages –Data sources Generic sources: portals, web directories Domain-specific sources: catalogs, guides, etc. –Advertising

Discovery More than just finding a resource –Need to be able to estimate value, likelihood of successful negotiation –An evaluative infrastructure is required Least formalized of e-commerce subareas. Unlikely to have a general-purpose solution soon –Too complex

A Brief History of the Web Prehistory: –Hypertext as an idea has been around since the 40s. Vannevar Bush: Memex Engelbart: 60s –1987: Hypercard Graphical tool allowing users to create hyperlinked documents. –Late 80s/early 90s: WAIS, Gopher

A Brief History of the Web 1989/90: Tim Berners-Lee proposes the WWW at CERN –A new global information retrieval system –Develops HTML, a simple markup language 1993: Mosaic developed at NCSA –Marc Andressen then founds Netscape 1993/94: NCSA httpd released –Open-source web server, supported CGI –Precursor to Apache

A Brief History of the Web 1994: Banner ads appear on HotWired –Beginning of the commercial web 1994: Yahoo founded –Appearance of the portal, search engine 1995: NSF backbone privatized –AT&T, Sprint, etc take over traffic –Network Solutions given a monopoly on domain names 1995: Microsoft releases Internet Explorer –In 7 years, Netscape goes from 100% market share to 20% (2001).

A Brief History of the Web 1995: AltaVista started –Full-text Web search 1995: Andressen first WWW billionaire 1995: Sun introduces Java –Able to ship code and text across networks 1995: eBay founded –First online auction : Explosive growth –Many new formats, applications, companies 1998: Akamai founded (web caching)

A Brief History of the Web 1998: ICANN governs names & addresses 1998: MP3 format popularized –WinAmp released –Small enough to make audio distribution practical 1998: Google founded. 2000: Napster appears –Beginnings of peer-to-peer technology, file sharing 2000(ish): End of the boom –Consolidation, reduction in growth

Lessons from Radio Radio was popularized in the 1920s –Originally intended as a one-to-one messaging system. –Fee-for-use pay structure. 1922: Explosive growth begins –RCA’s revenues from sales of receivers doubled each year –Broadcast model becomes prevalent –Thousands of broadcasters emerge

Lessons From Radio : Transition –How to make money broadcasting? Support sale of receivers Goodwill (sponsors) Public good – supported as a non-profit Advertising Tube tax/set tax (a la BBC) –By 1924, stations are failing as quickly as they start.

Lessons From Radio Affordable content driven by audience size “Rich-get-richer” for large stations 1926: RCA launches NBC –First nationwide broadcast –Creates the network system National content, local broadcasting –Advertising the dominant revenue generator WWW questions: –Who will be NBC? –What will the revenue model be? Advertising? Competition with TV, radio for this revenue. Micropayments? Subscriptions? Content aggregation?

Searching the Web Web growth estimated at 1000% in late 90s. Can search engines keep up with this growth? How to deal with the dynamic nature of the web? –Page contents change –Pages appear, disappear, move –Link structure changes

Search Engines Most common form of discovery Crawl the web to collect pages Stored and indexed for easy retrieval Query languages simple Goals: –Fast retrieval (Google gets 150 million queries per day) –Accurate (no dead links) –Precise (pages match user’s needs)

Terminology Outward link –Object that a page links to Outdegree: number of outward links Inward link –Pages that link to an object Indegree: number of inward links Path –Series of outward links from A to B

The Web as a Directed Graph We can represent the web as a directed graph. –Sites are nodes –Links are edges. Outward link –Object that a page links to Inward link –Pages that link to an object

The Web as a Directed Graph

Adjacency Matrix We can also represent the Web as a very large adjacency matrix. The eigenvector of this matrix illustrates the clusteredness of the Web –Distribution of in-degree and out-degree –Connectedness –Some ranking algorithms (HITS) use this measure.

Web structure Web can be broken into four areas (Kleinberg/Lawrence) –Core: Path between any two pages –Upstream: Can reach the core, but no path from core. –Downstream: can be reached from core, but cannot reach core. –Tendrils/islands – disconnected from the core. Areas (allegedly) have roughly equal size.

Coverage Search engines claim they index a large fraction of the web. How to verify this? 1.Run queries on many engines and compare number of hits. May return irrelevant documents Documents may no longer exist Documents may have changed

Coverage NEC (1998) – Estimate size of web, coverage for major search engines. –Query each engine, retrieve and compare all results (only exact matches). Coverage estimates: –HotBot: 57%, AltaVista: 46% – NorthernLight: 33%, Excite: 23% –Infoseek: 16%, Lycos: 4%

Estimating the size of the indexable web Overlap in coverage was used to estimate size. AB U U/B serves as an estimate of A/N, where N is the size of the Web. 1998: Altavista/Hotbot estimate: 320 million pages.

Using size to refine coverage estimates.(1997) This value can then be used to determine a coverage estimate for each engine. For each pair, solve for N. Assume real N is largest found. Updated: HotBot: 34%, AltaVista: 28% NorthernLight: 20%, Excite: 14% Infoseek: 10%, Lycos: 3%

Updates: (1999) Web growth ahead of indexing –No search engine covers more than 16% of the Web. –Union of all engines: ~50% coverage –Estimated size: 800 million pages –Search engines more likely to link to authorities –More likely to link to US, commercial sites.

Updates (12/2001) Self-reported number of pages indexed: Google: 2 billion (3 billion+ today) FAST (AllTheWeb.com): 625 million –(claimed 2.1 billion in 2002) Altavista: 550 million Inktomi: 500 million NorthernLight: 390 million

Indexing the web Spiders are used to crawl the web and collect pages. –A page is downloaded and its outward links are found. –Each outward link is then downloaded. –Exceptions: Links from CGI interfaces Robot Exclusion Standard

Indexing the Web “Stop words” stripped from page Forward index created –Bundles words –Maps words to documents. Can use TFIDF to only map “significant” keywords –Term Frequency * InverseDocumentFrequency

Indexing the web An inverted index is created –Forward index sorted according to word –Maps keywords to URLs Some wrinkles: –Morphology: stripping suffixes (stemming), singular vs. plural, tense, case folding –Semantic similarity Words with similar meanings share an index. Issue: trading coverage (number of hits) for precision (how closely hits match request)

Indexing Issues Indexing techniques were designed for static collections How to deal with pages that change? –Periodic crawls, rebuild index. –Varied frequency crawls Records need a way to be “purged” Hash of page stored Can use the text of a link to a page to help label that page. –Helps eliminate the addition of spurious keywords.

Indexing Issues Availability and speed –Most search engines will cache the page being referenced. Multiple search terms –OR: separate searches concatenated –AND: intersection of searches computed. –Regular expressions not typically handled. Parsing –Must be able to handle malformed HTML, partial documents

PageRank Google uses PageRank to determine relevance. Based on the “quality” of a page’s inward links. Average the PageRanks of each page that points to a given page, divided by their outdegree. Let p be a page, with T 1 – T n linking to p. PR(p) = (1-d) + d(Sum I (Pr(T I )/out I )) d is a ‘damping’ factor. PR ‘propagates’ through a graph.

PageRank Justification: –Imagine a random surfer who keeps clicking through links. d is the probability she starts a new search. –Or … –A page has a high ranking if highly ranked pages point to it. –Pros: difficult to game the system –Cons: Creates a “rich get richer” web structure where highly popular sites grow in popularity.

HITS HITS is also commonly used for document ranking. Gives each page a hub score and an authority score –A good authority is pointed to by many good hubs. –A good hub points to many good authorities. –Users want good authorities.

Issues with Ranking Algorithms Spurious keywords and META tags Users reinforcing each other –Increases “authority” measure Topic drift –Many hubs link to more than one topic

Web structure Structure is important for: –Predicting traffic patterns Who will visit a site? Where will visitors arrive from? How many visitors can you expect? –Estimating coverage Is a site likely to be indexed?

Core Compact –Short paths between sites –“Small world” phenomenon Distances are small relative to average path length –Number if inward and outward links follows a power law. Mechanism: preferential attachment –As new sites arrive, the probability of gaining an inward link is proportional to in-degree.

Power laws and small worlds Power laws occur everywhere in nature –Distribution of site sizes, city sizes, incomes, word frequencies –Random networks tend to evolve according to a power law. Small-world phenomenon –“Neighborhoods” will be joined by a common member –Hubs serve to connect neighborhoods –Linkage is closer than one might expect –Six Degrees of Separation, Kevin Bacon

Local structure More diverse than a power law Pages with similar topics self-organize into communities –Short average path length –High link density –Webrings –Inverse: Does a high link density imply the existence of a community? –Can this be used to study the emergence and growth of web communities?

Hubs and Authorities Common community structure –Hubs Many outward links Lists of resources –Authorities Many inward links Provide resources, content

Hubs and Authorities Hubs Authorities Link structure estimates over 100,000 Web communities Often not categorized by portals

Web Communities Alternate definition –Each member has more links to community members than non-community members. –Extension of a clique. –Can be discovered with network flow algorithms.

Weaknesses of search engines