Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.

Slides:



Advertisements
Similar presentations
PHP Meetup - SEO 2/12/2009. Where to Focus? Ensuring the findability of content Ensuring content is well understood by search engines Maximizing the importance.
Advertisements

Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Chapter 7 Retrieval Models.
Search Engines and Information Retrieval
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Presented by Li-Tal Mashiach Learning to Rank: A Machine Learning Approach to Static Ranking Algorithms for Large Data Sets Student Symposium.
The PageRank Citation Ranking “Bringing Order to the Web”
Query Languages: Patterns & Structures. Pattern Matching Pattern –a set of syntactic features that must occur in a text segment Types of patterns –Words:
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
Searching The Web Search Engines are computer programs (variously called robots, crawlers, spiders, worms) that automatically visit Web sites and, starting.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Link Structure and Web Mining Shuying Wang
Information Retrieval
Search Engine Optimization Strategies The Basics of SEO Offered to you by Mary Anne Donovan Vice President, SEO Literacy Consultants.
Overview of Search Engines
WEB SCIENCE: SEARCHING THE WEB. Basic Terms Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
SEO Lunch How to Grow A Business in 3 Bites Akiva Ben-Ezra
Increasing Website ROI through SEO and Analytics Dan Belhassen greatBIGnews.com Modern Earth Inc.
Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:
Search Optimization Techniques Dan Belhassen greatBIGnews.com Modern Earth Inc.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Search Engine Optimization: Understanding the Engines & Building Successful Sites Zohaib Ahmed Google Analytics Individual Qualified March 2012.
SEO. Self Exploding Organs SEO Search Engine Optimisation By Joey Cannon.
Search Engines and Information Retrieval Chapter 1.
The Technology Behind. The World Wide Web In July 2008, Google announced that they found 1 trillion unique webpages! Billions of new web pages appear.
Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Web Searching. How does a search engine work? It does NOT search the Web (when you make a query) It contains a database with info on numerous Web sites.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Meet the web: First impressions How big is the web and how do you measure it? How many people use the web? How many use search engines? What is the shape.
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Search Engines By: Faruq Hasan.
Information Retrieval Part 2 Sissi 11/17/2008. Information Retrieval cont..  Web-Based Document Search  Page Rank  Anchor Text  Document Matching.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Week 1 Introduction to Search Engine Optimization.
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
CS 440 Database Management Systems Web Data Management 1.
Traffic Source Tell a Friend Send SMS Social Network Group chat Banners Advertisement.
CPS 49S Google: The Computer Science Within and its Impact on Society Shivnath Babu Spring 2007.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
How do Web Applications Work?
SEO Search Engine Optimization
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Objective % Explain concepts used to create websites.
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Search Search Engines Search Engine Optimization Search Interfaces
Search Search Engines Search Engine Optimization Search Interfaces
Information Retrieval
CS 440 Database Management Systems
Introduction to Information Retrieval
Web Search Engines.
Query Type Classification for Web Document Retrieval
Presentation transcript:

Web Search

Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The figure shows a partial map of the Internet (the physical network over which the Web is built), which traces the paths of packets through the Internet from a host 2

3 Web Search vs. Data(base) Retrieval Data RetrievalWeb Search Matching Classification Query language Query specification Items wanted Error response Data representation Exact Match Partial (Best) Match Deterministic Probabilistic Model Monotonic Polytechnic Artificial Natural Complete Incomplete Matching Relevant Sensitive Insensitive Schema (Mostly) Index terms

Search Taxonomy (Types) n Informational (Topical) Search  Finding information about some topic which may be on one or more web pages, such as “sports cars” or “London” n Navigational Search  Finding a particular web page, e.g., BYU CS Department homepage, that the user has seen before or is assumed to exist n Transactional Search  Finding a site where a task, such as shopping/downloading music/online library browsing, can be performed n Each type of search is associated w/ different info. need 4

Search Taxonomy (Types) n Navigational Search  The result of a navigational query is the homepage of an institution or organization most relevant to the query  When a navigational query is accurately specified, a feature, such as Google’s “I’m Feeling Lucky”, will bring the user directly to the required homepage 5

6 Web Search vs. Traditional IR Information Retrieval Web Search Type of Information Rate of Change Replication Quality of Pages Range of Topics Type of Queries User Expertise Level Mainly Text-Based Multimedia Relatively Small Billions/Trillions of Pages Archived Data Size Static Dynamic Few Enormous (  30%) High Vary Dramatically Closed Collection Wide Open More Complete Short Similar Vary Significantly

Search Engine Optimization (SEO) n SEO: understanding the relative importance of features used in search and how they can be manipulated to obtain better search rankings for a web page, e.g.,  Improve the text used in the title tag  Improve the text in heading tags  Make sure that the domain name & URL contain important keywords  Improve the anchor text & link structure to the page 7

Anchor Text n Anchor text tends to be short, descriptive, and similar to query text n Used as a description of the content of the destination page  i.e., collection of anchor text in all links pointing to a page used as an additional text field n Incorporating the collection of all the anchor text (as extra text) in links to a page into the ranking algorithm n A simple search/ranking algorithm: search through all links in a collection, looking for anchor text that is an exact match for the user’s query n Retrieval experiments have shown that anchor text has significant impact on effectiveness for some types of queries, i.e., more than PageRank 8

Web Search n Given the size of the Web, many pages will contain all query terms  Ranking algorithm focuses on discriminating between these pages based on the occurrence of query terms  Word/term proximity (i.e., close proximity) is important The dependency model: query terms are assumed to appear in close proximity to each other in relevant documents,  e.g., given the query “Green party political views”, relevant docs should contain the phrases “green party” & “political views” in relatively close proximity N-grams are commonly used in commercial web search 9

Link Analysis n Links are a key component of the Web n Important for navigation, but also for web search  E.g., Example website  “Example website” is the anchor text  “ is the destination link  Both are used by web search engines 10

PageRank n Billions of web pages, some are more informative than the others n Links can be viewed as information about the popularity (authority?) of a web page, e.g., eBay  Can be used by ranking algorithm n Inlink count could be used as simple measure n Link analysis algorithms like PageRank provide more reliable ratings  Less susceptible to link spam 11

PageRank: Random Surfer Model n Browse the Web using the following algorithm:  Choose a random number r between 0 and 1  If r < λ: Go to a random page  If r ≥ λ: Click a link at random on the current page  Start again n PageRank of a page is the probability that the “random surfer” will be looking at that page  Links from popular pages will increase PageRank of pages they point to 12

Dangling Links n Random jump prevents getting stuck on pages that  Do not have links  Contains only links that no longer point to other pages  Have links forming a loop n Links that point to the first two types of pages are called dangling links  May also be links to pages that have not yet been crawled 13

PageRank n PageRank (PR) of page C = PR(A)/2 + PR(B)/1 n More generally,  where B u is the set of pages that point to u, and L v is the number of outgoing links from page v (not counting duplicate links) 14

PageRank n Don’t know PageRank values at start n Assume equal values (1/3 in this case), then iterate:  First iteration: PR(C) = 0.33/ = 0.5, PR(A) = 0.33, and PR(B) = 0.17  Second: PR(C) = 0.33/ = 0.33, PR(A) = 0.5, PR(B) = 0.17  Third: PR(C) = 0.42, PR(A) = 0.33, PR(B) = 0.25 n Converges to PR(C) = 0.4, PR(A) = 0.4, and PR(B) =

PageRank n Taking random page jump into account, 1/3 chance of going to any page when r < λ n PR(C) = λ/3 + (1 − λ) · (PR(A)/2 + PR(B)/1) n More generally,  where N is the number of pages, λ typically

17