Searching the Web CS3352. Searching the Web Three forms of searching – Specific queries  encyclopaedia, libraries Exploit hyperlink structure – Broad.

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Natural Language Processing WEB SEARCH ENGINES August, 2002.
Search Engine Algorithms Vincent Ng Department of Computing Hong Kong Polytechnic University Now and the Future.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
The PageRank Citation Ranking “Bringing Order to the Web”
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
(c) Maria Indrawan Distributed Information Retrieval.
1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.
Link Analysis, PageRank and Search Engines on the Web
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Chapter 19: Information Retrieval
Link Structure and Web Mining Shuying Wang
1 Searching the Web Baeza-Yates Modern Information Retrieval, 1999 Chapter 13.
Information Retrieval
Overview of Search Engines
Internet Research Search Engines & Subject Directories.
SEARCH ENGINE By Ms. Preeti Patel Lecturer School of Library and Information Science DAVV, Indore E mail:
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
1 Web Developer Foundations: Using XHTML Chapter 11 Web Page Promotion Concepts.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Courtney Forsmann IT Help Desk Manager Lewis-Clark State College October 1, 2014.
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
Recuperação de Informação B Cap : Ranking : Crawling the Web : Indices December 06, 1999.
Using Hyperlink structure information for web search.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
Overview of Web Ranking Algorithms: HITS and PageRank
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Search Engines Reyhaneh Salkhi Outline What is a search engine? How do search engines work? Which search engines are most useful and efficient? How can.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
Search Tools and Search Engines Searching for Information and common found internet file types.
Search Engines By: Faruq Hasan.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
1 CS 430: Information Discovery Lecture 5 Ranking.
WIRED Week 6 Syllabus Review Readings Overview Search Engine Optimization Assignment Overview & Scheduling Projects and/or Papers Discussion.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
General Architecture of Retrieval Systems 1Adrienn Skrop.
CS 440 Database Management Systems Web Data Management 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
IR Theory: Web Information Retrieval. Web IRFusion IR Search Engine 2.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Search Engines and Search techniques
DATA MINING Introductory and Advanced Topics Part III – Web Mining
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
Information Retrieval
HITS Hypertext-Induced Topic Selection
Search Engines and Link Analysis on the Web
Search Engines & Subject Directories
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Information Retrieval
CS 440 Database Management Systems
Data Mining Chapter 6 Search Engines
Search Engines & Subject Directories
Search Engines & Subject Directories
Junghoo “John” Cho UCLA
Chapter 31: Information Retrieval
Chapter 19: Information Retrieval
Presentation transcript:

Searching the Web CS3352

Searching the Web Three forms of searching – Specific queries  encyclopaedia, libraries Exploit hyperlink structure – Broad queries  web directories Web directories: classify web documents by subjects – Vague queries  search engines index portions of web

Problem with the data Distributed data High percentage of volatile data – much is generated Large volume – June 2000 Google full-text index of 560 million URLs Unstructured data – gifs, pdf etc Redundant data – mirrors (30% pages are near duplicates) Quality of data – false, poorly written, invalid, mis-spelt Heterogeneous data – media, formats, languages, alphabets

Users and the Web How to specify a query? How to interpret answers? – Ranking – Relevance selection – Summary presentations – Large document presentation Main purpose: research, leisure, business, education – 80% do not modify query – 85% look first screen only – 64% queries are unique – 25% users use single keywords Problem for polysemic words and synonyms

Web search All queries answered without accessing texts – by indices alone – Local copies of web pages expensive (Google cache) – Remote page access unrealistic Links – Link topology, link popularity, link quality, who links Page structure – Words in heading > words in text etc Sites – Sub collections of documents, mirror site detection Names Presenting summaries Community identification Indexing – Refresh rate Similarity engine Ranking scheme Caching and popularity measures

Spamming Most search engines have rules against – invisible text, – meta tag abuse, – heavy repetition – "domain spam” overtly submission of "mirror“ sites in an attempt to dominate the listings for particular terms

Excite Spamming Excite screens out spamming before adding a page to its web page index. – if it finds a string of words such as: – money money money money money money money – it will replace the excess repetition, so that essentially, the string becomes: – money xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx The more Excite detects unusual repetition, the more heavily it will penalize a page. Excite does not penalize for the use of hidden text, but penalties will apply if hidden text is used to disguise spam content. Excite penalises "domain spam."

Centralised architecture Crawler-indexer (most search engine) Crawler – Robot, spider, wanderer, walker, knowbot – Program that traverses web to send new or update pages to main server (where they are indexed) – Run on local server and send request to remote servers Centralised use of index to answer queries

Example (AltaVista) Interface Query engine Index Indexer Crawler 1998: 20 multi-processor machines 130 GB of RAM, 500 GB disk space

Distributed architecture Harvest: harvest.transarc.com Gatherers: – Collect and extract indexing information from one or more web servers at periodic time Brokers – Provide indexing mechanism and query interface to data gathered – Retrieve information from gatherers or other brokers, updating incrementally their indices

Harvest architecture User Broker Replication Manager Web site Object cache Gatherer Broker

Ranking algorithms Variations of Boolean and vector space model – TF  IDF plus Hyperlinks between pages – pages pointed to by a retrieved page – pages that point to a retrieved page – Popularity: number of hyperlinks to a page – Relatedness: number of hyperlinks common in pages or pages referenced by same pages WebQuery PageRank (Google) HITS (Clever)

Lets Use Links!

Metasearch Web server that sends query to – Several search engines – Web directories – Databases Collect results Unify them (Data fusion) Aim: better coverage Issues – Translation of query – Uniform result (fusion rankings, e,g, pages retrieved by several engines) – Wrappers

Google The best web engine: – comprehensive and relevant results Biggest index – 580 million pages visited and recorded – Uses link data to get to another 500 million pages Different kinds of index – smaller indexes containing a higher amount of the web's most popular pages, as determined by Google's link analysis system. Index refresh – Updated monthly/weekly – Daily for popular pages Serves queries from three data centres – two on West Coast of the US, one on East Coast.

Google: let this inspire you… Larry Page, Co-founder & Chief Executive Officer Sergey Brin, Co-founder & President PhD students at Stanford

Google Overview Crawls the web to create its listings. Combines traditional IR text matching with extremely heavy use of link popularity to rank the pages it has indexed. Other services also use link popularity, but none do to the extent that Google does.

Citation Importance Ranking

Google links Submission: – Add URL page (no need to do a "deep" submit) – Best way to ensure that your site is indexed is to build links. The more other sites are pointing at you, the more likely you will be crawled and ranked well. Crawling and Index Depth: – aims to refresh its index on a monthly basis, – if Google doesn't actually index a pages, it may still return it in a search because it makes extensive use of the text within hyperlinks. – This text is associated with the pages the link points at, and it makes it possible for Google to find matching pages even when these pages cannot themselves be indexed.

Google Relevancy (1) Google ranks web pages based on the number, quality and content of links pointing at them (citations). Number of Links – All things being equal, a page with more links pointing at it will do better than a page with few or no links to it. Link Quality – Numbers aren't everything. A single link from an important site might be worth more than many links from relatively unknown sites.

Google Relevancy (2) Link Content – The text in and around links relates to the page they point at. For a page to rank well for "travel," it would need to have many links that use the word travel in them or near them on the page. It also helps if the page itself is textually relevant for travel Ranking boosts on text styles – The appearance of terms in bold text, or in header text, or in a large font size is all taken into account. None of these are dominant factors, but they do figure into the overall equation.

PageRank Usage simulation & Citation importance ranking: – Based on a model of a Web surfer who follows links and makes occasional haphazard jumps, arriving at certain places more frequently than others. User randomly navigates – Jumps to random page with probability p – Follows a random hyperlink from the page with probability 1-p – Never goes back to a previously visited page by following a previously traversed link backwards Google finds a single type of universally important page--intuitively, locations that are heavily visited in a random traversal of the Web's link structure.

PageRank Process modelled by Markov Chain – probability of being in each page is computed, p set by the system – W j = PageRank of page j – n i = number of outgoing links on page i – m is the number of nodes in G, that is, the number of Web pages in the collection – Ranking of other pages is normalized by the number of links in the page – Computed using an iterative method  W j = p + (1- p) WiWi nini (i,j)  G | i  j m

PageRank Wj Wk 2 Wk 1 Wi 2 Wi 1 Wi 3 (1- p) W1W1 n1n1 m p W2W2 n2n2 W3W3 n3n3 + +

Google Content Performs a full-text index of the pages it visits. It gathers all visible text. It does not read either the meta keywords or description tags. Descriptions are formed automatically by extracting the most relevant portions of pages. If a page has no description, it is probably because Google has never actually visited it.

Google Spamming Link popularity ranking system leaves it relatively immune to traditional spamming techniques. – Goes beyond the text on pages to decide how good they are. No links, low rank. Common spam idea – create a lot of new pages within a site that link to a single page, in an effort to boost that page's popularity, perhaps spreading out these pages across a network of sites. – Unlikely to work, do real link building instead with non-competitive sites that are related to yours.

Site identification

AltaVista

HITS: Hypertext Induced Topic Search The ranking scheme depends on the query Considers the set of pages that point to or are pointed at by pages in the answer S Implemented in IBM;s Clever Prototype Scientific American Article: /0699raghavan.html /0699raghavan.html

HITS (2) Authorities: – Pages that have many links point to them in S Hub: – pages that have many outgoing links Positive two-way feedback: – better authority pages come from incoming edges from good hubs – better hub pages come from outgoing edges to good authorities

Authorities and Hubs Authorities ( blue ) Hubs (red)

HITS two step iterative process assigns initial scores to candidate hubs and authorities on a particular topic in set of pages S 1. use the current guesses about the authorities to improve the estimates of hubs—locate all the best authorities 2. use the updated hub information to refine the guesses about the authorities--determine where the best hubs point most heavily and call these the good authorities. Repeat until the scores eventually converge to the principle eigenvector of the link matrix of S, which can then be used to determine the best authorities and hubs.eigenvector  H(p) = u  S | p  u A(u) A(p) = v  S | v  p H(u) 

HITS issues (3) Restrict set of pages to a maximum number Doesn’t work with non-existent, repeated or automatically generated links Weighting links on surrounding content Diffusion of the topic – A more general topic contains the original answer – Analyse the content of each page and score that, combining link weight with page score. Sub-grouping links HITS Used for web community identification

Cybercommunities

Google vs Clever Google 1. assigns initial rankings and retains them independently of any queries -- enables faster response. 2. looks only in the forward direction, from link to link. Clever 1. assembles a different root set for each search term and then prioritizes those pages in the context of that particular query. 2. also looks backward from an authoritative page to see what locations are pointing there. Humans are innately motivated to create hub-like content expressing their expertise on specific topics.

Autonomy High performance pattern matching based on Bayesian inference networks Identifies patterns in text based on usage and term frequency that correspond to concepts X% probability that a document is about a subject – Encode the signature – Categorize it – Link it to related documents with same signature Not just search engine tool.

Human powered searching Ask Jeeves – an answer service using human editors to build the knowledgebase. Editors proactively suggest questions and watch what people are actually searching for. Yahoo – uses humans to organize the web. Human editors find sites or review submissions, then place those sites in one or more "categories" that are relevant to them.

Research Issues Modelling Querying Distributed architecture Ranking Indexing Dynamic pages Browsing User interface Duplicated data Multimedia

Further reading html