Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Consequence: If there isn’t an HREF path from some Yahoo.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
What is WEB SPAM Many slides from a lecture by Marc Najork, Microsoft: “Detecting Spam Web Pages”
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Precision and Recall.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Chapter 7: Text mining UIC - CS 594 Bing Liu 1 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Ch 4: Information Retrieval and Text Mining
Link Analysis, PageRank and Search Engines on the Web
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Structure and Web Mining Shuying Wang
Link Analysis HITS Algorithm PageRank Algorithm.
Chapter 5: Information Retrieval and Web Search
Google and the Page Rank Algorithm Székely Endre
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Presented By: - Chandrika B N
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Adversarial Information Retrieval The Manipulation of Web Content.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Overview of Web Ranking Algorithms: HITS and PageRank
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Information Retrieval and Web Search
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
What is WEB SPAM Many slides are from a lecture by Marc Najork: “Detecting Spam Web Pages”
Google's Page Rank. Google Page Ranking “The Anatomy of a Large-Scale Hypertextual Web Search Engine” by Sergey Brin and Lawrence Page
Data e Web Mining. - S. Orlando 1 Information Retrieval and Web Search Salvatore Orlando Bing Liu. “Web Data Mining: Exploring Hyperlinks, Contents”, and.
CS 440 Database Management Systems Web Data Management 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
WEB SPAM.
Search Engines and Link Analysis on the Web
PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Hubs and Authorities Jeffrey D. Ullman.
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Lecture 22 SVD, Eigenvector, and Web Search
CS 440 Database Management Systems
Data Mining Chapter 6 Search Engines
Chapter 5: Information Retrieval and Web Search
Junghoo “John” Cho UCLA
Web Search Engines.
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Information Retrieval and Web Design
Precision and Recall.
Presentation transcript:

Web Search

Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Consequence: If there isn’t an HREF path from some Yahoo like directory, then your page probably isn’t indexed by any search engine.

Information Retrieval (IR) Conceptually, IR is the study of finding needed information. –IR helps users find information that matches their information needs expressed as queries Historically, IR is about document retrieval, emphasizing document as the basic unit. –Finding documents relevant to user queries Web search also has its root in IR. From: Bing Liu. Web Data Mining. 2007

IR architecture From: Bing Liu. Web Data Mining. 2007

IR queries Keyword queries Boolean queries (using AND, OR, NOT) Phrase queries Proximity queries Full document queries Natural language questions From: Bing Liu. Web Data Mining. 2007

Information retrieval models An IR model governs how a document and a query are represented and how the relevance of a document to a user query is defined. Main models: –Boolean model –Vector space model –Statistical language model –etc From: Bing Liu. Web Data Mining. 2007

Boolean model Each document or query is treated as a “bag” of words or terms. Word sequence is not considered. Given a collection of documents D, let V = {t 1, t 2,..., t |V| } be the set of distinctive words/terms in the collection. V is called the vocabulary. A weight w ij > 0 is associated with each term t i of a document d j in D. d j = (w 1j, w 2j,..., w |V|j ) For a term that does not appear in document d j, w ij = 0. From: Bing Liu. Web Data Mining. 2007

Boolean model (cont’d) Query terms are combined logically using the Boolean operators AND, OR, and NOT. –E.g., ((data AND mining) AND (NOT text)) Retrieval –Given a Boolean query, the system retrieves every document that makes the query logically true. –Called exact match. The retrieval results are usually quite poor because term frequency is not considered. From: Bing Liu. Web Data Mining. 2007

Vector space model Documents are also treated as a “bag” of words or terms. Each document is represented as a vector. However, the term weights are no longer 0 or 1. Each term weight is computed based on some variations of TF or TF- IDF scheme. Term Frequency (TF) Scheme: The weight of a term t i in document d j is the number of times that t i appears in d j, denoted by f ij. Normalization may also be applied. Shortcoming of the TF scheme is that it doesn’t consider the situation where a term appears in many documents of the collection. –Such a term may not be discriminative. From: Bing Liu. Web Data Mining. 2007

TF-IDF term weighting scheme The most well known weighting scheme –TF: (normalized) term frequency –IDF: inverse document frequency. N: total number of docs df i : the number of docs that t i appears. The final TF-IDF term weight is: From: Bing Liu. Web Data Mining. 2007

Retrieval in vector space model Query q is represented in the same way as a document. The term w iq of each term t i in q can also computed in the same way as in normal document. Relevance of d j to q: Compare the similarity of query q and document d j. For this, use cosine similarity (the cosine of the angle between the two vectors) From: Bing Liu. Web Data Mining. 2007

An Example Suppose a document space is defined by three terms (words): –hardware, software, users –the vocabulary A set of documents are defined as: –A1=(1, 0, 0),A2=(0, 1, 0), A3=(0, 0, 1) –A4=(1, 1, 0),A5=(1, 0, 1), A6=(0, 1, 1) –A7=(1, 1, 1)A8=(1, 0, 1).A9=(0, 1, 1) If the Query is “hardware and software” what documents should be retrieved? From: Bing Liu. Web Data Mining. 2007

An Example (cont.) In Boolean query matching: –document A4, A7 will be retrieved (“AND”) –retrieved: A1, A2, A4, A5, A6, A7, A8, A9 (“OR”) In similarity matching (cosine): –q=(1, 1, 0) –S(q, A1)=0.71, S(q, A2)=0.71,S(q, A3)=0 –S(q, A4)=1,S(q, A5)=0.5, S(q, A6)=0.5 –S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5 –Document retrieved set (with ranking)= {A4, A7, A1, A2, A5, A6, A8, A9} From: Bing Liu. Web Data Mining. 2007

Text pre-processing Word (term) extraction: easy Stopwords removal Stemming Frequency counts and computing TF-IDF term weights. From: Bing Liu. Web Data Mining. 2007

Stopwords removal Some of the most frequently used words aren’t useful in IR and text mining – these words are called stop words. –the, of, and, to, …. –Typically about 400 to 500 such words –For an application, an additional domain specific stopwords list may be constructed Why do we need to remove stopwords? –Reduce indexing (or data) file size stopwords accounts 20-30% of total word counts. –Improve efficiency and effectiveness stopwords are not useful for searching or text mining they may also confuse the retrieval system. From: Bing Liu. Web Data Mining. 2007

Stemming Techniques used to find out the root/stem of a word. E.g., –user engineering –users engineered –used engineer –using stem: use engineer Usefulness: improving effectiveness of IR and text mining –matching similar words –Mainly improve recall reducing indexing size –combing words with same roots may reduce indexing size as much as 40-50%. From: Bing Liu. Web Data Mining. 2007

Basic stemming methods Using a set of rules. E.g., remove ending –if a word ends with a consonant other than s, followed by an s, then delete s. –if a word ends in es, drop the s. –if a word ends in ing, delete the ing unless the remaining word consists only of one letter or of th. –If a word ends with ed, preceded by a consonant, delete the ed unless this leaves only a single letter. –…... transform words –if a word ends with “ies” but not “eies” or “aies” then “ies --> y.” From: Bing Liu. Web Data Mining. 2007

Precision and Recall In information retrieval (search engines) community, system evaluation revolves around the notion of relevant and not relevant documents. Precision is the fraction of retrieved documents that are relevant Recall is the fraction of relevant documents that are retrieved

In terms of confusion matrix…

Why having two numbers? The advantage of having the two numbers for precision and recall is that one is more important than the other in many circumstances. Typical web surfers: –would like every result on the first page to be relevant (high precision), but have not the slightest interest in knowing let alone looking at every document that is relevant. Professional searchers such as paralegals and intelligence analysts: –are very concerned with trying to get as high recall as possible, and will tolerate fairly low precision results in order to get it.

What about a single number? The combined measure which is standardly used is called the F measure, which is the weighted harmonic mean of precision and recall: where The default is to equally weight precision and recall, giving a balanced F measure. –This corresponds to making  = 1/2 or  =1. –Commonly written as F 1, which is short for F  =1

Precision at k The above measures precision at all recall levels. What matters is rather how many good results there are on the first page or the first three pages. This leads to measures of precision at fixed low levels of retrieved results, such as 10 or 30 documents. This is referred to as “Precision at k”, for example “Precision at 10.”

Web Search as a huge IR system A Web crawler (robot) crawls the Web to collect all the pages. Servers establish a huge inverted indexing database and other indexing databases At query (search) time, search engines conduct different types of vector query matching. –There is an Information Retrieval score coming out of this. The documents have HREF links as well. They are used to compute a reputation score. The two scores are combined together in order to produce a ranking of the returned documents.

Inverted Indexes The idea behind an inverted index is simple: Start with a set of documents containing words, and you want to "invert" that, to create a bunch of words each of which lists all the documents that contain that word. Usually used with “buckets.”

Additional Information in Buckets We can extend bucket to include role, position of word, e.g. Type Position

Google Page Ranking “The Anatomy of a Large-Scale Hypertextual Web Search Engine” by Sergey Brin and Lawrence Page

Outline 1.Page rank, for discovering the most “important” pages on the Web, as used in Google. 2.Hubs and authorities, a more detailed evaluation of the importance of Web pages using a variant of the eigenvector calculation used for Page rank.

Page Rank (PR) Intuitively, we solve the recursive definition of “importance”: A page is important if important pages link to it. Page rank is the estimated page importance. In short PageRank is a “vote”, by all the other pages on the Web, about how important a page is. –A link to a page counts as a vote of support. –If there’s no link there’s no support (but it’s an abstention from voting rather than a vote against the page).

Page Rank Formula PR(A) = PR(T 1 )/C(T 1 ) +…+ PR(T n )/C(T n ) 1.PR(T n ) - Each page has a notion of its own self- importance, which is say 1 initially. 2.C(T n ) – Count of outgoing links from page T n. 1.Each page spreads its vote out evenly amongst all of it’s outgoing links. 3.PR(T n )/C(T n ) – a)Each page spreads its vote out evenly amongst all of it’s outgoing links. b)So if our page (say page A) has a back link from page “n” the share of the vote page A will get from page “n” is “PR(T n )/C(T n ).”

How is Page Rank Calculated? This is where it gets tricky. The page rank (PR) of each page depends on the PR of the pages pointing to it. –But we won’t know what PR those pages have until the pages pointing to them have their PR calculated and so on… –And when you consider that page links can form circles it seems impossible to do this calculation! But actually it’s not that bad. Google paper says: –PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the link matrix of the web. Just go ahead and calculate a page’s PR without knowing the final value of the PR of the other pages. –Each time we run the calculation we’re getting a closer estimate of the final value. –Repeat the calculations lots of times until the numbers stop changing much.

Web Matrix Capture the formula by the web matrix WebM, that is: 1.Each page i corresponds to row i and column i of the matrix. 2.If page j has n successors (links), then: –the ij th entry is 1/n if page i is one of these n successors of page j, and –0 otherwise. Then, the importance vector containing the rank of each page is calculated by: Rank new = WebM Rank old

Example In 1839, the Web consisted on only three pages: Netscape, Microsoft, and Amazon. For example, the first column of the Web matrix reflects the fact that Netscape divides its importance between itself and Amazon. The second column indicates that Microsoft gives all its importance to Amazon. Start with n = m = a = 1, then do rounds of improvements. Based on Jeff Ullman’s notes

Example The first four iterations give the following estimates: n = 1 m = 1 a = 1 1 1/2 3/2 5/4 3/4 1 9/8 1/2 11/8 5/4 11/16 17/16 In the limit, the solution is n = a = 6/5; m = 3/5. That is, Netscape and Amazon each have the same importance, and twice the importance of Microsoft (well this was 1839). Based on Jeff Ullman’s notes

Problems With Real Web Graphs Dead ends: a page that has no successors has nowhere to send its importance. Eventually, all importance will “leak out of” the Web. Example: Suppose Microsoft tries to claim that it is a monopoly by removing all links from its site. The new Web, and the rank vectors for the first 4 iterations are shown. n = 1 1 3/4 5/8 1/2 m = 1 1/2 1/4 1/4 3/16 a = 1 1/2 1/2 3/8 5/16 Eventually, each of n, m, and a become 0; i.e., all the importance leaked out. Based on Jeff Ullman’s notes

Problems With Real Web Graphs Spider traps: a group of one or more pages that have no links out of the group will eventually accumulate all the importance of the Web. Example: Angered by the decision, Microsoft decides it will link only to itself from now on. Now, Microsoft has become a spider trap. The new Web, and the rank vectors for the first 4 iterations are shown. n = 1 1 3/4 5/8 1/2 m = 1 3/2 7/4 2 35/16 a = 1 1/2 1/2 3/8 5/16 Now, m converges to 3, and n = a = 0. Based on Jeff Ullman’s notes

Google Solution to Dead Ends and Spider Traps Stop the other pages having too much influence. This total vote is “damped down” by multiplying it by a factor. Example: If we use a 20% damp-down, the equation of previous example becomes: The solution to this equation is n = 7/11; m = 21/11; a = 5/11. Based on Jeff Ullman’s notes

Hubs and Authorities Intuitively, we define “hub” and “authority” in a mutually recursive way: –a hub links to many authorities, and –an authority is linked to by many hubs. Authorities turn out to be pages that offer information about a topic, e.g., Hubs are pages that don't provide the information, but tell you where to find the information, e.g.,

Matrix formulation Use a matrix formulation similar to that of PageRank, but without the stochastic restriction. –We count each link as 1, regardless of how many successors or predecessors a page has. Namely, define a matrix A whose rows and columns correspond to Web pages, with entry A ij = 1 if page i links to page j, and 0 if not. –Notice that A T, the transpose of A, looks like the matrix used for computing Page rank, but A T has 1's where the Page- rank matrix has fractions.

Authority and Hubbiness Vectors Let a and h be vectors, whose i th component corresponds to the degrees of authority and hubbiness of the i th page. Let and  be suitable scaling factors. Then we can state: 1.h = A a That is, the hubbiness of each page is the sum of the authorities of all the pages it links to, scaled by. 2.a =  A T h That is, the authority of each page is the sum of the hubbiness of all the pages that link to it, scaled by . Based on Jeff Ullman’s notes

Simple substitutions We can derive from (1) and (2), using simple substitution, two equations that relate vectors a and h only to themselves. 1.a =  A T A a 2.h =  A A T h As a result, we can compute h and a by relaxation, giving us the principal eigenvectors of the matrices AA T and A T A, respectively. Based on Jeff Ullman’s notes

Example If we use =  = 1 and assume that the vectors h = [h n, h m, h a ] = [1, 1, 1], and a = [a n, a m, a a ] = [1, 1, 1], the first three iterations of the equations for a and h are: Based on Jeff Ullman’s notes

Web Spam: raison d’etre E-commerce is rapidly growing –Projected to $329 billion by 2010 More traffic  more money Large fraction of traffic from Search Engines Increase Search Engine referrals: –Place ads –Provide genuinely better content –Create Web spam …  From: Ntoulas, Najork, Manasse, Fetterly. Detecting Spam Web Pages through Content Analysis. 2006

Web Spam (you know it when you see it)

Defining Web Spam Spam Web page: A page created for the sole purpose of attracting search engine referrals (to this page or some other “target” page) Ultimately a judgment call –Some web pages are borderline cases From: Ntoulas, Najork, Manasse, Fetterly. Detecting Spam Web Pages through Content Analysis. 2006

Why Web Spam is Bad Bad for users –Makes it harder to satisfy information need –Leads to frustrating search experience Bad for search engines –Wastes bandwidth, CPU cycles, storage space –Pollutes corpus (infinite number of spam pages!) –Distorts ranking of results From: Ntoulas, Najork, Manasse, Fetterly. Detecting Spam Web Pages through Content Analysis. 2006

How pervasive is Web Spam? Real-Web data from the MSNBot crawler –Collected during August 2004 Processed only MIME types –text/html –text/plain 105,484,446 Web pages in total From: Ntoulas, Najork, Manasse, Fetterly. Detecting Spam Web Pages through Content Analysis. 2006

Spam per Top-level Domain 95% confidence From: Ntoulas, Najork, Manasse, Fetterly. Detecting Spam Web Pages through Content Analysis. 2006

Spam per Language 95% confidence From: Ntoulas, Najork, Manasse, Fetterly. Detecting Spam Web Pages through Content Analysis. 2006

Content Spamming Most search engines use variations of TF-IDF based measures to assess the relevance of a page to a user query. Content-based spamming methods tailor the contents of the text fields in HTML pages to make spam pages more relevant to some queries. Content spamming can be placed in any text field: –Title Since search engines usually give higher weights to terms in the title of a page due to the importance of the title to a page, it is thus common to spam the title. –Body –Anchor Text anchor text of a hyperlink is considered very important by search engines. It is indexed for the page containing it and also for the page that it points to, so anchor text spam affects the ranking of both pages.

Link Spamming Hyperlinks play an important role in determining the reputation score of a page. Thus spammers also spam on hyperlinks. Out-Link Spamming: –It is quite easy to add out-links in one's pages pointing to some authoritative pages to boost the hub cores of one's pages. A page is a hub page if it points to many authoritative (or quality) pages. To create massive out-links, spammers may use a technique called directory cloning. –There are many directories, e.g., Yahoo!, DMOZ Open Directory. Spammers simply replicate a large portion of a directory in the spam page to create a massive out-link structure quickly.

Link Spamming In-Link Spamming: –harder to achieve because it is not easy to add hyperlinks on the Web pages of others. –Spammers typically use one or more of the following techniques. Creating a honey pot: If a page wants to have a high reputation/quality score, it needs quality pages pointing to it. –This method tries to create some important pages that contain links to target spam pages. –For example, the spammer can create a set of pages that contains some very useful information, e.g., glossary of Web mining terms, or Java FAQ and help pages. Posting links to the user-generated content (reviews, forum discussions, blogs, etc): –There are numerous sites on the Web that allow the user to freely post messages, which are called the user-generated content. –Spammers can add links pointing to their pages to the seemly innocent messages that they post. Participating in link exchange: –In this case, many spammers form a group and set up a link exchange scheme so that their sites point to each other in order to promote the pages of all the sites

Hiding Techniques Content Hiding: Spam items are made invisible. One simple method is to make the spam terms the same color as the background color. –For example, one may use the following for hiding, spamitems … Cloaking: Spam Web servers return a HTML document to the user and a different document to a Web crawler. –Spam Web servers can identify Web crawlers in one of the two ways: It maintains a list of IP addresses of search engines and identifies search engine crawlers by matching IP addresses. It identifies Web browsers based on the user-agent field in the HTTP request message. For instance, the user-agent name of the following HTTP request message is the one used by the Microsoft Internet Explorer –GET/pub/WWWlTheProject.html HTTP/1.1 –Host: –User-Agent:Mozilla/4.0(compatible;MSIE6.0;WindowsNT5.1) Search engine crawlers usually identify themselves by names distinct from normal Web browsers.

Combating Spam Combating spam can be seen as a classification problem, i.e., predicting whether a page is a spam page or not. –One can use any classification algorithm to train a spam classifier. The key issue is to design features used in learning. –The following are some example features used in [Ntoulas, Najork, Manasse, Fetterly, 2006] to detect spam. Number of words in the page: A spam page tends to contain more words than a non-spam page so as to cover a large number of popular words. Average word length: The mean word length for English prose is about 5 letters. Average word length of synthetic content is often different. Number of words in the page title: Since search engines usually give extra weights to terms appearing in page titles, spammers often put many keywords in the titles of the spam pages. Fraction of visible content: Spam pages often hide spam terms by making them invisible to the user.