Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Markov Models.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Link Analysis Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman.
Link Analysis: PageRank
Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.
How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search - Summer Term 2006 III. Web Search - Introduction (Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
The PageRank Citation Ranking “Bringing Order to the Web”
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Learning Bit by Bit Search. Information Retrieval Census Memex Sea of Documents Find those related to “new media” Brute force.
Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.
WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Presented By: - Chandrika B N
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.
Using Hyperlink structure information for web search.
The Search Engine Landscape: 2010 How Users Interact with Engines & How the Search Engines Crawl, Index & Rank Pages Rand Fishkin CEO & Co-Founder: SEOmoz.
CPSC 534L Notes based on the Data Mining book by A. Rajaraman and J. Ullman: Ch. 5.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Search Engine Optimization: A Survey of Current Best Practices Author - Niko Solihin Resource -Grand Valley State University April, 2013 Professor - Soe-Tsyr.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Understanding Search Engines. Basic Defintions: Search Engine Search engines are information retrieval (IR) systems designed to help find specific information.
SMX Madrid 2008 Uncovering the Algorithm A Peek Inside How Google Evaluates and Ranks Pages.
9 Algorithms: PageRank. Ranking After matching, have to rank:
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
Nadav Eiron, Kevin S.McCurley, JohA.Tomlin IBM Almaden Research Center WWW’04 CSE 450 Web Mining Presented by Zaihan Yang.
Autumn Web Information retrieval (Web IR) Handout #11:FICA: A Fast Intelligent Crawling Algorithm Ali Mohammad Zareh Bidoki ECE Department, Yazd.
CS 440 Database Management Systems Web Data Management 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
Efficient Crawling Through URL Ordering By: Junghoo Cho, Hector Garcia-Molina, and Lawrence Page Presenter : Omkar S. Kasinadhuni Simerjeet Kaur.
IR Theory: Web Information Retrieval. Web IRFusion IR Search Engine 2.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
The PageRank Citation Ranking: Bringing Order to the Web
WEB SPAM.
HITS Hypertext-Induced Topic Selection
Chapter 7 Web Structure Mining
The Anatomy of a Large-Scale Hypertextual Web Search Engine
A Comparative Study of Link Analysis Algorithms
9 Algorithms: PageRank.
CS 440 Database Management Systems
PageRank algorithm based on Eigenvectors
9 Algorithms: PageRank.
Junghoo “John” Cho UCLA
Presentation transcript:

Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering Yazd University Graduate Research Web Lab. & IPKD Lab., YU Senior Research Parsijoo External Research Member of MSC Lab., DUT

Slide 2 Information Retrieval Systems: Search Engines Graphs in Information Retrieval – Connection-based Ranking Spamming Spam Detection A Real world Case Outline Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case…

Slide 3 Enterprise Document Retrieval Web Information Retrieval Systems: Search Engines Web Retrieval vs. Document Retrieval – Structure of documents – Scale – Domain – Users – Query Specificity – Determination Introduction to IR Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Search Engines Trust in Web

Slide 4 Architecture of Search Engines Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Search Engines Trust in Web Crawler(s) Page Repository Indexer Module Collection Analysis Module Query Engine Ranking Client Indexes : Text Structure Utility Queries Web

Slide 5 Web Structure – Meta Data – Linkage Applications of Web Structure – Crawling – Indexing – Ranking Cont. Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Search Engines Trust in Web math.sharif.edu Math Dept.

Slide 6 Cite / Link – Use / Quote / Express favoring – Trust / Applicability Assumption – A link from page A to page B is a recommendation of page B by the author of A (we say B is successor of A) Recursion: Quality of a page is related to – its in-degree, – the quality of pages linking to it Trust in Web Structure Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Search Engines Trust in Web AB

Slide 7 Page and Berin [1] introduce the random surfer model Definition – Random surfer starts from a random page – The surfer proceeds to a randomly chosen successor of the current page (With probability 1/outdegree) Random Surfer on the Web Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming s Surfer

Slide 8 Random Surfer on the Web (II) Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming

Slide 9 Random Surfer on the Web (II) Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming Surfer s

Slide 10 Random Surfer on the Web (II) Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming Surfer s

Slide 11 Random Surfer on the Web (II) Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming Surfer s

Slide 12 Random Surfer on the Web (II) Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming Surfer s

Slide 13 Random Surfer on the Web (III) Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming s s s s s s s s s s s

Slide 14 Random Surfer on the Web (III) Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming s s s s s s s s s s s

Slide 15 Random Surfer on the Web (II) Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming s s s s s s s s s

Slide 16 Random Surfer on the Web (II) Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming

Slide 17 Each page inherits its rank from its ancestors. Issues – Web graph is not strongly connected – Convergence of PageRank is not guaranteed – Effects of sink nodes – Pages without outputs – Trapping pages PageRank Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming

Slide 18 Cont. Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming s s s s s s s s s s s Sink

Slide 19 Teleport – Random surfer jumps from a node to any other node – The destination is chosen uniformly from all nodes Prob. of selecting each node is (1/n) – In each node, surfer has the option of jumping Prob. of jumping is α (0 ≤ α ≤ 1) Damping factor (d=1- α ) PageRank with Teleport Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming s

Slide 20 Spamming Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming Spam – The manipulation of web page content for the purpose of appearing high up in search results. Spamming Techniques – Text content manipulation – (tags, comments, invisible text blocks) – Structural content manipulation (Mimicking important websites)

Slide 21 Spam Detection Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming Spam Detection Methods – Text Spam Comparing word probability – Link-farm Spam Trust/Anti-trust Rank Community Detection

Slide 22 Link-farm Spam Detection Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming Link-farm Spam – Trust Rank – Anti-trust

Slide 23 Parsijoo Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Parsijoo

Slide 24 A Real World Case… Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Parsijoo Parsijoo Facts – Crawled Pages: (1x10 9 /month) rem. 500 x 10 6 Crawling rate: 2000 page/sec – Cached URLs: 10 x ,000 URL /sec 10 X 10 6 Unique Host (each host needs one queue) – Unique URLS: 800 x 10 6 – Unique Words: 80 X 10 6 – Unique Requests: 200 x 10 3 /day

Slide 25 A Real World Case… Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Parsijoo Parsijoo Facts – Requests (per day) Web:100 K Image:35 K News: 10 K Music: 10 K Scholar: 1 K Video: 5 K SADANA and etc. 35K – Unique Requests: 200 x 10 3 /day

Slide 26