Presentation is loading. Please wait.

Presentation is loading. Please wait.

Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.

Similar presentations


Presentation on theme: "Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering."— Presentation transcript:

1 Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering Yazd University Graduate Research Assistant @ Web Lab. & IPKD Lab., YU Senior Research Fellow @ Parsijoo External Research Member of MSC Lab., DUT

2 Slide 2 Information Retrieval Systems: Search Engines Graphs in Information Retrieval – Connection-based Ranking Spamming Spam Detection A Real world Case Outline Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case…

3 Slide 3 Enterprise Document Retrieval Web Information Retrieval Systems: Search Engines Web Retrieval vs. Document Retrieval – Structure of documents – Scale – Domain – Users – Query Specificity – Determination Introduction to IR Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Search Engines Trust in Web

4 Slide 4 Architecture of Search Engines Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Search Engines Trust in Web Crawler(s) Page Repository Indexer Module Collection Analysis Module Query Engine Ranking Client Indexes : Text Structure Utility Queries Web

5 Slide 5 Web Structure – Meta Data – Linkage Applications of Web Structure – Crawling – Indexing – Ranking Cont. Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Search Engines Trust in Web www.sharif.edu math.sharif.edu Math Dept.

6 Slide 6 Cite / Link – Use / Quote / Express favoring – Trust / Applicability Assumption – A link from page A to page B is a recommendation of page B by the author of A (we say B is successor of A) Recursion: Quality of a page is related to – its in-degree, – the quality of pages linking to it Trust in Web Structure Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Search Engines Trust in Web AB

7 Slide 7 Page and Berin [1] introduce the random surfer model Definition – Random surfer starts from a random page – The surfer proceeds to a randomly chosen successor of the current page (With probability 1/outdegree) Random Surfer on the Web Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming s Surfer

8 Slide 8 Random Surfer on the Web (II) Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming

9 Slide 9 Random Surfer on the Web (II) Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming Surfer s

10 Slide 10 Random Surfer on the Web (II) Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming Surfer s

11 Slide 11 Random Surfer on the Web (II) Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming Surfer s

12 Slide 12 Random Surfer on the Web (II) Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming Surfer s

13 Slide 13 Random Surfer on the Web (III) Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming s s s s s s s s s s s

14 Slide 14 Random Surfer on the Web (III) Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming s s s s s s s s s s s

15 Slide 15 Random Surfer on the Web (II) Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming s s s s s s s s s

16 Slide 16 Random Surfer on the Web (II) Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming

17 Slide 17 Each page inherits its rank from its ancestors. Issues – Web graph is not strongly connected – Convergence of PageRank is not guaranteed – Effects of sink nodes – Pages without outputs – Trapping pages PageRank Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming

18 Slide 18 Cont. Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming s s s s s s s s s s s Sink

19 Slide 19 Teleport – Random surfer jumps from a node to any other node – The destination is chosen uniformly from all nodes Prob. of selecting each node is (1/n) – In each node, surfer has the option of jumping Prob. of jumping is α (0 ≤ α ≤ 1) Damping factor (d=1- α ) PageRank with Teleport Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming s

20 Slide 20 Spamming Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming Spam – The manipulation of web page content for the purpose of appearing high up in search results. Spamming Techniques – Text content manipulation – (tags, comments, invisible text blocks) – Structural content manipulation (Mimicking important websites)

21 Slide 21 Spam Detection Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming Spam Detection Methods – Text Spam Comparing word probability – Link-farm Spam Trust/Anti-trust Rank Community Detection

22 Slide 22 Link-farm Spam Detection Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Random Surfer Model PageRank PageRank with Teleport Spamming Link-farm Spam – Trust Rank – Anti-trust

23 Slide 23 Parsijoo Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Parsijoo

24 Slide 24 A Real World Case… Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Parsijoo Parsijoo Facts – Crawled Pages: (1x10 9 /month) rem. 500 x 10 6 Crawling rate: 2000 page/sec – Cached URLs: 10 x 10 9 80,000 URL /sec 10 X 10 6 Unique Host (each host needs one queue) – Unique URLS: 800 x 10 6 – Unique Words: 80 X 10 6 – Unique Requests: 200 x 10 3 /day

25 Slide 25 A Real World Case… Introduction to Information Retrieval Graphs in Information Retrieval A Real World Case… Parsijoo Parsijoo Facts – Requests (per day) Web:100 K Image:35 K News: 10 K Music: 10 K Scholar: 1 K Video: 5 K SADANA and etc. 35K – Unique Requests: 200 x 10 3 /day

26 Slide 26


Download ppt "Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering."

Similar presentations


Ads by Google