Presentation is loading. Please wait.

Presentation is loading. Please wait.

Saeed Rahmani, Dr. Mohammd Hadi Sadroddini Shiraz University

Similar presentations


Presentation on theme: "Saeed Rahmani, Dr. Mohammd Hadi Sadroddini Shiraz University"— Presentation transcript:

1 Saeed Rahmani, Dr. Mohammd Hadi Sadroddini Shiraz University
Utilizing Distributed Environment for Clustering Web Page Streams in Search Engines Saeed Rahmani, Dr. Mohammd Hadi Sadroddini Shiraz University

2 Overview Introduction Proposed Model FICA as Crawler Algorithm
Incremental Clustering Distributed Environment MapReduce PowerGraph Reference

3 The Age of Big Data Social Media Science Advertising Web
72 Hours each Minute YouTube 28 Million Wikipedia Pages 1 Billion Facebook Users 6 Billion Flickr Photos Identify influential people and information Find communities Target ads and products Model complex data dependencies

4 Powerful tools for tackling large-data problems
Bioinformatics DNA sequence assembly protein-protein interaction networks Recommendation system Search Engine Text processing Machine Translation How are you? MapReduce, PowerGraph, Spark, Storm, …

5 Proposed Model Web Crawler PowerGraph [2] FICA [1] Web Graph
Urls and Links Temp Repository Web Page Pre-processing [3] MapReduce Important N-gram Detection Unit [4] Incremental Clustering [6] MapReduce [5] Clustered Web Page Repository Search Engine News Analysis …. Application

6 Fast Intelligent Crawling Algorithm(FICA)
Throughput of crawling algorithms where the benchmark ranking is PageRank[1]. Logarithmic Distance in FICA[1]

7 Incremental clustering [6]
D1 D2 D3 D4 D5 D6

8 Main Concept - Document Representation - Documents Similarity
- Nearest Neighbor

9 Islamic republic of Iran
N-gram N-Grams are sequences of tokens. The N stands for how many terms are used Unigram: 1 term Bigram: 2 terms Trigrams: 3 terms Persian n-gram Size English equivalent ایران 1 Iran سایت site دانلود download جدید new انجمن forum صفحه اصلی 2 home page ثبت نام registration درباره ما about us نرم افزار software وب سایت web site تماس با ما 3 contact us پرسش و پاسخ question and answer جمهوری اسلامی ایران Islamic republic of Iran اس ام اس SMS دانلود نرم افزار software download List of Persian n-grams

10 MapReduce Framework [5]
Input Map quick, 1 ate, 1 mouse, 1 cow, 1 Shuffle & Sort Reduce brown, 2 fox, 2 how, 1 now, 1 the, 3 ate, 1 cow, 1 mouse, 1 quick, 1 Output the, 1 brown, 1 fox, 1 the quick brown fox the, 2 fox, 1 Word Count the fox ate the mouse Local reduce function for repeated keys produced by same map For associative ops. like sum, count, max Decreases amount of intermediate data Example: local counting for Word Count: def combiner(key, values): output(key, sum(values)) When maps produce many repeated keys – It is often useful to do a local aggregation following the map – Done by specifying a Combiner – Goal is to decrease size of the transient data – Combiners have the same interface as Reduces, and often are the same class. – Combiners must not have side effects, because they run an indeterminate number of times. – In WordCount, conf.setCombinerClass(Reduce.class); how, 1 now, 1 brown, 1 how now brown cow

11 Implementing FICA using Map-Reduce
(sec) DataSet: “uk-2002” Nodes: 18,520,486 Edges: 298,113,762 active and in-active nodes (Iteration) Execution time of Map-Reduce iterations using 100 random initial seeds  Speedup 2.23 Map 1.4 Shuffle 3.24 Sort 1.15 Reduce 2.87 All Time Execution time using active and in-active nodes of Map-Reduce iterations using 100 random initial seeds (sec) (Iteration) Speedup of improved Map-Reduce execution using active and in-active nodes.

12 Natural Graphs More than 108 vertices have one neighbor.
Top 1% of vertices are adjacent to 50% of the edges! [Image from WikiCommons]

13 Power-Law Graphs are Difficult to Partition
CPU 1 CPU 2 Power-Law graphs do not have low-cost balanced cuts Traditional graph-partitioning algorithms perform poorly on Power-Law Graphs.

14 Curse of the Slow Job Iterations Barrier Barrier Barrier
Data Barrier Data Data Data CPU 1 CPU 2 CPU 1 CPU 1 Data CPU 2 CPU 2 Data Data CPU 3 CPU 3 CPU 3 Data Data Data

15 Program Run on This For This PowerGraph[2] Split High-Degree vertices
Machine 1 Machine 2 Split High-Degree vertices New Abstraction  Equivalence on Split Vertices

16 Distributed Execution of a PowerGraph Vertex-Program
Machine 1 Machine 2 Master Gather Y’ Y’ Y’ Y’ Y Σ Σ1 Σ2 Y Mirror Apply Y Y Machine 3 Machine 4 Σ3 Σ4 Scatter Mirror Mirror

17 PageRank on the Twitter Follower Graph
Natural Graph with 40M Users, 1.4 Billion Links Communication Runtime Total Network (GB) Seconds Reduces Communication Runs Faster 32 Nodes x 8 Cores (EC2 HPC cc1.4x)

18 Reference [1] Zareh Bidoki, Ali Mohammad, Nasser Yazdani, and Pedram Ghodsnia. "FICA: a fast intelligent crawling algorithm." Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence. IEEE Computer Society, 2007. [2] Gonzalez, Joseph E., et al. "PowerGraph: Distributed graph-parallel computation on natural graphs." Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI) [3] Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze.Introduction to information retrieval. Vol. 1. Cambridge: Cambridge University Press, 2008. [4] Balkir, Atilla Soner, Ian Foster, and Andrey Rzhetsky. "A distributed look-up architecture for text mining applications using MapReduce." High Performance Computing, Networking, Storage and Analysis (SC), 2011 International Conference for. IEEE, 2011. [5] Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): [6] M. Khalilian and N. Mustapha, "Data stream clustering: Challenges and issues," arXiv preprint arXiv: , 2010. [7] Chung, Seokkyung, Dennis McLeod, and Jongeun Jun. "Incremental Mining from News Streams." (2009). [8] POWER, R., AND LI, J. Piccolo: building fast, distributed programs with partitioned tables. In OSDI (2010). [9] Low, Yucheng, et al. "Graphlab: A new framework for parallel machine learning." arXiv preprint arXiv: (2010).

19 ?


Download ppt "Saeed Rahmani, Dr. Mohammd Hadi Sadroddini Shiraz University"

Similar presentations


Ads by Google