Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Overview of MapReduce and Hadoop
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Jeffrey D. Ullman Stanford University.  Mining of Massive Datasets, J. Leskovec, A. Rajaraman, J. D. Ullman.  Available for free download at i.stanford.edu/~ullman/mmds.html.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
MapReduce: Simplified Data Processing on Large Clusters 컴퓨터학과 김정수.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
Mining High Utility Itemset in Big Data
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Bi-Hadoop: Extending Hadoop To Improve Support For Binary-Input Applications Xiao Yu and Bo Hong School of Electrical and Computer Engineering Georgia.
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
MapReduce Algorithm Design Based on Jimmy Lin’s slides
MapReduce and the New Software Stack CHAPTER 2 1.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Jimmy Lin and Michael Schatz Design Patterns for Efficient Graph Algorithms in MapReduce Michele Iovino Facoltà di Ingegneria dell’Informazione, Informatica.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
Hadoop MapReduce Framework
Introduction to MapReduce and Hadoop
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
湖南大学-信息科学与工程学院-计算机与科学系
Hadoop Basics.
Cse 344 May 4th – Map/Reduce.
CS639: Data Management for Data Science
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Map Reduce, Types, Formats and Features
Presentation transcript:

Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication Engineering, University of Thessaly, Greece 13 th International Conference on Web Information System Engineering WISE 2012, November 28-30, Paphos, Cyprus

Scientometrics Metrics evaluating the research work of a scientist by assigning impact scores to his/her articles. Usually expressed as definitions of the form: A scientist a is of value V if at least V of his articles receive a score A researcher must author numerous qualitative articles.

h-index The first and most popular of these metrics is h-index (Hirsch, 2005). A researcher a has h-index h, if h of his/her P a articles have received at least h citations. This metric indicates the impact of an author. An author must not only produce numerous papers; His/her papers must also be cited frequently.

Time-Aware h-index Variants The contemporary and trend h-index (Sidiropoulos, 2006) introduced temporal aspects in the evaluation of a scientist's work. They assign to each article of an author time- decaying scores: Contemporary Score: Trend Score:

Contemporary and Trend h-indices (ΔY) i : The time (in years) elapsed since the publication of the article i. number of papers citing p i Contemporary score: The value of an article decays over time. Trend Score: An article is important if it continues to be cited in the present. A scientist a has contemporary h-index h c if at least h c of his articles receive a score

Scientometrics Computation Easy for small datasets For h-index we just need to identify the articles of each researcher and enumerate all their incoming citations. However, for large datasets the computation becomes more complex: The data (authors, citations, and metadata) do not fit in main memory. Tens of millions of articles and authors

Academic Search Engines

MapReduce (1) MapReduce: A fault tolerant framework for distributing problems to large clusters. The input data is split in chunks; each chunk is processed by a single Worker process (Map). Mapper outputs are written in intermediate files. In the sequel, the Reducers process and merge the Mappers’ outputs. The data is formatted in key-value pairs.

MapReduce (2) The MapReduce algorithms are expressed by writing only two functions: Map: Reduce: The MapReduce Jobs are deployed on top of a distributed file system which serves files, replicates data, and transparently addresses hardware failures.

Combiners An optional component standing between the Mappers and the Reducers. It serves as a mini-Reducer; it merges the values of the same keys for each Map Job. It reduces the data exchanged among system nodes, thus saving bandwidth. Implementations Explicit: Declare a Combine function In-Mapper: merge the values of the same key within map. Guaranteed execution!

Parallelizing the Problem Goal: Compute Scientometrics in parallel Input: Output: To reach our goal, we have to construct for each author, a list of his/her articles sorted by decreasing score: Then, we just iterate through the list and we compute the desired metric value.

Methods 1,2 – Map Phase (author, ) (, score) Notice: We emit the references’ authors, not the paper’s authors

Method 1, Reduce Phase Reducer Input: (author, pair ) Create an associative array which stores for each author, a list of pairs. Sum partial paper scores Sort the array by descending score. Compute metric.

Method 2, Reduce Phase Keys sorted by author and paper (secondary sort). We create an associative array which stores for each author, a list of pairs Compute metric Reducer Input: (pair, score)

Method 1-C, Map-Reduce In-Mapper Combiner (unique keys). map emits author as key, and a list of pairs as value. The Reducer merges the lists associated with the same key. The list is sorted by score Merge unique authors

Method 2-C, Map-Reduce In-Mapper Combiner (unique keys). map emits pairs as key, and as value. The Reducer merges the lists associated with the same key. The list is sorted by score

Experiments We applied our algorithms at the CiteSeerX dataset, an open repository comprised of 1,8 million research articles. We used the XML version of the dataset. Total input size: ~28 GB. Small, but the largest publicly available.

MapReduce I/O Sizes The methods which employ Combiners perform reasonably better Method 1-C: The Mappers produce 21.7 million key-value records (gain ~41%). Total output size = 600MB (gain ~13% less bandwidth). Method 2-C: 34.2 million records and 643 MB (~7%).

Running Times We also measured the running times of the four methods on two clusters: A local, small cluster comprised of 8 CoreI7 processing cores. A commercial Web cloud infrastructure with up to 40 processing cores. On the first cluster, we replicated the input data across all nodes. On the second case, we are not aware of the data physical location.

Running Times All four methods have the same computational complexity. We expect timings proportional to the size of the data exchanged among the Mappers and the Reducers. This favors the Methods 1-C and 2-C which limit data transfers via the Combiners.

Running Times – Local Cluster All methods scale well to the size of the cluster Method 1C is the fastest It outperforms method 2-C by ~18% It outperforms method 1 by 30-35%

Running Times – Web Cluster We repeated the experiment on a Web cloud infrastructure. Running times between the two clusters are not comparable. Different hardware and architecture Method 1-C is still the fastest

Conclusions We studied the problem of computing author evaluation metrics (scientometrics) in large academic search engines with MapReduce. We introduced four approaches. We showed that the most efficient strategy is to create one list of pairs for each unique author during the map phase. In this way we achieve at least 20% reduced running times and we gain ~13% bandwidth.

Thank you! Any Questions?