Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.

Slides:

Advertisements

Similar presentations

Big Data: Big Challenges for Computer Science Henri Bal Vrije Universiteit Amsterdam.

Advertisements

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.

SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Ilias Tachmazidis 1,2, Grigoris Antoniou 1,2,3, Giorgos Flouris 2, Spyros Kotoulas 4 1 University of Crete 2 Foundation for Research and Technology, Hellas.

Store RDF Triples In A Scalable Way Liu Long & Liu Chunqiu.

SkewTune: Mitigating Skew in MapReduce Applications

Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.

Tutorial at WWW 2011, Distributed reasoning: because size matters Andreas Harth, Aidan Hogan, Spyros Kotoulas,

Spark: Cluster Computing with Working Sets

SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.

Tutorial at ISWC 2011, Distributed reasoning: because size matters Andreas Harth, Aidan Hogan, Spyros Kotoulas,

(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.

Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.

PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.

Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

Interpreting the data: Parallel analysis with Sawzall LIN Wenbin 25 Mar 2014.

A Dynamic MapReduce Scheduler for Heterogeneous Workloads Chao Tian, Haojie Zhou, Yongqiang He,Li Zha 簡報人：碩資工一甲董耀文.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

Face Detection And Recognition For Distributed Systems Meng Lin and Ermin Hodžić 1.

Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.

Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.

1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.

Introduction to Hadoop and HDFS

Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools Mohammad Farhan Husain, Latifur Khan, Murat Kantarcioglu and Bhavani Thuraisingham.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, Bhavani Thuraisingham University.

« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)

HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. LogKV: Exploiting Key-Value.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Mining High Utility Itemset in Big Data

Eneryg Efficiency for MapReduce Workloads: An Indepth Study Boliang Feng Renmin University of China Dec 19.

Henri Bal Vrije Universiteit Amsterdam High Performance Distributed Computing.

Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.

An Architecture for Distributed High Performance Video Processing in the Cloud 作者 :Pereira, R.; Azambuja, M.; Breitman, K.; Endler, M. 出處 :2010 IEEE 3rd.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

RDF-3X : RISC-Style RDF Database Engine

Scalable Distributed Reasoning Using MapReduce Jacopo Urbani, Spyros Kotoulas, Eyal Oren, and Frank van Harmelen Department of Computer Science, Vrije.

Dynamic Slot Allocation Technique for MapReduce Clusters School of Computer Engineering Nanyang Technological University 25 th Sept 2013 Shanjiang Tang,

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

RDFPath: Path Query Processing on Large RDF Graph with MapReduce Martin Przyjaciel-Zablocki et al. University of Freiburg ESWC May 2013 SNU IDB.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

DOE Network PI Meeting 2005 Runtime Data Management for Data-Intensive Scientific Applications Xiaosong Ma NC State University Joint Faculty: Oak Ridge.

A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.

Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

Only Aggressive Elephants are Fast Elephants Nov 11 th 2013 Database Lab. Wonseok Choi.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Distributed Storage and Querying Techniques for a Semantic Web of Scientific Workflow Provenance The ProvBase System Artem Chebotko (joint work with.

Cloud Data Anonymization Using Hadoop Map-Reduce Framework With Qos Evaluation and Behaviour analysis PROJECT GUIDE: Ms.S.Subbulakshmi TEAM MEMBERS: A.Mahalakshmi( ).

SpatialHadoop: A MapReduce Framework for Spatial Data

Year 2 Updates.

Very VERY large scale knowledge representation In collaboration with:

Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz

Applying Twister to Scientific Applications

Author: Ahmed Eldawy, Mohamed F. Mokbel, Christopher Jonathan

MapReduce: Data Distribution for Reduce

Cse 344 May 4th – Map/Reduce.

DryadInc: Reusing work in large-scale computations

Presentation transcript:

Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed Computing) June SNU IDB Lab. Lee, Inhoe

Outline  Introduction  Conventional Approach  MapReduce Data Compression  MapReduce Data Decompression  Evaluation  Conclusions

Introduction  Semantic Web – An extension of the current World Wide Web  A information = a set of statements  Each statement = three different terms; – subject, predicate, and object –

Introduction  the terms consist of long strings – Most semantic web applications compress the statements – to save space and increase the performance  the technique to compress data is dictionary encoding

Motivation  Currently the amount of Semantic Web data – Is steadily growing  Compressing many billions of statements – becomes more and more time-consuming.  A fast and scalable compression is crucial  A technique to compress and decompress Semantic Web statements – using the MapReduce programming model  Allowed us to reason directly on the compressed statements with a consequent increase of performance [1, 2]

Outline  Introduction  Conventional Approach  MapReduce Data Compression  MapReduce Data Decompression  Evaluation  Conclusions

Conventional Approach  Dictionary encoding – Compress data – Decompress data

Outline  Introduction  Conventional Approach  MapReduce Data Compression  MapReduce Data Decompression  Evaluation  Conclusions

MapReduce Data Compression  job 1: identifies the popular terms and assigns them a numerical ID  job 2: deconstructs the statements, builds the dictionary table and replaces all terms with a corresponding numerical ID  job 3: read the numerical terms and reconstruct the statements in their compressed form

Job1 : caching of popular terms  Identify the most popular terms and assigns them a numerical number – count the occurrences of the terms – select the subset of the most popular ones – Randomly sample the input

Job1 : caching of popular terms

Job1 : caching of popular terms

Job1 : caching of popular terms

Job2: deconstruct statements  Deconstruct the statements and compress the terms with a numerical ID  Before the map phase starts, loading the popular terms into the main memory  The map function reads the statements and assigns each of them a numerical ID – Since the map tasks are executed in parallel, we partition the numerical range of the IDs so that each task is allowed to assign only a specific range of numbers

Job2: deconstruct statements

Job2: deconstruct statements

Job2: deconstruct statements

Job3: reconstruct statements  Read the previous job’s output and reconstructs the statements using the numerical IDs

Job3: reconstruct statements

Job3: reconstruct statements

Job3: reconstruct statements

Outline  Introduction  Conventional Approach  MapReduce Data Compression  MapReduce Data Decompression  Evaluation  Conclusions

MapReduce data decompression  Join between the compressed statements and the dictionary table  job 1: identifies the popular terms  job 2: perform the join between the popular resources and the dictionary table  job 3: deconstruct the statements and decompresses the terms performing a join on the input  job 4: reconstruct the statements in the original format

Job 1: identify popular terms

Job 2 : join with dictionary table

Job 3: join with compressed input

Job 3: join with compressed input

Job 3: join with compressed input (20, (21, …. (113, (114, mail)

Job 4: reconstruct statements

Job 4: reconstruct statements

Job 4: reconstruct statements

Outline  Introduction  Conventional Approach  MapReduce Data Compression  MapReduce Data Decompression  Evaluation  Conclusions

Evaluation  Environments – 32 nodes of the DAS3 cluster to set up our Hadoop framework  Each node – two dual-core 2.4 GHz AMD Opteron CPUs – 4 GB main memory – 250 GB storage

Results  The throughput of the compression algorithm is higher for a larger datasets than for a smaller one – our technique is more efficient on larger inputs, where the computation is not dominated by the platform overhead  Decompression is slower than Compression

Results  The beneficial effects of the popular-terms cache

Results  Scalability – Different input size – Varying the number of nodes

Outline  Introduction  Conventional Approach  MapReduce Data Compression  MapReduce Data Decompression  Evaluation  Conclusions

Conclusions  Proposed a technique to compress Semantic Web statements – using the MapReduce programming model  Evaluated the performance measuring the runtime – More efficient for larger inputs  Tested the scalability – Compression algo. scales more efficiently  A major contribution to solve this crucial problem in the Semantic Web

References  [1] J. Urbani, S. Kotoulas, J. Maassen, F. van Harmelen, and H. Bal. Owl reasoning with mapreduce: calculating the closure of 100 billion triples. Currently under submission,  [2] J. Urbani, S. Kotoulas, E. Oren, and F. van Harmelen. Scalable distributed reasoning using mapreduce. In Proceedings of the ISWC '09, 2009.

Outline  Introduction  Conventional Approach  MapReduce Data Compression – Job 1: caching of popular terms – Job 2: deconstruct statements – Job 3: reconstruct statements  MapReduce Data Decompression – Job 2: join with dictionary table – Job 3: join with compressed input  Evaluation – Runtime – Scalability  Conclusions

Conventional Approach  Dictionary encoding  Input : ABABBABCABABBA  Output :