Parallellization of a semantic discovery tool for a search engine Kalle Happonen Wray Buntine.

Slides:



Advertisements
Similar presentations
A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Advertisements

Parallel Computing Glib Dmytriiev
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Mining di Dati Web Web Community Mining and Web log Mining : Commody Cluster based execution Romeo Zitarosa.
Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
® IBM Software Group © 2006 IBM Corporation Rational Software France Object-Oriented Analysis and Design with UML2 and Rational Software Modeler 04. Other.
1 Machine Learning with Apache Hama Tommaso Teofili tommaso [at] apache [dot] org.
MPI and C-Language Seminars Seminar Plan  Week 1 – Introduction, Data Types, Control Flow, Pointers  Week 2 – Arrays, Structures, Enums, I/O,
A Grid Parallel Application Framework Jeremy Villalobos PhD student Department of Computer Science University of North Carolina Charlotte.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Segmentation Graph-Theoretic Clustering.
Science Advisory Committee Meeting - 20 September 3, 2010 Stanford University 1 04_Parallel Processing Parallel Processing Majid AlMeshari John W. Conklin.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Sun FIRE Jani Raitavuo Niko Ronkainen. Sun FIRE 15K Most powerful and scalable Up to 106 processors, 576 GB memory and 250 TB online disk storage Fireplane.
Hinrich Schütze and Christina Lioma Lecture 4: Index Construction
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
Performance Engineering and Debugging HPC Applications David Skinner
High Performance Computing (HPC) at Center for Information Communication and Technology in UTM.
The hybird approach to programming clusters of multi-core architetures.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Task Farming on HPCx David Henty HPCx Applications Support
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Neural and Evolutionary Computing - Lecture 10 1 Parallel and Distributed Models in Evolutionary Computing  Motivation  Parallelization models  Distributed.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University DCCFinder: A Very- Large Scale Code Clone Analysis.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
Batch Scheduling at LeSC with Sun Grid Engine David McBride Systems Programmer London e-Science Centre Department of Computing, Imperial College.
Introduction to Parallel Rendering Jian Huang, CS 594, Spring 2002.
Example: Sorting on Distributed Computing Environment Apr 20,
Domain Decomposed Parallel Heat Distribution Problem in Two Dimensions Yana Kortsarts Jeff Rufinus Widener University Computer Science Department.
A Hierarchical MapReduce Framework Yuan Luo and Beth Plale School of Informatics and Computing, Indiana University Data To Insight Center, Indiana University.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 April 11, 2006 Session 23.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Project18’s Communication Drawing Design By: Camilo A. Silva BIOinformatics Summer 2008.
Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.
HAMA: An Efficient Matrix Computation with the MapReduce Framework Sangwon Seo, Edward J. Woon, Jaehong Kim, Seongwook Jin, Jin-soo Kim, Seungryoul Maeng.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
PROGRESS: GEW'2003 Using Resources of Multiple Grids with the Grid Service Provider Michał Kosiedowski.
Joe Bradish Parallel Neural Networks. Background  Deep Neural Networks (DNNs) have become one of the leading technologies in artificial intelligence.
CIP HPC CIP - HPC HPC = High Performance Computer It’s not a regular computer, it’s bigger, faster, more powerful, and more.
MEMORY MANAGEMENT Operating System. Memory Management Memory management is how the OS makes best use of the memory available to it Remember that the OS.
Performance Evaluation of Parallel Algorithms on a Computational Grid Environment Simona Blandino 1, Salvatore Cavalieri 2 1 Consorzio COMETA, 2 Faculty.
Next Generation of Apache Hadoop MapReduce Owen
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
Matrix Multiplication in Hadoop
Progress Report—11/13 宗慶. Problem Statement Find kernels of large and sparse linear systems over GF(2)
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
Big Data is a Big Deal!.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
How to download, configure and run a mapReduce program In a cloudera VM Presented By: Mehakdeep Singh Amrit Singh Chaggar Ranjodh Singh.
基于多核加速计算平台的深度神经网络 分割与重训练技术
Tutorial: Big Data Algorithms and Applications Under Hadoop
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Data Partition Dr. Xiao Qin Auburn University.
Text & Web Mining 9/22/2018.
CSE 454 Advanced Internet Systems University of Washington
Presentation transcript:

Parallellization of a semantic discovery tool for a search engine Kalle Happonen Wray Buntine

MPCA MPCA is a discrete version of PCA (principle components, top K eigenvectors of a matrix) for sparse integer data. To make it simple ● goes through documents to build a sparse document-word matrix ● creates different categories based on word occurrence ● assigns membership level for documents based on relevance Uses are ● building categories of documents, e.g., Open Directory Project or DMOZ has hierarchical categories ● partitioning a subset of the web into strongly connected regions and identifying hubs (good outgoing links) and authorities (commonly linked to) ● also used in the genome project for genotype discovery, although not this particular software

The matrices doc word comp word node 1 node 2 doc comp input output node 1 node 2 node 1 node 2

How it works Slave 1 (node 2) doc word doc compword comp Master (node 1) doc word doc compword comp word comp

The parallel version Relevant facts ● The task is divided by splitting up the documents ● The component - word matrix is large – a lot of data to transfer ● Total amount of data to transfer between loops is linearilly dependent on node amount ● The component - word matrix is always in memory – the available memory limits the component amount for a problem What we have done ● We are using MPI for parallellization ● We have a working tested parallel version – with room for improvement

Complexity

Some performance data Small testrun on a 4 node cluster ● 2 * Athlon64 cpus / node ● 2 GB memory / node ● gigabit ethernet

Scalability Finnish web 3 M documents 10 components Reuters news archive 800 k documents 100 components

Problems Data transfer the limiting factor - solutions ● Use a faster network ● Compress data before transfer ● Change the communication model - possibly need to break the algorithm ● Run on reduced problem to convergence – then run on full problem Memory requirements ● Larger jobs mean more words -> less components with the same amount of memory ● Run large jobs on nodes with much memory? ● Compress data in memory?

The present and the future The Present ● Working parallellization with MPI ● Succesful runs on clusters ● Has been used on the whole Finnish web (3 M docs), Wikipedia (600k docs) and DMOZ (6 M docs) ● Problems with memory requirements and communication The Future ● Run on much larger document sets ● Improve scalability ● Runs on a grid