School of Computing Clemson University

Slides:



Advertisements
Similar presentations
Meet Hadoop Doug Cutting & Eric Baldeschwieler Yahoo!
Advertisements

Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Transform + analyze Visualize + decide Capture + manage Dat a.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation.
Hadoop Ecosystem Overview
Big Data and Hadoop and DLRL Introduction to the DLRL Hadoop Cluster Sunshin Lee and Edward A. Fox DLRL, CS, Virginia Tech 21 May 2015 presentation for.
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Introduction to Apache Hadoop CSCI 572: Information Retrieval and Search Engines Summer 2010.
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
CERN IT Department CH-1211 Geneva 23 Switzerland t XLDB 2010 (Extremely Large Databases) conference summary Dawid Wójcik.
HADOOP ADMIN: Session -2
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Hola Hadoop. 0. Clean-Up The Hard-disks Delete tmp/ folder from workspace/mdp-lab3 Delete unneeded downloads.
Distributed and Parallel Processing Technology Chapter1. Meet Hadoop Sun Jo 1.
Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
Introduction to Hadoop 趨勢科技研發實驗室. Copyright Trend Micro Inc. Outline Introduction to Hadoop project HDFS (Hadoop Distributed File System) overview.
Owen O’Malley Yahoo! Grid Team
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Scaling for Large Data Processing What is Hadoop? HDFS and MapReduce
On the Varieties of Clouds for Data Intensive Computing 董耀文 Antslab Robert L. Grossman University of Illinois at Chicago And Open Data.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
SEMINAR ON Guided by: Prof. D.V.Chaudhari Seminar by: Namrata Sakhare Roll No: 65 B.E.Comp.
HAMS Technologies 1
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Hadoop & Condor Dhruba Borthakur Project Lead, Hadoop Distributed File System Presented at the The Israeli Association of Grid Technologies.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Youngil Kim Awalin Sopan Sonia Ng Zeng.  Introduction  Concept of the Project  System architecture  Implementation – HDFS  Implementation – System.
Hadoop implementation of MapReduce computational model Ján Vaňo.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
Pig Installation Guide and Practical Example Presented by Priagung Khusumanegara Prof. Kyungbaek Kim.
Nov 2006 Google released the paper on BigTable.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
HADOOP Course Content By Mr. Kalyan, 7+ Years of Realtime Exp. M.Tech, IIT Kharagpur, Gold Medalist. Introduction to Big Data and Hadoop Big Data › What.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
By: Joel Dominic and Carroll Wongchote 4/18/2012.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
Moscow, November 16th, 2011 The Hadoop Ecosystem Kai Voigt, Cloudera Inc.
Data Science Hadoop YARN Rodney Nielsen. Rodney Nielsen, Human Intelligence & Language Technologies Lab Outline Classical Hadoop What’s it all about Hadoop.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
An Brief Introduction Charlie Taylor Associate Director, Research Computing UF Research Computing.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Advanced Computing Facility Introduction
GRID COMPUTING.
HPC usage and software packages
Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.
Apache hadoop & Mapreduce
HADOOP ADMIN: Session -2
How to download, configure and run a mapReduce program In a cloudera VM Presented By: Mehakdeep Singh Amrit Singh Chaggar Ranjodh Singh.
Hadoop Developer.
Sqoop Mr. Sriram
Hadoopla: Microsoft and the Hadoop Ecosystem
Rahi Ashokkumar Patel U
Hadoop Clusters Tess Fulkerson.
Hadoop Basics.
Introduction to Apache
High Performance Computing in Bioinformatics
Presentation transcript:

School of Computing Clemson University Hadoop on Palmetto HPC Pengfei Xuan Jan 8, 2015 School of Computing Clemson University

Outline Introduction Hadoop over Palmetto HPC Cluster Hands-on Practice

… HPC Cluster vs. Hadoop Cluster Compute Node Compute Node Networking HPC HDD SSD RAM Data Node Data Node Data Node

+ + HPC Clusters The National Center for Supercomputing Application Forge (NVIDIA GPU Cluster) 44 GPU Nodes 6 or 8 NVIDIA Fermi M2070 GPUs per Node 6GB Graphics Memory per GPU 600 TB GPFS File System 40GB/sec InfiniBand QDR per Node (Point-to-point unidirectional speed)   + + InfiniBand Switch 40GB InfiniBand Adapter 8 NVIDIA Fermi M2070 GPUs

Hadoop Clusters OK. First of all, let’s take a look at the system architecture of our Hadoop experimental platform. At the bottom, the Palmetto Cluster provides the hardware infrastructure. On top of this infrastructure, we create a KVM virtual machine by using the user-level permission. Inside of this virtual machine, we can get root permission, which provides a fully-controlled running environment to deploy all the components of a Hadoop system. The front end of a Hadoop ecosystem is the high-level API or programming language for Hadoop framework. For example, SQL and R are two high-level programming languages that can act as API front-end for the Hadoop framework. In the Hadoop ecosystem, they correspond to components of Hive and rmr. The developer can easily write SQL and R scripts to query massive data from the Bigtable or HDFS file system. In our class, we will have the opportunities to practice handling big data using all kinds of tools in Hadoop ecosystem. We will talk about these tools later. To summarize, Using Hadoop Virtual machine on the Palmetto Cluster offers us many advantages. It greatly reduces the complexity of deployment. Otherwise, if you deploy whole system step by step, it will take you at least 1 to 3 weeks which depends on your experience. Additionally, if you port an incompatible component to your existed clean system, then sometimes you have to reinstall whole system from very beginning. It is because some of these new components will replace/rewrite your original configuration files in many different places. It’s difficult to trace as well as to roll back the changes. Performance is another advantage. Comparing to running the virtual machine on your laptop, the Palmetto computing node has big memory, large storage, faster CPUs and faster network. At last, virtual machine based experiment or development environment is convenient to us. For example, to do version control, we can trace updates for hadoop by creating multiple backups of the image file of the virtual machine.

History of Hadoop 1998 Google Funded 2003 GFS Paper 2004 MapReduce Paper Nutch DFS Impl. 2005 Nutch MR Impl. 2006 BigTable Paper Hadoop Project 2008 World’s Largest Hadoop 2010 Facebook 21 PB Data 2011 Microsoft, IBM, Oracle, Twitter, Amazon Now Everywhere, Our Class! To talk the history of Hadoop, we have to mention two important peoples – Jeffrey Dean and Doug Cutting. Jeffery Dean, as a Google Fellow, is one of most important contributors to Google system infrastructure, which provides the primary theories for Hadoop. Doug Cutting followed with the Google’s theories and finally created Hadoop project. Google is a pioneer in the data-intensive computing area. Google was founded at 1998. From that time on, all kinds of data in the Internet are continuously filled in servers in the Google data center. The total accumulated data bursts rapidly. Google published GFS paper at 2003, MapReduce papers at 2004, and BigTable at 2006. These three papers respectively address three major problems including Data storage, Data computational model and data management when facing the big data in the real world. These three concepts finally lead a revolution in the distributed system area, and made a great impact on the current design of system architecture for data-intensive computing. In the industry, Hadoop as a implementation of this architecture is widely being used by many big companies, such as Yahoo!, Facebook, LinkedIn, Twitter and so on. He is the creator of Hadoop. And before the Hadoop, he also created the Lucene and Nutch. (By the way, does anyone know the Lucene and Nutch projects? They are a vertical search engine. If you want to develop your own information retrieval system or web search service, you can use Lucene and Nutch. It’s a very powerful framework, and I believe they can satisfy most of your requirements). Doug Cutting is an expert to implement all kinds of complex distributed systems. In 2003, Google released the first paper – GFS. Doug Cutting quickly followed this idea, and created an another implementation of GFS called Nutch Distributed File System (NDFS) in 2004. NDFS actually is the ancestor of HDFS nowadays. In 2004, Google released the second paper - MapReduce. Doug Cutting did the same thing and the MapReduce version of Nutch has been implemented in 2005. In 2006, Yahoo! hired Doug Cutting, the Hadoop project split out of Nutch. In 2008, Yahoo announced the world’s largest Hadoop production application, which runs on more than 10,000 core Linux cluster. In 2010, Facebook claimed that they have the largest Hadoop cluster in the world with 21 PB of storage. In the industry, most of Big data related problems can be resolved by Hadoop. The Hadoop system is now widely used by everyone from Twitter to Facebook to Yahoo!. Many traditional companies, such as Microsoft, IBM, Oracle, also join the Hadoop community. Finally, in our data-intensive computing class, we are also trying to study and apply the Hadoop to our research and projects. Jeffrey Dean Doug Cutting

Google vs. Hadoop Infrastructures HiPal Databee Hive Scribe HBase Hadoop Zookeeper Oozie Hive Data Highway Hive / Pig HBase Hadoop Zookeeper Dremel Evenflow MySQL Gateway Sawzall Bigtable MapReduce / GFS Chubby IV. Data Analysis Layer III. Data Flow Layer II. Data Storage and Computing Layer I. Data Coordination Layer This slides shows the several similar system architectures for data-intensive computing in the real world. It looks like a little bit complex. In order to understand these architectures easily, I split them to four layers based on different-level operation to the data. From bottom to top, to first layer, I called it as data coordination layer. Second layer called data storage and computing layer. Third layer called data flow layer. And forth layer called data analysis layer. The first layer, data coordination layer, provides high-availability and fault-tolerant services to the whole platform by synchronization all kinds of system services. In Google’s architecture, this layer is implemented by Chubby. In Hadoop, it’s done by Zookeeper. The second layer provides the fundamental data storage and computing services. For example, GFS and HDFS offer a large-scale commodity-based file system solution. BigTable and HBase allow a management for structured data. MapReduce provides a data-parallel computing model to process the massive data. The third layer, the data flow layer, provides a data pipeline or workflow service to design and schedule the data analysis jobs on the data storage and computing layer. The forth layer, the data analysis layer, provides high-level data analysis and query interface to user over the underlay shared clusters of commodity machines. The user can perform interactive data analysis, such as SQL-like query, at scale demands a high degree of parallelism. Azkaban Sqoop Kafka Pig Voldemort Hadoop Zookeeper Hue Crunch Oozie Hive Sqoop Flume Hive / Pig HBase Hadoop Zookeeper

cat * | grep | sort | uniq -c | cat > file MapReduce Word Count Example cat * | grep | sort | uniq -c | cat > file

Run Hadoop over Palmetto Cluster Setup Hadoop configuration files Start Hadoop services Copy input files to HDFS (stage-in) Run Hadoop job (MapReduce WordCount) Copy output files from HDFS to your home directory (stage-out) Stop Hadoop services Clear up

Commands Create job directory: $> mkdir myHadoopJob1 $> cd myHadoopJob1 Get Hadoop PBS Script: $> cp /tmp/runHadoop.pbs . Or, $> wget https://raw.githubusercontent.com/pfxuan/myhadoop/master/examples/runHadoop.pbs Submit job to Palmetto cluster: $> qsub runHadoop.pbs Check status of your job: $> qstat -anu your_cu_username Verify the correctness of your result: $> grep Hadoop wordcount-output/* | grep 51