Presentation is loading. Please wait.

Presentation is loading. Please wait.

School of Computing Clemson University

Similar presentations


Presentation on theme: "School of Computing Clemson University"— Presentation transcript:

1 School of Computing Clemson University
Hadoop on Palmetto HPC Pengfei Xuan Jan 8, 2015 School of Computing Clemson University

2 Outline Introduction Hadoop over Palmetto HPC Cluster
Hands-on Practice

3 … HPC Cluster vs. Hadoop Cluster Compute Node Compute Node
Networking HPC HDD SSD RAM Data Node Data Node Data Node

4 + + HPC Clusters The National Center for Supercomputing Application
Forge (NVIDIA GPU Cluster) 44 GPU Nodes 6 or 8 NVIDIA Fermi M2070 GPUs per Node 6GB Graphics Memory per GPU 600 TB GPFS File System 40GB/sec InfiniBand QDR per Node (Point-to-point unidirectional speed)   + + InfiniBand Switch 40GB InfiniBand Adapter 8 NVIDIA Fermi M2070 GPUs

5 Hadoop Clusters OK. First of all, let’s take a look at the system architecture of our Hadoop experimental platform. At the bottom, the Palmetto Cluster provides the hardware infrastructure. On top of this infrastructure, we create a KVM virtual machine by using the user-level permission. Inside of this virtual machine, we can get root permission, which provides a fully-controlled running environment to deploy all the components of a Hadoop system. The front end of a Hadoop ecosystem is the high-level API or programming language for Hadoop framework. For example, SQL and R are two high-level programming languages that can act as API front-end for the Hadoop framework. In the Hadoop ecosystem, they correspond to components of Hive and rmr. The developer can easily write SQL and R scripts to query massive data from the Bigtable or HDFS file system. In our class, we will have the opportunities to practice handling big data using all kinds of tools in Hadoop ecosystem. We will talk about these tools later. To summarize, Using Hadoop Virtual machine on the Palmetto Cluster offers us many advantages. It greatly reduces the complexity of deployment. Otherwise, if you deploy whole system step by step, it will take you at least 1 to 3 weeks which depends on your experience. Additionally, if you port an incompatible component to your existed clean system, then sometimes you have to reinstall whole system from very beginning. It is because some of these new components will replace/rewrite your original configuration files in many different places. It’s difficult to trace as well as to roll back the changes. Performance is another advantage. Comparing to running the virtual machine on your laptop, the Palmetto computing node has big memory, large storage, faster CPUs and faster network. At last, virtual machine based experiment or development environment is convenient to us. For example, to do version control, we can trace updates for hadoop by creating multiple backups of the image file of the virtual machine.

6 History of Hadoop 1998 Google Funded 2003 GFS Paper
MapReduce Paper Nutch DFS Impl. Nutch MR Impl. BigTable Paper Hadoop Project World’s Largest Hadoop Facebook 21 PB Data Microsoft, IBM, Oracle, Twitter, Amazon Now Everywhere, Our Class! To talk the history of Hadoop, we have to mention two important peoples – Jeffrey Dean and Doug Cutting. Jeffery Dean, as a Google Fellow, is one of most important contributors to Google system infrastructure, which provides the primary theories for Hadoop. Doug Cutting followed with the Google’s theories and finally created Hadoop project. Google is a pioneer in the data-intensive computing area. Google was founded at From that time on, all kinds of data in the Internet are continuously filled in servers in the Google data center. The total accumulated data bursts rapidly. Google published GFS paper at 2003, MapReduce papers at 2004, and BigTable at These three papers respectively address three major problems including Data storage, Data computational model and data management when facing the big data in the real world. These three concepts finally lead a revolution in the distributed system area, and made a great impact on the current design of system architecture for data-intensive computing. In the industry, Hadoop as a implementation of this architecture is widely being used by many big companies, such as Yahoo!, Facebook, LinkedIn, Twitter and so on. He is the creator of Hadoop. And before the Hadoop, he also created the Lucene and Nutch. (By the way, does anyone know the Lucene and Nutch projects? They are a vertical search engine. If you want to develop your own information retrieval system or web search service, you can use Lucene and Nutch. It’s a very powerful framework, and I believe they can satisfy most of your requirements). Doug Cutting is an expert to implement all kinds of complex distributed systems. In 2003, Google released the first paper – GFS. Doug Cutting quickly followed this idea, and created an another implementation of GFS called Nutch Distributed File System (NDFS) in NDFS actually is the ancestor of HDFS nowadays. In 2004, Google released the second paper - MapReduce. Doug Cutting did the same thing and the MapReduce version of Nutch has been implemented in In 2006, Yahoo! hired Doug Cutting, the Hadoop project split out of Nutch. In 2008, Yahoo announced the world’s largest Hadoop production application, which runs on more than 10,000 core Linux cluster. In 2010, Facebook claimed that they have the largest Hadoop cluster in the world with 21 PB of storage. In the industry, most of Big data related problems can be resolved by Hadoop. The Hadoop system is now widely used by everyone from Twitter to Facebook to Yahoo!. Many traditional companies, such as Microsoft, IBM, Oracle, also join the Hadoop community. Finally, in our data-intensive computing class, we are also trying to study and apply the Hadoop to our research and projects. Jeffrey Dean Doug Cutting

7 Google vs. Hadoop Infrastructures
HiPal Databee Hive Scribe HBase Hadoop Zookeeper Oozie Hive Data Highway Hive / Pig HBase Hadoop Zookeeper Dremel Evenflow MySQL Gateway Sawzall Bigtable MapReduce / GFS Chubby IV. Data Analysis Layer III. Data Flow Layer II. Data Storage and Computing Layer I. Data Coordination Layer This slides shows the several similar system architectures for data-intensive computing in the real world. It looks like a little bit complex. In order to understand these architectures easily, I split them to four layers based on different-level operation to the data. From bottom to top, to first layer, I called it as data coordination layer. Second layer called data storage and computing layer. Third layer called data flow layer. And forth layer called data analysis layer. The first layer, data coordination layer, provides high-availability and fault-tolerant services to the whole platform by synchronization all kinds of system services. In Google’s architecture, this layer is implemented by Chubby. In Hadoop, it’s done by Zookeeper. The second layer provides the fundamental data storage and computing services. For example, GFS and HDFS offer a large-scale commodity-based file system solution. BigTable and HBase allow a management for structured data. MapReduce provides a data-parallel computing model to process the massive data. The third layer, the data flow layer, provides a data pipeline or workflow service to design and schedule the data analysis jobs on the data storage and computing layer. The forth layer, the data analysis layer, provides high-level data analysis and query interface to user over the underlay shared clusters of commodity machines. The user can perform interactive data analysis, such as SQL-like query, at scale demands a high degree of parallelism. Azkaban Sqoop Kafka Pig Voldemort Hadoop Zookeeper Hue Crunch Oozie Hive Sqoop Flume Hive / Pig HBase Hadoop Zookeeper

8 cat * | grep | sort | uniq -c | cat > file
MapReduce Word Count Example cat * | grep | sort | uniq -c | cat > file

9 Run Hadoop over Palmetto Cluster
Setup Hadoop configuration files Start Hadoop services Copy input files to HDFS (stage-in) Run Hadoop job (MapReduce WordCount) Copy output files from HDFS to your home directory (stage-out) Stop Hadoop services Clear up

10 Commands Create job directory: $> mkdir myHadoopJob1 $> cd myHadoopJob1 Get Hadoop PBS Script: $> cp /tmp/runHadoop.pbs . Or, $> wget Submit job to Palmetto cluster: $> qsub runHadoop.pbs Check status of your job: $> qstat -anu your_cu_username Verify the correctness of your result: $> grep Hadoop wordcount-output/* | grep 51


Download ppt "School of Computing Clemson University"

Similar presentations


Ads by Google