ZhangGang 2012.12.25. Since the Hadoop farm has not successfully configured at CC, so I can not do some test with HBase. I just use the machine named.

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
Developing a MapReduce Application – packet dissection.
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
AStudy on the Viability of Hadoop Usage on the Umfort Cluster for the Processing and Storage of CReSIS Polar Data Mentor: Je’aime Powell, Dr. Mohammad.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
ETM Hadoop. ETM IDC estimate put the size of the “digital universe” at zettabytes in forecasting a tenfold growth by 2011 to.
Hadoop Setup. Prerequisite: System: Mac OS / Linux / Cygwin on Windows Notice: 1. only works in Ubuntu will be supported by TA. You may try other environments.
Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland
Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation.
Hadoop Ecosystem Overview
GROUP 7 TOOLS FOR BIG DATA Sandeep Prasad Dipojjwal Ray.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
HADOOP ADMIN: Session -2
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
DLRL Cluster Matt Bollinger, Joseph Pontani, Adam Lech Client: Sunshin Lee CS4624 Capstone Project March 3, 2014 Virginia Tech, Blacksburg, VA.
ZhangGang, Fabio, Deng Ziyan /31 NoSQL Introduction to Cassandra Data Model Design Implementation.
THE HOG LANGUAGE A scripting MapReduce language. Jason Halpern Testing/Validation Samuel Messing Project Manager Benjamin Rapaport System Architect Kurry.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Zhang Gang Big data High scalability One time write, multi times read …….(to be add )
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Distributed Systems Fall 2014 Zubair Amjad. Outline Motivation What is Sqoop? How Sqoop works? Sqoop Architecture Import Export Sqoop Connectors Sqoop.
LOGO Discussion Zhang Gang 2012/11/8. Discussion Progress on HBase 1 Cassandra or HBase 2.
Performance Evaluation on Hadoop Hbase By Abhinav Gopisetty Manish Kantamneni.
An Introduction to HDInsight June 27 th,
Discussion MySQL&Cassandra ZhangGang 2012/11/22. Optimize MySQL.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
1 HBase Intro 王耀聰 陳威宇
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
CSE 548 Advanced Computer Network Security Trust in MobiCloud using Hadoop Framework Updates Sayan Cole Jaya Chakladar Group No: 1.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System.
Distributed Time Series Database
Nov 2006 Google released the paper on BigTable.
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
Hadoop Joshua Nester, Garrison Vaughan, Calvin Sauerbier, Jonathan Pingilley, and Adam Albertson.
CSE 548 Advanced Computer Network Security Trust in MobiCloud using Hadoop Framework Updates Sayan Kole Jaya Chakladar Group No: 1.
Cloud Computing project NSYSU Sec. 1 Demo. NSYSU EE IT_LAB2 Outline  Our system’s architecture  Flow chart of the hadoop’s job(web crawler) working.
2011 International Symposium on Intelligence Information Processing and Trusted Computing Huanggang Normal University Hubei, China Gaizhen Yang Speaker.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
Apache Hadoop on Windows Azure Avkash Chauhan
CPSC8985 FA 2015 Team C3 DATA MIGRATION FROM RDBMS TO HADOOP By Naga Sruthi Tiyyagura Monika RallabandiRadhakrishna Nalluri.
Hadoop. Introduction Distributed programming framework. Hadoop is an open source framework for writing and running distributed applications that.
Hadoop.
Introduction to Distributed Platforms
Apache hadoop & Mapreduce
Unit 2 Hadoop and big data
Software Systems Development
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
CS122B: Projects in Databases and Web Applications Winter 2017
Hadoop Developer.
CLOUDERA TRAINING For Apache HBase
Three modes of Hadoop.
Hadoop Clusters Tess Fulkerson.
Ministry of Higher Education
Introduction to Apache
Lecture 16 (Intro to MapReduce and Hadoop)
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Web Application Development Using PHP
Presentation transcript:

ZhangGang

Since the Hadoop farm has not successfully configured at CC, so I can not do some test with HBase. I just use the machine named hadoop01 that belongs to that farm to select records from mysql, at the same time do the same select in my PC and compare the processing time. As below is a table about the comparison:

内容内容内容内容 内容内容 一级标题一级标题一级标题 PlotPC(s)Farm(s)LHCb web protal(s) Diskspace by Site about 16 Diskspace by Jobtype about 7 CPUTime by Jobtype about 7 CPUTime by Site (10/06/20-12/06/20) about 9 CPUTime by Site (08/06/20-10/06/20) about 7

Install and configure Hadoop HBase in my PC

The environment has not successfully configured in CC, besides,some parts of references confused me,I don not fully understand what the meaning. So I try to set up a pseudo-distributed mode in my own computer. As I know from some references, we need some services to deal with our problem:

Hadoop :HDFS and MapReduce A framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models The basic part,the other parts are based on it. HBase: A scalable, distributed database that supports structured data storage for large tables. We will input data from mysql to HBase in one format.

Sqoop: Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. We use it to transfer our data from mysql to HBase. Thrift: The Apache Thrift software framework, for scalable cross-language services development. Because we want to use python,so thrift is needed.

To set up a Hadoop environment is much more complicated than I excepted. I met many unknown errors. Till now I just successfully installed and configured the Hadoop, HBase and thrift. Sqoop still has some errors.

Hadoop: 1.create a user account named hadoop 2. install SSh 3. install Java 4. install Hadoop and configure hadoop-env.sh, then the standalone mode is successful. There are three *.xml files,in standalone mode, they are empty, if we want to get a pseudo- distributed mode. We must configure them.

core-site.xml: Hadoop core configuration items, like I/O configuration. hdfs-site.xml: Hadoop daemon process configuration items, like namenode, datanode maprep-site.xml: MapReduce daemon process configuration items, like jobtracker tasktracker. Start Hadoop: Format the HDFS : Start-all daemon process: 内容内容

Then the Hadoop is get started a 内容 内容内容内容内容 内容内容 一级标题一级标题一级标题

HBase: Install java(has inatalled) Inatall habse Configure hbase-site.xml: set the hbase.rootdir Start HBase: Use HBase shell and create a table named test, it has two column family : zhang and gang 内容内容内容内容 内容内容

Thrift: (I feel complicated) 1. Install all the required tools and libraries to build and install the Apache Thrift compiler. 2. From the top directory,do:./configure 3. Once run configure,then :make and make test 4.From the top directory, become superuser and do: make install If no error(I met many), the thrift is successfully installed. Then generate the python client and move it to ~/python2.7/site-packages:

After generate a python client, we can use python to access to HBase. The next part is about a script I write to interaction with HBase. _____________________________________________ ___________________________________________ __________ script

As a test, I create a table named diracAccounting, it has two column families: ‘groupby’ and ‘generate’, and each family has a column: ‘groupby:Site’, ‘generate:CPUTime’. The row key is the ‘starttime’in mysql tables. The whole code I push in github: m/Hadoop/HbasePy/hbaseplot.py m/Hadoop/HbasePy/hbaseplot.py

def put(self): '''put some records to hbase table''' Select‘Site’,‘CPUTime’,‘Starttime’from mysql database and put into the table in HBase. Set the ‘starttime’as the row key.

def generatePlot (self, groupbyName, generateName): '''use records to generate a plot''‘ In this function, I scan the records and generate a plot.

end