Presentation is loading. Please wait.

Presentation is loading. Please wait.

ZhangGang 2012.12.25. Since the Hadoop farm has not successfully configured at CC, so I can not do some test with HBase. I just use the machine named.

Similar presentations


Presentation on theme: "ZhangGang 2012.12.25. Since the Hadoop farm has not successfully configured at CC, so I can not do some test with HBase. I just use the machine named."— Presentation transcript:

1 ZhangGang 2012.12.25

2 Since the Hadoop farm has not successfully configured at CC, so I can not do some test with HBase. I just use the machine named hadoop01 that belongs to that farm to select records from mysql, at the same time do the same select in my PC and compare the processing time. As below is a table about the comparison:

3 内容内容内容内容 内容内容 一级标题一级标题一级标题 PlotPC(s)Farm(s)LHCb web protal(s) Diskspace by Site97.3936.93about 16 Diskspace by Jobtype 97.4740.45about 7 CPUTime by Jobtype 90.6640.08about 7 CPUTime by Site (10/06/20-12/06/20) 97.6939.54about 9 CPUTime by Site (08/06/20-10/06/20) 86.6432.31about 7

4 Install and configure Hadoop HBase in my PC

5 The environment has not successfully configured in CC, besides,some parts of references confused me,I don not fully understand what the meaning. So I try to set up a pseudo-distributed mode in my own computer. As I know from some references, we need some services to deal with our problem:

6 Hadoop :HDFS and MapReduce A framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models The basic part,the other parts are based on it. HBase: A scalable, distributed database that supports structured data storage for large tables. We will input data from mysql to HBase in one format.

7 Sqoop: Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. We use it to transfer our data from mysql to HBase. Thrift: The Apache Thrift software framework, for scalable cross-language services development. Because we want to use python,so thrift is needed.

8 To set up a Hadoop environment is much more complicated than I excepted. I met many unknown errors. Till now I just successfully installed and configured the Hadoop, HBase and thrift. Sqoop still has some errors.

9 Hadoop: 1.create a user account named hadoop 2. install SSh 3. install Java 4. install Hadoop and configure hadoop-env.sh, then the standalone mode is successful. There are three *.xml files,in standalone mode, they are empty, if we want to get a pseudo- distributed mode. We must configure them.

10 core-site.xml: Hadoop core configuration items, like I/O configuration. hdfs-site.xml: Hadoop daemon process configuration items, like namenode, datanode maprep-site.xml: MapReduce daemon process configuration items, like jobtracker tasktracker. Start Hadoop: Format the HDFS : Start-all daemon process: 内容内容

11 Then the Hadoop is get started a 内容 内容内容内容内容 内容内容 一级标题一级标题一级标题

12 HBase: Install java(has inatalled) Inatall habse Configure hbase-site.xml: set the hbase.rootdir Start HBase: Use HBase shell and create a table named test, it has two column family : zhang and gang 内容内容内容内容 内容内容

13 Thrift: (I feel complicated) 1. Install all the required tools and libraries to build and install the Apache Thrift compiler. 2. From the top directory,do:./configure 3. Once run configure,then :make and make test 4.From the top directory, become superuser and do: make install If no error(I met many), the thrift is successfully installed. Then generate the python client and move it to ~/python2.7/site-packages:

14 After generate a python client, we can use python to access to HBase. The next part is about a script I write to interaction with HBase. _____________________________________________ ___________________________________________ __________ script

15 As a test, I create a table named diracAccounting, it has two column families: ‘groupby’ and ‘generate’, and each family has a column: ‘groupby:Site’, ‘generate:CPUTime’. The row key is the ‘starttime’in mysql tables. The whole code I push in github: https://github.com/zhangg/LearningCode/blob/master/Progra m/Hadoop/HbasePy/hbaseplot.py https://github.com/zhangg/LearningCode/blob/master/Progra m/Hadoop/HbasePy/hbaseplot.py

16 def put(self): '''put some records to hbase table''' Select‘Site’,‘CPUTime’,‘Starttime’from mysql database and put into the table in HBase. Set the ‘starttime’as the row key.

17 def generatePlot (self, groupbyName, generateName): '''use records to generate a plot''‘ In this function, I scan the records and generate a plot.

18 end


Download ppt "ZhangGang 2012.12.25. Since the Hadoop farm has not successfully configured at CC, so I can not do some test with HBase. I just use the machine named."

Similar presentations


Ads by Google