Presentation is loading. Please wait.

Presentation is loading. Please wait.

Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850.

Similar presentations


Presentation on theme: "Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850."— Presentation transcript:

1 Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850 (240) 389-0750 msilverman@treeminer.com

2 TREEMINER, INC. CONFIDENTIAL Agenda Introduction to Hadoop Developing and testing a Map/Reduce application Auto-Clustering in Hadoop and Interworking with Apache Storm

3 TREEMINER, INC. CONFIDENTIAL Introduction to Hadoop Hadoop consists of: Clustered, distributed, highly available file system (HDFS) Execution framework (Map/Reduce)

4 TREEMINER, INC. CONFIDENTIAL Hadoop File System “Rack” aware Local storage Distributed copies (generally 3) Rack

5 TREEMINER, INC. CONFIDENTIAL Sample Hadoop File System

6 TREEMINER, INC. CONFIDENTIAL Hadoop “Eco-System” Hive Allows SQL-like querying of data in HDFS Pig Basic scripting language for Hadoop Databases Hbase, Accumulo, Cassandra, Neo4j

7 TREEMINER, INC. CONFIDENTIAL Map / Reduce Parallel Execution Framework

8 TREEMINER, INC. CONFIDENTIAL Map / Reduce Parallel Execution Framework

9 TREEMINER, INC. CONFIDENTIAL WordCount Example

10 TREEMINER, INC. CONFIDENTIAL Getting Started Cloudera and Hortonworks have sandboxes that are easy to download and are fully contained implementations in a VM. Also download from Apache. http://hortonworks.com/products/hortonworks- sandbox/ http://www.cloudera.com/content/cloudera/en/dow nloads/quickstart_vms/cdh-5-3-x.html http://hadoop.apache.org/releases.html

11 TREEMINER, INC. CONFIDENTIAL Developing In Map / Reduce Standalone Mode – Hadoop runs as single process, best for debugging Pseudo-Distributed – Separate processes on same server Fully Distributed – Full blown cluster

12 TREEMINER, INC. CONFIDENTIAL Eclipse Framework Write code in eclipse PC or Linux Options: Run Hadoop on Windows Run Eclipse in Linux with Plugin Run Eclipse in Windows, Remote debug and profiling Profiling: Yourkit

13 TREEMINER, INC. CONFIDENTIAL WordCount Create a project in eclipse Load wordcount code (widely available and in sandbox downloads) Compile jar file Execute on hadoop in standalone mode $ hadoop jar path/to/file.jar input output

14 TREEMINER, INC. CONFIDENTIAL Monitoring Hadoop Jobs

15 TREEMINER, INC. CONFIDENTIAL Monitoring Hadoop Jobs

16 TREEMINER, INC. CONFIDENTIAL Resources http://www.cloudera.com http://www.hortonworks.com hadoop.apache.org http://web.stanford.edu/class/cs246/homew orks/tutorial.pdf Hadoop: A Definitive Guide by Tom White

17 TREEMINER, INC. CONFIDENTIAL Example: Document AutoClustering using Hadoop and Storm https://www.youtube.com/watch?v=5X65WV0n4rU


Download ppt "Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850."

Similar presentations


Ads by Google