How to Build Big Data Pipelines for Hadoop

Name: How to Build Big Data Pipelines for Hadoop
Uploaded: 2017-07-11T06:38:24+00:00
Duration: PTM34S59
Channel: Jaden Gallagher
Description: How to Build Big Data Pipelines for Hadoop

How to Build Big Data Pipelines for Hadoop
Dr. Mark Pollack

Big Data “Big data” refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze A subjective and moving target Big data in many sectors today range from 10’s of TB to multiple PB

Enterprise Data Trends

The Data Access Landscape - The Value of Data
Value from Data Exceeds Hardware & Software costs Value in connecting data sets Grouping e-commerce users by user agent Orbitz shows more expensive hotels to Mac users See Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en) AppleWebKit/418.9 (KHTML, like Gecko) Safari/419.3c

Spring and Data Access Spring has always provided excellent data access support Transaction Management Portable data access exception hierarchy JDBC – JdbcTemplate ORM - Hibernate, JPA, JDO, Ibatis support Cache support (Spring 3.1) Spring Data project started in 2010 Goal is to “refresh” Spring’s Data Access support In light of new data access landscape

Spring Data Mission Statement
“ Provide a familiar and consistent Spring-based programming model for Big Data, NoSQL, and relational stores while retaining store-specific features and capabilities.

Spring Data – Supported Technologies
Relational JPA JDBC Extensions NoSQL Redis HBase Mongo Neo4j Lucene Gemfire Big Data Hadoop HDFS and M/R Hive Pig Cascading Splunk Access Repositories QueryDSL REST

A View of a Big Data System
Monitoring / Deployment Integration Apps Analytical Apps Where Spring Projects can be used to provide a solution Interactive Processing (Structured DB) Batch Analysis Stream Processing SaaS Social Real Time Analytics Distribution Engine Ingestion Engine Data Streams (Log Files, Sensors, Mobile) Unstructured Data Store

Big Data Problems are Integration Problems
Real world big data solutions require workflow across systems Share core components of a classic integration workflow Big data solutions need to integrate with existing data and apps Event-driven processing Batch workflows

Spring projects offer substantial integration functionality
Spring Integration for building and configuring message-based integration flows using input & output adapters, channels, and processors Spring Batch for building and operating batch workflows and manipulating data in files and ETL Basis for JSR 352 in EE7...

Spring projects offer substantial integration functionality
Spring Data for manipulating data in relational DBs as well as a variety of NoSQL databases and data grids (inside Gemfire 7.0) Spring for Apache Hadoop for orchestrating Hadoop and non-Hadoop workflows in conjunction with Batch and Integration processing (inside GPHD 1.2)

Integration is an essential part of Big Data

Some Existing Big Data Integration tools

Hadoop as a Big Data Platform

Spring for Hadoop - Goals
Hadoop has a poor out of the box programming model Applications are generally a collection of scripts calling command line apps Spring simplifies developing Hadoop applications By providing a familiar and consistent programming and configuration mode Across a wide range of use cases HDFS usage Data Analysis (MR/Pig/Hive/Cascading) Workflow Event Streams Integration Allowing to start small and grow

Relationship with other Spring projects

Spring Hadoop – Core Functionality

Capabilities: Spring + Hadoop
Declarative configuration Create, configure, and parameterize Hadoop connectivity and all job types Environment profiles – easily move from dev to qa to prod Developer productivity Create well-formed applications, not spaghetti script applications Simplify HDFS and FsShell API with support for JVM scripting Runner classes for MR/Pig/Hive/Cascading for small workflows Helper “Template” classes for Pig/Hive/HBase

Core Map Reduce idea

Counting Words – Configuring M/R
Standard Hadoop APIs Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true);

Counting Words –M/R Code
Standard Hadoop API - Mapper public class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken());context.write(word, one); } } }

Counting Words –M/R Code
Standard Hadoop API - Reducer public class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }

Running Hadoop Example Jars
Standard Hadoop SDHP (Spring Hadoop) bin/hadoop jar hadoop-examples.jar wordcount /wc/input /wc/output <hdp:configuration /> <hdp:jar-runner id=“wordcount“ jar="hadoop-examples.jar> <hdp:arg value=“wordcount“ /> <hdp:arg value=“/wc/input“ /> <hdp:arg value=“/wc/output“/> </hdp:jar-runner>

Running Hadoop Tools Standard Hadoop SHDP
bin/hadoop jar –conf myhadoop-site.xml –D ignoreCase=true wordcount.jar org.myorg.WordCount /wc/input /wc/output <hdp:configuration resources=“myhadoop-site.xml“/> <hdp:tool-runner id="wc“ jar=“wordcount.jar”> <hdp:arg value=“/wc/input“ /> <hdp:arg value=“/wc/output“/> ignoreCase=true </hdp:tool-runner>

Configuring Hadoop applicationContext.xml hadoop-dev.properties
<context:property-placeholder location="hadoop-dev.properties"/> <hdp:configuration> fs.default.name=${hd.fs} </hdp:configuration> <hdp:job id="word-count-job" input-path=“${input.path}" output-path="${output.path}“ jar=“myjob.jar” mapper="org.apache.hadoop.examples.WordCount.TokenizerMapper“ reducer="org.apache.hadoop.examples.WordCount.IntSumReducer"/> <hdp:job-runner id=“runner” job-ref="word-count-job“ run-at-startup=“true“ /> applicationContext.xml input.path=/wc/input/ output.path=/wc/word/ hd.fs=hdfs://localhost:9000 hadoop-dev.properties

HDFS and Hadoop Shell as APIs
Access all “bin/hadoop fs” commands through FsShell mkdir, chmod, test class MyScript { @Autowired FsShell fsh; @PostConstruct void init() { String outputDir = "/data/output"; if (fsShell.test(outputDir)) { fsShell.rmr(outputDir); }

HDFS and FsShell as APIs
FsShell is designed to support JVM scripting languages copy-files.groovy // use the shell (made available under variable fsh) if (!fsh.test(inputDir)) { fsh.mkdir(inputDir); fsh.copyFromLocal(sourceFile, inputDir); fsh.chmod(700, inputDir) } if (fsh.test(outputDir)) { fsh.rmr(outputDir)

<hdp:script id=“setupScript“ language=“groovy“> <hdp:property name=“inputDir“ value=“${input}“/> <hdp:property name=“outputDir“ value=“${output}“/> <hdp:property name=“sourceFile“ value=“${source}“/> // use the shell (made available under variable fsh) if (!fsh.test(inputDir)) { fsh.mkdir(inputDir); fsh.copyFromLocal(sourceFile, inputDir); fsh.chmod(700, inputDir) } if (fsh.test(outputDir)) { fsh.rmr(outputDir) </hdp:script> appCtx.xml

Externalize Script appCtx.xml <script id="setupScript" location="copy-files.groovy"> <property name="inputDir" value="${wordcount.input.path}"/> <property name="outputDir" value="${wordcount.output.path}"/> <property name=“sourceFile“ value="${localSourceFile}"/> </script>

$> demo

Streaming Jobs and Environment Configuration
bin/hadoop jar hadoop-streaming.jar \ –input /wc/input –output /wc/output \ -mapper /bin/cat –reducer /bin/wc \ -files stopwords.txt <context:property-placeholder location="hadoop-${env}.properties"/> <hdp:streaming id=“wc“ input-path=“${input}” output-path=“${output}” mapper=“${cat}” reducer=“${wc}” files=“classpath:stopwords.txt”> </hdp:streaming> input.path=/wc/input/ output.path=/wc/word/ hd.fs=hdfs://localhost:9000 hadoop-dev.properties env=dev java –jar SpringLauncher.jar applicationContext.xml

Streaming Jobs and Environment Configuration
bin/hadoop jar hadoop-streaming.jar \ –input /wc/input –output /wc/output \ -mapper /bin/cat –reducer /bin/wc \ -files stopwords.txt <context:property-placeholder location="hadoop-${env}.properties"/> <hdp:streaming id=“wc“ input-path=“${input}” output-path=“${output}” mapper=“${cat}” reducer=“${wc}” files=“classpath:stopwords.txt”> </hdp:streaming> input.path=/gutenberg/input/ output.path=/gutenberg/word/ hd.fs=hdfs://darwin:9000 hadoop-qa.properties env=qa java –jar SpringLauncher.jar applicationContext.xml

Word Count – Injecting Jobs
Use Dependency Injection to obtain reference to Hadoop Job Perform additional runtime configuration and submit public class WordService { @Inject private Job mapReduceJob; public void processWords() { mapReduceJob.submit(); }

What is Pig? An alternative to writing MapReduce applications
Improve productivity Pig applications are written in the Pig Latin Language Pig Latin is a high level data processing language In the spirit of sed and ask, not SQL Pig Latin describes a sequence of steps Each step performs a transformation on item of data in a collection Extensible with User defined functions (UDFs) A PigServer is responsible for translating PigLatin to MR

Counting Words – PigLatin Script
input_lines = LOAD '/tmp/books' AS (line:chararray); -- Extract words from each line and put them into a pig bag words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; -- filter out any words that are just white spaces filtered_ filtered_words = FILTER words BY word MATCHES '\\w+'; -- create a group for each word word_groups = GROUP filtered_words BY word; -- count the entries in each group word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/number-of-words';

Using Pig Standard Pig Spring Hadoop Creates a PigServer
Optional execution of scripts on application startup pig –x mapreduce wordcount.pig pig wordcount.pig –P pig.properties –p pig.exec.nocombiner=true <pig-factory job-name=“wc” properties-location=“pig.properties"> pig.exec.nocombiner=true <script location=“wordcount.pig"> <arguments>ignoreCase=TRUE</arguments> </script> </pig-factory>

Spring’s PigRunner Execute a small Pig workflow (HDFS, PigLatin, HDFS)
<pig-factory job-name=“analysis“ properties-location="pig-server.properties"/> <script id="hdfsScript” location="copy-files.groovy"> <property name=“sourceFile" value="${localSourceFile}"/> <property name="inputDir" value="${inputDir}"/> <property name="outputDir" value="${outputDir}"/> </script> <pig-runner id="pigRunner“ pre-action="hdfsScript” run-at-startup="true"> <script location=“wordCount.pig"> <arguments> inputDir=${inputDir} outputDir=${outputDir} </arguments> </pig-runner>

Schedule a Pig job PigRunner implements Callable
Use Spring’s Scheduling support @Scheduled(cron= “ * * ?”) public void process() { pigRunner.call(); }

PigTemplate Simplifies the programmatic use of Pig
Common tasks are ‘one-liners’ <configuration> fs.default.name=${hd.fs} mapred.job.tracker=${mapred.job.tracker} </configuration> <pig-factory id="pigFactory“ properties-location="pig-server.properties"/> <pig-template pig-factory-ref="pigFactory"/> <beans:bean id="passwordRepository" class="com.oreilly.springdata.hadoop.pig.PigPasswordRepository“ c:template-ref="pigTemplate“ p:baseOutputDir="/data/password-repo/output“ /> </beans:bean>

PigTemplate - Programmatic Use
public class PigPasswordRepository implements PasswordRepository { private PigTemplate pigTemplate; private String pigScript = "classpath:password-analysis.pig"; public void processPasswordFile(String inputFile) { String outputDir = baseOutputDir + File.separator + counter.incrementAndGet(); Properties scriptParameters = new Properties(); scriptParameters.put("inputDir", inputFile); scriptParameters.put("outputDir", outputDir); pigTemplate.executeScript(pigScript, scriptParameters); } //...

What is Hive? An alternative to writing MapReduce applications
Improve productivity Hive applications are written using HiveQL HiveQL is in the spirit of SQL A HiveServer is responsible for translating HiveQL to MR Access via JDBC, ODBC, or Thrift RPC

Counting Words - HiveQL
-- import the file as lines CREATE EXTERNAL TABLE lines(line string) LOAD DATA INPATH ‘books’ OVERWRITE INTO TABLE lines; -- create a virtual view that splits the lines SELECT word, count(*) FROM lines LATERAL VIEW explode(split(text, ‘ ‘ )) lTable as word GROUP BY word;

Using Hive Command-line JDBC based
$HIVE_HOME/bin/hive –f wordcount.sql –d ignoreCase=TRUE –h hive-server.host Class.forName(“org.apache.hadoop.hive.jdbc.HiveDriver”); Connection con = DriverManager.getConnection(“jdbc:hive://server:port/default”,””, “”) try { Statement stmt = con.createStatement(); ResultSet res = stmt.executeQuery(“…”) ... while (res.next()) {…} } catch (SQLException ex) {} } finally { try { con.close(); } catch (Exception ex) {} }

Using Hive with Spring Hadoop
Access Hive using JDBC Client and use JdbcTemplate <bean id="hive-driver" class="org.apache.hadoop.hive.jdbc.HiveDriver"/> <bean id="hive-ds" class="org.springframework.jdbc.datasource.SimpleDriverDataSource" c:driver-ref="hive-driver" c:url="${hive.url}"/> <bean id="template" class="org.springframework.jdbc.core.JdbcTemplate" c:data-source-ref="hive-ds"/>

Using Hive with Spring Hadoop
Reuse existing knowledge of Spring’s Rich ResultSet to POJO Mapping Features public long count() { return jdbcTemplate.queryForLong("select count(*) from " + tableName); } List<Password> result = jdbcTemplate.query(“select * from passwords", new ResultSetExtractor<List<Password>() { public String extractData(ResultSet rs) throws SQLException { // extract data from result set }});

Standard Hive – Thrift API
HiveClient is not thread-safe, throws checked exceptions public long count() { HiveClient hiveClient = createHiveClient(); try { hiveClient.execute("select count(*) from " + tableName); return Long.parseLong(hiveClient.fetchOne()); // checked exceptions } catch (HiveServerException ex) { throw translateExcpetion(ex); } catch (org.apache.thrift.TException tex) { throw translateExcpetion(tex); } finally { try { hiveClient.shutdown(); } catch (org.apache.thrift.TException tex) { logger.debug("Unexpected exception on shutting down HiveClient", tex); }}} protected HiveClient createHiveClient() { TSocket transport = new TSocket(host, port, timeout); HiveClient hive = new HiveClient(new TBinaryProtocol(transport)); try { transport.open(); } catch (TTransportException e) { throw translateExcpetion(e); } return hive; }

Spring Hadoop – Batch & Integration

Hadoop Workflows managed by Spring Batch
Reuse same Batch infrastructure and knowledge to manage Hadoop workflows Step can be any Hadoop job type or HDFS script

Capabilities: Spring + Hadoop + Batch
Collect Transform RT Analysis Ingest Batch Analysis Distribute Use Spring Batch for File/DB/NoSQL driven applications Collect: Process local files Transform: Scripting or Java code to transform and enrich RT Analysis: N/A Ingest: (batch/aggregate) write to HDFS or split/filtering Batch Analysis: Orchestrate Hadoop steps in a workflow Distribute: Copy data out of HDFS to structured storage JMX enabled along with REST interface for job control

Spring Batch Configuration for Hadoop
<job id="job1"> <step id="import" next="wordcount"> <tasklet ref=“import-tasklet"/> </step> <step id=“wc" next="pig"> <tasklet ref="wordcount-tasklet"/> <step id="pig"> <tasklet ref="pig-tasklet“></step> <split id="parallel" next="hdfs"> <flow><step id="mrStep"> <tasklet ref="mr-tasklet"/> </step></flow> <flow><step id="hive"> <tasklet ref="hive-tasklet"/> </split> <step id="hdfs"> <tasklet ref="hdfs-tasklet"/></step> </job>

Spring Batch Configuration for Hadoop
Reuse previous Hadoop job definitions <script-tasklet id=“import-tasklet"> <script location="clean-up-wordcount.groovy"/> </script-tasklet> <tasklet id="wordcount-tasklet" job-ref="wordcount-job"/> <job id=“wordcount-job" scope=“prototype" input-path="${input.path}" mapper="org.apache.hadoop.examples.WordCount.TokenizerMapper" reducer="org.apache.hadoop.examples.WordCount.IntSumReducer"/> <pig-tasklet id="pig-tasklet"> <script location="org/company/pig/handsome.pig" /> </pig-tasklet>

Capabilities: Spring + Hadoop + SI
Collect Transform RT Analysis Ingest Batch Analysis Distribute Use Spring Integration for Event driven applications Collect: Single node or distributed data collection (tcp/JMS/Rabbit) Transform: Scripting or Java code to transform and enrich RT Analysis: Connectivity to multiple analysis techniques Ingest: Write to HDFS, Split/Filter data stream to other stores JMX enabled + control bus for starting/stopping individual components

Ingesting Copying Local Log Files into HDFS
Poll a local directory for files, files are rolled over every 10 min Copy files to staging area and then to HDFS Use an aggregator to wait to “process all files available every hour” to launch MR job

Ingesting Syslog into HDFS
Use syslog adapter Transformer categorized messages Route to specific channels based on category One route leads to HDFS write and filtered data stored in Redis

Ingesting Multi-node Syslog into HDFS
Syslog collection across multiple machines Use TCP Adapters to forward events Or other middleware

Ingesting JDBC to HDFS Use Spring Batch JdbcItemReader FileItemWriter
<step id="step1"> <tasklet> <chunk reader=“jdbcItemReader" processor="itemProcessor" writer=“flatFileItemWriter" commit-interval="100" retry-limit="3"/> </chunk> </tasklet> </step>

Exporting HDFS to local Files
Use FsShell Include as step in Batch workflow Spring Batch and fire events when jobs end… SI can poll HDFS… <hdp:script id=“hdfsCopy“ language=“groovy“> <hdp:property name=“sourceDir“ value=“${sourceDir}“/> <hdp:property name=“outputDir“ value=“${outputDir}“/> // use the shell (made available under variable fsh) fsh.copyToLocal(sourceDir, outputDir); </hdp:script> <step id="hdfsStep"> <script-tasklet script-ref="hdfsCopy"/> </step> 59

Exporting HDFS to JDBC Use Spring Batch MutliFileItemReader
JdbcItemWriter <step id="step1"> <tasklet> <chunk reader=“flatFileItemReader" processor="itemProcessor" writer=“jdbcItemWriter" commit-interval="100" retry-limit="3"/> </chunk> </tasklet> </step>

Exporting HDFS to Mongo
Use Spring Batch MutliFileItemReader MongoItemWriter <step id="step1"> <tasklet> <chunk reader=“flatFileItemReader" processor="itemProcessor" writer=“mongoItemWriter"/> </chunk> </tasklet> </step>

CEP – Style Data Pipeline
Esper for CEP functionality Gemfire for Continuous Query as well as “data capacitor” like functionalty Greenplum Database as another ‘big data store’ for ingestion. HTTP Transform Esper Gemfire HDFS Route Consumer Endpoint Filter GPDB

Thank You!

Resources Prepping for GA – feedback welcome
Project Page: springsource.org/spring-data/hadoop Source Code: github.com/SpringSource/spring-hadoop Books

How to Build Big Data Pipelines for Hadoop

Similar presentations

Presentation on theme: "How to Build Big Data Pipelines for Hadoop"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

How to Build Big Data Pipelines for Hadoop

Similar presentations

Presentation on theme: "How to Build Big Data Pipelines for Hadoop"— Presentation transcript:

Similar presentations

About project

Feedback