Presentation is loading. Please wait.

Presentation is loading. Please wait.

How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack.

Similar presentations

Presentation on theme: "How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack."— Presentation transcript:

1 How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack

2 Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze A subjective and moving target Big data in many sectors today range from 10s of TB to multiple PB Big Data 2

3 Enterprise Data Trends 3

4 Value from Data Exceeds Hardware & Software costs Value in connecting data sets Grouping e-commerce users by user agent Orbitz shows more expensive hotels to Mac users See The Data Access Landscape - The Value of Data 4 Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en) AppleWebKit/418.9 (KHTML, like Gecko) Safari/419.3c

5 Spring has always provided excellent data access support –Transaction Management –Portable data access exception hierarchy –JDBC – JdbcTemplate –ORM - Hibernate, JPA, JDO, Ibatis support –Cache support (Spring 3.1) Spring Data project started in 2010 Goal is to refresh Springs Data Access support –In light of new data access landscape Spring and Data Access 5

6 Spring Data Mission Statement 6 Provide a familiar and consistent Spring-based programming model for Big Data, NoSQL, and relational stores while retaining store-specific features and capabilities.

7 Relational –JPA –JDBC Extensions NoSQL –Redis –HBase –Mongo –Neo4j –Lucene –Gemfire Big Data –Hadoop HDFS and M/R Hive Pig Cascading –Splunk Access –Repositories –QueryDSL –REST Spring Data – Supported Technologies 7

8 A View of a Big Data System 8 Integration Apps Stream Processing Unstructured Data Store Interactive Processing (Structured DB) Interactive Processing (Structured DB) Batch Analysis Batch Analysis Analytical Apps Real Time Analytics Real Time Analytics Data Streams (Log Files, Sensors, Mobile) Data Streams (Log Files, Sensors, Mobile) Ingestion Engine Ingestion Engine Distribution Engine Distribution Engine Monitoring / Deployment SaaS Social Where Spring Projects can be used to provide a solution

9 Real world big data solutions require workflow across systems Share core components of a classic integration workflow Big data solutions need to integrate with existing data and apps Event-driven processing Batch workflows Big Data Problems are Integration Problems 9

10 Spring Integration for building and configuring message-based integration flows using input & output adapters, channels, and processors Spring Batch for building and operating batch workflows and manipulating data in files and ETL Basis for JSR 352 in EE7... Spring projects offer substantial integration functionality 10

11 Spring Data for manipulating data in relational DBs as well as a variety of NoSQL databases and data grids (inside Gemfire 7.0) Spring for Apache Hadoop for orchestrating Hadoop and non-Hadoop workflows in conjunction with Batch and Integration processing (inside GPHD 1.2) Spring projects offer substantial integration functionality 11

12 Integration is an essential part of Big Data 12

13 Some Existing Big Data Integration tools 13

14 Hadoop as a Big Data Platform 14

15 Hadoop has a poor out of the box programming model Applications are generally a collection of scripts calling command line apps Spring simplifies developing Hadoop applications By providing a familiar and consistent programming and configuration mode Across a wide range of use cases – HDFS usage – Data Analysis (MR/Pig/Hive/Cascading) – Workflow – Event Streams – Integration Allowing to start small and grow Spring for Hadoop - Goals 15

16 Relationship with other Spring projects 16

17 Spring Hadoop – Core Functionality 17

18 Declarative configuration – Create, configure, and parameterize Hadoop connectivity and all job types – Environment profiles – easily move from dev to qa to prod Developer productivity –Create well-formed applications, not spaghetti script applications –Simplify HDFS and FsShell API with support for JVM scripting –Runner classes for MR/Pig/Hive/Cascading for small workflows –Helper Template classes for Pig/Hive/HBase Capabilities: Spring + Hadoop 18

19 Core Map Reduce idea 19

20 Standard Hadoop APIs Counting Words – Configuring M/R 20 Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true);

21 Standard Hadoop API - Mapper Counting Words –M/R Code 21 public class TokenizerMapper extends Mapper { private final static IntWritable one = new IntWritable(1); Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken());context.write(word, one); }

22 Standard Hadoop API - Reducer Counting Words –M/R Code 22 public class IntSumReducer extends Reducer { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }

23 Standard Hadoop SDHP (Spring Hadoop) Running Hadoop Example Jars 23 bin/hadoop jar hadoop-examples.jar wordcount /wc/input /wc/output

24 Standard Hadoop SHDP Running Hadoop Tools 24 bin/hadoop jar –conf myhadoop-site.xml –D ignoreCase=true wordcount.jar org.myorg.WordCount /wc/input /wc/output ignoreCase=true

25 Configuring Hadoop 25${hd.fs}

26 Access all bin/hadoop fs commands through FsShell –mkdir, chmod, test HDFS and Hadoop Shell as APIs 26 class MyScript FsShell void init() { String outputDir = "/data/output"; if (fsShell.test(outputDir)) { fsShell.rmr(outputDir); }

27 FsShell is designed to support JVM scripting languages HDFS and FsShell as APIs 27 // use the shell (made available under variable fsh) if (!fsh.test(inputDir)) { fsh.mkdir(inputDir); fsh.copyFromLocal(sourceFile, inputDir); fsh.chmod(700, inputDir) } if (fsh.test(outputDir)) { fsh.rmr(outputDir) } copy-files.groovy

28 HDFS and FsShell as APIs // use the shell (made available under variable fsh) if (!fsh.test(inputDir)) { fsh.mkdir(inputDir); fsh.copyFromLocal(sourceFile, inputDir); fsh.chmod(700, inputDir) } if (fsh.test(outputDir)) { fsh.rmr(outputDir) } appCtx.xml

29 Externalize Script HDFS and FsShell as APIs 29 appCtx.xml

30 30 $> demo

31 input.path=/wc/input/ output.path=/wc/word/ hd.fs=hdfs://localhost:9000 Streaming Jobs and Environment Configuration bin/hadoop jar hadoop-streaming.jar \ –input /wc/input –output /wc/output \ -mapper /bin/cat –reducer /bin/wc \ -files stopwords.txt env=dev java –jar SpringLauncher.jar applicationContext.xml

32 Streaming Jobs and Environment Configuration bin/hadoop jar hadoop-streaming.jar \ –input /wc/input –output /wc/output \ -mapper /bin/cat –reducer /bin/wc \ -files stopwords.txt env=qa java –jar SpringLauncher.jar applicationContext.xml input.path=/gutenberg/input/ output.path=/gutenberg/word/ hd.fs=hdfs://darwin:9000

33 Use Dependency Injection to obtain reference to Hadoop Job –Perform additional runtime configuration and submit Word Count – Injecting Jobs 33 public class WordService private Job mapReduceJob; public void processWords() { mapReduceJob.submit(); }

34 Pig 34

35 An alternative to writing MapReduce applications –Improve productivity Pig applications are written in the Pig Latin Language Pig Latin is a high level data processing language –In the spirit of sed and ask, not SQL Pig Latin describes a sequence of steps –Each step performs a transformation on item of data in a collection Extensible with User defined functions (UDFs) A PigServer is responsible for translating PigLatin to MR What is Pig? 35

36 Counting Words – PigLatin Script 36 input_lines = LOAD '/tmp/books' AS (line:chararray); -- Extract words from each line and put them into a pig bag words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; -- filter out any words that are just white spaces filtered_ filtered_words = FILTER words BY word MATCHES '\\w+'; -- create a group for each word word_groups = GROUP filtered_words BY word; -- count the entries in each group word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/number-of-words';

37 Standard Pig Spring Hadoop –Creates a PigServer –Optional execution of scripts on application startup Using Pig 37 pig –x mapreduce wordcount.pig pig wordcount.pig –P –p pig.exec.nocombiner=true pig.exec.nocombiner=true ignoreCase=TRUE

38 Execute a small Pig workflow (HDFS, PigLatin, HDFS) Springs PigRunner 38 inputDir=${inputDir} outputDir=${outputDir}

39 PigRunner implements Callable Use Springs Scheduling support Schedule a Pig job * * ?) public void process() {; }

40 Simplifies the programmatic use of Pig Common tasks are one-liners PigTemplate 40${hd.fs} mapred.job.tracker=${mapred.job.tracker}

41 PigTemplate - Programmatic Use 41 public class PigPasswordRepository implements PasswordRepository { private PigTemplate pigTemplate; private String pigScript = "classpath:password-analysis.pig"; public void processPasswordFile(String inputFile) { String outputDir = baseOutputDir + File.separator + counter.incrementAndGet(); Properties scriptParameters = new Properties(); scriptParameters.put("inputDir", inputFile); scriptParameters.put("outputDir", outputDir); pigTemplate.executeScript(pigScript, scriptParameters); } //... }

42 Hive 42

43 An alternative to writing MapReduce applications –Improve productivity Hive applications are written using HiveQL HiveQL is in the spirit of SQL A HiveServer is responsible for translating HiveQL to MR Access via JDBC, ODBC, or Thrift RPC What is Hive? 43

44 Counting Words - HiveQL import the file as lines CREATE EXTERNAL TABLE lines(line string) LOAD DATA INPATH books OVERWRITE INTO TABLE lines; -- create a virtual view that splits the lines SELECT word, count(*) FROM lines LATERAL VIEW explode(split(text, )) lTable as word GROUP BY word;

45 Command-line JDBC based Using Hive 45 $HIVE_HOME/bin/hive –f wordcount.sql –d ignoreCase=TRUE –h Class.forName(org.apache.hadoop.hive.jdbc.HiveDriver); Connection con = DriverManager.getConnection(jdbc:hive://server:port/default,, ) try { Statement stmt = con.createStatement(); ResultSet res = stmt.executeQuery(…)... while ( {…} } catch (SQLException ex) {} } finally { try { con.close(); } catch (Exception ex) {} }

46 Access Hive using JDBC Client and use JdbcTemplate Using Hive with Spring Hadoop 46

47 Reuse existing knowledge of Springs Rich ResultSet to POJO Mapping Features Using Hive with Spring Hadoop 47 public long count() { return jdbcTemplate.queryForLong("select count(*) from " + tableName); } List result = jdbcTemplate.query(select * from passwords", new ResultSetExtractor () { public String extractData(ResultSet rs) throws SQLException { // extract data from result set }});

48 HiveClient is not thread-safe, throws checked exceptions Standard Hive – Thrift API 48 public long count() { HiveClient hiveClient = createHiveClient(); try { hiveClient.execute("select count(*) from " + tableName); return Long.parseLong(hiveClient.fetchOne()); // checked exceptions } catch (HiveServerException ex) { throw translateExcpetion(ex); } catch (org.apache.thrift.TException tex) { throw translateExcpetion(tex); } finally { try { hiveClient.shutdown(); } catch (org.apache.thrift.TException tex) { logger.debug("Unexpected exception on shutting down HiveClient", tex); }}} protected HiveClient createHiveClient() { TSocket transport = new TSocket(host, port, timeout); HiveClient hive = new HiveClient(new TBinaryProtocol(transport)); try {; } catch (TTransportException e) { throw translateExcpetion(e); } return hive; }

49 Spring Hadoop – Batch & Integration 49

50 Reuse same Batch infrastructure and knowledge to manage Hadoop workflows Step can be any Hadoop job type or HDFS script Hadoop Workflows managed by Spring Batch 50

51 Spring Batch for File/DB/NoSQL driven applications –Collect: Process local files –Transform: Scripting or Java code to transform and enrich –RT Analysis: N/A –Ingest: (batch/aggregate) write to HDFS or split/filtering –Batch Analysis: Orchestrate Hadoop steps in a workflow –Distribute: Copy data out of HDFS to structured storage –JMX enabled along with REST interface for job control Capabilities: Spring + Hadoop + Batch 51 CollectTransformRT AnalysisIngestBatch AnalysisDistributeUse

52 Spring Batch Configuration for Hadoop 52

53 Reuse previous Hadoop job definitions Spring Batch Configuration for Hadoop 53

54 Spring Integration for Event driven applications –Collect: Single node or distributed data collection (tcp/JMS/Rabbit) –Transform: Scripting or Java code to transform and enrich –RT Analysis: Connectivity to multiple analysis techniques –Ingest: Write to HDFS, Split/Filter data stream to other stores –JMX enabled + control bus for starting/stopping individual components Capabilities: Spring + Hadoop + SI 54 CollectTransformRT AnalysisIngestBatch AnalysisDistributeUse

55 Poll a local directory for files, files are rolled over every 10 min Copy files to staging area and then to HDFS Use an aggregator to wait to process all files available every hour to launch MR job Ingesting Copying Local Log Files into HDFS 55

56 Use syslog adapter Transformer categorized messages Route to specific channels based on category One route leads to HDFS write and filtered data stored in Redis Ingesting Syslog into HDFS 56

57 Syslog collection across multiple machines Use TCP Adapters to forward events –Or other middleware Ingesting Multi-node Syslog into HDFS 57

58 Use Spring Batch –JdbcItemReader –FileItemWriter Ingesting JDBC to HDFS 58

59 Use FsShell Include as step in Batch workflow Spring Batch and fire events when jobs end… SI can poll HDFS… Exporting HDFS to local Files // use the shell (made available under variable fsh) fsh.copyToLocal(sourceDir, outputDir);

60 Use Spring Batch –MutliFileItemReader –JdbcItemWriter Exporting HDFS to JDBC 60

61 Use Spring Batch –MutliFileItemReader –MongoItemWriter Exporting HDFS to Mongo 61

62 CEP – Style Data Pipeline 62 HTTP EndpointConsumer Route HDFSTransformEsperGemfire GPDBFilter Esper for CEP functionality Gemfire for Continuous Query as well as data capacitor like functionalty Greenplum Database as another big data store for ingestion.

63 Thank You!

64 Prepping for GA – feedback welcome Project Page: Source Code: Books Resources 64

Download ppt "How to Build Big Data Pipelines for Hadoop Dr. Mark Pollack."

Similar presentations

Ads by Google