Data Analytics 김재윤 이성민 ( 팀장 ) 이용현 최찬희 하승도. Contents Part l 1. Introduction - Data Analytics Cases - What is Data Analytics? - OLTP, OLAP - ROLAP - MOLAP.

Data Analytics 김재윤 이성민 ( 팀장 ) 이용현 최찬희 하승도

Contents Part l 1. Introduction - Data Analytics Cases - What is Data Analytics? - OLTP, OLAP - ROLAP - MOLAP - Column store 2. Applications & Data Management System - MATLAB - R - Impala - Splunk - HANA 3. Research Trend - Hyper: (ICDE 2011) Combination of OLTP & OLAP - Starfish (CIDR 2011) - Crisis Informatics (ICSE 2011) 2

Contents Part ll 4. Demo1: Statistical Computing - MATLAB - R 5. Demo2: Analytics on Hadoop - Pig - Hive - Impala 6. Demo3: Real-time Analytics - Splunk 3

Statistical Analytics Query Processing Time-series Analytics Data Visualization Open Source MATLAB OXOOX R OXOOO Impala OXO Splunk XOO OX HANA OXX Overview 4

- MATLAB - R Demo 1 Statistical Computing 5

MATLAB Engineering software which provides numerical analytics environment - Matrix manipulations - Plotting of functions and data - Implementation of algorithms - Creation of user interfaces - MATLAB can interfacing with C, C++, Java, Fortran, Python 6

MATLAB Interface 7

MATLAB Too slow to manage large data 8

MATLAB Code example 9

Demo: Plot 10

Demo: Data Linking 11

Demo: Regression 12

Demo: Polynomial Fitting 13

R Programming language for statistical computing and graphics - Widely used among statisticians and data analyist - Can run on Windows, Mac, Lunix - Can use for free - Easily extensible through functions - Provides statistical techniques - Provides high quality graphical techniques - A lot of library from third party 14

R R Language Example 15

R R Studio Example 16

R Vector Example 17

R Matrix Example 18

R Scatter Plot & Visualization Example 19

R Plentiful Library Example 20

R Heatmap Example 21

R Line Graph Example 22

R Linear Regression Example 23

- Pig - Hive - Impala Demo 2 Analytics on Hadoop 24

Analytics on Hadoop 1. Mapper and Reducer programs -Writing Java programs to analyze data at HDFS 2. SQL-like queries -Writing high-level query language like Oracle or MySQL 25

Analytics on Hadoop 1. Mapper and Reducer for word count 26 public class WordCount { public static void main(String[] args) { int res = ToolRunner.run(new WordCount(), args); } public int run(String[] args) throws Exception { Job job = Job.getInstance(getConf(), "wordcount"); job.setJarByClass(this.getClass()); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.c lass); return job.waitForCompletion(true) ? 0 : 1; } public static class Map extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); private long numRecords = 0; private static final Pattern WORD_BOUNDARY = Pattern.compile("\\s*\\b\\s*"); public void map(LongWritable offset, Text lineText, Context context) throws IOException, InterruptedException { String line = lineText.toString(); Text currentWord = new Text(); for (String word : WORD_BOUNDARY.split(line)) { if (word.isEmpty()) { continue; } currentWord = new Text(word); context.write(currentWord,one); } } } public static class Reduce extends Reducer { @Override public void reduce(Text word, Iterable counts, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable count : counts) { sum += count.get(); } context.write(word, new IntWritable(sum)); } Ref. Cloudera Hadoop Tutorial

Analytics on Hadoop 2. SQL-like queries for word count 27 CREATE TABLE doc( text string ); LOAD DATA LOCAL INPATH '/home/Documents/sentiment/Wikipedia.txt' OVERWRITE INTO TABLE doc; SELECT word, COUNT(*) FROM (SELECT explode(split(text, ' ')) AS word FROM doc) GROUP BY word;

SQL on Hadoop 28 1. Pig -SQL-like scripting language is called Pig Latin -They are translated into MapReduce jobs Automatically 2. Hive - SQL-like scripting language is called HiveQL(HQL) -They are also translated into MapReduce jobs 3. Impala - Supports most of HiveQL and additional statements - Distributed processing(impalad) instead of MapReduce They enable users to write complex data transformations without knowing Java!

PigHiveImpala Released Year200620082012 Dev.LanguageJava C++ SQLPig LatinHiveQL Query Processing Tuple-at-a-time (MapReduce) Tuple-at-a-time (MapReduce) Block-at-a-time (Impalad) ODBC/JDBCYes LatencyHigh Low Suitable JobsBatch Real-time SQL on Hadoop 29

Benchmark System Environment 30 Cluster13 Nodes (1 master + 12 slaves) CPUIntel i5 Memory32.0GB (each node) HDD5.0TB each (each node) OSUbuntu 12.0.4 Hadoop2.3.0 Pig0.12.0 Hive0.13.1 Impala2.1.1

Benchmark Data Set - Randomly generated 1GB sales transaction from TPC-DS 31 Store_Sales Date_FK Customer_FK Item_FK number cost whole_cost tax Date_FK quater day month year Date_Dim Item Item_FK color company Customer_FK name salutation country Customer

Benchmark Query 1: Average sales cost in first half year 32 SELECT AVG(ss.ss_ext_wholesale_cost) FROM date_dim AS d, store_sales AS ss WHERE d.d_date_sk = ss.ss_sold_date_sk AND d.d_qoy < 3; Aggregation Join Range Point Rank Hive & Impala

Benchmark Query 1: Average sales cost in first half year 33 ss = LOAD '/user/user01/store_sales.csv' USING PigStorage(',') AS (ss_sold_date_sk:chararray, …, ss_net_profit:int); d = LOAD '/user/user01/date_dim.csv' USING PigStorage(',') AS (d_date_sk:chararray, …, d_current_year:int); metadata = JOIN ss BY ss_sold_date_sk, d BY d_date_sk; result = FILTER metadata BY d_qoy < 3; grouped = GROUP result ALL; avg_sales = FOREACH grouped GENERATE AVG(result.ss_ext_wholesale_cost); STORE avg_sales INTO 'query1.txt'; Pig

Benchmark Query 2: Average sales cost on Sunday 34 SELECT AVG(s.ss_ext_wholesale_cost) FROM store_sales AS s, date_dim AS d WHERE d.d_date_sk = s.ss_sold_date_sk AND d.d_day_name LIKE 'Sunday'; Aggregation Join Range Point Rank Hive & Impala

Benchmark Query 2: Average sales cost on Sunday 35 ss = LOAD '/user/user01/store_sales.csv' USING PigStorage(',') AS (ss_sold_date_sk:chararray, …, ss_net_profit:int); d = LOAD '/user/user01/date_dim.csv' USING PigStorage(',') AS (d_date_sk:chararray, …, d_current_year:int); metadata = JOIN ss BY ss_sold_date_sk, d BY d_date_sk; result = FILTER metadata BY d_day_name == ‘Sunday’; grouped = GROUP result ALL; avg_sales = FOREACH grouped GENERATE AVG(result.ss_ext_wholesale_cost); STORE avg_sales INTO 'query2.txt'; Pig

Benchmark Query 3: Bottom 20 customer’s birth country ordered by average sales cost on Sunday 36 SELECT c.c_birth_country, AVG(ss.ss_ext_wholesale_cost) AS avg_sales FROM store_sales AS ss, customer AS c, date_dim AS d WHERE c.c_customer_sk = ss.ss_customer_sk AND d.d_date_sk = ss.ss_sold_date_sk AND d.d_day_name LIKE 'Sunday' AND c.c_birth_country != '' GROUP BY c.c_birth_country ORDER BY avg_sales LIMIT 20; Aggregation Join Range Point Rank Hive & Impala

Benchmark Query 3: Bottom 20 customer’s birth country ordered b y average sales cost on Sunday 37 ss = LOAD '/user/user01/store_sales.csv' USING PigStorage(',') AS (ss_sold_date_sk:chararray,…, ss_net_profit:int); d = LOAD '/user/user01/date_dim.csv' USING PigStorage(',') AS (d_date_sk:chararray, …, d_current_year:int); c = LOAD '/user/user01/customer.csv' USING PigStorage(',') AS (c_customer_sk:chararray, …, c_last_review_date:int); metadata = JOIN ss BY ss_sold_date_sk, d BY d_date_sk; metadata2 = JOIN ss BY ss_customer_sk, c BY c_customer_sk; result = FILTER metadata2 BY (d.d_day_name == ‘Sunday’) AND (c.c_birth_country != ‘’); grouped = GROUP result BY c.c_birth_country; avg_table = FOREACH grouped GENERATE AVG(result.ss_ext_wholesale_cost) as avg_sales; ordered = ORDER avg_table BY avg_sales; STORE ordered INTO 'query3.txt'; Pig

Benchmark Our results 38

Benchmark Results from Cloudera documents 39 Ref. http://blog.cloudera.com/blog/2014/01/impala-performance-dbms-class-speed/

- Splunk Demo 3 Real-time Analytics 40

Splunk An engine for real-time machine data - Collection, indexing, analyzing and visualizing machine data to identify problems, patterns, risks and opportunities and drive better decisions for IT and the business Machine data (Unstructured data, No predefined schema) - Logs, Application queries, Records(Billing, Call detail, Events), Click Stream 41

Splunk demo (1) Simple commands using Windows application logs 43

Splunk demo (2) Foot traffic analytics using Cisco Meraki data 44

Reference [1] Cloudera hadoop tutorial, http://www.cloudera.com/content/cloudera/en/documentation/hadoop-tutorial/CD H5/Hadoop-Tutorial/ht_wordount1_source.html, 2015.05.29.http://www.cloudera.com/content/cloudera/en/documentation/hadoop-tutorial/CD H5/Hadoop-Tutorial/ht_wordount1_source.html [2] Introduction to HIVE, http://amalgjose.wordpress.com/2013/10/19/an-introduction-to-apache-hive, 2015.05.29 [3] SQL on Hadoop, Intelligent Data Systems Lab, Seoul Nat’l University. [4] TPC Benchmarks Standard Specification, version 1.3.1, Transaction Processing Performance Council, 2015.02. 45

Data Analytics 김재윤 이성민 ( 팀장 ) 이용현 최찬희 하승도. Contents Part l 1. Introduction - Data Analytics Cases - What is Data Analytics? - OLTP, OLAP - ROLAP - MOLAP.

Similar presentations

Presentation on theme: "Data Analytics 김재윤 이성민 ( 팀장 ) 이용현 최찬희 하승도. Contents Part l 1. Introduction - Data Analytics Cases - What is Data Analytics? - OLTP, OLAP - ROLAP - MOLAP."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Analytics 김재윤 이성민 ( 팀장 ) 이용현 최찬희 하승도. Contents Part l 1. Introduction - Data Analytics Cases - What is Data Analytics? - OLTP, OLAP - ROLAP - MOLAP.

Similar presentations

Presentation on theme: "Data Analytics 김재윤 이성민 ( 팀장 ) 이용현 최찬희 하승도. Contents Part l 1. Introduction - Data Analytics Cases - What is Data Analytics? - OLTP, OLAP - ROLAP - MOLAP."— Presentation transcript:

Similar presentations

About project

Feedback