Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Analytics 김재윤 이성민 ( 팀장 ) 이용현 최찬희 하승도. Contents Part l 1. Introduction - Data Analytics Cases - What is Data Analytics? - OLTP, OLAP - ROLAP - MOLAP.

Similar presentations


Presentation on theme: "Data Analytics 김재윤 이성민 ( 팀장 ) 이용현 최찬희 하승도. Contents Part l 1. Introduction - Data Analytics Cases - What is Data Analytics? - OLTP, OLAP - ROLAP - MOLAP."— Presentation transcript:

1 Data Analytics 김재윤 이성민 ( 팀장 ) 이용현 최찬희 하승도

2 Contents Part l 1. Introduction - Data Analytics Cases - What is Data Analytics? - OLTP, OLAP - ROLAP - MOLAP - Column store 2. Applications & Data Management System - MATLAB - R - Impala - Splunk - HANA 3. Research Trend - Hyper: (ICDE 2011) Combination of OLTP & OLAP - Starfish (CIDR 2011) - Crisis Informatics (ICSE 2011) 2

3 Contents Part ll 4. Demo1: Statistical Computing - MATLAB - R 5. Demo2: Analytics on Hadoop - Pig - Hive - Impala 6. Demo3: Real-time Analytics - Splunk 3

4 Statistical Analytics Query Processing Time-series Analytics Data Visualization Open Source MATLAB OXOOX R OXOOO Impala OXO Splunk XOO OX HANA OXX Overview 4

5 - MATLAB - R Demo 1 Statistical Computing 5

6 MATLAB Engineering software which provides numerical analytics environment - Matrix manipulations - Plotting of functions and data - Implementation of algorithms - Creation of user interfaces - MATLAB can interfacing with C, C++, Java, Fortran, Python 6

7 MATLAB Interface 7

8 MATLAB Too slow to manage large data 8

9 MATLAB Code example 9

10 Demo: Plot 10

11 Demo: Data Linking 11

12 Demo: Regression 12

13 Demo: Polynomial Fitting 13

14 R Programming language for statistical computing and graphics - Widely used among statisticians and data analyist - Can run on Windows, Mac, Lunix - Can use for free - Easily extensible through functions - Provides statistical techniques - Provides high quality graphical techniques - A lot of library from third party 14

15 R R Language Example 15

16 R R Studio Example 16

17 R Vector Example 17

18 R Matrix Example 18

19 R Scatter Plot & Visualization Example 19

20 R Plentiful Library Example 20

21 R Heatmap Example 21

22 R Line Graph Example 22

23 R Linear Regression Example 23

24 - Pig - Hive - Impala Demo 2 Analytics on Hadoop 24

25 Analytics on Hadoop 1. Mapper and Reducer programs -Writing Java programs to analyze data at HDFS 2. SQL-like queries -Writing high-level query language like Oracle or MySQL 25

26 Analytics on Hadoop 1. Mapper and Reducer for word count 26 public class WordCount { public static void main(String[] args) { int res = ToolRunner.run(new WordCount(), args); } public int run(String[] args) throws Exception { Job job = Job.getInstance(getConf(), "wordcount"); job.setJarByClass(this.getClass()); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.c lass); return job.waitForCompletion(true) ? 0 : 1; } public static class Map extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); private long numRecords = 0; private static final Pattern WORD_BOUNDARY = Pattern.compile("\\s*\\b\\s*"); public void map(LongWritable offset, Text lineText, Context context) throws IOException, InterruptedException { String line = lineText.toString(); Text currentWord = new Text(); for (String word : WORD_BOUNDARY.split(line)) { if (word.isEmpty()) { continue; } currentWord = new Text(word); context.write(currentWord,one); } } } public static class Reduce extends Reducer { @Override public void reduce(Text word, Iterable counts, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable count : counts) { sum += count.get(); } context.write(word, new IntWritable(sum)); } Ref. Cloudera Hadoop Tutorial

27 Analytics on Hadoop 2. SQL-like queries for word count 27 CREATE TABLE doc( text string ); LOAD DATA LOCAL INPATH '/home/Documents/sentiment/Wikipedia.txt' OVERWRITE INTO TABLE doc; SELECT word, COUNT(*) FROM (SELECT explode(split(text, ' ')) AS word FROM doc) GROUP BY word;

28 SQL on Hadoop 28 1. Pig -SQL-like scripting language is called Pig Latin -They are translated into MapReduce jobs Automatically 2. Hive - SQL-like scripting language is called HiveQL(HQL) -They are also translated into MapReduce jobs 3. Impala - Supports most of HiveQL and additional statements - Distributed processing(impalad) instead of MapReduce They enable users to write complex data transformations without knowing Java!

29 PigHiveImpala Released Year200620082012 Dev.LanguageJava C++ SQLPig LatinHiveQL Query Processing Tuple-at-a-time (MapReduce) Tuple-at-a-time (MapReduce) Block-at-a-time (Impalad) ODBC/JDBCYes LatencyHigh Low Suitable JobsBatch Real-time SQL on Hadoop 29

30 Benchmark System Environment 30 Cluster13 Nodes (1 master + 12 slaves) CPUIntel i5 Memory32.0GB (each node) HDD5.0TB each (each node) OSUbuntu 12.0.4 Hadoop2.3.0 Pig0.12.0 Hive0.13.1 Impala2.1.1

31 Benchmark Data Set - Randomly generated 1GB sales transaction from TPC-DS 31 Store_Sales Date_FK Customer_FK Item_FK number cost whole_cost tax Date_FK quater day month year Date_Dim Item Item_FK color company Customer_FK name salutation country Customer

32 Benchmark Query 1: Average sales cost in first half year 32 SELECT AVG(ss.ss_ext_wholesale_cost) FROM date_dim AS d, store_sales AS ss WHERE d.d_date_sk = ss.ss_sold_date_sk AND d.d_qoy < 3; Aggregation Join Range Point Rank Hive & Impala

33 Benchmark Query 1: Average sales cost in first half year 33 ss = LOAD '/user/user01/store_sales.csv' USING PigStorage(',') AS (ss_sold_date_sk:chararray, …, ss_net_profit:int); d = LOAD '/user/user01/date_dim.csv' USING PigStorage(',') AS (d_date_sk:chararray, …, d_current_year:int); metadata = JOIN ss BY ss_sold_date_sk, d BY d_date_sk; result = FILTER metadata BY d_qoy < 3; grouped = GROUP result ALL; avg_sales = FOREACH grouped GENERATE AVG(result.ss_ext_wholesale_cost); STORE avg_sales INTO 'query1.txt'; Pig

34 Benchmark Query 2: Average sales cost on Sunday 34 SELECT AVG(s.ss_ext_wholesale_cost) FROM store_sales AS s, date_dim AS d WHERE d.d_date_sk = s.ss_sold_date_sk AND d.d_day_name LIKE 'Sunday'; Aggregation Join Range Point Rank Hive & Impala

35 Benchmark Query 2: Average sales cost on Sunday 35 ss = LOAD '/user/user01/store_sales.csv' USING PigStorage(',') AS (ss_sold_date_sk:chararray, …, ss_net_profit:int); d = LOAD '/user/user01/date_dim.csv' USING PigStorage(',') AS (d_date_sk:chararray, …, d_current_year:int); metadata = JOIN ss BY ss_sold_date_sk, d BY d_date_sk; result = FILTER metadata BY d_day_name == ‘Sunday’; grouped = GROUP result ALL; avg_sales = FOREACH grouped GENERATE AVG(result.ss_ext_wholesale_cost); STORE avg_sales INTO 'query2.txt'; Pig

36 Benchmark Query 3: Bottom 20 customer’s birth country ordered by average sales cost on Sunday 36 SELECT c.c_birth_country, AVG(ss.ss_ext_wholesale_cost) AS avg_sales FROM store_sales AS ss, customer AS c, date_dim AS d WHERE c.c_customer_sk = ss.ss_customer_sk AND d.d_date_sk = ss.ss_sold_date_sk AND d.d_day_name LIKE 'Sunday' AND c.c_birth_country != '' GROUP BY c.c_birth_country ORDER BY avg_sales LIMIT 20; Aggregation Join Range Point Rank Hive & Impala

37 Benchmark Query 3: Bottom 20 customer’s birth country ordered b y average sales cost on Sunday 37 ss = LOAD '/user/user01/store_sales.csv' USING PigStorage(',') AS (ss_sold_date_sk:chararray,…, ss_net_profit:int); d = LOAD '/user/user01/date_dim.csv' USING PigStorage(',') AS (d_date_sk:chararray, …, d_current_year:int); c = LOAD '/user/user01/customer.csv' USING PigStorage(',') AS (c_customer_sk:chararray, …, c_last_review_date:int); metadata = JOIN ss BY ss_sold_date_sk, d BY d_date_sk; metadata2 = JOIN ss BY ss_customer_sk, c BY c_customer_sk; result = FILTER metadata2 BY (d.d_day_name == ‘Sunday’) AND (c.c_birth_country != ‘’); grouped = GROUP result BY c.c_birth_country; avg_table = FOREACH grouped GENERATE AVG(result.ss_ext_wholesale_cost) as avg_sales; ordered = ORDER avg_table BY avg_sales; STORE ordered INTO 'query3.txt'; Pig

38 Benchmark Our results 38

39 Benchmark Results from Cloudera documents 39 Ref. http://blog.cloudera.com/blog/2014/01/impala-performance-dbms-class-speed/

40 - Splunk Demo 3 Real-time Analytics 40

41 Splunk An engine for real-time machine data - Collection, indexing, analyzing and visualizing machine data to identify problems, patterns, risks and opportunities and drive better decisions for IT and the business Machine data (Unstructured data, No predefined schema) - Logs, Application queries, Records(Billing, Call detail, Events), Click Stream 41

42 Overview of Splunk Data indexing Search language 42 search | command arguments | command arguments | … sourcetype=syslog [ search login error | return 1user ] [+|-] @ error earliest=-1d@d latest=-h@h

43 Splunk demo (1) Simple commands using Windows application logs 43

44 Splunk demo (2) Foot traffic analytics using Cisco Meraki data 44

45 Reference [1] Cloudera hadoop tutorial, http://www.cloudera.com/content/cloudera/en/documentation/hadoop-tutorial/CD H5/Hadoop-Tutorial/ht_wordount1_source.html, 2015.05.29.http://www.cloudera.com/content/cloudera/en/documentation/hadoop-tutorial/CD H5/Hadoop-Tutorial/ht_wordount1_source.html [2] Introduction to HIVE, http://amalgjose.wordpress.com/2013/10/19/an-introduction-to-apache-hive, 2015.05.29 [3] SQL on Hadoop, Intelligent Data Systems Lab, Seoul Nat’l University. [4] TPC Benchmarks Standard Specification, version 1.3.1, Transaction Processing Performance Council, 2015.02. 45


Download ppt "Data Analytics 김재윤 이성민 ( 팀장 ) 이용현 최찬희 하승도. Contents Part l 1. Introduction - Data Analytics Cases - What is Data Analytics? - OLTP, OLAP - ROLAP - MOLAP."

Similar presentations


Ads by Google