Cloud Computing Era (Practice)

Cloud Computing Era (Practice)
Phoenix Liau Trend Micro

Three Major Trends to Chang the World
Cloud Computing Big Data Mobile

什麼是雲端運算？美國國家標準技術研究所 (NIST)的定義:
Essential Characteristics Service Models Deployment Models 美國國家標準技術研究所 (NIST)的定義: Essential Characteristics: On-demand self-service. A consumer can unilaterally provision computing capabilities, such as server time and network storage, as needed automatically without requiring human interaction with each service’s provider. Broad network access. Capabilities are available over the network and accessed through standard mechanisms that promote use by heterogeneous thin or thick client platforms (e.g., mobile phones, laptops, and PDAs). Resource pooling. The provider’s computing resources are pooled to serve multiple consumers using a multi-tenant model, with different physical and virtual resources dynamically assigned and reassigned according to consumer demand. There is a sense of location independence in that the customer generally has no control or knowledge over the exact location of the provided resources but may be able to specify location at a higher level of abstraction (e.g., country, state, or datacenter). Examples of resources include storage, processing, memory, network bandwidth, and virtual machines. Rapid elasticity. Capabilities can be rapidly and elastically provisioned, in some cases automatically, to quickly scale out and rapidly released to quickly scale in. To the consumer, the capabilities available for provisioning often appear to be unlimited and can be purchased in any quantity at any time. Measured Service. Cloud systems automatically control and optimize resource use by leveraging a metering capability at some level of abstraction appropriate to the type of service (e.g., storage, processing, bandwidth, and active user accounts). Resource usage can be monitored, controlled, and reported providing transparency for both the provider and consumer of the utilized service. Service Models: Cloud Software as a Service (SaaS). The capability provided to the consumer is to use the provider’s applications running on a cloud infrastructure. The applications are accessible from various client devices through a thin client interface such as a web browser (e.g., web-based ). The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, storage, or even individual application capabilities, with the possible exception of limited user-specific application configuration settings. Cloud Platform as a Service (PaaS). The capability provided to the consumer is to deploy onto the cloud infrastructure consumer-created or acquired applications created using programming languages and tools supported by the provider. The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, or storage, but has control over the deployed applications and possibly application hosting environment configurations. Cloud Infrastructure as a Service (IaaS). The capability provided to the consumer is to provision processing, storage, networks, and other fundamental computing resources where the consumer is able to deploy and run arbitrary software, which can include operating systems and applications. The consumer does not manage or control the underlying cloud infrastructure but has control over operating systems, storage, deployed applications, and possibly limited control of select networking components (e.g., host firewalls). Deployment Models: Private cloud. The cloud infrastructure is operated solely for an organization. It may be managed by the organization or a third party and may exist on premise or off premise. Community cloud. The cloud infrastructure is shared by several organizations and supports a specific community that has shared concerns (e.g., mission, security requirements, policy, and compliance considerations). It may be managed by the organizations or a third party and may exist on premise or off premise. Public cloud. The cloud infrastructure is made available to the general public or a large industry group and is owned by an organization selling cloud services. Hybrid cloud. The cloud infrastructure is a composition of two or more clouds (private, community, or public) that remain unique entities but are bound together by standardized or proprietary technology that enables data and application portability (e.g., cloud bursting for load-balancing between clouds). 以服務(as-a-service)的商業模式，透過Internet技術，提供具有擴充性(scalable)和彈性(elastic)的IT相關功能給使用者

It’s About the Ecosystem
Structured, Semi-structured Enterprise Data Warehouse IaaS PaaS SaaS Cloud Computing Generate Big Data Lead Business Insights create Competition, Innovation, Productivity SaaS: Google Doc/Mail/Drive, Microsoft Office 365, Salesforce.com (B2B) / Dropbox, NetFlix, Facebook (B2C) PaaS: Google App Engine, Microsoft Azure IaaS: Amazon AWS, Rackspace, CHT Hi-Cloud

What is BigData? A set of files A database A single file

What is the problem 75MB/sec
Getting the data to the processors becomes the bottleneck Quick calculation Typical disk data transfer rate: 75MB/sec Time taken to transfer 100GB of data to the processor: approx. 22 minutes!

The Era of Big Data – Are You Ready
Businesses are driving the growth of big data. The capable data storage, efficient management, and capturing values to business values of huge size of data are enterprise big challenges. Overwhelming quantities of big data will challenge enterprise storage infrastructure and data center architecture which will cause chain reactions in database storage, data mining, business intelligence, cloud computing, and computing application. It is predicated that more TB data will be used for the commercial analysis. The large quantity of the TB data can be called "large dataset". According to the statement of IDC Company, the utilization of data will increase by forty-four times, and the world’s data amount will reach 35.2ZB. The file size of the single dataset will also increase, which is easy for the analysis and understand comprehension of them. Data for business commercial analysis 2011: multi-terabyte (TB) 2020: 35.2 ZB (1 ZB = 1 billion TB)

Who Needs It? When to use? When to use? Ad-hoc Reporting (<1sec)
Enterprise Database Hadoop When to use? Ad-hoc Reporting (<1sec) Multi-step Transactions Lots of Inserts/Updates/Deletes When to use? Affordable Storage/Compute Unstructured or Semi-structured Resilient Auto Scalability

Hadoop!

– inspired by Apache Hadoop project
inspired by Google's MapReduce and Google File System papers. Open sourced, flexible and available architecture for large scale computation and data processing on a network of commodity hardware Open Source Software + Hardware Commodity IT Costs Reduction Hadoop is based on work done by Google Specifically, on papers describing the Google File System (GFS) published in 2003, and MapReduce published in 2004

HDFS Hadoop Distributed File System Redundancy Fault Tolerant Scalable
Self Healing Write Once, Read Many Times Java API Command Line Tool ©2011 Cloudera, Inc. All Rights Reserved.

MapReduce Two Phases of Functional Programming Redundancy
Fault Tolerant Scalable Self Healing Java API 13 ©2011 Cloudera, Inc. All Rights Reserved.

MapReduce HDFS Hadoop Core Java Java Java Java 14
©2011 Cloudera, Inc. All Rights Reserved.

Word Count Example Key: offset Value: line Key: word Value: count
Value: sum of count 0:The cat sat on the mat 22:The aardvark sat on the sofa

The Hadoop Ecosystems

The Ecosystem is the System
Hadoop has become the kernel of the distributed operating system for Big Data No one uses the kernel alone A collection of projects at Apache

Hadoop Distributed File System (HDFS)
Relation Map MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) (Coordination) Zookeeper Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining)

Zookeeper – Coordination Framework
Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Sqoop/Flume (Data integration) Pig/Hive (Analytical Language) (Coordination) Zookeeper MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

What is ZooKeeper A centralized service for maintaining
Configuration information Providing distributed synchronization A set of tools to build distributed applications that can safely handle partial failures ZooKeeper was designed to store coordination data Status information Configuration Location information

Flume / Sqoop – Data Integration Framework

What’s the problem for data collection
Data collection is currently a priori and ad hoc A priori – decide what you want to collect ahead of time Ad hoc – each kind of data source goes through its own collection path

(and how can it help?) A distributed data collection service
It efficiently collecting, aggregating, and moving large amounts of data Fault tolerant, many failover and recovery mechanism One-stop solution for data collection of all formats

Flume: High-Level Overview
Logical Node Source Sink

Log Log Flume Node Flume Node HDFS Flume Architecture ...

Flume Sources and Sinks
Local Files HDFS Stdin, Stdout Twitter IRC IMAP ©2011 Cloudera, Inc. All Rights Reserved.

Sqoop Easy, parallel database import/export What you want do?
Insert data from RDBMS to HDFS Export data from HDFS back into RDBMS

Sqoop Examples $ sqoop import --connect jdbc:mysql://localhost/world --username root --table City ... $ hadoop fs -cat City/part-m-00000 1,Kabul,AFG,Kabol, ,Qandahar,AFG,Qandahar, ,Herat,AFG,Herat, ,Mazar-e-Sharif,AFG,Balkh, ,Amsterdam,NLD,Noord-Holland,731200 29 ©2011 Cloudera, Inc. All Rights Reserved.

Pig / Hive – Analytical Language

Why Hive and Pig? Although MapReduce is very powerful, it can also be complex to master Many organizations have business or data analysts who are skilled at writing SQL queries, but not at writing Java code Many organizations have programmers who are skilled at writing code in scripting languages Hive and Pig are two projects which evolved separately to help such people analyze huge amounts of data via MapReduce Hive was initially developed at Facebook, Pig at Yahoo!

Hive – Developed by What is Hive?
An SQL-like interface to Hadoop Data Warehouse infrastructure that provides data summarization and ad hoc querying on top of Hadoop MapRuduce for execution HDFS for storage Hive Query Language Basic-SQL : Select, From, Join, Group-By Equi-Join, Muti-Table Insert, Multi-Group-By Batch query SELECT * FROM purchases WHERE price > 100 GROUP BY storeid

Pig – Initiated by A high-level scripting language (Pig Latin)
Process data one step at a time Simple to write MapReduce program Easy understand Easy debug A = load ‘a.txt’ as (id, name, age, ...) B = load ‘b.txt’ as (id, address, ...) C = JOIN A BY id, B BY id;STORE C into ‘c.txt’

Hive vs. Pig Hive Pig Language HiveQL (SQL-like)
Pig Latin, a scripting language Schema Table definitions that are stored in a metastore A schema is optionally defined at runtime Programmait Access JDBC, ODBC PigServer

WordCount Example Input For the given sample input the map emits
the reduce just sums up the values Hello World Bye World Hello Hadoop Goodbye Hadoop < Hello, 1> < World, 1> < Bye, 1> < World, 1> < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1> < Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2>

WordCount Example In MapReduce
public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); context.write(key, new IntWritable(sum)); public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true);

WordCount Example By Pig
A = LOAD 'wordcount/input' USING PigStorage as (token:chararray); B = GROUP A BY token; C = FOREACH B GENERATE group, COUNT(A) as count; DUMP C;

WordCount Example By Hive
CREATE TABLE wordcount (token STRING); LOAD DATA LOCAL INPATH ’wordcount/input' OVERWRITE INTO TABLE wordcount; SELECT count(*) FROM wordcount GROUP BY token;

Hive Pig MapReduce HDFS Sqoop Flume FS RDBMS The Story So Far SQL
Script MapReduce Java HDFS Java Sqoop Flume RDBMS FS SQL Posix ©2011 Cloudera, Inc. All Rights Reserved.

Hbase – Column NoSQL DB Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Sqoop/Flume (Data integration) Pig/Hive (Analytical Language) (Coordination) Zookeeper MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

Structured-data vs Raw-data

I – Inspired by Coordinated by Zookeeper Low Latency
Random Reads And Writes Distributed Key/Value Store Simple API PUT GET DELETE SCANE

Hbase – Data Model Cells are “versioned”
Table rows are sorted by row key Region – a row range [start-key:end-key]

Hbase – workflow

HBase Examples hbase> create 'mytable', 'mycf‘ hbase> list
hbase> put 'mytable', 'row1', 'mycf:col1', 'val1‘ hbase> put 'mytable', 'row1', 'mycf:col2', 'val2‘ hbase> put 'mytable', 'row2', 'mycf:col1', 'val3‘ hbase> scan 'mytable‘ hbase> disable 'mytable‘ hbase> drop 'mytable' ©2011 Cloudera, Inc. All Rights Reserved.

Oozie – Job Workflow & Scheduling

Job 1 Job 3 Job 2 Job 4 Job 5 What is ? A Java Web Application
Oozie is a workﬂow scheduler for Hadoop Crond for Hadoop Triggered Time Data Job 1 Job 3 Job 2 Job 4 Job 5 Oozie workflow is a collection of actions (i.e. Hadoop Map/Reduce jobs, Pig jobs) arranged in a control dependency DAG (Direct Acyclic Graph) specified in hPDL (a XML Process Definition Language)

Oozie Features Component Independent MapReduce Hive Pig SqoopStreaming

Mahout – Data Mining Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Sqoop/Flume (Data integration) Pig/Hive (Analytical Language) (Coordination) Zookeeper MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

What is Machine-learning tool
Distributed and scalable machine learning algorithms on the Hadoop platform Building intelligent applications easier and faster

Use case Example Predict what the user likes based on
His/Her historical behavior Aggregate behavior of people similar to him

Conclusion Today, we introduced: Why Hadoop is needed
The basic concepts of HDFS and MapReduce What sort of problems can be solved with Hadoop What other projects are included in the Hadoop ecosystem

Recap – Hadoop Ecosystem

趨勢科技雲端防毒 Case Study

Collaboration in the underground

New Unique Malware Discovered
網路威脅呈現爆炸性的成長各式各樣的變種病毒、垃圾郵件、不明的下載來源等等，這些來自網路上的威脅，躲過傳統安全防護系統的偵測，一直持續呈現爆炸性的成長，形成嚴重的資安威脅 New Unique Malware Discovered 1M unique Malwares every month 。 and “Time to Protection” and “Complexity of Threat” become a big challenge

Traditional approach is no more sufficient to handle today’s big data
From handle malware sample/sourcing to find the trace from the log From defense inside perimeter to defense in the cloud

New Design Concept for Threat Intelligence
Web Crawler Trend Micro Endpoint Protection Mail Protection Web Protection Honeypot CDN / xSP Human Intelligence 150M+ Worldwide Endpoints/Sensors Leverage our worldwide customer base as sensor for new and potential threats identification Shift from “Reactive” to “Proactive” protection Threat Intelligence from multiple dimension and variety of data sources

Challenges We Are Faced
The Concept is Great but …. 6TB of data and 15B lines of logs received daily by It becomes the Big Data Challenge!

Threat Intelligence/Solution
Issues to Address Raw Data Information Threat Intelligence/Solution Volume: Infinite Time: No Delay Target: Keep Changing Threats

SPN High Level Architecture
SPN Feedback SPAM SPN High Level Architecture CDN Log HTTP POST L4 Log Receiver Log Receiver Web Pages L4 HTTP Download Log Post Processing Log Post Processing Log Post Processing Lumber Jack Global Object Cache (GOC) Tracking Logging System (TLS) Malware Classification Correlation Platform SPN infrastructure Adhoc-Query (Pig) HBase MapReduce Hadoop Distributed File System (HDFS) Circus (Ambari) Feedback Information Message Bus Application Reputation Service Web Reputation Service File Reputation Service

Trend Micro Big Data process capacity
雲端防毒每日需要處理的資料量 85 億個 Web Reputation 查詢 30 億個 Reputation查詢 70 億個 File Reputation 查詢處理 6 TB 從全世界收集到的 raw logs 來自1.5億台終端裝置的連線

Trend Micro: Web Reputation Services
Technology Process Operation Block malicious URL within 15 minutes once it goes online! 15 Minutes User Traffic | Honeypot Akamai Rating Server for Known Threats Unknown & Prefilter Page Download Threat Analysis Trend Micro Products / Technology 8 billions/day 40% filtered CDN Cache 4.8 billions/day High Throughput Web Service 82% filtered Hadoop Cluster 860 millions/day Web Crawling 99.98% filtered Machine Learning Data Mining 25,000 malicious URL /day

Big Data Cases

Line Data on HBase Line data Consistency in HBase
MODEL: <key> -> <model> INDEX: <key> -> <[property in model> User: <userID> -> <User obj>, <userID> <-> <phone> Consistency in HBase Contact model: use column qualifier to store Support range query (e.g. message box)

Pig at Linkedin aster, hadoop, voldemort, azkaban, pig

Linkedin - Pig Example views = LOAD '/data/awesome' USING VoldemortStorage(); views = LOAD '/data/etl/tracking/extracted/profile-view' USING VoldemortStorage('date.range', 'num.days=90;days.ago=1’)

Facebook Messages

Facebook Open Source Stack
Memcached --> App Server Cache ▪ZooKeeper --> Small Data Coordination Service ▪HBase --> Database Storage Engine ▪HDFS --> Distributed FileSystem ▪Hadoop --> Asynchronous Map-Reduce Jobs

Questions?

Thank you!

Cloud Computing Era (Practice)

Similar presentations

Presentation on theme: "Cloud Computing Era (Practice)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cloud Computing Era (Practice)

Similar presentations

Presentation on theme: "Cloud Computing Era (Practice)"— Presentation transcript:

Similar presentations

About project

Feedback