Presentation is loading. Please wait.

Presentation is loading. Please wait.

Making Fly Parviz Deyhim Obsession Scalable, Highly-available and Secure.

Similar presentations


Presentation on theme: "Making Fly Parviz Deyhim Obsession Scalable, Highly-available and Secure."— Presentation transcript:

1 Making Fly Parviz Deyhim

2 Obsession Scalable, Highly-available and Secure

3 Obsession Scalable, Highly-available and Secure

4 Amazon Elastic MapReduce

5 What is EMR? Map-Reduce engineIntegrated with tools Hadoop-as-a-service Massively parallel Cost effective AWS wrapper Integrated to AWS services

6 HDFS Amazon EMR Integration

7 HDFS Integration Amazon EMR Amazon S3 Amazon DynamoDB

8 HDFS Integration Analytics languages Data management Amazon EMR Amazon S3 Amazon DynamoDB

9 HDFS Integration Analytics languages Data management Amazon EMR Amazon RDS Amazon S3 Amazon DynamoDB

10 HDFS Integration Analytics languages Data management Amazon RedShift Amazon EMR Amazon RDS Amazon S3 Amazon DynamoDB Data Pipeline

11 HDFS Integration Analytics languages Data management Amazon RedShift Amazon EMR Amazon RDS Amazon S3 Amazon DynamoDB Data Pipeline

12 Amazon EMR Concepts Master Node Core Nodes Task Nodes

13 Core Nodes Master Node Master instance group Amazon EMR cluster Core instance group HDFS DataNode (HDFS)

14 Core Nodes Master Node Master instance group Amazon EMR cluster Core instance group HDFS Can Add Core Nodes: More CPU More Memory More HDFS Space HDFS

15 Core Nodes Master Node Master instance group Amazon EMR cluster Core instance group HDFS Can’t remove core nodes: HDFS corruption HDFS

16 Task Nodes Master Node Master instance group Amazon EMR cluster Core instance group HDFS No HDFS Provides compute resources: CPU Memory

17 Task Nodes Master Node Master instance group Amazon EMR cluster Core instance group HDFS Can add and remove task nodes

18 Spark On Amazon EMR

19 Bootstrap Actions Ability to run or install additional packages/software on EMR nodes Simple bash script stored on S3 Script gets executed during node/instance boot time Script gets executed on every node that gets added to the cluster

20 Spark on Amazon EMR Bootstrap action installs Spark on EMR nodes Currently on Spark 0.73 & upgrading to 0.8 very soon

21 Why Spark on Amazon EMR? Deploy small and large Spark clusters in minutes EMR handles node recover in case of failures Integration with EC2 Spot Market, Amazon Redshift, Amazon Data pipeline, Amazon Cloudwatch and etc

22 Why Spark on Amazon EMR? Shipping Spark logs to S3 for debugging – Define S3 bucket at cluster deploy time

23 Spark on EC2 Spot Market Bid on un-used EC2 capacity Spark is memory hungry. Bid on large memory instances with the fraction of the cost

24 Spark on EC2 Spot Market Instance Type# of nodes On-demand Cost Spot CostCost/GB Of Memory Using Spot M1.xlarge63$31/h$4.41/h0.44c/GB/h CC2.8xlarge16$39/h$4.64/h0.46c/GB/h M2.4xlarge15$24/h$2.25/h0.22c/GB/h 1TB Memory Cluster

25 Spark on EC2 Spot Market Master Node Master instance group Amazon EMR cluster HDFS 32GB Memory Launch initial Spark cluster with core nodes HDFS to store and checkpoint RDDs

26 Spark on EC2 Spot Market Master Node Master instance group Amazon EMR cluster HDFS 32GB Memory 256GB Memory Add Task nodes in spot market to increase memory capacity

27 Spark on EC2 Spot Market Master Node Master instance group Amazon EMR cluster HDFS 32GB Memory 256GB Memory Amazon S3 Create RDDs from HDFS or Amazon S3 with: sc.textFile OR sc.sequenceFile Run Computation on RDDs

28 Spark on EC2 Spot Market Master Node Master instance group Amazon EMR cluster HDFS 32GB Memory 256GB Memory Save the resulting RDDs to HDFS or S3 with: rdd.saveAsSequenceF ile OR rdd.saveAsObjectFile Amazon S3 saveAsObjectFile

29 Spark on EC2 Spot Market Master Node Master instance group Amazon EMR cluster HDFS 32GB Memory Shutdown TaskNodes when your job is done

30 Elastic Spark With Amazon EMR

31 Autoscaling Spark Master Node Amazon EMR cluster HDFS 32GB Memory

32 Autoscaling Spark Master Node Amazon EMR cluster HDFS 32GB Memory 256GB Memory

33 Elastic Spark When to Scale? – Depends on your job CPU bounded or Memory intensive? – Probably both for Spark jobs Use CPU/Memory util. metrics to decide when to scale

34 Amazon EMR Cloudwatch metrics EMR integrates with Cloudwatch Provides many metrics. Examples: – Load Metrics – HDFS Metrics – S3 Metrics

35 Amazon EMR Cloudwatch metrics

36 Basics on Cloudwatch Metrics Pick any Cloudwatch metrics Pick a threshold that you like to be notified if its breached Setup Cloudwatch Alarms based on your thresholds Receive SNS notification in forms of: – – SNS – HTTP API Call

37 Take Manual Action Such As Adding More Task Nodes Monitor With Cloudwatch Receive Notification Basics on Cloudwatch Metrics

38 Take Automated Actions Monitor With Cloudwatch HTTP API Calls Basics on Cloudwatch Metrics

39 Spark Autoscaling Based on Load Setup Cloudwatch alarm on EMR “TotalLoad” metric Receive /SNS/HTTP notification Add more worker nodes by adding EMR task nodes

40 Spark Autoscaling Based Memory Spark needs memory – Lost of it!! How to scale based on the memory usage?

41 Spark Metrics Spark 0.8 provides cluster metrics Source and Sink topology

42 Spark Metrics Spark Metric Sources (metrics.properties): worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource

43 Spark Metrics Spark Metric Sinks (metrics.properties): Package: org.apache.spark.metrics.sink ConsoleSink JmxSink CsvSink GangliaSink

44 Spark Metrics Spark Metric Sinks (metrics.properties): CloudwatchSink

45 Spark Metrics & Cloudwatch

46 Monitor Spark metrics with Cloudwatch Setup Cloudwatch alarms and get notified if any metrics reached your threshold. Example: if JvmHeapUsed > 20G Receive notification and take manual or automated actions

47 Spark Streaming and Amazon Kinesis

48 Amazon Kinesis Kinesis

49 Amazon Kinesis CreateStream – Creates a new Data Stream within the Kinesis Service PutRecord – Adds new records to a Kinesis Stream DescribeStream – Provides metadata about the Stream, including name, status, Shards, etc. GetNextRecord – Fetches next record for processing by user business logic MergeShard / SplitShard – Scales Stream up/ down DeleteStream – Deletes the Stream

50 Amazon Kinesis Kinesis

51 Spark Streaming and Amazon Kinesis SparkStreaming Kinesis Receiver Extends NetworkReceiver Creates a single Receiver per shard and reads from Kinesis

52 Misc. New AWS instances provide enhanced networking in VPC C3 I2 Higher PPS Less Jitter Great CPU Power Suitable for Spark: Serialization and Shuffle

53 What Do You Like To See On Spark By Amazon? Send Feedbacks To: Parviz


Download ppt "Making Fly Parviz Deyhim Obsession Scalable, Highly-available and Secure."

Similar presentations


Ads by Google