Iterative Computing on Massive Data Sets

Iterative Computing on Massive Data Sets
William Rory Kronmiller

Why Do I Care?

Some Companies Using Spark
Alibaba Databricks Amazon eBay Baidu Facebook Berkeley AMPLab NASA JPL Opentable CERN Tencent Concur TripAdvisor

Some Things You Can Do With Spark
Real-Time Advertisement Selection Genome Sequencing Network IDS Fraud Detection

Okay, But What Is It?

Spark The Distributed System
UC Berkeley Research Project Designed to run on a cluster of unreliable machines Resilient Distributed Datasets (RDDs) Collections Partitioned and Parallel Immutable Recoverable LAZY In-Memory Originated as a UC Berkely project; originally just designed to show off Berkely’s distributed job scheduler Mesos Like the projects leading up to it, designed to run under the assumption that worker machines will fail Core of the system is the Resilient Distributed Dataset RDD is a programming abstraction Can think of it as a partitioned collection of tuples, somwhat similar to a view in a database Write Spark programs by transforming one/more RDDs into another RDD - very functionally-oriented programming model RDDs are lazy and immutable and keep track of their “lineages” so that nodes can fail without the computation having to stop; also involves less checkpointing Materialized RDDs can be cached in memory for efficient iterative and ad-hoc computation. Lazy evaluation also makes pipelining easier tl;dr RDDs let you perform ad-hoc and iterative computation on massive data sets using a cluster of consumer-grade hardware

Output Operators Trigger Materialization
Sometimes called Output Operations (DStreams) Also called Actions (RDDs) Examples reduce() collect() count() foreach() take() saveAsTextFile() Output operators a.k.a. Actions are operations on RDDs that trigger materialization Basically are functions that are called when you are done manipulating the data and want to see the results If you don’t have an output operator, your Spark job will not do anything

Spark High Level Design
Spark can be broken up into three main components: Driver: manages a particular Spark Job Cluster Manager: manages a pool of worker machines servicing one or more Spark Jobs Worker: provides resources for performing parallel computation on and memory for RDDs

Spark Driver The Core of a Spark Job: Talks to cluster
Creates and schedules tasks Tracks computation progress Hosts WebUI To make a Spark Job, you write a Scala, Java, or Python program that hooks into the Spark libraries Your program consists of a series of transformations on RDDs and ends with some kind of output operation Submit your program to a Spark driver (e.g. a program running on your laptop with VPN access to a Spark cluster) Spark driver breaks program up into a Directed Acyclic Graph of tasks (RDD transformations) Driver gets resources from cluster and schedules the tasks Driver also hosts a WebUI for viewing progress of computation, usually on

Cluster Manager Spark runs on multiple clusters: Mesos YARN
Spark Standalone Nice feature of Spark: built from day 1 to play nice with other computing frameworks Can work on multiple cluster computing frameworks: just needs a way of getting resources to schedule tasks Originally built to run on Mesos, but can also now run on YARN or in “Standalone mode” where it serves as its own resource manager

Spark Executor Hosts RDD Partitions Performs Computations
Ideally Executor Runs on HDFS Host for Partition Not much more to be added about Spark Executors Data locality reduces latency so run spark executors on the machines where the data is stored if practical

Spark SQL RDDs as SQL Tables DataFrame
Similar to R DataFrames ¯\_(ツ)_/¯ Untyped :( DataSet Typed :) Has Become Superset for DataFrame Compatible with Hive and HiveQL Compatible with JDBC and ODBC thru Thrift Server Adds support for high-performance SQL-like operations Allows RDDs to be represented like tables in SQL databases DataFrame untyped - access columns by name using strings DataSet typed - access columns using regular Java/Scala/Python dot operator syntax DataSets superset of data frames now, with DataFrames being DataSets of type Row Can use SQL queries and ODBC with Spark

MLlib Machine Learning Library Huge Just Google It
Could Be Its Own UPE IO Spark has rich suite of Machine Learning libraries High performance iterative computation well suited to ML MLlib is probably one of the biggest focuses for the researchers developing Spark at the moment, so too big to get into much detail Some things you can do: K Means Clustering PCA TF-IDF Word2Vec Linear Regression Decision Trees Collaborative Filtering Much much more

GraphX Graph Processing on Spark Edge Computation
Vertex Centric Computation Pregel API Can also treat RDDs as collections of vertices and edges then perform graph computation on them Could also be its own UPE IO

Word Count Demo Show word count TF/IDF demo Explain what’s going on

Postgres vs Spark PostgreSQL Unlimited Database Size 32TB Table Size
SQL Databases Are Good For Transactions Low Latency Small-Medium Size Data Sets One Big Server (Probably) Efficiency: Decades of Query Optimizers Spark is Good For ETL Data Mining Machine Learning Cloud Environments PostgreSQL Unlimited Database Size 32TB Table Size 1.6TB Row Size Spark 2014 Daytona GraySort Contest Winner 100TB in 23 Minutes 1000TB in 243 Minutes Some of you might be wondering why you would use something like Spark rather than a SQL database Not a valid comparison Spark meant for data mining/processing, has lots of tools/features for doing that Databases meant to process transactions, store/update data Encourage people to google how to use Spark with Postgres

Hadoop MapReduce vs Spark
Both perform batch computation Both support Map/Reduce paradigm Both distributed/fault-tolerant Both use HDFS Spark RDDs Enable Iterative Computation Some of you might also be wondering how Spark compares to MapReduce Easier comparison: Spark compatible with HDFS, more capable than MapReduce If you encounter a stack using MapReduce that works…you’re fine If you have a choice, I’d suggest Spark for its performance and flexibility

Spark Streaming

Micro Batch Model Spark Uses Batch Computation Take a bucket of data
Transform the bucket of data … Profit What about unbounded streams of data? Micro-Batches Take a bucket of data every N seconds… One of Spark’s most powerful features: Spark Streaming Perform near-real-time analysis on unbounded data How? Take a stream, slice it up into buckets and treat each bucket like a regular spark batch

DStreams Micro-batch programming abstraction
Unbounded collection of RDDs Still parallel, fault- tolerant, etc… Still lazy - Spark provides programming abstraction for micro-batches called DStreams, which are collections of RDDs

Streaming DEMO

Try This At Home

spark.apache.org cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark databricks.com sparkhub.databricks.com deepspace.jpl.nasa.gov mapr.com/blog/game-changing-real-time-use-cases-apache-spark-on-hadoop databricks.com/blog/2016/08/31/apache-spark-scale-a-60-tb-production-use-case.html jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-architecture.html people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf mapr.com/blog/how-get-started-using-apache-spark-graphx-scala blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/ datamation.com/data-center/hadoop-vs.-spark-the-new-age-of-big-data.html

Iterative Computing on Massive Data Sets

Similar presentations

Presentation on theme: "Iterative Computing on Massive Data Sets"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Iterative Computing on Massive Data Sets

Similar presentations

Presentation on theme: "Iterative Computing on Massive Data Sets"— Presentation transcript:

Similar presentations

About project

Feedback