Presentation is loading. Please wait.

Presentation is loading. Please wait.

Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,

Similar presentations


Presentation on theme: "Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,"— Presentation transcript:

1 Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin, Michael Franklin, Scott Shenker, Ion Stoica AMP Lab, UC Berkeley Making Big Data Analytics Interactive UC BERKELEY

2 Motivation Data is growing faster than computation speeds »Web apps, mobile, scientific, … Requires large clusters to analyze Programming clusters is hard »Failures, placement, load balancing

3 Spark Programming Model Manipulate distributed datasets as you would local collections Concept: resilient distributed datasets (RDDs) »Distributed collections of objects that are automatically reconstructed on failure »Built through parallel transformations (map, filter, etc) »Can be cached in memory for fast reuse

4 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) messages.cache() Block 1 Block 2 Block 3 Worker Master messages.filter(_.contains(“Linux”)).count() messages.filter(_.contains(“PHP”)).count() tasks results Cache 1 Cache 2 Cache 3 Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5 sec (vs 170 s for on-disk data)

5 Projects Built on Spark Spark available in Java, Scala, Python Growing open source community »500-member meetup »14 companies contributing Using it to develop a stack of fast analytics tools »Shark SQL, Spark Streaming, … Spark Mesos Shark Streaming Graph Hadoop, MPI... Learning Private Cluster Cloud BlinkDB Berkeley Data Analytics Stack (BDAS)

6 Find Out More Visit us at the AMP Lab: 4 th floor Soda Hall UC BERKELEY www.spark-project.org


Download ppt "Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,"

Similar presentations


Ads by Google