Spark in the Hadoop Ecosystem Eric Baldeschwieler (a.k.a. Eric14)

Slides:



Advertisements
Similar presentations
Meet Hadoop Doug Cutting & Eric Baldeschwieler Yahoo!
Advertisements

Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin Ma,
Big Data Training Course for IT Professionals Name of course : Big Data Developer Course Duration : 3 days full time including practical sessions Dates.
Can’t We All Just Get Along? Sandy Ryza. Introductions Software engineer at Cloudera MapReduce, YARN, Resource management Hadoop committer.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
Berkeley Data Analytics Stack (BDAS) Overview Ion Stoica UC Berkeley UC BERKELEY.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Hortonworks Eric Baldeschwieler – CEO © Hortonworks Inc Architecting the Future of Big Data June 29, 2011.
© 2015 IBM Corporation IBM SPSS Statistics prepared by: Dennis Buttera, Curriculum Advisor IBM Academic Partnerships.
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Putting the Sting in Hive Page 1 Alan F.
AMPCamp Introduction to Berkeley Data Analytics Systems (BDAS)
Outline | Motivation| Design | Results| Status| Future
Big Data and Hadoop and DLRL Introduction to the DLRL Hadoop Cluster Sunshin Lee and Edward A. Fox DLRL, CS, Virginia Tech 21 May 2015 presentation for.
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Apache Spark and the future of big data applications Eric Baldeschwieler.
Clearstorydata.com Using Spark and Shark for Fast Cycle Analysis on Diverse Data Vaibhav Nivargi.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
Tyson Condie.
Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
1 CS 294: Big Data System Research: Trends and Challenges Fall 2015 (MW 9:30-11:00, 310 Soda Hall) Ion Stoica and Ali Ghodsi (
Agenda Motion Imagery Challenges Overview of our Cloud Activities -Big Data -Large Data Implementation Lessons Learned Summary.
Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.
How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.
HDFS (Hadoop Distributed File System) Taejoong Chung, MMLAB.
Hadoop implementation of MapReduce computational model Ján Vaňo.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
Dato Confidential 1 Danny Bickson Co-Founder. Dato Confidential 2 Successful apps in 2015 must be intelligent Machine learning key to next-gen apps Recommenders.
BIG DATA/ Hadoop Interview Questions.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Hadoop Big Data Usability Tools and Methods. On the subject of massive data analytics, usability is simply as crucial as performance. Right here are three.
MapReduce Compilers-Apache Pig
Image taken from: slideshare
Big Data is a Big Deal!.
PROTECT | OPTIMIZE | TRANSFORM
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Hadoop Aakash Kag What Why How 1.
Introduction to Spark Streaming for Real Time data analysis
Introduction to Distributed Platforms
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Hadoop Tutorials Spark
Spark Presentation.
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Data Platform and Analytics Foundational Training
Hadoop Clusters Tess Fulkerson.
Introduction to Spark.
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
CS110: Discussion about Spark
Overview of big data tools
Spark and Scala.
Charles Tappert Seidenberg School of CSIS, Pace University
Big-Data Analytics with Azure HDInsight
CS 239 – Big Data Systems Fall 2018
Presentation transcript:

Spark in the Hadoop Ecosystem Eric Baldeschwieler (a.k.a. Eric14)

Who is Eric14 A Hadoop ecosystem cheerleader & Tech Advisor Previously CTO/CEO of Hortonworks VP Hadoop Yahoo!

Spark “on the radar” Yahoo! Hadoop team collaboration w Berkeley Amp/Rad lab begins Spark example built for Nexus -> Mesos “Spark is 2 years ahead of anything at Google” - Conviva seeing good results w Spark Yahoo! working with Spark / Shark Today - Many success stories - Early commercial support

Spark Today

Spark updates Hadoop Hardware had advanced since Hadoop started: Very large RAMs, Faster networks (10Gb+) Bandwidth to disk not keeping up MapReduce awkward for key big data workloads: Low latency dispatch (E.G. quick queries) Iterative algorithms (E.G. ML, Graph…) Streaming data ingest

Spark, “lingua franca?” Support for many development techniques SQL, Streaming, Graph & in memory, MapReduce Write “UDFs” once and use in all contexts Small, simple & elegant API Easy to learn and use; expressive & extensible Retains advantages of MapReduce (fault tolerance…)

Spark often better Today you will hear many success stories from teams who have converted Hadoop based workloads to Spark and seen: Huge speedups Big cost savings But there do exist cases were Hadoop is superior… Proven to work at the largest scales Mature & widely commercially supported Much larger ecosystem of solutions and tools

Spark complements Hadoop Hadoop 2.x being widely deployed this year Spark now just another kind of Yarn Job Spark leverages Hadoop ecosystem HDFS, HCatalog, Data Input/OutputFormats Huge investment in data collection & tooling

The Future

Spark and Hadoop Hadoop vs. BDAS is not the story! Yes spark is useful separately from Hadoop and may flourish independently… But with Yarn, Spark can be brought to the data in every Hadoop 2 cluster in the world… Complimenting the investment already made by enterprise.

Spark the “lingua franca” Data scientists & Developers need an open standard for sharing their Algorithms & functions, an “R” for big data. Spark best current candidate: Open Source - Apache Foundation Expressive (MR, iteration, Graphs, SQL, streaming) Easily extended & embedded (DSLs, Java, Python…)

– William Gibson “The future is already here. It’s just not very evenly distributed.”