CS110: Discussion about Spark

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

Ilias Tachmazidis 1,2, Grigoris Antoniou 1,2,3, Giorgos Flouris 2, Spyros Kotoulas 4 1 University of Crete 2 Foundation for Research and Technology, Hellas.
Spark: Cluster Computing with Working Sets
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Data Engineering How MapReduce Works
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
BIG DATA/ Hadoop Interview Questions.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
PySpark Tutorial - Learn to use Apache Spark with Python
Image taken from: slideshare
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
Big Data is a Big Deal!.
PROTECT | OPTIMIZE | TRANSFORM
MapReduce Compiler RHadoop
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Hadoop Aakash Kag What Why How 1.
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
Machine Learning Library for Apache Ignite
ITCS-3190.
Spark.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Hadoop Tutorials Spark
Distributed Network Traffic Feature Extraction for a Real-time IDS
Large-scale file systems and Map-Reduce
Spark Presentation.
Data Platform and Analytics Foundational Training
Pyspark 최 현 영 컴퓨터학부.
Extraction, aggregation and classification at Web Scale
Central Florida Business Intelligence User Group
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Introduction to Spark.
The Basics of Apache Hadoop
Cloud Distributed Computing Environment Hadoop
CS6604 Digital Libraries IDEAL Webpages Presented by
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
Ch 4. The Evolution of Analytic Scalability
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Spark and Scala.
VI-SEEM data analysis service
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Spark and Scala.
MAPREDUCE TYPES, FORMATS AND FEATURES
Introduction to Spark.
CS639: Data Management for Data Science
Apache Hadoop and Spark
Fast, Interactive, Language-Integrated Cluster Computing
Big-Data Analytics with Azure HDInsight
MapReduce: Simplified Data Processing on Large Clusters
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Presentation transcript:

CS110: Discussion about Spark Yijun Yuan May 30th , 2018

Schedule 00 Big Data Problem and possible solutions Basic Spark Core Working with RDDs Spark Cluster and Parallel programming(in lab) From https://www.realdbamagic.com/intro-to-apache-spark-2016-slides/

Big Data Problem and possible Solutions 01 Big Data Problem and possible Solutions The Big data Challenge:

Big Data Problem and possible Solutions 01 Big Data Problem and possible Solutions Older Solution: Giant server with lots of resources Data needs to be copied to the server in real time. Scale-out Solution: Multiple machine for single task More machine and better infrastructure and framework storage, Network, etc.

Big Data Problem and possible Solutions 01 Big Data Problem and possible Solutions Distributed System Challenges: How to distributed the work? How to ensure coherence? How to deal with faults?

Big Data Problem and possible Solutions 01 Big Data Problem and possible Solutions Big Data Solution: Hadoop (HDFS + MapReduce) Spark(On memory resource on Clusters)

Big Data Problem and possible Solutions 01 Big Data Problem and possible Solutions MapReduce: Map: Take a large problem and divides into sub problems and run same function on all subsystems Reduce: Combine the output from all sub-problems. Example: Radix sort words count gradient descent

Big Data Problem and possible Solutions 01 Big Data Problem and possible Solutions Spark Advantages: 1. high level abstract: focus on what not how 2. Cluster computing a. Managed by single master node b. Distributed to worker nodes c. Scalable and fault tolerant 3. Distributed Storage a. Data is distributed when store b. Replication for efficiency and fault tolerance 4. High performance by in-memory utilization and cashing

Big Data Problem and possible Solutions 01 Big Data Problem and possible Solutions Spark and Hadoop are built to co-exist: Spark can use other storage systems(S3, local disks, NFS), but works best with HDFS It use Hadoop Input and output formats

Big Data Problem and possible Solutions 01 Big Data Problem and possible Solutions Extension of spark

Big Data Problem and possible Solutions 01 Big Data Problem and possible Solutions Spark Use Cases: Combination of massive data, intensive computing and iterative algorithm e.g. Index building, graph creation, pattern recognition and ML. Reason: Distributed storage Distributed computing In-memory processing and pipelining

02 Basic Spark Core Spark shell

Basic Spark Core 02 Spark Context: Configuration of the file system RDD: Resilient Distributed Datasets

Basic Spark Core 02 RDD: Resilient Distributed Datasets Operations: Actions - return values(count, take, collect) - Calculations Transformations - define new RDD(map, filter) - setup things - RDD is immutable - Piped functional programming: RDD take function as parameters

Work with RDD 03 RDD creation RDDs basics Sampling Set operation Aggregations Key/value pairs We run example in python notebook step by step!!! API doc: https://spark.apache.org/docs/2.2.0/api/python/index.html pyspark tutorial: https://github.com/jadianes/spark-py-notebooks

03 RDD creation textRead parallelize

03 RDDs bacics map filter collect count take

03 Sampling sample takeSample

03 Set operation subtract distinct cartesian

03 Aggregations reduce aggregate

03 Key value pairs reduceByKey counteByKey combineByKey

THANKS!