CS 239 – Big Data Systems Fall 2018

CS 239 – Big Data Systems Fall 2018
Harry Xu UCLA

My Research Background
Programming languages and compilers Static and dynamic program analysis Compiler Runtime system Big Data systems Dataflow systems Graph systems Distributed systems Single-machine disk-based systems Some industrial experience Microsoft – created and solely developed an optimizing compiler for Cosmos/Scope that improved the overall performance of production jobs by up to 3X IBM – created and developed a series of profiling tools for large-scale systems Big Data system support for scalable program analysis Language/runtime support for scalable systems

BigDatalog Application Circle Infrastructure Circle

This Course: Big Data Systems
What it is about Low-level infrastructures Programming models Runtimes Scalability and efficiency What it is NOT about High-level applications Workloads Data collection and usage An example We are going to discuss some papers on machine learning systems We are NOT going to discuss learning models and algorithms because I don’t know much about them

Industrial Relevance Many papers came directly from industry
GFS, MapReduce, Bigtable, Spanner, TensorFlow (Google) HDFS (Yahoo) Azure, Trill, Dryad, Naiad (Microsoft) Spark, Tachyon (Databricks) Applications v.s. systems Many people can develop applications Few people can develop systems Applications are specific to domains while skills required to build infrastructures are generic

Goals to Achieve Understand what systems are available for data analytics Understand fundamental challenges in system design Understand how to design a customized system for a certain workload Gain experience with system development by proposing and implementing a new idea

What This Course is Related To
Distributed systems Database systems Computer Architecture Networking Storage (memory, disk, file system, etc.) Graph algorithms Statistics Machine learning

Aspects of Big Data Processing
Where to put data? How to process data at scale? How to process different types of data? Structured data Unstructured data Streaming data Graph data Data for model training How to take advantage of technological advances How to make processing efficient?

Topics Covered (I) Distributed storage systems Dataflow engines
HDFS, GFS, Bigtable, Spanner, and Azure storage Dataflow engines MapReduce, Dryad, AsterixDB, Spark Batch processing Hive, Spark SQL, and SCOPE Resource Management Mesos, YARN, LATE, Borg, Sparrow

Topics Covered (II) Stream processing Graph processing
Storm, Flink, Kafka, Naiad, Trill, SVE, Drizzle Graph processing Pregel, Ligra, GraphChi, Xstream, GridGraph Machine learning TensorFlow, Parameter Servers, Project Adam

Why Do We Need Those Systems
Enablers Better performance Scalability Efficiency Energy Easy/flexible programmability

Course Structure Paper critiques Presentation
Due before each presentation day Presentation 20-25 mins Participation in active discussion Project 2-3 students form a group, working on an innovative idea in system development

Things about Presentations/Critiques
Reuse slides as much as possible A good rule of thumb is to follow this order What problems does the paper solve? Why are they (serious) problems? Why aren’t they already solved? What are the main challenges? How did the authors overcome them? What evidence did the authors show that the problems is solved? Questions, concerns, opportunities for improvement

CS 239 – Big Data Systems Fall 2018

Similar presentations

Presentation on theme: "CS 239 – Big Data Systems Fall 2018"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 239 – Big Data Systems Fall 2018

Similar presentations

Presentation on theme: "CS 239 – Big Data Systems Fall 2018"— Presentation transcript:

Similar presentations

About project

Feedback