Advanced Topics in Distributed Systems Fall 2011 Instructor: Costin Raiciu.

Slides:



Advertisements
Similar presentations
Data Center Networking with Multipath TCP
Advertisements

Improving Datacenter Performance and Robustness with Multipath TCP
Introduction to Data Center Computing Derek Murray October 2010.
Large Scale Computing Systems
Introduction to Advanced Computing Platforms for Data Analysis Ruoming Jin.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Improving Datacenter Performance and Robustness with Multipath TCP Costin Raiciu, Sebastien Barre, Christopher Pluntke, Adam Greenhalgh, Damon Wischik,
Utilizing Datacenter Networks: Dealing with Flow Collisions Costin Raiciu Department of Computer Science University Politehnica of Bucharest.
Datacenter Network Topologies
Data Center Basics (ENCS 691K – Chapter 5)
Hadoop Ecosystem Overview
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Introduction. Readings r Van Steen and Tanenbaum: 5.1 r Coulouris: 10.3.
Ch 4. The Evolution of Analytic Scalability
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Study Tips for COP 4531 Ashok Srinivasan Computer Science, Florida State University Aim: To suggest learning techniques that will help you do well in this.
Cloud MapReduce : a MapReduce Implementation on top of a Cloud Operating System Speaker : 童耀民 MA1G Authors: Huan Liu, Dan Orban Accenture.
First... Background Topics Schedule Self Study Me Willem de Bruijn PhD candidate at Vrije Universiteit.
CS492: Special Topics on Distributed Algorithms and Systems Fall 2008 Lab 3: Final Term Project.
© Spinnaker Labs, Inc. Google Cluster Computing Faculty Training Workshop Open Source Tools for Teaching.
Scientific Computing at Amazon Disruptive Innovations in Distributed Computing Dave Ward, Principal Product Manager Adam Gray, Senior Product Manager.
Descriptive Data Analysis of File Transfer Data Sudarshan Srinivasan Victor Hazlewood Gregory D. Peterson.
Cloud Computing 1. Outline  Introduction  Evolution  Cloud architecture  Map reduce operation  Platform 2.
Software Architecture
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Distributed systems [Fall 2014] G Lec 1: Course Introduction.
Hadoop 2 cluster with Oracle Solaris Zones, ZFS and unified archives Orgad Kimchi - Principal Software Engineer September 29, 2014 Oracle Confidential.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Amazon Web Services BY, RAJESH KANDEPU. Introduction  Amazon Web Services is a collection of remote computing services that together make up a cloud.
CSE 548 Advanced Computer Network Security Document Search in MobiCloud using Hadoop Framework Sayan Cole Jaya Chakladar Group No: 1.
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve , Devendra Dahiphale , Amit Chhajer 報告 : 饒展榕.
Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek Google, Inc. OSDI Feb 2012 Presentation.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2015 Lecture 11: Conclusion Aidan Hogan
Course Information Andy Wang Operating Systems COP 4610 / CGS 5765.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Distributed systems [Fall 2015] G Lec 1: Course Introduction.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
CPS 216: Advanced Database Systems Shivnath Babu.
Carnegie Mellon University © Robert T. Monroe Management Information Systems Cloud Computing I Cloud Models and Technologies Management.
Web Technologies Lecture 13 Introduction to cloud computing.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
CS426: Building Decentralized Systems Mahesh Balakrishnan.
Data Centers and Cloud Computing 1. 2 Data Centers 3.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Microsoft Ignite /28/2017 6:07 PM
MapReduce using Hadoop Jan Krüger … in 30 minutes...
CC Procesamiento Masivo de Datos Otoño Lecture 12: Conclusion
CIS 700-5: The Design and Implementation of Cloud Networks
CSE 704 Data Center Computing Intro
Big Data is a Big Deal!.
Distributed Network Traffic Feature Extraction for a Real-time IDS
Improving Datacenter Performance and Robustness with Multipath TCP
Improving Datacenter Performance and Robustness with Multipath TCP
Hadoop Clusters Tess Fulkerson.
EECS 582 Final Review Mosharaf Chowdhury EECS 582 – F16.
湖南大学-信息科学与工程学院-计算机与科学系
Andy Wang Operating Systems COP 4610 / CGS 5765
Ch 4. The Evolution of Analytic Scalability
Andy Wang Operating Systems COP 4610 / CGS 5765
Introduction to Apache
Overview of big data tools
Syllabus and Introduction Keke Chen
Andy Wang Operating Systems COP 4610 / CGS 5765
Internet and Web Simple client-server model
Andy Wang Operating Systems COP 4610 / CGS 5765
Andy Wang Operating Systems COP 4610 / CGS 5765
Presentation transcript:

Advanced Topics in Distributed Systems Fall 2011 Instructor: Costin Raiciu

We’ve gotten used to great applications

Enabling Such Apps is Hard Apps – Process huge amounts of data – Are fast – Are reliable One machine is not enough – Limited reliability, speed Super computers are expensive

What Makes These Applications Tick?

Distributed Systems

Cares about technology relating to distributed systems: – Networks – Virtual machines – Distributed filesystems – Distributed computation We care about details, not about products – Why? This course…

Traditional Data Center Network Topology … Racks of servers Top of Rack Switches Aggregation Switches Core Switch 1Gbps 10Gbps

Fat Tree Topology [Fares et al., 2008; Clos, 1953] Aggregation Switches K Pods with K Switches each K=4 Racks of servers 1Gbps

Many operating systems running on a single box Provides: – Isolation – Flexibility – Better utilization of the machine Inside a Machine: Virtualization

How do we store data? Distributed filesystem – NFS: UNIX-like semantics Single server Limited scalability – Google File System Optimized for large-batch writes and sequential reads Tolerates inconsistency

How do we get work done? Map reduce – Apply the same function in parallel on different data on many machines – Aggregate results Useful for: – Building big web-search indices – Processing large amounts of data (PB)

This is just a taster

Course outline Distributed Apps we care about – Distributed Computation (Map Reduce, Driad, Hadoop) – Distributed Filesystems (NFS and GFS) – Web search – Caching (Memcached) – Distributed Hash Tables (Chord, Dynamo) – NoSQL databases (BigTable, Cassandra) Infrastructure: networks – Topologies: FatTree, VL2, Bcube – Using capacity: Hedera, MPTCP – Performance Optimizations: Incast, DCTCP

Course outline [2] Infrastructure: OS abstractions – Virtual Machines (Xen, VMM) – Distributed memory (Ivy) Security – Information Leakage – Good Isolation vs. High Utilization (Seawall, CloudPolice)

Course Admin Lectures: – 2 hours per week, Tuesday 8-10 EC102 Lab classes: – 2 hours per week, Tuesday EG106 – Project discussions – Help with practical issues – Help with high level goals, theory Website: curs.cs.pub.ro – If you have problems, let me know

Grading Project: 5p – Groups of 3-4 students – 4 stages: to help you get the job done easily, without last minute work over Christmas Exam: 3p Presentation (1h): 1p Class participation: 1p

Presentation Select one topic before the end of October (list will be posted this week) – Presentation date is fixed – If you miss your presentation, you lose 2p Class participation – 2 papers presented per course by your colleagues – Read them before and take part in discussion

Exam Open book Need to understand and think – not memorize Studying 3 days before the exam won’t work – You need to take part in classes and read-up

Projects Large scale data processing with MapReduce – We will use Apache Hadoop – We will run code on Amazon EC2 (and maybe on local clusters) – Several datasets you can choose from

Datasets available Crawled set of HTML pages from.uk Wikipedia Page Traffic Statistics Apache Mail Archives Million Song Dataset M-Lab dataset: Network Path and Application Diagnosis tool Human genome US Census databases Freebase data dump

Stage 1 Choose dataset to use Select one/many questions to answer using the dataset Write small Hadoop script to parse a subset of the data Come up with a few simple graphs (e.g. dataset size, histograms) Start writing: Introduction to your report, problem statement Start the implementation and evaluation – Size of dataset, time to do one pass, etc. Strict deadline [1p]: November 1 st

Stage 2 How do we solve the problem? – Review related work – Select potential approaches Discuss pros/cons Implementation and evaluation – Implement the code – Run experiments – Refine code and reiterate Goal: 70% of functionality should be implemented Deadline [1p]: December 1 st – Output in report: Implementation section Early evaluation section

Stage 3 Final implementation Evaluation What did we learn? Deadline [1p]: December 21 th – In class project presentation: 10 mins

Stage 4 Write-up – Polish report – Create a coherent story – Convince me that this is useful Deadline to hand-in final report: last day of semester (January 14 th ) [1p]