Google Cloud computing techniques (Lecture 03) 18th Jan 20161Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059.

Slides:



Advertisements
Similar presentations
Introduction to cloud computing
Advertisements

Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
The google file system Cs 595 Lecture 9.
Distributed Computations
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
Lecture 7 – Bigtable CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation is licensed.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
7/2/2015EECS 584, Fall Bigtable: A Distributed Storage System for Structured Data Jing Zhang Reference: Handling Large Datasets at Google: Current.
Distributed Computations MapReduce
BigTable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows,
BigTable CSE 490h, Autumn What is BigTable? z “A BigTable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
David Gibbs and Govardhan Tanniru Georgia State University Department of Computer Science P.O. Box 3965 Atlanta, GA
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Cloud Computing. Evolution of Computing with Network (1/2) Network Computing  Network is computer (client - server)  Separation of Functionalities Cluster.
MapReduce.
Google and Cloud Computing Google 与云计算 王咏刚 Google 资深工程师.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
1 The Google File System Reporter: You-Wei Zhang.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Introduction to MapReduce Amit K Singh. “The density of transistors on a chip doubles every 18 months, for the same cost” (1965) Do you recognize this.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Google’s Big Table 1 Source: Chang et al., 2006: Bigtable: A Distributed Storage System for Structured Data.
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
MapReduce M/R slides adapted from those of Jeff Dean’s.
Hypertable Doug Judd Zvents, Inc.. hypertable.org Background.
Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows,
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
CSC590 Selected Topics Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Essential Google by Zhongyuan Wang Outline Motivation & Goals Problems Solution : BigTable File System vs Database Google’s Database : Google.
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Bigtable : A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows,
Bigtable: A Distributed Storage System for Structured Data
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 24: GFS.
Bigtable: A Distributed Storage System for Structured Data Google Inc. OSDI 2006.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Bigtable A Distributed Storage System for Structured Data.
From Coulouris, Dollimore, Kindberg and Blair Distributed Systems: Concepts and Design Chapter 3 System Models.
Lecture 7 Bigtable Instructor: Weidong Shi (Larry), PhD
Hadoop Aakash Kag What Why How 1.
Map Reduce.
CSE-291 (Cloud Computing) Fall 2016
Google and Cloud Computing
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
MapReduce Simplied Data Processing on Large Clusters
Hadoop Basics.
John Kubiatowicz (with slides from Ion Stoica and Ali Ghodsi)
Presentation transcript:

Google Cloud computing techniques (Lecture 03) 18th Jan 20161Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore

The Google File System

The Google File System (GFS) A scalable distributed file system for large distributed data intensive applications Multiple GFS clusters are currently deployed. The largest ones have: storage nodes 300+ TeraBytes of disk storage heavily accessed by hundreds of clients on distinct machines

Introduction Shares many same goals as previous distributed file systems performance, scalability, reliability, etc GFS design has been driven by four key observation of Google application workloads and technological environment

Intro: Observations 1 1. Component failures are the norm constant monitoring, error detection, fault tolerance and automatic recovery are integral to the system 2. Huge files (by traditional standards) Multi GB files are common I/O operations and blocks sizes must be revisited

Intro: Observations 2 3. Most files are mutated by appending new data This is the focus of performance optimization and atomicity guarantees 4. Co-designing the applications and APIs benefits overall system by increasing flexibility

The Design Cluster consists of a single master and multiple chunkservers and is accessed by multiple clients

Operation Log Record of all critical metadata changes Stored on Master and replicated on other machines Defines order of concurrent operations Changes not visible to clients until they propigate to all chunk replicas Also used to recover the file system state

System Interactions: Leases and Mutation Order

Introduction to MapReduce 18th Jan Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore

MapReduce: Insight ”Consider the problem of counting the number of occurrences of each word in a large collection of documents” How would you do it in parallel ? 18th Jan Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore

Map operation Map, a pure function, written by the user, takes an input key/value pair and produces a set of intermediate key/value pairs. e.g. (doc—id, doc-content) Draw an analogy to SQL, map can be visualized as group-by clause of an aggregate query. 18th Jan Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore

Reduce operation On completion of map phase, all the intermediate values for a given output key are combined together into a list and given to a reducer. Can be visualized as aggregate function (e.g., average) that is computed over all the rows with the same group-by attribute. 18th Jan Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore

MapReduce: Execution overview 18th Jan Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore

MapReduce: Example 18th Jan Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore

MapReduce in Parallel: Example 18th Jan Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore

18th Jan Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore

MapReduce: Some More Apps Distributed Grep. Count of URL Access Frequency. Clustering (K-means) Graph Algorithms. Indexing Systems MapReduce Programs In Google Source Tree 18th Jan Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore

MapReduce: Extensions and similar apps PIG (Yahoo) Hadoop (Apache) DryadLinq (Microsoft) 18th Jan Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore

Large Scale Systems Architecture using MapReduce User App MapReduce Distributed File Systems (GFS) 18th Jan Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore

BigTable: A Distributed Storage System for Structured Data 18th Jan Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore

Introduction BigTable is a distributed storage system for managing structured data. Designed to scale to a very large size Petabytes of data across thousands of servers Used for many Google projects Web indexing, Personalized Search, Google Earth, Google Analytics, Google Finance, … Flexible, high-performance solution for all of Google’s products 18th Jan Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore

Motivation Lots of (semi-)structured data at Google URLs: Contents, crawl metadata, links, anchors, pagerank, … Per-user data: User preference settings, recent queries/search results, … Geographic locations: Physical entities (shops, restaurants, etc.), roads, satellite image data, user annotations, … Scale is large Billions of URLs, many versions/page (~20K/version) Hundreds of millions of users, thousands or q/sec 100TB+ of satellite image data 18th Jan Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore

Why not just use commercial DB? Scale is too large for most commercial databases Even if it weren’t, cost would be very high Building internally means system can be applied across many projects for low incremental cost Low-level storage optimizations help performance significantly Much harder to do when running on top of a database layer 18th Jan Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore

BigTable Distributed multi-level map Fault-tolerant, persistent Scalable Thousands of servers Terabytes of in-memory data Petabyte of disk-based data Millions of reads/writes per second, efficient scans Self-managing Servers can be added/removed dynamically Servers adjust to load imbalance 18th Jan Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore

Basic Data Model A BigTable is a sparse, distributed persistent multi-dimensional sorted map (row, column, timestamp) -> cell contents Good match for most Google applications 18th Jan Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore

WebTable Example Want to keep copy of a large collection of web pages and related information Use URLs as row keys Various aspects of web page as column names Store contents of web pages in the contents: column under the timestamps when they were fetched. 18th Jan Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore

Rows Name is an arbitrary string Access to data in a row is atomic Row creation is implicit upon storing data Rows ordered lexicographically Rows close together lexicographically usually on one or a small number of machines 18th Jan Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore

Rows (cont.) Reads of short row ranges are efficient and typically require communication with a small number of machines. Can exploit this property by selecting row keys so they get good locality for data access. Example: math.gatech.edu, math.uga.edu, phys.gatech.edu, phys.uga.edu VS edu.gatech.math, edu.gatech.phys, edu.uga.math, edu.uga.phys 18th Jan Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore

18th Jan Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore

18th Jan Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore

18th Jan Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore

18th Jan Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore

18th Jan Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore

18th Jan Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore

18th Jan Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore

18th Jan Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore

18th Jan Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore

18th Jan Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore

18th Jan Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore