Presentation is loading. Please wait.

Presentation is loading. Please wait.

Google Cloud computing techniques (Lecture 03) 18th Jan 20161Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059.

Similar presentations


Presentation on theme: "Google Cloud computing techniques (Lecture 03) 18th Jan 20161Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059."— Presentation transcript:

1 Google Cloud computing techniques (Lecture 03) 18th Jan 20161Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059

2 The Google File System

3 The Google File System (GFS) A scalable distributed file system for large distributed data intensive applications Multiple GFS clusters are currently deployed. The largest ones have: 1000+ storage nodes 300+ TeraBytes of disk storage heavily accessed by hundreds of clients on distinct machines

4 Introduction Shares many same goals as previous distributed file systems performance, scalability, reliability, etc GFS design has been driven by four key observation of Google application workloads and technological environment

5 Intro: Observations 1 1. Component failures are the norm constant monitoring, error detection, fault tolerance and automatic recovery are integral to the system 2. Huge files (by traditional standards) Multi GB files are common I/O operations and blocks sizes must be revisited

6 Intro: Observations 2 3. Most files are mutated by appending new data This is the focus of performance optimization and atomicity guarantees 4. Co-designing the applications and APIs benefits overall system by increasing flexibility

7 The Design Cluster consists of a single master and multiple chunkservers and is accessed by multiple clients

8 Operation Log Record of all critical metadata changes Stored on Master and replicated on other machines Defines order of concurrent operations Changes not visible to clients until they propigate to all chunk replicas Also used to recover the file system state

9 System Interactions: Leases and Mutation Order

10 Introduction to MapReduce 18th Jan 201610Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059

11 MapReduce: Insight ”Consider the problem of counting the number of occurrences of each word in a large collection of documents” How would you do it in parallel ? 18th Jan 201611Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059

12 Map operation Map, a pure function, written by the user, takes an input key/value pair and produces a set of intermediate key/value pairs. e.g. (doc—id, doc-content) Draw an analogy to SQL, map can be visualized as group-by clause of an aggregate query. 18th Jan 201612Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059

13 Reduce operation On completion of map phase, all the intermediate values for a given output key are combined together into a list and given to a reducer. Can be visualized as aggregate function (e.g., average) that is computed over all the rows with the same group-by attribute. 18th Jan 201613Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059

14 MapReduce: Execution overview 18th Jan 201614Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059

15 MapReduce: Example 18th Jan 201615Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059

16 MapReduce in Parallel: Example 18th Jan 201616Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059

17 18th Jan 201617Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059

18 MapReduce: Some More Apps Distributed Grep. Count of URL Access Frequency. Clustering (K-means) Graph Algorithms. Indexing Systems MapReduce Programs In Google Source Tree 18th Jan 201618Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059

19 MapReduce: Extensions and similar apps PIG (Yahoo) Hadoop (Apache) DryadLinq (Microsoft) 18th Jan 201619Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059

20 Large Scale Systems Architecture using MapReduce User App MapReduce Distributed File Systems (GFS) 18th Jan 201620Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059

21 BigTable: A Distributed Storage System for Structured Data 18th Jan 201621Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059

22 Introduction BigTable is a distributed storage system for managing structured data. Designed to scale to a very large size Petabytes of data across thousands of servers Used for many Google projects Web indexing, Personalized Search, Google Earth, Google Analytics, Google Finance, … Flexible, high-performance solution for all of Google’s products 18th Jan 201622Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059

23 Motivation Lots of (semi-)structured data at Google URLs: Contents, crawl metadata, links, anchors, pagerank, … Per-user data: User preference settings, recent queries/search results, … Geographic locations: Physical entities (shops, restaurants, etc.), roads, satellite image data, user annotations, … Scale is large Billions of URLs, many versions/page (~20K/version) Hundreds of millions of users, thousands or q/sec 100TB+ of satellite image data 18th Jan 201623Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059

24 Why not just use commercial DB? Scale is too large for most commercial databases Even if it weren’t, cost would be very high Building internally means system can be applied across many projects for low incremental cost Low-level storage optimizations help performance significantly Much harder to do when running on top of a database layer 18th Jan 201624Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059

25 BigTable Distributed multi-level map Fault-tolerant, persistent Scalable Thousands of servers Terabytes of in-memory data Petabyte of disk-based data Millions of reads/writes per second, efficient scans Self-managing Servers can be added/removed dynamically Servers adjust to load imbalance 18th Jan 201625Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059

26 Basic Data Model A BigTable is a sparse, distributed persistent multi-dimensional sorted map (row, column, timestamp) -> cell contents Good match for most Google applications 18th Jan 201626Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059

27 WebTable Example Want to keep copy of a large collection of web pages and related information Use URLs as row keys Various aspects of web page as column names Store contents of web pages in the contents: column under the timestamps when they were fetched. 18th Jan 201627Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059

28 Rows Name is an arbitrary string Access to data in a row is atomic Row creation is implicit upon storing data Rows ordered lexicographically Rows close together lexicographically usually on one or a small number of machines 18th Jan 201628Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059

29 Rows (cont.) Reads of short row ranges are efficient and typically require communication with a small number of machines. Can exploit this property by selecting row keys so they get good locality for data access. Example: math.gatech.edu, math.uga.edu, phys.gatech.edu, phys.uga.edu VS edu.gatech.math, edu.gatech.phys, edu.uga.math, edu.uga.phys 18th Jan 201629Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059

30 18th Jan 201630Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059

31 18th Jan 201631Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059

32 18th Jan 201632Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059

33 18th Jan 201633Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059

34 18th Jan 201634Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059

35 18th Jan 201635Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059

36 18th Jan 201636Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059

37 18th Jan 201637Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059

38 18th Jan 201638Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059

39 18th Jan 201639Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059

40 18th Jan 201640Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059


Download ppt "Google Cloud computing techniques (Lecture 03) 18th Jan 20161Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore-560059."

Similar presentations


Ads by Google