Presentation is loading. Please wait.

Presentation is loading. Please wait.

Homework 1: Common Mistakes Memory Leak Storing of memory pointers instead of data.

Similar presentations


Presentation on theme: "Homework 1: Common Mistakes Memory Leak Storing of memory pointers instead of data."— Presentation transcript:

1 Homework 1: Common Mistakes Memory Leak Storing of memory pointers instead of data

2 Memory Leak A program that uses “new” (“malloc”) without “delete” (“free”) suffers from memory leak. Why? A program that uses “new” (“malloc”) without “delete” (“free”) suffers from memory leak. Why?  C++ “new” (C “malloc”) allocate space from heap. If a program loses access to this space then this memory space remains unused until the program stops executing.  Example pseudo-code: For cntr ranging from 1 to 100 do BEGIN Record = new byte[1024]; Set Record fields Insert the record into the database END

3 Memory Leak Why is memory leak bad? Why is memory leak bad?  As the program executes, the available heap space shrinks.  This space is allocated from the virtual memory managed by the operating system.  If the virtual address space exceeds the available memory space and starts to thrash, the program becomes very, very, …, very slow!

4 Memory Leak A program that uses “new” (“malloc”) without “delete” (“free”) suffers from memory leak. Why? A program that uses “new” (“malloc”) without “delete” (“free”) suffers from memory leak. Why?  C++ “new” (C “malloc”) allocate space from heap. If a program loses access to this space then this memory space remains unused until the program stops executing.  Correct pseudo-code: For cntr ranging from 1 to 100 do BEGIN Record = new byte[1024]; Set Record fields Insert the record into the database delete record; END Returns the 1 kilobyte of data back to the heap!

5 Bad Design: Store Pointers Insertion of record 1 inserts pointers (not data): Insertion of record 1 inserts pointers (not data): Why is this design bad? Why is this design bad?  When the program stops execution, all the memory addresses (pointers) stored in the database become invalid. idAgeName 525 Shahram Record 1

6 Bad Design: Store Pointers Insertion of record 1 inserts pointers (not data): Insertion of record 1 inserts pointers (not data): Why is this design bad? Why is this design bad?  When the program stops execution, all the memory addresses (pointers) stored in the database become invalid.  The operating system may move “Shahram” around, invalidating the stored memory address. idAgeName 525 Shahram Record 1 0101010101010

7 Good Design: Store Data Serialize the record to generate data: Serialize the record to generate data: Insertion of record 1 inserts data into the DBMS. This is the right design! Insertion of record 1 inserts data into the DBMS. This is the right design! idAgeName 525 Shahram Record 1

8 Homework 2 Posted on the 585 web site (http://dblab.usc.edu/csci585) and is due on Feb 24 th. Posted on the 585 web site (http://dblab.usc.edu/csci585) and is due on Feb 24 th.http://dblab.usc.edu/csci585 Objective: Objective:  Use of primary and secondary indexes  Highlight a few limitations of BDB when configured in main memory.  Use of transactions to maintain integrity of a main-memory database. Read the homework description for in-class review this Thursday. Read the homework description for in-class review this Thursday.

9 Gamma DBMS (Part 3): Function Shipping versus Data Shipping Evaluation Shahram Ghandeharizadeh Computer Science Department University of Southern California

10 Data Shipping Client retrieves data from the node. Client retrieves data from the node. Client performs computation locally. Client performs computation locally. Limitation: Dumb servers, utilizes the limited network bandwidth. Limitation: Dumb servers, utilizes the limited network bandwidth. A Node Data Process f(x) XmitData

11 Function Shipping Client ships the function to the node for processing. Client ships the function to the node for processing. Relevant data is sent to client. Relevant data is sent to client. Function f(x) should produce less data than the original data stored in the database. Function f(x) should produce less data than the original data stored in the database. Minimizes demand for the network bandwidth. Minimizes demand for the network bandwidth. A Node Output of f(x) Process function f(x)

12 Gamma Gamma is based on function shipping. Gamma is based on function shipping.  Hybrid-hash join partitions the referenced tables across the nodes of the shared-nothing architecture. Data does not leave the realm of the shared-nothing hardware.

13 Service Time Focus on query service time (only 1 request executing in the system) as a function of input table size. Focus on query service time (only 1 request executing in the system) as a function of input table size.  Hash partition the table.  Store the results of each query back in the database. Why?

14 Why? Seek time is a function of the distance traveled by the disk head. Seek time is a function of the distance traveled by the disk head.

15 Join Queries Join tables A and Bprime. Join tables A and Bprime.  A is 10X Bprime.  Produces the same number of records as BPrime. Note that re-partitioning the table is not that expensive. Note that re-partitioning the table is not that expensive.

16 Join Queries Join tables A and Bprime. Join tables A and Bprime.  A is 10X Bprime.  Produces the same number of records as BPrime. Why?

17 How to Evaluate? Focus on use of parallelism and scalability of the system. How? Focus on use of parallelism and scalability of the system. How?  Speedup:  Given a table with r rows and a query, if the service time of the system is X with one node, does it speedup by a factor of n with n nodes?  Scaleup:  If the service time of a query referencing a table with r rows and a system with n nodes is X, does the service time remain X with a table of mr rows and mn nodes? Both metrics measure service time of the system because only one request is submitted to the system. Both metrics measure service time of the system because only one request is submitted to the system. Speedup # of Nodes Scaleup

18 Selection Predicates: Speedup Super-linear speed-up with 1% non- clustered index and 10% clustered index selection. Referenced table consists of 1 million rows. Super-linear speed-up with 1% non- clustered index and 10% clustered index selection. Referenced table consists of 1 million rows.

19 Selection Predicates: Scaleup

20 Join Predicates: Speedup 1 Bucket starting with 5 nodes. 1 Bucket starting with 5 nodes.  Results would have been superlinear if Bprime did not fit in main memory of 5 nodes.

21 Join Predicates: Scaleup Overhead of parallelism: The scheduler coordinating the activation, coordination, and de-activation of different operators. Overhead of parallelism: The scheduler coordinating the activation, coordination, and de-activation of different operators.

22 2009: Evolution of Gamma Shared-nothing architecture consisting of thousands of nodes! Shared-nothing architecture consisting of thousands of nodes!  A node is an off-the-shelf, commodity PC. Google File System Google’s Bigtable Data Model Google’s Map/Reduce Framework Yahoo’s Pig Latin …….

23 Gamma in 2009 Shared-nothing architecture consisting of thousands of nodes! Shared-nothing architecture consisting of thousands of nodes!  A node is an off-the-shelf, commodity PC. Google File System Google’s Bigtable Data Model Google’s Map/Reduce Framework Yahoo’s Pig Latin ……. Divide & Conquer

24 Gamma in 2009 Source code for Pig and hadoop are available for free download. Source code for Pig and hadoop are available for free download. Google File System Google’s Bigtable Data Model Google’s Map/Reduce Framework Yahoo’s Pig Latin ……. Hadoop Pig

25 References Pig Latin Pig Latin  Olston et. al. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. Map Reduce Map Reduce  Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January 2008. Bigtable Bigtable  Chang et. al. Bigtable: A Distributed Storage System for Structured Data. In OSDI 2006. GFS GFS  Ghemawat et. al. The Google File System. In SOSP 2003.

26 Overview: Pig Latin A high level program that specifies a query execution plan. A high level program that specifies a query execution plan.  Example: For each sufficiently large category, retrieve the average pagerank of high-pagerank urls in that category.  SQL assuming a table urls (url, category, pagerank) SELECTcategory, AVG(pagerank) FROMurls WHEREpagerank > 0.2 GROUP BYcategory HAVINGcount(*) > 1,000,000

27 Overview: Pig Latin A high level program that specifies a query execution plan. A high level program that specifies a query execution plan.  Example: For each sufficiently large category, retrieve the average pagerank of high-pagerank urls in that category.  Pig Latin: 1. Good_urls = FILTER urls BY pagerank > 0.2; 2. Groups = GROUP Good_urls BY category; 3. Big_groups = FILTER Groups by COUNT(Good_urls) > 1,000,000; 4. Output = FOREACH Big_groups GENERATE category, AVG(Good_urls, AVG(Good_urls.pagerank);

28 Overview: Map/Reduce (Hadoop) A programming model to make parallelism transparent to a programmer. A programming model to make parallelism transparent to a programmer.  Programmer specifies:  a map function that processes a key/value pair to generate a set of intermediate key/value pairs.  Divides the problem into smaller “intermediate key/value” sub-problems.  a reduce function to merge all intermediate values associated with the same intermediate key.  Solve each sub-problem.  Final results might be stored across R files.  Run-time system takes care of:  Partitioning the input data across nodes,  Scheduling the program’s execution,  Node failures,  Coordination among multiple nodes.

29 Overview: Bigtable A data model (a schema). A data model (a schema). A sparse, distributed persistent multi-dimensional sorted map. A sparse, distributed persistent multi-dimensional sorted map. Data is partitioned across the nodes seamlessly. Data is partitioned across the nodes seamlessly. The map is indexed by a row key, column key, and a timestamp. The map is indexed by a row key, column key, and a timestamp. Output value in the map is an un-interpreted array of bytes. Output value in the map is an un-interpreted array of bytes.  (row: byte[ ], column: byte[ ], time: int64)  byte[ ]

30 Overview: Bigtable Used in different applications supported by Google. Used in different applications supported by Google.

31 Overview: GFS A highly available, distributed file system for inexpensive commodity PCs. A highly available, distributed file system for inexpensive commodity PCs.  Supports node failures as the norm rather than the exception.  Stores and retrieves multi-GB files.  Assumes files are append only (instead of updates that modify a certain piece of existing data).  Atomic append operation to enable multiple clients to append to a file with minimal synchronization.  Relaxed consistency model to simplify the file system and enhance performance.

32 How to start? Bottom-up, starting with GFS. Bottom-up, starting with GFS. Google File System Google’s Bigtable Data Model Google’s Map/Reduce Framework Yahoo’s Pig Latin …….

33 Google File System: Assumptions

34 Google File System: Assumptions (Cont…)

35 GFS: Interfaces Create, delete, open, close, read, and write files. Create, delete, open, close, read, and write files. Snapshot a file: Snapshot a file:  Create a copy of the file. Record append operation: Record append operation:  Allows multiple clients to append data to the same file concurrently, while guaranteeing the atomicity of each individual client’s append.

36 GFS: Architecture 1 Master 1 Master Multiple chunkservers Multiple chunkservers File is partitioned into fixed- size chunks. File is partitioned into fixed- size chunks. Each chunk has a 64 bit chunk handle that is unique globally. Each chunk has a 64 bit chunk handle that is unique globally. Each chunk is replicated on several chunkservers. Each chunk is replicated on several chunkservers.  Degree of replication is application specific; default is 3. Software Software  Master maintains all file system meta-data: namespace, access control info, mapping from files to chunks, current location of chunks.  GFS client caches meta-data about file system.  Client receives data from chunkserver directly.  Client and chunkserver do not cache file data.

37 GFS: Architecture 1 Master 1 Master Multiple chunkservers Multiple chunkservers File is partitioned into fixed- size (64 MB) chunks. File is partitioned into fixed- size (64 MB) chunks. Each chunk has a 64 bit chunk handle that is unique globally. Each chunk has a 64 bit chunk handle that is unique globally. Each chunk is replicated on several chunkservers. Each chunk is replicated on several chunkservers.  Degree of replication is application specific; default is 3. Software Software  Master maintains all file system meta-data: namespace, access control info, mapping from files to chunks, current location of chunks.  GFS client caches meta-data about file system.  Client receives data from chunkserver directly.  Client and chunkserver do not cache file data. Clientchooses one of the replicas.

38 GFS Master 1 master simplifies software design. 1 master simplifies software design. Master monitors availability of chunkservers using heart-beat messages. Master monitors availability of chunkservers using heart-beat messages. 1 master is a single point of failure: 1 master is a single point of failure:  Master does not store chunk location information persistently: When the master is started, it asks each chunkserver about its chunks (and whenever a chunkserver joins).  File and chunk namespaces,  Mapping from files to chunks,  Location of each chunk’s replica.

39 Mutation = Update Mutation is an operation that changes the contents of or metadata of a chunk. Mutation is an operation that changes the contents of or metadata of a chunk. Content mutation: Content mutation:  Performed on all chunk’s replicas.  Master grants a chunk lease to one of the replicas, primary.  Primary picks a serial order for all mutations to the chunk. Lease: Lease:  Granted by master, typically 60 seconds.  Primary may request extensions.  If master loses communication with a primary, it can safely grant a new lease to another replica after the current lease expires.

40 Updates


Download ppt "Homework 1: Common Mistakes Memory Leak Storing of memory pointers instead of data."

Similar presentations


Ads by Google