April 30th – Scheduling / parallel

Slides:



Advertisements
Similar presentations
Parallel Databases By Dr.S.Sridhar, Ph.D.(JNUD), RACI(Paris, NICE), RMR(USA), RZFM(Germany) DIRECTOR ARUNAI ENGINEERING COLLEGE TIRUVANNAMALAI.
Advertisements

Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Parallel Database Systems
Data Warehousing 1 Lecture-25 Need for Speed: Parallelism Methodologies Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Parallel Algorithms for Relational Operations. Models of Parallelism There is a collection of processors. –Often the number of processors p is large,
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management 9. course. Execution of queries.
Data Warehousing 1 Lecture-24 Need for Speed: Parallelism Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
Lecture 14- Parallel Databases Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch
CS 440 Database Management Systems Lecture 5: Query Processing 1.
Unit - 4 Introduction to the Other Databases.  Introduction :-  Today single CPU based architecture is not capable enough for the modern database.
CS 540 Database Management Systems
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
CS 440 Database Management Systems Parallel DB & Map/Reduce Some slides due to Kevin Chang 1.
CPT-S Advanced Databases 11 Yinghui Wu EME 49.
CS 440 Database Management Systems
CS 540 Database Management Systems
15.1 – Introduction to physical-Query-plan operators
CS 540 Database Management Systems
CS 540 Database Management Systems
CS 440 Database Management Systems
Large-scale file systems and Map-Reduce
Parallel Databases.
Lecture 16: Data Storage Wednesday, November 6, 2006.
Database Applications (15-415) DBMS Internals- Part VII Lecture 16, October 25, 2016 Mohammad Hammoud.
Chapter 12: Query Processing
Database Performance Tuning and Query Optimization
Evaluation of Relational Operations
Lecture 11: DMBS Internals
Chapter 15 QUERY EXECUTION.
Database Management Systems (CS 564)
Evaluation of Relational Operations: Other Operations
Mapping the Data Warehouse to a Multiprocessor Architecture
Cse 344 May 7th – Exam Review.
Cse 344 April 25th – Disk i/o.
Disk Storage, Basic File Structures, and Buffer Management
April 20th – RDBMS Internals
Cse 344 May 2nd – Map/reduce.
February 26th – Map/Reduce
Chapter 17: Database System Architectures
Cse 344 May 4th – Map/Reduce.
Database Applications (15-415) DBMS Internals- Part VII Lecture 19, March 27, 2018 Mohammad Hammoud.
Cse 344 APRIL 23RD – Indexing.
Query Processing B.Ramamurthy Chapter 12 11/27/2018 B.Ramamurthy.
Akshay Tomar Prateek Singh Lohchubh
Faloutsos/Pavlo C. Faloutsos – A. Pavlo Lecture#13: Query Evaluation
Selected Topics: External Sorting, Join Algorithms, …
Parallel Analytic Systems
Parallel DBMS Chapter 22, Part A
Indexing and Hashing B.Ramamurthy Chapter 11 2/5/2019 B.Ramamurthy.
Chapter 11 I/O Management and Disk Scheduling
Parallel DBMS Chapter 22, Sections 22.1–22.6
Chapter 12 Query Processing (1)
13.3 Accelerating Access to Secondary Storage
Chapter 11 Database Performance Tuning and Query Optimization
Evaluation of Relational Operations: Other Techniques
Wednesday, 5/8/2002 Hash table indexes, physical operators
Unit 12 Index in Database 大量資料存取方法之研究 Approaches to Access/Store Large Data 楊維邦 博士 國立東華大學 資訊管理系教授.
Query Processing.
The Gamma Database Machine Project
Evaluation of Relational Operations: Other Techniques
Parallel DBMS DBMS Textbook Chapter 22
Unit 12 Index in Database 大量資料存取方法之研究 Approaches to Access/Store Large Data 楊維邦 博士 國立東華大學 資訊管理系教授.
CSE 190D Database System Implementation
Presentation transcript:

April 30th – Scheduling / parallel Cse 344 April 30th – Scheduling / parallel

Disk Scheduling Query optimization Good DB design Good estimation Hardware independent All Disk I/Os are not created equal Sectors close to each other are more preferable to read

Disk Scheduling Disk I/O behavior Very rare to have requests come in one at a time Requests come in batches, i.e. read the whole file How does the hardware process a batch?

Disk Scheduling Suppose sectors are ordered from the outside to the inside of the disk Given a collection of sectors, how do we read them with the smallest amount of head movement?

Disk Scheduling What are some strategies for processing the following batch? 95, 180, 34, 119, 11, 123, 62, 64 Assume sectors are numbered from 0-199 and that we start at sector 50

Disk Scheduling What are some strategies for processing the following batch? 95, 180, 34, 119, 11, 123, 62, 64 Assume sectors are numbered from 0-199 and that we start at sector 50 Ideas?

Disk Scheduling 95, 180, 34, 119, 11, 123, 62, 64

Disk Scheduling 95, 180, 34, 119, 11, 123, 62, 64 FIFO. Naive solution Pros/cons?

Disk Scheduling 95, 180, 34, 119, 11, 123, 62, 64 FIFO. Naive solution Pros/cons? + easy to add new sectors to the queue + almost no computation to maintain - non-optimal, easy to create adversarial batches – doesn‘t really take advantage of batches 640 tracks

Disk Scheduling 95, 180, 34, 119, 11, 123, 62, 64

Disk Scheduling 95, 180, 34, 119, 11, 123, 62, 64 Get closest (Shortest seek time first) Pros/cons?

Disk Scheduling 95, 180, 34, 119, 11, 123, 62, 64 Get closest (Shortest seek time first) Pros/cons? + efficient (236) - costly to maintain - starvation

Disk Scheduling 95, 180, 34, 119, 11, 123, 62, 64

Disk Scheduling 95, 180, 34, 119, 11, 123, 62, 64 Sorting: 50,11,34,62,64,95,119,123,180 Pros/cons?

Disk Scheduling 95, 180, 34, 119, 11, 123, 62, 64 Sorting: 50,11,34,62,64,95,119,123,180 Pros/cons? + fewer track movements (208) - costly to maintain, add new - doesn’t account for start position + no starvation

Disk Scheduling 95, 180, 34, 119, 11, 123, 62, 64 How do we modify the ”sorting” algorithm to better take advantage of the start position?

Disk Scheduling 95, 180, 34, 119, 11, 123, 62, 64 How do we modify the ”sorting” algorithm to better take advantage of the start position? How does an elevator schedule rides?

Disk Scheduling 95, 180, 34, 119, 11, 123, 62, 64 How do we modify the ”sorting” algorithm to better take advantage of the start position? How does an elevator schedule rides? Start in a position, go in one direction until you reach the end, repeat going the other way

Disk Scheduling 95, 180, 34, 119, 11, 123, 62, 64 Elevator algorithm (SCAN) Pros/cons?

Disk Scheduling 95, 180, 34, 119, 11, 123, 62, 64 Elevator algorithm (SCAN) Pros/cons? Notice that the algorithm goes all the way to the ends (0 -199)

Disk Scheduling 95, 180, 34, 119, 11, 123, 62, 64 Elevator algorithm (SCAN) Pros/cons? + no starvation - some maintenance + efficient (230)

Disk Scheduling Weird fact about disks Moving the arm accurately takes longer than moving it large numbers of tracks Why might this matter?

Disk Scheduling Weird fact about disks Moving the arm accurately takes longer than moving it large numbers of tracks Why might this matter? SCAN in only one direction then quickly move the arm back to the beginning (quicker than standard find) C-SCAN

Disk Scheduling 95, 180, 34, 119, 11, 123, 62, 64 Elevator algorithm (C-SCAN) Pros/cons? + no starvation - some maintenance + efficient (187 + large movement) ~ goes from 0-199

Disk Scheduling 95, 180, 34, 119, 11, 123, 62, 64 Elevator algorithm (C-SCAN)

Disk Scheduling 95, 180, 34, 119, 11, 123, 62, 64 What if we don’t insist on going all the way to the ends? - need “accurate” arm movement + can save some articulation - might delay reads from inner/outer sectors

Disk Scheduling 95, 180, 34, 119, 11, 123, 62, 64 What if we don’t insist on going all the way to the ends? (C-LOOK) - need “accurate” arm movement + can save some articulation (157 + large) - might delay reads from inner/outer sectors

Disk Scheduling 95, 180, 34, 119, 11, 123, 62, 64 C-LOOK (circular look)

Query Evaluation Steps SQL query Parse & Rewrite Query Logical plan (RA) Select Logical Plan Query optimization Select Physical Plan Physical plan Analogy with sorting: we have decided the ordering of operations, and now we need to choose their implementation. For instance, there are many ways to implement sort. Query Execution Disk Scheduling Disk

Query Evaluation Design Query DBMS Hardware Single machine optimization Hardware scaleup

Why compute in parallel? Multi-cores: Most processors have multiple cores This trend will likely increase in the future Big data: too large to fit in main memory Distributed query processing on 100x-1000x servers Widely available now using cloud services Recall HW3 and HW6

Performance Metrics for Parallel DBMSs Nodes = processors, computers Speedup: More nodes, same data  higher speed Scaleup: More nodes, more data  same speed

Linear v.s. Non-linear Speedup Ideal ×1 ×5 ×10 ×15 # nodes (=P)

Linear v.s. Non-linear Scaleup Batch Scaleup Ideal ×1 ×5 ×10 ×15 # nodes (=P) AND data size

Why Sub-linear Speedup and Scaleup? Startup cost Cost of starting an operation on many nodes Interference Contention for resources between nodes Skew Slowest node becomes the bottleneck

Architectures for Parallel Databases Shared memory Shared disk Shared nothing

Interconnection Network Shared Memory P P P Nodes share both RAM and disk Dozens to hundreds of processors Example: SQL Server runs on a single machine and can leverage many threads to speed up a query check your HW3 query plans Easy to use and program Expensive to scale last remaining cash cows in the hardware industry Interconnection Network Global Shared Memory D D D

Interconnection Network Shared Disk P P P All nodes access the same disks Found in the largest "single-box" (non-cluster) multiprocessors Example: Oracle No need to worry about shared memory Hard to scale: existing deployments typically have fewer than 10 machines M M M Interconnection Network D D D

Interconnection Network Shared Nothing Cluster of commodity machines on high-speed network Called "clusters" or "blade servers” Each machine has its own memory and disk: lowest contention. Example: Google Because all machines today have many cores and many disks, shared- nothing systems typically run many "nodes” on a single physical machine. Easy to maintain and scale Most difficult to administer and tune. Interconnection Network P P P M M M D D D We discuss only Shared Nothing in class

Approaches to Parallel Query Evaluation Purchase pid=pid cid=cid Customer Product Purchase pid=pid cid=cid Customer Product Purchase pid=pid cid=cid Customer Product Inter-query parallelism Transaction per node Good for transactional workloads Inter-operator parallelism Operator per node Good for analytical workloads Intra-operator parallelism Operator on multiple nodes Good for both? Purchase pid=pid cid=cid Customer Product Purchase pid=pid cid=cid Customer Product We study only intra-operator parallelism: most scalable

Distributed Query Processing Data is horizontally partitioned on many servers Operators may require data reshuffling First let’s discuss how to distribute data across multiple nodes / servers

Single Node Query Processing (Review) Given relations R(A,B) and S(B, C), no indexes: Selection: σA=123(R) Scan file R, select records with A=123 Group-by: γA,sum(B)(R) Scan file R, insert into a hash table using A as key When a new key is equal to an existing one, add B to the value Join: R ⋈ S Scan file S, insert into a hash table using B as key Scan file R, probe the hash table using B

Horizontal Data Partitioning Servers: 1 2 P . . . K A B …

Horizontal Data Partitioning Servers: 1 2 P . . . K A B … K A B … K A B … K A B … Which tuples go to what server?

Horizontal Data Partitioning Block Partition: Partition tuples arbitrarily s.t. size(R1)≈ … ≈ size(RP) Hash partitioned on attribute A: Tuple t goes to chunk i, where i = h(t.A) mod P + 1 Recall: calling hash fn’s is free in this class Range partitioned on attribute A: Partition the range of A into -∞ = v0 < v1 < … < vP = ∞ Tuple t goes to chunk i, if vi-1 < t.A < vi

Uniform Data v.s. Skewed Data Let R(K,A,B,C); which of the following partition methods may result in skewed partitions? Block partition Hash-partition On the key K On the attribute A Range partition Uniform Uniform Assuming good hash function E.g. when all records have the same value of the attribute A, then all records end up in the same partition May be skewed Keep this in mind in the next few slides