Nectar: Efficient Management of Computation and Data in Data Centers Lenin Ravindranath Pradeep Kumar Gunda, Chandu Thekkath, Yuan Yu, Li Zhuang.

Slides:



Advertisements
Similar presentations
Boxwood: Distributed Data Structures as Storage Infrastructure Lidong Zhou Microsoft Research Silicon Valley Team Members: Chandu Thekkath, Marc Najork,
Advertisements

Cluster Computing with Dryad Mihai Budiu, MSR-SVC LiveLabs, March 2008.
Tuning: overview Rewrite SQL (Leccotech)Leccotech Create Index Redefine Main memory structures (SGA in Oracle) Change the Block Size Materialized Views,
Adam Jorgensen Pragmatic Works Performance Optimization in SQL Server Analysis Services 2008.
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
17th February, 2000 by Maciej Korzeniowski (CERN-IT-IA-MI) 1 Oracle Discoverer Product Presentation  This is an ad hoc query and analysis tool for.
The Big Picture Scientific disciplines have developed a computational branch Models without closed form solutions solved numerically This has lead to.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Pradeep Kumar Gunda, Lenin Ravindranath, Chandramohan A. Thekkath, Yuan Yu, and Li Zhuang Presented by: Hien Nguyen.
Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.
1 Routing and Scheduling in Web Server Clusters. 2 Reference The State of the Art in Locally Distributed Web-server Systems Valeria Cardellini, Emiliano.
1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 11 Database Performance Tuning and Query Optimization.
Inventory Management System With Berkeley DB 1. What is Berkeley DB? Berkeley DB is an Open Source embedded database library that provides scalable, high-
Hands-On Microsoft Windows Server 2003 Administration Chapter 6 Managing Printers, Publishing, Auditing, and Desk Resources.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 11 Managing and Monitoring a Windows Server 2008 Network.
All content in this presentation is protected – © 2008 American Power Conversion Corporation Rael Haiboullin System Engineer Capacity Manager.
1 Using Compressed Files and Folders Applications and operating systems read and write to compressed files. NTFS uncompresses the file before making it.
Enterprise Reporting with Reporting Services SQL Server 2005 Donald Farmer Group Program Manager Microsoft Corporation.
Frangipani: A Scalable Distributed File System C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation.
Sensor Networks Storage Sanket Totala Sudarshan Jagannathan.
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Introduction to .Net Framework
JOnAS developer workshop – /02/2004 status Emmanuel Cecchet
Image Processing Image Processing Windows HPC Server 2008 HPC Job Scheduler Dryad DryadLINQ Machine Learning Graph Analysis Graph Analysis Data Mining.NET.
Database Systems: Design, Implementation, and Management Tenth Edition Chapter 11 Database Performance Tuning and Query Optimization.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Chapter pages1 File Management Chapter 12.
Virtualization. Virtualization  In computing, virtualization is a broad term that refers to the abstraction of computer resources  It is "a technique.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.
A Web Crawler Design for Data Mining
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Module 7 Reading SQL Server® 2008 R2 Execution Plans.
Architecture Rajesh. Components of Database Engine.
DUCKS – Distributed User-mode Chirp- Knowledgeable Server Joe Thompson Jay Doyle.
Contents 1.Introduction, architecture 2.Live demonstration 3.Extensibility.
Chapter 10: File-System Interface Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Jan 1, 2005 Chapter 10: File-System.
Optimizer Deployment Centralized Database module on Optimizer hub server Each monitored server has an instance of optimizer installed.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
7. Replication & HA Objectives –Understand Replication and HA Contents –Standby server –Failover clustering –Virtual server –Cluster –Replication Practicals.
VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
Dryad and DryaLINQ. Dryad and DryadLINQ Dryad provides automatic distributed execution DryadLINQ provides automatic query plan generation Dryad provides.
Chapter 15 A External Methods. © 2004 Pearson Addison-Wesley. All rights reserved 15 A-2 A Look At External Storage External storage –Exists beyond the.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
M.Kersten MonetDB, Cracking and recycling Martin Kersten CWI Amsterdam.
Last Updated : 27 th April 2004 Center of Excellence Data Warehousing Group Teradata Performance Optimization.
You there? Yes Network Health Monitoring Heartbeats are sent to monitor health status of network interfaces Are sent over all cluster.
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
Database Systems, 8 th Edition SQL Performance Tuning Evaluated from client perspective –Most current relational DBMSs perform automatic query optimization.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
ITMT 1371 – Window 7 Configuration 1 ITMT Windows 7 Configuration Chapter 8 – Managing and Monitoring Windows 7 Performance.
Diving into Query Execution Plans ED POLLACK AUTOTASK CORPORATION DATABASE OPTIMIZATION ENGINEER.
DGAS Distributed Grid Accounting System INFN Workshop /05/1009, Palau Giuseppe Patania Andrea Guarise 6/18/20161.
Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)
Table General Guidelines for Better System Performance
How Alluxio (formerly Tachyon) brings a 300x performance improvement to Qunar’s streaming processing Xueyan Li (Qunar) & Chunming Li (Garena)
Database Performance Tuning and Query Optimization
Table General Guidelines for Better System Performance
Chapter 11 Database Performance Tuning and Query Optimization
Building a Threat-Analytics Multi-Region Data Lake on AWS
DryadInc: Reusing work in large-scale computations
5/7/2019 Map Reduce Map reduce.
Presentation transcript:

Nectar: Efficient Management of Computation and Data in Data Centers Lenin Ravindranath Pradeep Kumar Gunda, Chandu Thekkath, Yuan Yu, Li Zhuang

Motivation Resources are poorly managed in a data center Computation Storage Redundant computations – Wasting resources Manually managed – Unused files occupying space – Redundant output files

Goal Efficiently manage resources in a cluster Computation Storage Nectar

Key Insight Data Center Computation Storage Single query interface for computation and data access DryadLINQ Query Interface User

Goal Efficiently manage resources in a cluster Computation Storage Nectar

Computation PROBLEM: Redundant Computation – Programs share sub queries – Programs share partial data sets SOLUTION: Caching – Cache results of popular sub queries – Automatically rewrite user query to use cache X.Select(…) X.Select(…).Where(…) X.Select(…) (X+X’).Select(…)

Does caching help? Analyzed logs from production clusters Logs of 3 months (Oct – Dec 2008) 33 virtual clusters, jobs Parsed SCOPE programs, extracted sub queries Simulated caching

Caching helps About 50% cache hit on 10 clusters More than 30% cache hit on 20 clusters 35% on average

Goal Efficiently manage resources in a cluster Computation Storage Nectar

Storage PROBLEM: Manually managed – Unused files occupying space 50% data was never accessed in the last 275 days

Storage SOLUTION: Automatically manage data – Track usage and delete infrequently used files – Store programs which re-computes the data

Query Interface Data Center Computation Storage DryadLINQ Query Interface User

Goal Efficiently manage resources in a cluster Computation Storage Nectar

Data Center Computation Storage DryadLINQ Query Interface Nectar User

Nectar Architecture Query Rewriter DryadLINQ Dryad DryadLINQ program Query Cache entries Nectar Client Cache Server Add T to cache P P’ Add R to cache R T Cluster

Nectar Architecture Query Rewriter Nectar Client Cache Server

Query Rewriter Select X X R R X X X’ Select X’ Select R R Concat (R+R’) Cache

Query Rewriter Select X X R R X X X’ Select X’ Select R R Merge Sort (R+R’) Cache Order by

Query Rewriter Generates multiple plans – Using multiple cache entries Selects the best plan – Based on benefit Execution time Output Size Whether pipeline is broken Operators supported – Select, Where, Order by, Group by, Join X.Select(…) X.Select(…).Where(…)

Nectar Architecture Query Rewriter Nectar Client Cache Server

SQL Server Garbage Collector Cache Policy Cache Server URIQuery Fingerprint Query + Data Fingerprint Execution Time Output Size Inquire Stats Usage Stats Fingerprints

Cache policy Insertion Policy – Always add program output to cache – Sub query outputs are added to cache Popularity exceeds a threshold Savings exceeds a threshold

Garbage Collector Storage pressure – Delete infrequently used files Deletion policy – Based on savings – Cache type Mark and sweep algorithm – Delete cache entry – Reachability analysis Delete files Cache Server Distributed FS 1 2

What if I try to access a garbage collected file?

Nectar Architecture Query Rewriter Nectar Client Cache Server Program store

Program Store Store executed programs in the cluster Output file is tied to its corresponding program that generates the output If a file is deleted, the program is executed to regenerate the output

Managing Data Nectar Client Program Store Distributed FS foo.pt Cache Server FP Program FP A31E4.pt ToPartitionedTable (lenin\foo.pt) DryadLINQ Dryad usrNectar P’ Program P

Managing Data Nectar Client Program Store Distributed FS foo.pt Cache Server FP Program FP FromPartitionedTable (lenin\foo.pt) DryadLINQ Dryad usrNectar P A31E4.pt

Managing Data Nectar Client Program Store Distributed FS foo.pt Cache Server FP Program FP FromPartitionedTable (lenin\foo.pt) DryadLINQ Dryad usrNectar P A31E4.pt Program KJ1LM.pt

Goal Efficiently manage resources in a cluster Computation Storage Nectar Computation Storage Unified computation and data

Distributed cache servers Cache Server SQL Server Partitioned by query fingerprint Nectar Client Centralized Garbage collector Centralized Garbage collector Hash based on query fingerprint Program store Cache Server SQL Server

Summary We built Nectar – Automatically manage data – Efficiently manage computation Components Query Rewriter – Automatically rewrite queries to use cache Cache server – Popular sub queries are cached – Garbage collected based on usage Program store – Store programs which regenerates the output

Status Almost done with development – Query Rewriter Including other operators – Fingerprinter Program static analysis – Cache Server – Program Store In the process of deploying

Can we do better?

Cluster Utilization Most clusters have more than 40% Idle time Even the busiest clusters have 10-20% idle time

Exploiting idle time Do speculative caching – Cache popular data before query issued – Run program on new streams when available No side effects – Executed only when cluster is idle – Low priority jobs – Output garbage collected with high priority – More electric bill? Not Really!

Questions

Backup

Caching Results