Workload-Aware Data Partitioning in Community-Driven Data Grids

Slides:



Advertisements
Similar presentations
Lehrstuhl Informatik III: Datenbanksysteme Community Training: Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke,
Advertisements

From Startup to Enterprise A Story of MySQL Evolution Vidur Apparao, CTO Stephen OSullivan, Manager of Data and Grid Technologies April 2009.
Dominik Stoklosa Poznan Supercomputing and Networking Center, Supercomputing Department EGEE 2007 Budapest, Hungary, October 1-5 Workflow management in.
Dominik Stokłosa Pozna ń Supercomputing and Networking Center, Supercomputing Department INGRID 2008 Lacco Ameno, Island of Ischia, ITALY, April 9-11 Workflow.
14 Sept 2005NVO Summer School II1 Whats on Tap in the VO? T HE US N ATIONAL V IRTUAL O BSERVATORY Robert Hanisch US NVO Project Manager Space Telescope.
© 2007 Open Grid Forum Grids in the IT Data Center OGF 21 - Seattle Nick Werstiuk October 16, 2007.
Technische Universität München Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems Martina Albutiu, Alfons Kemper, and.
Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e-Science 1 Richard Kuntschke 1, Tobias Scholl 1, Sebastian Huber 1, Alfons.
Technische Universität München Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems Martina Albutiu, Alfons Kemper, and.
Lehrstuhl Informatik III: Datenbanksysteme 1 HiSbase – Informationsfusion in P2P Netzwerken Tobias Scholl, Bernhard Bauer, Benjamin Gufler, Richard Kuntschke,
Technische Universität Ilmenau CCSW 2013 Sander Wozniak
Distributed Data Processing
Large-Scale Distributed Systems Andrew Whitaker CSE451.
Development of China-VO ZHAO Yongheng NAOC, Beijing Nov
Choosing an Order for Joins
P2PR-tree: An R-tree-based Spatial Index for P2P Environments ANIRBAN MONDAL YI LIFU MASARU KITSUREGAWA University of Tokyo.
The Big Picture Scientific disciplines have developed a computational branch Models without closed form solutions solved numerically This has lead to.
Hopkins Storage Systems Lab, Department of Computer Science Automated Physical Design in Database Caches T. Malik, X. Wang, R. Burns Johns Hopkins University.
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
Johns Hopkins University Xiaodan Wang Eric Perlman Randal Burns Tamas Budavari Charles Meneveau Alexander Szalay Purdue University Tanu Malik JAWS: J ob-
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Nokia Technology Institute Natural Partner for Innovation.
1 Challenges and New Trends in Data Intensive Science Panel at Data-aware Distributed Computing (DADC) Workshop HPDC Boston June Geoffrey Fox Community.
Chapter 13 (Web): Distributed Databases
20 Spatial Queries for an Astronomer's Bench (mark) María Nieto-Santisteban 1 Tobias Scholl 2 Alexander Szalay 1 Alfons Kemper 2 1. The Johns Hopkins University,
The LHC Computing Grid – February 2008 The Worldwide LHC Computing Grid Dr Ian Bird LCG Project Leader 15 th April 2009 Visit of Spanish Royal Academy.
Cooperative Caching Middleware for Cluster-Based Servers Francisco Matias Cuenca-Acuna Thu D. Nguyen Panic Lab Department of Computer Science Rutgers University.
Rendezvous Points-Based Scalable Content Discovery with Load Balancing Jun Gao Peter Steenkiste Computer Science Department Carnegie Mellon University.
Rutgers PANIC Laboratory The State University of New Jersey Self-Managing Federated Services Francisco Matias Cuenca-Acuna and Thu D. Nguyen Department.
Data-Intensive Computing in the Science Community Alex Szalay, JHU.
July 13, “How are Real Grids Used?” The Analysis of Four Grid Traces and Its Implications IEEE Grid 2006 Alexandru Iosup, Catalin Dumitrescu, and.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
Word Wide Cache Distributed Caching for the Distributed Enterprise.
Scheduling in Heterogeneous Grid Environments: The Effects of Data Migration Leonid Oliker, Hongzhang Shan Future Technology Group Lawrence Berkeley Research.
MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG MADES - A Multi-Layered, Adaptive, Distributed Event Store Tilmann Rabl Mohammad Sadoghi Kaiwen Zhang Hans-Arno.
Amdahl Numbers as a Metric for Data Intensive Computing Alex Szalay The Johns Hopkins University.
Hopkins Storage Systems Lab, Department of Computer Science A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching Xiaodan Wang, Tanu.
Presenter: Dipesh Gautam.  Introduction  Why Data Grid?  High Level View  Design Considerations  Data Grid Services  Topology  Grids and Cloud.
Alex Szalay, Jim Gray Analyzing Large Data Sets in Astronomy.
Performance Concepts Mark A. Magumba. Introduction Research done on 1058 correspondents in 2006 found that 75% OF them would not return to a website that.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Master Thesis Defense Jan Fiedler 04/17/98
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Massively Distributed Database Systems - Distributed DBS Spring 2014 Ki-Joune Li Pusan National University.
LSST: Preparing for the Data Avalanche through Partitioning, Parallelization, and Provenance Kirk Borne (Perot Systems Corporation / NASA GSFC and George.
Washingtonpost.com  Introduction  Who we are - four very different sites –washingtonpost.com –budgettravelonline.com –newsweek.com –slate.com.
Xiaodan Wang Department of Computer Science Johns Hopkins University Processing Data Intensive Queries in Scientific Database Federations.
1 Evaluation of Cooperative Web Caching with Web Polygraph Ping Du and Jaspal Subhlok Department of Computer Science University of Houston presented at.
SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.
ESFRI & e-Infrastructure Collaborations, EGEE’09 Krzysztof Wrona September 21 st, 2009 European XFEL.
Copyright © 2006, GemStone Systems Inc. All Rights Reserved. Increasing computation throughput with Grid Data Caching Jags Ramnarayan Chief Architect GemStone.
Distributed Computing Systems CSCI 6900/4900. Review Distributed system –A collection of independent computers that appears to its users as a single coherent.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Distributed Computing Systems CSCI 4780/6780. Scalability ConceptExample Centralized servicesA single server for all users Centralized dataA single on-line.
@ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University.
University of Texas at Arlington Scheduling and Load Balancing on the NASA Information Power Grid Sajal K. Das, Shailendra Kumar, Manish Arora Department.
Netherlands Institute for Radio Astronomy Big Data Radio Astronomy A VRC for SKA and its pathfinders Hanno Holties March 28 th 2012.
1 Evaluation of Cooperative Web Caching with Web Polygraph Ping Du and Jaspal Subhlok Department of Computer Science University of Houston presented at.
Smart Grid Big Data: Automating Analysis of Distribution Systems Steve Pascoe Manager Business Development E&O - NISC.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
8 th International Desktop Grid Federation Workshop, Hannover, Germany, August 17 th, 2011 DEGISCO Desktop Grids for International Scientific Collaboration.
CSCI5570 Large Scale Data Processing Systems
Kyriaki Dimitriadou, Brandeis University
A Black-Box Approach to Query Cardinality Estimation
Jozsef Patvarczki, Elke A. Rundensteiner, and Neil T. Heffernan
StreamGlobe: P2P Stream Sharing
Users Manage Terabytes of Data with Powerful and Agnostic Hosting from Azure Cloud Service Partner Logo “Given the challenges we face both in dealing with.
Database System Architectures
Presentation transcript:

Workload-Aware Data Partitioning in Community-Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin Gufler, Angelika Reiser, and Alfons Kemper Department of Computer Science, Technische Universität München Germany

Should I Split or Replicate? Many challenges and opportunities in e-science for database research High-throughput data management Correlation of distributed data sources Community-driven data grids Dealing with data skew and query hot spots Workload-awareness by employing cost model during partitioning Should I Split or Replicate? 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

Query Load Balancing via Partitioning 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

Query Load Balancing via Partitioning 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

Query Load Balancing via Partitioning 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

Query Load Balancing via Partitioning X 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

Query Load Balancing via Replication 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

Query Load Balancing via Replication 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

Query Load Balancing via Replication 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

The AstroGrid-D Project German Astronomy Community Grid http://www.gac-grid.org/ Funded by the German Ministry of Education and Research Part of D-Grid 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

Up-Coming Data-Intensive Applications Alex Szalay, Jim Gray (Nature, 2006): “Science in an exponential world” Data rates Terabytes a day/night Petabytes a year LHC LSST LOFAR Pan-STARRS LOFAR LHC 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

The Multiwavelength Milky Way http://adc.gsfc.nasa.gov/mw/ 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

Research Challenges Directly deal with Terabyte/Petabyte-scale data sets Integrate with existing community infrastructures High throughput for growing user communities 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

Current Sharing in Data Grids Data autonomy Policies allow partners to access data Each institution ensures Availability (replication) Scalability Various organizational structures [Venugopal et al. 2006]: Centralized Hierarchical Federated Hybrid 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

Community-Driven Data Grids (HiSbase) 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

“Distribute by Region – not by Archive!” 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

“Distribute by Region – not by Archive!” 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

“Distribute by Region – not by Archive!” 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

“Distribute by Region – not by Archive!” 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

Mapping Data to Nodes 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

Workload-Aware Training Phase Incorporate query traces during training phase Base partitioning scheme on Data load Query load Challenges Balance query load without losing data load balancing Approximate real query hot spots from query sample 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

Dealing with Query Hot Spots Query skew triggered by increased interest in particular subsets of the data Two well-known query load balancing techniques: Data partitioning Data replication Finding trade-offs between both 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

When to Split (Partition) or to Replicate Considers partition characteristics Amount of data (few/many data points) Number of queries (few/many queries) Extent of regions and queries (small/big queries) Data points Few Queries Many Queries Small Big Few ─ SPLIT REPLICATE Many 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

Region Weight Functions Data only (#objects in a region) Queries only (#queries in a region) Scaled queries Approximate “real” extent of hot spot Avoid overfitting to training query set Heat of a region (#objects * #queries) Extents of regions and queries Replicate when many big queries big small 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

Evaluation Pobs Weight functions: data, heat, extent Data sets (observational, simulation) Workloads (SDSS query log, synthetic) Partitioning Scheme Properties Load distribution Communication overhead Throughput Measurements Distributed setup FreePastry simulator Pobs 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

Load Distribution Uniform data set from the Millennium simulation Workload with extreme hot spot In the following: 1024 partitions Heat of a region (#data * #queries) Normalized across all partitioning schemes 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

Query-unaware Training 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

Training with Scaled Queries (scaled 50x) 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

Training with Scaled Queries (scaled 400x) 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

Heat-based, Extent-based Training 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

Communication Overhead for Pobs 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

Throughput for Pobs 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

Load Balancing During Runtime Complement workload-aware partitioning with runtime load-balancing Short-term peaks Master-slave approach Load monitoring Long-term trends Based on load monitoring Histogram evolution 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

HiSbase Related Work On-line load balancing Hundreds of thousands to millions of nodes Reacting fast Treating objects individually HiSbase 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

Should I Split or Replicate? Many challenges and opportunities in e-science for database research High-throughput data management Correlation of distributed data sources Community-driven data grids Dealing with data skew and query hot spots Workload-awareness by employing cost model during partitioning 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

Thank You for Your Attention Get in Touch Database systems group, TU München Web site: http://www-db.in.tum.de E-mail: scholl@in.tum.de The HiSbase project http://www-db.in.tum.de/research/projects/hisbase/ Thank You for Your Attention 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

Queries Intersecting Multiple Regions 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

Regions Without Queries 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

Throughput for Pobs (300 nodes, sim.) 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

Throughput for Pobs (1000 nodes, sim.) 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning

Throughput (Region-Uniform Queries) 2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning