Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.

Slides:



Advertisements
Similar presentations
Distributed Processing, Client/Server and Clusters
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
File Management Chapter 12. File Management File management system is considered part of the operating system Input to applications is by means of a file.
Distributed Processing, Client/Server, and Clusters
Lecture 1: History of Operating System
1: Operating Systems Overview
OPERATING SYSTEM OVERVIEW
Chapter 1 Introduction 1.1A Brief Overview - Parallel Databases and Grid Databases 1.2Parallel Query Processing: Motivations 1.3Parallel Query Processing:
Organizing Data & Information
Operating Systems CS208. What is Operating System? It is a program. It is the first piece of software to run after the system boots. It coordinates the.
5 Creating the Physical Model. Designing the Physical Model Phase IV: Defining the physical model.
©Silberschatz, Korth and Sudarshan18.1Database System Concepts Centralized Systems Run on a single computer system and do not interact with other computer.
Definition of terms Definition of terms Explain business conditions driving distributed databases Explain business conditions driving distributed databases.
Database System Architectures  Client-server Database System  Parallel Database System  Distributed Database System Wei Jiang.
Chapter 1 Introduction to Databases
Distributed Databases
Microsoft Load Balancing and Clustering. Outline Introduction Load balancing Clustering.
Shilpa Seth.  Centralized System Centralized System  Client Server System Client Server System  Parallel System Parallel System.
Systems analysis and design, 6th edition Dennis, wixom, and roth
Database Design – Lecture 16
LOGO OPERATING SYSTEM Dalia AL-Dabbagh
 What is an operating system? What is an operating system?  Where does the OS fit in? Where does the OS fit in?  Services provided by an OS Services.
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
Chapter 3: Operating-System Structures System Components Operating System Services System Calls System Programs System Structure Virtual Machines System.
 Introduction to Operating System Introduction to Operating System  Types Of An Operating System Types Of An Operating System  Single User Single User.
2005 SPRING CSMUIntroduction to Information Management1 Organizing Data John Sum Institute of Technology Management National Chung Hsing University.
Introduction to Hadoop and HDFS
DCE (distributed computing environment) DCE (distributed computing environment)
Session-8 Data Management for Decision Support
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.
Chapter 101 Multiprocessor and Real- Time Scheduling Chapter 10.
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
PART II OPERATING SYSTEMS LECTURE 8 SO TAXONOMY Ştefan Stăncescu 1.
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
Server to Server Communication Redis as an enabler Orion Free
Management Information Systems, 4 th Edition 1 Chapter 8 Data and Knowledge Management.
1: Operating Systems Overview 1 Jerry Breecher Fall, 2004 CLARK UNIVERSITY CS215 OPERATING SYSTEMS OVERVIEW.
PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.
DATABASE CONNECTIVITY TO MYSQL. Introduction =>A real life application needs to manipulate data stored in a Database. =>A database is a collection of.
Application Software System Software.
Data Communications and Networks Chapter 9 – Distributed Systems ICT-BVF8.1- Data Communications and Network Trainer: Dr. Abbes Sebihi.
McGraw-Hill©The McGraw-Hill Companies, Inc., 2000 OS 1.
Introduction to Databases Angela Clark University of South Alabama.
Em Spatiotemporal Database Laboratory Pusan National University File Processing : Database Management System Architecture 2004, Spring Pusan National University.
Background Computer System Architectures Computer System Software.
IT 5433 LM1. Learning Objectives Understand key terms in database Explain file processing systems List parts of a database environment Explain types of.
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
Managing Data Resources File Organization and databases for business information systems.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Fundamentals of Information Systems, Sixth Edition
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
2. OPERATING SYSTEM 2.1 Operating System Function
Lecture 1 Introduction to Database
Where are being used the OS?
ICT Database Lesson 1 What is a Database?.
DUCKS – Distributed User-mode Chirp-Knowledgeable Server
CSCI1600: Embedded and Real Time Software
Chapter 2: Operating-System Structures
Computer Evolution and Performance
PVFS: A Parallel File System for Linux Clusters
Database System Architectures
Chapter 2: Operating-System Structures
CSCI1600: Embedded and Real Time Software
Presentation transcript:

Frontiers in Massive Data Analysis Chapter 3

 Difficult to include data from multiple sources  Each organization develops a unique way of representing the data  Organizations are codeveloping shared metadata structures

 Instead of developing a complicated metadata structure, different organizations share their data with a basic set of operations  More complex tools are developed as they are needed

 Data created from mining confidential must meet certain legal and corporate privacy requirements  Private data has to be protected from malicious users as well

 Raw processing speed is not increasing as quickly, so manufacturers are moving towards more processors instead of faster processors  I/O performance has to increase to meet the requirements of supporting multiple cores simultaneously

 Hardware elements that can perform specialized tasks quickly  GPUs are often used for rapidly calculating floating point values, but are limited by I/O bottlenecks and limited software tools

 CPUs have become more parallel by combining more cores per socket and how many operations can be executed per clock cycle  More cores at a slower speed have superior performance and power efficiency

 The DSMS runs queries on (typically real time) input streams  The feeds are analyzed and summarized continuously

 Can use a structured query language similar to SQL that uses windowing to limit how much data is analyzed  Can also use a “boxes-and-arrows” system that provides a graphical interface. The user selects what tasks execute in a box and connects the boxes with arrows to define how data is analyzed

 A clustered system consists of multiple high performance nodes that execute submitted jobs  Think of the HPC systems on campus  A job manager controls load balancing and queue management

 Provides access to distributed file systems stored on different servers  The user is presented with a standard file system that hides the underlying distributed systems

 POSIX compliant systems provide the same interface that a standalone file system would provide  Makes it simple to convert programs to use clustered resources

 Metadata is managed separately by dedicated servers which forward client requests to the correct file server  Distributed systems run into synchronization issues as the cluster grows large

 These systems were designed to solve the issues that POSIX systems encounter in large clusters  Metadata is still handled by dedicated servers

 Designed to handle distributed analysis tasks  Uses a large block size (64 MB) to minimize metadata requests by clients  Clients are expected to handle inconsistencies in the file systems by comparing checksums

 Maps a collection of nodes to partition data, then shuffles the hashed files so that common records are passed to the same node  Simplifies analysis on distributed data

 Resources in a multi-tenant cluster are dynamically allocated as a user’s needs change  Allows users to gain access to large systems without the overhead associated with maintaining a large cluster

 Databases reliably store and retrieve data and can provide querying over the data sets  Large parallel databases are spread over servers without a cluster file system managing nodes

 Data can be partitioned by evenly spreading data among the nodes or spreading the data based on hashes on some of the fields  The nodes evaluate queries on local partitions then combine the results from each node

 If certain tables are frequently joined together in queries, store them on the same node  When joining tables from different nodes, transfer the smaller of the two

 Parallel databases are very difficult to tune and populate with data  Very difficult to develop and debug parallel programs