Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.

Frontiers in Massive Data Analysis Chapter 3

 Difficult to include data from multiple sources  Each organization develops a unique way of representing the data  Organizations are codeveloping shared metadata structures

 Instead of developing a complicated metadata structure, different organizations share their data with a basic set of operations  More complex tools are developed as they are needed

 Data created from mining confidential must meet certain legal and corporate privacy requirements  Private data has to be protected from malicious users as well

 Raw processing speed is not increasing as quickly, so manufacturers are moving towards more processors instead of faster processors  I/O performance has to increase to meet the requirements of supporting multiple cores simultaneously

 Hardware elements that can perform specialized tasks quickly  GPUs are often used for rapidly calculating floating point values, but are limited by I/O bottlenecks and limited software tools

 CPUs have become more parallel by combining more cores per socket and how many operations can be executed per clock cycle  More cores at a slower speed have superior performance and power efficiency

 The DSMS runs queries on (typically real time) input streams  The feeds are analyzed and summarized continuously

 Can use a structured query language similar to SQL that uses windowing to limit how much data is analyzed  Can also use a “boxes-and-arrows” system that provides a graphical interface. The user selects what tasks execute in a box and connects the boxes with arrows to define how data is analyzed

 A clustered system consists of multiple high performance nodes that execute submitted jobs  Think of the HPC systems on campus  A job manager controls load balancing and queue management

 Provides access to distributed file systems stored on different servers  The user is presented with a standard file system that hides the underlying distributed systems

 POSIX compliant systems provide the same interface that a standalone file system would provide  Makes it simple to convert programs to use clustered resources

 Metadata is managed separately by dedicated servers which forward client requests to the correct file server  Distributed systems run into synchronization issues as the cluster grows large

 These systems were designed to solve the issues that POSIX systems encounter in large clusters  Metadata is still handled by dedicated servers

 Designed to handle distributed analysis tasks  Uses a large block size (64 MB) to minimize metadata requests by clients  Clients are expected to handle inconsistencies in the file systems by comparing checksums

 Maps a collection of nodes to partition data, then shuffles the hashed files so that common records are passed to the same node  Simplifies analysis on distributed data

 Resources in a multi-tenant cluster are dynamically allocated as a user’s needs change  Allows users to gain access to large systems without the overhead associated with maintaining a large cluster

 Databases reliably store and retrieve data and can provide querying over the data sets  Large parallel databases are spread over servers without a cluster file system managing nodes

 Data can be partitioned by evenly spreading data among the nodes or spreading the data based on hashes on some of the fields  The nodes evaluate queries on local partitions then combine the results from each node

 If certain tables are frequently joined together in queries, store them on the same node  When joining tables from different nodes, transfer the smaller of the two

 Parallel databases are very difficult to tune and populate with data  Very difficult to develop and debug parallel programs

Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.

Similar presentations

Presentation on theme: "Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.

Similar presentations

Presentation on theme: "Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing."— Presentation transcript:

Similar presentations

About project

Feedback