Presentation is loading. Please wait.

Presentation is loading. Please wait.

Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.

Similar presentations


Presentation on theme: "Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing."— Presentation transcript:

1 Frontiers in Massive Data Analysis Chapter 3

2  Difficult to include data from multiple sources  Each organization develops a unique way of representing the data  Organizations are codeveloping shared metadata structures

3  Instead of developing a complicated metadata structure, different organizations share their data with a basic set of operations  More complex tools are developed as they are needed

4  Data created from mining confidential must meet certain legal and corporate privacy requirements  Private data has to be protected from malicious users as well

5  Raw processing speed is not increasing as quickly, so manufacturers are moving towards more processors instead of faster processors  I/O performance has to increase to meet the requirements of supporting multiple cores simultaneously

6  Hardware elements that can perform specialized tasks quickly  GPUs are often used for rapidly calculating floating point values, but are limited by I/O bottlenecks and limited software tools

7  CPUs have become more parallel by combining more cores per socket and how many operations can be executed per clock cycle  More cores at a slower speed have superior performance and power efficiency

8  The DSMS runs queries on (typically real time) input streams  The feeds are analyzed and summarized continuously

9  Can use a structured query language similar to SQL that uses windowing to limit how much data is analyzed  Can also use a “boxes-and-arrows” system that provides a graphical interface. The user selects what tasks execute in a box and connects the boxes with arrows to define how data is analyzed

10  A clustered system consists of multiple high performance nodes that execute submitted jobs  Think of the HPC systems on campus  A job manager controls load balancing and queue management

11  Provides access to distributed file systems stored on different servers  The user is presented with a standard file system that hides the underlying distributed systems

12  POSIX compliant systems provide the same interface that a standalone file system would provide  Makes it simple to convert programs to use clustered resources

13  Metadata is managed separately by dedicated servers which forward client requests to the correct file server  Distributed systems run into synchronization issues as the cluster grows large

14  These systems were designed to solve the issues that POSIX systems encounter in large clusters  Metadata is still handled by dedicated servers

15  Designed to handle distributed analysis tasks  Uses a large block size (64 MB) to minimize metadata requests by clients  Clients are expected to handle inconsistencies in the file systems by comparing checksums

16  Maps a collection of nodes to partition data, then shuffles the hashed files so that common records are passed to the same node  Simplifies analysis on distributed data

17  Resources in a multi-tenant cluster are dynamically allocated as a user’s needs change  Allows users to gain access to large systems without the overhead associated with maintaining a large cluster

18  Databases reliably store and retrieve data and can provide querying over the data sets  Large parallel databases are spread over servers without a cluster file system managing nodes

19  Data can be partitioned by evenly spreading data among the nodes or spreading the data based on hashes on some of the fields  The nodes evaluate queries on local partitions then combine the results from each node

20  If certain tables are frequently joined together in queries, store them on the same node  When joining tables from different nodes, transfer the smaller of the two

21  Parallel databases are very difficult to tune and populate with data  Very difficult to develop and debug parallel programs


Download ppt "Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing."

Similar presentations


Ads by Google