Zhangxi Lin, The Rawls College,

Zhangxi Lin, The Rawls College, 2016-03-31
Data Lake & Data Hub Zhangxi Lin, The Rawls College,

Data Warehouse Popular for business intelligence tasks, and being replaced by less-structured Data Lakes which allow more flexibility. The limitation of data warehouses is that they store data from various sources in some specific static structures and categories that dictate the kind of analysis that is possible on that data, at the very point of entry. While this was sufficient during the early stages of evolution of business intelligence where analysis was primarily done on proprietary databases and the scope was restricted to the canned reports, dashboards with limited and pre-defined interaction paths. This approach has started to fall apart in the world of big data discovery where it is very difficult to ascertain upfront all the intelligence and insights one would be able to derive from the variety of different sources, including proprietary databases, files, 3rd party tools to social media and web, that keep cropping up on a regular basis.

Data Lake A large-scale storage repository and processing engine.
Provides "massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs“ The term was coined by James Dixon, Pentaho chief technology officer. Dixon used the term initially to contrast with "data mart", which is a smaller repository of interesting attributes extracted from the raw data. One example of a data lake is the distributed file system, Apache Hadoop.

Top Five Differences between Data Lakes and Data Warehouses
Retain all data Support all data types Support all users Adapt easily to changes Provide faster insights

Data Hub A collection of data from multiple sources organized for distribution, sharing, and often subsetting and sharing. Generally this data distribution is in the form of a hub and spoke architecture. A data hub differs from a data warehouse in that it is generally unintegrated and often at different grains. It differs from an operational data store because a data hub does not need to be limited to operational data. A data hub differs from a data lake by homogenizing data and possibly serving data in multiple desired formats, rather than simply storing it in one place, and by adding other value to the data such as de-duplication, quality, security, and a standardized set of query services. A Data Lake tends to store data in one place for availability, and allow/require the consumer to process or add value to the data.

Turn ‘Data Lake’ into an Enterprise Data Hub
Using Hadoop as a “data lake” — a scalable data repository built on the cheap-and-deep HDFS (Hadoop Distributed File System) storage economics — to capture data from anywhere, and in any format, for future analysis. As Hadoop deployments shift from proof-of-concept sandbox experiments to enterprise-grade, mission-critical production solutions, they take on new workloads, and those workloads need all the power and all the flexibility of those ecosystem components listed above. Customers with existing investments in non-HDFS data lakes are just as excited about attacking new analytic and processing workloads as everyone else. Setting up an alternative HDFS-based Hadoop cluster using Direct Attached Storage (DAS) would mean copying data from the existing NAS-based data lake into a separate Hadoop installation. Copying is expensive; copying terabytes or petabytes is prohibitively so.

Enterprise Data Hub

Zhangxi Lin, The Rawls College,

Similar presentations

Presentation on theme: "Zhangxi Lin, The Rawls College,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Zhangxi Lin, The Rawls College,

Similar presentations

Presentation on theme: "Zhangxi Lin, The Rawls College,"— Presentation transcript:

Similar presentations

About project

Feedback