Presentation is loading. Please wait.

Presentation is loading. Please wait.

Top five challenges facing the Enterprise Data Warehouse (EDW)

Similar presentations


Presentation on theme: "Top five challenges facing the Enterprise Data Warehouse (EDW)"— Presentation transcript:

1 Top five challenges facing the Enterprise Data Warehouse (EDW)
1 2 3 4 5 Offload ETL Offload data Add new data sources Data quality Data governance 50x-100x ~ 50% Handling new data sources Low confidence No governance Reduce processing and storage costs by up to 100X with Hadoop; free EDW processing capacity for high value analytics and reporting Identify and move cold/rarely used data to Hadoop; store more data longer Store years versus months of transaction data store new data sources, such as unstructured and semi-structured, web, social and IoT data Enable faster, more effective, and more reliable decision making by ensuring that data is trustworthy Data stewardship, business glossary, data lineage; an ungoverned data lake is an unmanageable data lake We have 20 years of working with clients that use an Enterprise Data Warehouse(s) and in talking with those clients we’ve found that there are five challenges that almost all of our clients are facing. At least some of these may look familiar to you. 1. Higher cost of storage in EDW vs. Hadoop; ETL 50%-90% of EDW processing capacity - The reason is to save lots of money in software licenses and hardware infrastructure, and provide a better means of storing and accessing data.  One of the big reasons the traditional EDW costs so much is because Extract Load and Transformation processing historically runs inside the data warehouse engine. While originally this was deemed a best practice because the ETL process ran faster as it was closer to the data and used the EDW’s parallel database engine, it can consume tremendous amounts of processing capacity on  that system, reducing capacity for data retrieval which can negatively impact mission critical processing times and Business Intelligence systems performance. The only way to combat this in the short term is to keep buying more processing capacity. And storage capacity on an EDW can be 100 times more expensive than the same amount of storage capacity on a Hadoop based warehouse or repository. 2. Unused/Cold data in EDW – Also, EDWs have been often misused by customers and become a dumping ground for all data. Data that is rarely used (aka cold data) is being loaded into the EDWs because companies are unsure if they will need it in their warehouse for analytics and reporting, so they don’t take a chance and just load all or most of their data into it. This quickly consumes the storage capacity, causing companies to have to archive data which can make it now not easy to access or inaccessible. 3. Unstructured & Semi-structured data; Months vs. Years of data; Not storing web/IOT data - Almost more important than just cost, traditional EDWs are unable to digest and use new sources of data which may be needed in the analytic and reporting processing that works off the EDW. Modern day analytics require new types of data to be more accurate and relevant such as web data or IOT data (sensory gathered data, consumer sentiment, etc.), and EDWs don’t posses the ingestion, storage, technology for this unstructured data. And because EDWs have to be so concerned about storage and costs, companies can’t store large periods of historical data that may cause their analytic processing to produce better results, allowing companies to really harness their data for revenue growth. Garbage In Garbage Out; Complex data cleansing routines can’t be pushed into the database; Most data cleansing tools are not massively scalable - Enterprise Data Warehouses are commonly plagued with data quality issues in the data that is in the warehouse. This “un-clean” data can lead to incomplete or inaccurate analytic processing and business intelligence reports, that will give business users and decision makers low confidence in the data and analytic results, which almost defeats the purposed. The EDWs are not designed to run complex data cleansing routines and doing so would consume too much expensive processing capacity of the EDW. No data stewardship, no business glossary, no data lineage - Data Governance is also a big concern with data stored in an EDW. Not just who is allowed to use or access the data, but lack of business meaning and ownership or data stewardship results in data being dumped and once in, not being found and used because other users do not know what that data is used for. And with no ownership, accountability, or data stewardship, and not knowing where the data came from or how it got into the warehouse, much of the data will never be used because users don’t know anything about that data. Those five challenges have to be addressed or you are going to fail – your organization is going to end up chasing its tail trying to utilize your data.

2 EDW Offloading is the largest reason that companies are adopting Hadoop data lakes, driven by these factors: The compelling opportunity for realizing significant cost reduction The significant benefits of establishing a governed data lake to provide self-service analytics for non-technical users who analyze traditional structured data that is enriched with new forms of data To set the foundational layer for operationalizing future actionable insights in a customer-centric or product-centric way    

3 Low cost, efficient bulk movement of data
EDW Offloading can involve up to three related activities: 1.    Moving or supplementing data integration from the EDW to Hadoop 2.    Moving unused data from the EDW to Hadoop 3.    Storing new types of data in Hadoop for enriching EDW analytics EDW Offloading can involve up to three related activities: 1.    Moving or supplementing data integration from the EDW to Hadoop 2.    Moving unused data from the EDW to Hadoop 3.    Storing new types of data in Hadoop for enriching EDW analytics Move data Y-way parallel with data repartitioning Same easy drag and drop paradigm 15 tb/hr IBM HDFS loading test 30 tb/hr Just double the hardware EDW Load Hadoop Z-way parallel Extract EDW data X-way parallel Low cost, efficient bulk movement of data High performance, parallel interfaces to EDW and HDFS Extract X-way parallel, move Y-way parallel, and load Z-way parallel Different degrees of parallel work in different parts of the job. No limitations on throughput and performance Double the performance by doubling hardware Move data without landing to disk Zero coding environment – 10X productivity gain over hand coding

4 Classification and Validation
In each phase of the EDW offloading activities, organizations have opportunities to add data quality processing and data governance as part of the offloading process Pushing ETL workloads into the EDW has prevented organizations from implementing data quality processing and data governance EDW business Classification and Validation Data policies and rules And remember, governing the data in the Hadoop data lake entails more than governing just that particular data. It needs to know about and encompass the original transactional data that was used to derive the data in the data lake, the BI reports, the users, and other related data elements that is consuming that data. OLD TEXT FROM CHART Improved data quality creates trustworthy data for better decision making Most EDWs don’t support comprehensive data quality processing. Reliance on pushing ETL into the database has prevented data quality processing. “Garbage in and garbage out” usually the case for EDW Analytics Hadoop without comprehensive data quality is useless. Hadoop will produce “garbage in and garbage out” faster and at a lower cost than the EDW Many organizations are implementing data quality processing as part of the EDW Offloading process.

5 Only IBM offers a modular solution for all eight of the EDW offloading requirements
1. Move data 2. Transform & integrate data 3. Replicate data 4. Improve data quality 5. Govern data 6. Augment & enrich data 7. Reference architecture 8. Implementation patterns Trusted Analytics Foundation DataStage® QualityStage® Big Integrate BigQuality InfoSphere® Information Server Information Governance Catalog Data Replication

6 Capabilities Required capability Why this is important IBM solution
1. Move data Low cost, efficient movement of data DataStage/BigIntegrate 2. Transform and integrate Reduce costs while leveraging existing assets 3. Improve data quality No data quality means garbage in, garbage out QualityStage/BigQuality/ Information Analyzer 4. Govern your data Ungoverned Hadoop means unmanageable Hadoop Information Governance Catalog (IGC) 5. Replicate Deliver data where and when needed IBM Data Replication 6. Augment and enrich Increase ROI from EDW analytics 7. Reference architecture Reduce project costs and risks IBM Enterprise Analytics Reference Architecture 8. Implementation patterns IBM Analytics Implementation Patterns


Download ppt "Top five challenges facing the Enterprise Data Warehouse (EDW)"

Similar presentations


Ads by Google