Presentation is loading. Please wait.

Presentation is loading. Please wait.

Navigator Management Partners LLC Business Analysis Professional Development Day – Sep 2014 How to understand and deliver requirements to your Business.

Similar presentations


Presentation on theme: "Navigator Management Partners LLC Business Analysis Professional Development Day – Sep 2014 How to understand and deliver requirements to your Business."— Presentation transcript:

1 Navigator Management Partners LLC Business Analysis Professional Development Day – Sep 2014 How to understand and deliver requirements to your Business Intelligence Peers Neelam Mohanty

2 TYPICAL DATA WAREHOUSING NEEDS  Identify when changes are made to the source systems  Uniquely identify a source record  Be usable – allow users to drill up, drill down or across  Handle overlapping data from multiple systems  Allocate header level facts to detail line-item data  Handle semantic complexity  Eliminate mismatches when integrating data from multiple sources  Store historical and current data

3 REQUIREMENTS FROM SOURCE SYSTEMS  Change Data Capture (CDC)  Natural Key  Data Quality  Hierarchies  Grain  History  Conflict Resolution  Semantic Complexity

4 CHANGE DATA CAPTURE (CDC)  Why the need to identify changes?  Data Warehouses need to identify slowly changing dimensions  Master Data Management (MDM) needs to identify when changes were made to master data attributes  Capture changes as they happen in source systems for improving Data Quality  Data warehouses want to transfer only the relevant changes to the source data since the last transfer (also called incremental loading). Doing a ‘full refresh’ is usually undesirable other than for smaller tables  Capture deletions, edits and inserts

5 CHANGE DATA CAPTURE (CDC)  Why is CDC important?  Reduces load time, required resources and associated costs  Provides a solution for the continued and accelerating growth in data volumes  Supports lower delivery latencies for real-time data warehousing  Mitigates the risk of failure in long-running ETL jobs.  Four ways of implementing CDC:  Database Log Scraping  Audit Columns  Timed Extracts  Full database “diff compare”

6 UNIQUE IDENTIFIER/NATURAL KEY  Why is a unique identifier important?  Data Warehouses typically store years of history – the source system may only store the last three months of history and reuse the unique identifier  Data Warehouses typically need to integrate multiple source systems  Knowing a birthdate, gender and ZIP code is enough to identify 87 percent of the people living in the United States  Fuzzy vs Exact matches

7 DATA QUALITY  Why does it happen?  Bad businesses processes  Inadequate software training  Bad software design and implementation  Consequences on BI  Bad or delayed decisions  Revenue Impacts  Money spent in standardizing or cleansing data downstream  Affects ability to reach customers, cross-sell and up-sell  Efficiency Impacts  Time and resources spent on fixing data quality issues

8 HIERARCHIES  A hierarchy is a set of levels having many-to-one relationships between each other, and the set of levels collectively makes up a dimension or a tree.many-to-one relationships dimension  Why important?  Usability - Hierarchies are the key to navigating dimensions  Query Performance  Level  A position in a hierarchy  Hierarchies are not always officially part of the source system

9 THE HOLY GRAIN(L)  Most of the basic control documents that transfer money and goods (or services) from place to place take the parent child form.  An invoice (the parent) has many line items (the children).  Other obvious examples besides invoices include orders, bills of lading, insurance policies, and retail sales tickets.  Parent-child transaction databases commonly have facts of different granularity  Data Warehouses often try to allocate header level facts down to the child line item detail level  This provides the ability to slice and dice and roll up the facts to the child line-item detail

10 HISTORY  Data Warehouses typically need to do a one-time history load when they go live  Also called initial load or initial/seeding load  The data in the operational systems may often be in multiple formats  It is believed that data quality gets poorer and poorer as the data gets older  Missing data value issues often must be dealt with when loading history

11 SURVIVORSHIP (CONFLICT RESOLUTION)  Overlapping data from multiple systems  Survivorship (Conflict resolution) – Rules to define how to assemble a single record from two or more records with overlapping attributes that may contain conflicting values  Resolution strategies can be  Instance based – rely on the actual data values  Deciding versus Mediating strategies  Metadata based – rely on freshness of data or reliability of source or which data element was most recently updated

12 SEMANTIC COMPLEXITY  Different users of a database have different conceptions of what the data represents  Take the example of a table of mobile phone numbers – there are values of all 0’s, nulls, all 9’s and valid digits  What does null mean?  The record is of someone who does not have a mobile phone  The record is of someone who has a mobile phone, and chose not to supply the number  The record is of someone who has a mobile phone, but forgot to supply the number or the number was hard to decipher and recorded as null  What do the all 0’s and all 9’s mean?

13 QUESTIONS?  Contact me at

14 CHANGE DATA CAPTURE (CDC)  Why the need to identify changes?  Data Warehouses need to identify slowly changing dimensions  Master Data Management (MDM) needs to identify when changes were made to master data attributes  Capture changes as they happen in source systems for improving Data Quality  Data warehouses want to transfer only the relevant changes to the source data since the last transfer (also called incremental loading). Doing a ‘full refresh’ is usually undesirable other than for smaller tables  Capture deletions, edits and inserts

15 CHANGE DATA CAPTURE (CDC)  Why is CDC important?  Reduces load time, required resources and associated costs  Provides a solution for the continued and accelerating growth in data volumes  Supports lower delivery latencies for real-time data warehousing  Mitigates the risk of failure in long-running ETL jobs.  Four ways of implementing CDC:  Database Log Scraping  Audit Columns  Timed Extracts  Full database “diff compare”

16 UNIQUE IDENTIFIER/NATURAL KEY  Why is a unique identifier important?  Data Warehouses typically store years of history – the source system may only store the last three months of history and reuse the unique identifier  Data Warehouses typically need to integrate multiple source systems  Knowing a birthdate, gender and ZIP code is enough to identify 87 percent of the people living in the United States  Fuzzy vs Exact matches

17 SURVIVORSHIP (CONFLICT RESOLUTION)  Overlapping data from multiple systems  Survivorship (Conflict resolution) – Rules to define how to assemble a single record from two or more records with overlapping attributes that may contain conflicting values  Resolution strategies can be  Instance based – rely on the actual data values  Deciding versus Mediating strategies  Metadata based – rely on freshness of data or reliability of source or which data element was most recently updated

18 HIERARCHIES  A hierarchy is a set of levels having many-to-one relationships between each other, and the set of levels collectively makes up a dimension or a tree.many-to-one relationships dimension  Why important?  Usability - Hierarchies are the key to navigating dimensions  Query Performance  Level  A position in a hierarchy  Hierarchies are not always officially part of the source system

19 THE HOLY GRAIN(L)  Most of the basic control documents that transfer money and goods (or services) from place to place take the parent child form.  An invoice (the parent) has many line items (the children).  Other obvious examples besides invoices include orders, bills of lading, insurance policies, and retail sales tickets.  Parent-child transaction databases commonly have facts of different granularity  Data Warehouses often try to allocate header level facts down to the child line item detail level  This provides the ability to slice and dice and roll up the facts to the child line-item detail

20 DATA QUALITY  Why does it happen?  Bad businesses processes  Inadequate software training  Bad software design and implementation  Consequences on BI  Bad or delayed decisions  Revenue Impacts  Money spent in standardizing or cleansing data downstream  Affects ability to reach customers, cross-sell and up-sell  Efficiency Impacts  Time and resources spent on fixing data quality issues

21 HISTORY  Data Warehouses typically need to do a one-time history load when they go live  Also called initial load or initial/seeding load  The data in the operational systems may often be in multiple formats  It is believed that data quality gets poorer and poorer as the data gets older  Missing data value issues often must be dealt with when loading history

22 SEMANTIC COMPLEXITY  Different users of a database have different conceptions of what the data represents  Take the example of a table of mobile phone numbers – there are values of all 0’s, nulls, all 9’s and valid digits  What does null mean?  The record is of someone who does not have a mobile phone  The record is of someone who has a mobile phone, and chose not to supply the number  The record is of someone who has a mobile phone, but forgot to supply the number or the number was hard to decipher and recorded as null  What do the all 0’s and all 9’s mean?

23 QUESTIONS?  Contact me at


Download ppt "Navigator Management Partners LLC Business Analysis Professional Development Day – Sep 2014 How to understand and deliver requirements to your Business."

Similar presentations


Ads by Google