Presentation is loading. Please wait.

Presentation is loading. Please wait.

Prasenjit Ghosh. Director Balram Mishra. Project Manager

Similar presentations


Presentation on theme: "Prasenjit Ghosh. Director Balram Mishra. Project Manager"— Presentation transcript:

1 Data Validation Framework for challenging business environment in modern era
Prasenjit Ghosh. Director Balram Mishra. Project Manager Abhisek Mohanty. Project Manager Devipriya Selvaraju. Technology Lead Infosys Limited Logo of your organization

2 Abstract In today's IT industry, Data Analytics plays a crucial role in solving various business challenges. The insights are generated from diverse and massive data, accumulated from variety of sources having varied origins. Predominantly structured and unstructured data are collaborated from the source to a data lake. From the data lake, the required insights are generated using various modern day User Interface technologies. Fundamentally the challenges reside profoundly in the core i.e. quality of the base data post migrations and transformations, and essentially has “huge data volume” and “diverse data sources” at the root. Therefore, it is of utmost importance to have a comprehensive data validation framework which can address the above challenges and also should be flexible to be plugged in for various functions like developer self-test, independent QA, load testing etc. Our attempt is to present a practitioner's view based on a real time project challenge and the solution framework implemented using an open source readily available framework. The early benefits from the usage of this solution are encouraging. Also the solution has the potential to be enhanced / leveraged further depending on the context specific needs.

3 Challenges in evaluating data quality post migration/transformation
Variety in Data Sources – Oracle, Mongo DB, CSV etc. Type of Data Transformation – One Time Migrations as well as Incremental Updates. Developer Tester QA Inability to identify missing records. Inability to validate data at attribute level. Challenges in tracking records over incremental migration. Automation Process Easy to use Tool Manual Testing - effort consuming, error prone. Need of automated reporting -Summary, Details, and Trend Analysis.

4 Addressing the Challenges

5 Practitioner’s View: Case Study : Real Time Problem Statement
# Problem Statement Description 1 Mismatch Count Mismatch in the record count between Source and Target 2 Missing Record Set Record drops in migration/transformation 3 Attribute Mismatch Mismatch in the attributes in the records 4 Incremental Validation Issues with Incremental data migration from Source to Target.

6 Practitioner’s View: Case Study : Solution Approach

7 Practitioner’s View: Case Study : Result (1/3)
Current Pain Points Solution Tedious way to validate the record count between source and target. Precise count difference between source and Mongo. Inability to identify missing records. Identification of missing records Inability to validate data at attribute level. Identification of attribute level mismatch Challenges in tracking records over incremental migration. Incremental validation for count/data missing/attribute mismatch. Late evaluation of final result. (Shift- Left)Validation by the developer himself. Apart from addressing the above pain points, this solution has capability for CSV and Mongo data Comparison Mongo and Elastic Comparison Multiple data sources (ex. Oracle, Mongo and Elastic) comparison in one go.

8 User Role Products Master Item Master
Practitioner’s View: Case Study : Result (2/3) Developers use this tool for validating the data transfer/transformation accuracy between Oracle and Mongo. The benefits realized are- Quick visibility of mismatch on huge volume of data Visibility on data latency via comparison of source and target Helps performance tune the data stream flow Decouple need of data quality check as part of migration. There by enabling focus and faster turnaround time to deliver large migration Data Entity Use Case Record Count Query Execution time Accuracy % Before using Apache Drill After using Apache Drill User Role Identification of Missing Records 6,387,608 18 mins 85% 98% Products Master 1,338,572 8 mins Item Master 4,546,279 10 mins

9 Practitioner’s View: Case Study : Result (3/3)
With the capability of Apache drill to integrate with reporting tool (tableau in our case) we are able to get ready dashboard on required dimensions like overall Summary, Trends over time etc. With Trend graphs we get the insights like Increase in data mismatch with increase in data inflow. Increase in latency with increase in data inflow. Increase in data drops with increase in data inflow. Data inflow spikes during month end, quarter end and year end.

10 Practitioner’s View: Case Study : Additional Points
Achieving Performance Distributed query optimization and execution Columnar execution Runtime compilation and code generation: Vectorization Optimistic/pipelined execution Achieving Security Authentication Encryption Impersonation Authorization

11 References & Appendix migration.pdf

12 Question & Answers

13 Thank You!!!


Download ppt "Prasenjit Ghosh. Director Balram Mishra. Project Manager"

Similar presentations


Ads by Google