Hadoop and Spark Dynamic Data Models Amila Kottege Software Developer

Hadoop and Spark Dynamic Data Models Amila Kottege Software Developer
Ontario Teachers' Pension Plan

Agenda What we do What we're building How we're building it

What we do Asset Liability Model
Monte Carlo simulation that projects the pension's liabilities Simulate ~300 variables Project into the future

A simulation takes about 1.5hrs
What we do A simulation takes about 1.5hrs Business expects to be able to analyze the results immediately after Business runs ~5000+ simulations a year

Reporting system to help business perform analysis
What we're building Reporting system to help business perform analysis Reporting engine based on Hadoop ecosystem HDFS Spark Hive A set of reusable calculations and algorithms in Spark Common statistical calculations Specific business calculations

What we're building Two main report types Static (canned) reports
Users provide inputs and configure canned reports Dynamic reports Users want exploratory type reports Self-serve and be able to manipulate data

Calculation 1 Output of Calculation 1 Calculation 2 Output of Calculation 2 Calculation 3 Output of Calculation 3 Output Combiner Calculation 4 Output of Calculation 4 Calculation 5 Output of Calculation 5

Static reports are simple
What we're building Static reports are simple Perform calculations based on user input Produce an Excel file with results Dynamic reporting is difficult Self-serve is difficult How do we provide a simple interface for business to analyse the results of the calculations in a self-serve manner?

Sometimes includes raw output from simulation
What we're building Self-serve for us Perform the complex calculations upon user request Generate new data Allow business to slice and dice this newly created data Sometimes includes raw output from simulation

We looked at many self-serve BI tools
What we're building We looked at many self-serve BI tools Tableau, QlikView, and Power Pivot Each has their benefits All required a well built data model Either loaded the whole data model to client side or would send queries every time a filter changed back to server

What we're building Data size is too large to fit in client computer Sending queries back and forth constantly is not the best user experience Changing a large data model is very difficult and slow process Does the user even need all the data? From all previous reports?

No, the user does not need all the data
How we're building it No, the user does not need all the data Very few, if any, cases exist where they want all the data Picking one tool for everything is difficult Use the correct tool when needed

Each report becomes its own database
How we're building it Each report becomes its own database Hadoop + Hive Databases in Hive exist upon query Minimal effect for us

How we're building it

How we're building it No magic here Spark's DataFrames
Each calculation/report has a predictable output structure Leverage this structure to create facts and dimensions Spark's DataFrames

Data models can grow with no dependency to the past
How we're building it Data models can grow with no dependency to the past Not tied to a single tool Tableau, QlikView, PowerPivot, etc. A system that does most of the hard work Spark, Hive, HDFS

Where we are Generate data models per report Generate an Excel file to connect to correct database In UAT

Thank you.

Hadoop and Spark Dynamic Data Models Amila Kottege Software Developer

Similar presentations

Presentation on theme: "Hadoop and Spark Dynamic Data Models Amila Kottege Software Developer"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hadoop and Spark Dynamic Data Models Amila Kottege Software Developer

Similar presentations

Presentation on theme: "Hadoop and Spark Dynamic Data Models Amila Kottege Software Developer"— Presentation transcript:

Similar presentations

About project

Feedback