Data Modeling for Predictive Analytics

Data Modeling for Predictive Analytics
SQL Saturday Providence – Dec 10 Beth Wolfset, Data Architect

SQL Saturday 575 - Thanks to our Sponsors!
Gold Silver Bronze Swag Blog

About BlueMetal (an event sponsor)
Modern technology, craftsman quality. We’re an interactive design and technology architecture firm matching the most experienced consultants in the industry to the most challenging business and technical problems facing our clients. Founded August 2010 and as of October 2015 we are an Insight company. 6 | YEARS IN OPERATION 5 | LOCATIONS 6 | SERVICE AREAS 4 | INDUSTRY SPECIALIZATIONS Locations: Boston, New York, Chicago, DC, Tempte Service Areas: Intelligent Customer Applications Modern Business Applications Real-Time Business Hybrid Cloud Modern Workplace Branch Infrastructure We are hiring Data Engineers in Boston and NYC

Data Is An Asset “Whether you want it or not, the amount and variety of data are expanding exponentially. Embrace that trend and transition your organizations to understand information as a competency that needs the right people, processes and platforms” John Lewis, president & CEO, consumer group, NA, at Nielsen “organizations integrating high-value, diverse, new information types and sources into a coherent information management infrastructure will outperform their industry peers financially by more than 20%.” Regina Casonato, et al, Gartner Research The Big Mystery: What’s Big Data Really Worth? - Vipal Monga, 2014 A Lack of Standards for Valuing Information Confounds Accountants, Economists What's Your Big Data Worth? - Ellis Booker, 2012 Big data experts say accounting rules need to catch up to the fact that information has value that should be reflected on a company's books Information Management in the 21st Century - Regina Casonato, Anne Lapkin, Mark A. Beyer, Yvonne Genovese, Ted Friedman, 2011 From "Information as a Byproduct" to "Information as an Asset" Gartner Says Worldwide Enterprise IT Spending to Reach $2.7 Trillion in 2012 – Peter Sondergaard, 2011 “Information is the oil of the 21st century, and analytics is the combustion engine.” Peter Sondergaard, Gartner Research

Why Are We Here? What is Data Modeling
What is the Predictive Analytics Process Database Types and how No SQL fits in What are the Types of Data Models and where they are appropriate A Word on other Data Topics Data Architecture and No SQL Code First vs Model First Here are the questions I address during this presentation

What is Data Modeling Data modeling is a process used to define and analyze data requirements needed to support the business processes within the scope of corresponding information systems in organizations. Therefore, the process of data modeling involves professional data modelers working closely with business stakeholders, as well as potential users of the information system. Structured analysis and design technique (SADT) is a […] methodology for describing systems as a hierarchy of functions. SADT is a structured analysis modelling language, which uses two types of diagrams: activity models and data models. It is developed since the late 1960s by Douglas T. Ross, and further formalized and published as IDEF0 in Ingres began as a research project at the University of California, Berkeley, starting in the early 1970s and ending in 1985 In 1979, RSI introduced Oracle V2 (Version 2) as the first commercially available SQL-based RDBMS, a landmark event in the history of relational databases. Data modeling is a process used to define and analyze data requirements needed to support the business processes within the scope of corresponding information systems in organizations. Therefore, the process of data modeling involves professional data modelers working closely with business stakeholders, as well as potential users of the information system. - Data modeling is the process of documenting a complex software system design as an easily understood diagram, using text and symbols to represent the way data needs to flow. The diagram can be used as a blueprint for the construction of new software or for re-engineering a legacy application. - Data modeling involves a progression from conceptual model to logical model to physical schema. - A Metaphor The Data Modeler takes a blank sheet of paper, and decides where to place the lines. They decide how many lines, where to put the lines, how many different set of lines. They set up the guidelines so that everyone can read and understand the information. They document what is expected in each column and how it should be used Someone can write on paper without lines, but other may not be able to understand what is written. Schema-on-Write vs Schema-on-Read - Schema-on-Write: When you decide how the data will be laid out before you start collecting/writing it Limits the data collected Design for flexibility and growth Plan for downstream uses of data Schema-on-Read – find a pattern in data after the fact Works with data from a variety of sources Keep raw data that may not be useful now Visual Notetaking or Sketchnoting

The Data Journey Transactional Source systems ETL Data Warehouse
9/12/2018 The Data Journey SQL Server Reporting Services (SSRS) SQL Server SQL Server Integration Services (SSIS) SQL Server Analysis Services (SSAS) Power BI Transactional Source systems ETL Staging Data Warehouse BI & Analytics Dashboards Reporting OLTP ERP CRM LOB Historical IoT Data Slide is a modification of Microsoft Slide that is found in several presentations : hadoop-on-azure-10551?l=WZjfYu97_ Predictive Analytics from dreamstime.com Logs Devices Web Sensors Social Azure Machine Learning (AML) HD Insight Azure Data Lake Microsoft Azure © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Predictive Analytics Process
Identify Business Objectives Exploratory Analysis Develop Statistical Model Test Statistical Model Implement Statistical Model Monitor & Analyze Identify Business Objectives: these are the business questions with a list of actions given the possible outcomes. There are many interesting discoveries in the data, but if they cannot lead to business actions (product enhancements, customer loyalty, additional revenue) then exploring them can be a waste of resources. Exploratory Analysis is when the data scientist working with available data. This may include transactional/historical data, warehouse data or IoTdata. This will include recognizing what data is available (corporate or public), any transformations needed, and findings with their statistical significance. The scientist will use reports or simple visualizations to make these discoveries. Develop Statistical Model – this is a more formal phase of creating statistical model(s) by determining relevant factors. It includes dividing the data into training and testing sets (which are not used in this phase) and determining which models fit the training set best. Test Statistical Model – involves running the model against the test data set. The model cannot be ‘tweaked’ against the test data set or that makes it a training set. There are different methods of training and validating models. Implement Statistical Model – at this point, the model has been developed and tested and is not being ‘productionalized’ for use. At this phase, special data sets may be created for on-going execution of the model. Monitor & Analyze – this includes watching the model for performance. It also includes on-going validation of the model as the behaviors or business practices change of time which could reduce the original models fit.

Popular Types of Databases
Database Type Example Relational (SQL) SQL Server SQL Database Oracle Sybase MySQL MS Access Document Lotus Notes CouchBase CouchDB MongoDB OrientDB Raven Terrastore Azure DocumentDB Graph & Resource Description Framework (RDF) Neo4J Flock HyperGraph Infinite Graph Jena Sesame AllegoGraph Search Engine ElasticSearch Splunk Solr MarkLogic Sphinx Key-Value Berkely Level Memcached Riak Redis Azure Tables Column-Family Cassandra HBase Hypertable Amazon Simple DB Object-Oriented Cache ObjectStore Objectivity/DB Db4o Versant Hierarchical Data Stewards work with data regardless of where and how it will be persisted. Most information from the book: No SQL Distilled Other data from site: And from: Understanding NoSQL on Microsoft Azure by David Chappell Products offered by Microsoft

 NO SQL Data Architecture and Not Only SQL NO SQL – Not Only SQL
Implies a non-relational database Does not imply there is no structure to the data Data may be stored in methods that do not require the structure to be understood a priori Structure of data may be defined at query time NO SQL does not mean NO Data Architect The Data Architect understands what data is available and the methods of accessing it, regardless of the management system Data requires documentation Generally uses Schema-On-Read NO SQL  With storage becoming cheaper and new ways to analyze data, Companies are becoming Data Hoarders

Data Models Model Types Business Case Conceptual Logical Physical
Star Schemas Summary Tables and Cubes Others Business Case To capture information for an Educational System. This will include information about the Employees, Faculty, Students , Alumni, Courses and Educational Organization

Data Modeling Tools Tool Creator Supported Database Platforms Supported OSs Supported data models (conceptual, logical, physical) Supported notations Forward Engineering Reverse Engineering Model/database comparison and synchronization Repository ERwin Data Modeler ERwin Inc. (formerly part of CA Technologies) Access, IBM DB2, Informix, Ingres, MySQL, Oracle, Progress, MS SQL Server, Sybase, Teradata Windows Conceptual, logical, physical IDEF1X, IE (Crows feet), and more Yes Update database and/or update model Workgroup edition provides collaboration ER/Studio Embarcadero (acquired by IDERA) Access, IBM DB2, Informix, Hitachi HiRDB, Firebird, Interbase, MySQL, MS SQL Server, Netezza, Oracle, PostgreSQL, Sybase, Teradata, Visual Foxpro and others via ODBC/ANSI SQL Conceptual, logical, physical, ETL IDEF1X, IE (Crows feet) ER/Studio Repository and Team Server (formerly Portal/CONNECT) for collaboration Enterprise Architect Sparx Systems IBM DB2, Firebird, InterBase, Informix, Ingres, Access, MS SQL Server, MySQL, SQLite, Oracle, PostgreSQL, Sybase Windows, Linux, Mac Conceptual, Logical & Physical + MDA Transform of Logical to Physical IDEF1X, UML DDL, Information Engineering & ERD Multi-user collaboration using File, DBMS or Cloud Repository (or transfer via XMI, CVS/TFS or Difference Merge). SQL Server Management Studio Microsoft MS SQL Server Physical Oracle SQL Developer Data Modeler Oracle Oracle, MS SQL Server, IBM DB2 Cross-platform Logical, physical PowerDesigner Sybase MS SQL Server, Oracle, PostgreSQL, MySQL, IBM DB2, Informix Information reprinted from: This is a subset of Data Modeling tools on the market. Erwin and ER Studio are listed first as they are the most commonly used tools and encompass full features. Oracle SQL Developer is based on the Oracle Designer product and was scaled for Oracle JDeveloper. It is now free and part of the SQL Developer product. PowerDesigner was quite popular at one time. Enterprise Architect is embedded with the EA suite, however does not include full features of other data modeling tools. SQL Server Management Studio allows diagrams to be developed, or a model reverse engineered from a reporting db. It does not encompass many other features. Visio does allow for the development of ER diagrams in the Professional edition when the ‘data’ option is included. It does not include the features of other data modeling tools. Entity Framework has tooling to allow the creation of a Entity Data Model. This is minimal functionality for a data model tool and in theory supports the physical model only.

Conceptual Model A conceptual data model is a high level or coarse model which is abstract in structure and content that is intended to represent a business area SubType SuperType The conceptual data model can also be called the domain model. This model should be consistent regardless of the application or method or persistence. Transactional Data  Warehouse Data  IoT Data Relational Data Non-Relational Data

Logical Model A logical data model (LDM) is a type of data model showing a representation of the organization's data, independent of any particular technology Attribute Primary / Surrogate Key Entity Business Key / Alternate Key SuperType Audit Columns Domains SubType Foreign Key Entity - A distinguishable person, place, thing, event or concept about which information is kept. Attribute - A relevant property or characteristic of an entity. Primary Key uniquely identifies the row Surrogate Key is an attribute or set of attributes that is generated strictly to serve as an entity's primary key. The data in a surrogate key has no inherent meaning or purpose except to uniquely identify every instance of the entity. Business Key / Alternate Key -A natural key ensures uniqueness according to the business rule Keys can contain multiple columns Foreign Key – field(s) in one table that uniquely identifies a row of another table Audit Columns - Every table should have audit columns to determine the date and time of every entry Domains - Define the object once and use repeatedly by applying the domain to the entity attributes and table columns  Transactional Data  Warehouse Data  IoT Data Relational Data  Non-Relational Data

Physical Model { "course": { "courseId": "ENG 301", "instructor": { "name": { "first_name": "Severus", "last_name": "Snape" }, "ssn": " " "students": [ "first_name": "Harry", "last_name": "Potter" "ssn": " " "first_name": "Luna", "last_name": "Lovegood" "ssn": " " } ] { "person": { "name": { "first_name": "Severus", "last_name": "Snape" }, "ssn": " ", "active_status": true, "person_type": "External", "role": { "faculty": { "department": "Engineering", "salary": "80000" "student": { "fee": 0 } "course": { "courseId": "ENG 301" A physical data model (or database design) is a representation of a data design which takes into account the capabilities and constraints of a given database management system What about reverse engineering? SuperType The entity in the logical model may become the table in the physical model. The attribute in the logical model may become the column in the physical model. The logical model used business terminology. The physical model will use the database terminology. These may not be the same. The physical model may be using the terminology of a third party tool. Or, it may be named based on legacy systems. Sample JSON is shown that would be valid for this problem. However, the JSON is not equivalent to a data model. But if the data is not represented in the JSON correctly it cannot be used to populate the database correctly. Reverse Engineering – use existing database to determine tables, columns, PK and FK. Some tools will reverse engineer and then try to build an ER diagram Some tools can infer FK when none have been declared by using column names or other indexes No tools can infer the meaning of a column  Transactional Data  Warehouse Data  IoT Data Relational Data X Non-Relational Data SubType

Star Schema A star schema is a representation of a dimensional data model which consists of facts and dimensions. Some further descriptions Transactional Data  Warehouse Data IoT Data X Relational Data Non-Relational Data

Summary Tables and Cubes
A summary or aggregation table will be created when metrics have been standardized and reports or analytics want to use the information frequently or consistently. Some further descriptions Aggregation or Summary tables are what you get in SSAS. Allow for complex aggregations to be done in advance Especially for KPIs or other metrics executed multiple times by multiple people Get consistent response to same question Within analysis services two types of ‘data models’ Multi-Dimensional (disk based, allows data mining, query SQL Server Data Tools or Pivot tables) Tabular (memory based, like power pivot, no data mining, uses DAX query language) These are about how to store and access the aggregation data Cannot coexist on same instance of SQL Server Have to know business needs and metrics definitions in advance – should have been done in analysis effort. Transactional Data  Warehouse Data IoT Data Relational Data Non-Relational Data

Predictive Analytics Process
Identify Business Objectives Exploratory Analysis Develop Statistical Model Test Statistical Model Implement Statistical Model Monitor & Analyze Two areas of modelling: Source data to build the predictive algorithm Could come from any data but most likely star schema in warehouse and unstructured data Could come from a cube Predictive Model Once defined, a data model may be needed to support the predictive model. This could include a subset of data (the predictors). Aggregated columns following transformations done in the algorithm (sum, logarithms). Could include other statistics.

Other Areas Data Model Patterns Master Data Management
Code First vs Model First Data Flow Diagrams Use Case Diagrams Data Governance Other Areas?

Data Model Patterns Events Demographic Data Logs Taxonomies &
Reference Data Business Objects Logs Auditing Utilization Demographic Data When the same objects is represented different ways, it is difficult to get a cross functional look Could be common data objects across systems In a transactional system, these are areas for master data management In a warehouse, these will be transformed before storing Could be objects standard across organizations: Purchase order header, purchase order line item For Predictive analytics, data should be clean and may be manipulated

Master Data Management
core data that is essential to operation of the business consistent and uniform set of identifiers and extended attributes that describes the core entities Master Data Management a methodology that identifies the most critical information within an organization—and creates a single view of truth to power business processes discipline in which business and IT work together to ensure the uniformity, accuracy, stewardship, semantic consistency and accountability of the enterprise’s official shared master data assets may be technology enabled Master Plan List tCustomer Customer Config Sales Customer Data Three Ways to Master Data Mutually Exclusive Vertically Fragmented Match and Merge Name: Severus Snape SSN: Address: 9 Galen St Phone: Name: S. Snape Degree: Engineering Name: Prof. Snape Emp Id: 456 Name: Prof. Severus Snape SSN: Emp Id: 456 Address: 9 Galen St Phone: Degree: Engineering Which is the right customer address? Mastering is always about having disparate sources of data and brining it together. Three types of mastering: Mutually exclusive – the data is different sources but there is no overlap. Mastering brings the sources into a single ‘list’ with a common structure. Example: class list across departments, vendor contract data Vertically fragmented – the data is in different sources and there are attributes of the data that are different in these sources. Mastering creates a single view of the data that appears as if all attributes are in one master record. It is important to identify the source of the data by attribute in this type of mastering. Only one application can create a new record. However, different applications may be able to update – but only the attributes they are the source for. Example: most likely the plan list. Match and merge – the data is in different sources and there is overlap of both rows of data and attributes within the data. Therefore, to master the data and provide a single view a complicated set of understanding the sources and attributes, ranking the owner, and recognizing the same information that arrives from those sources differently. These are the match and merge rules. Example: Provider data master. Why Do MDM Portal Provides single logical view of data that is consistent and trustworthy information Users see same data consistently across all applications and all plan Efficiency Reduces overhead Integrate data Compliance Facilitates industry pressures and government mandates Allows data integrity for data object Reuse - Expedites computing in multiple systems, architectures, platforms and applications Scalability - Support the projected growth Engineering Classes Math Classes Philosophy Classes Master Class List

Data Abstraction Layer
Code First vs Model First Code First Build code-data structures in memory Measured on speed to provide functionality OR/M maps structures to DB Limits need to understand DB Access Goals Ease of Development Understand code Model First Understand data and future growth Use standards and templates Measured on multiple uses across applications Consistency of model facilitates efforts Goals Efficient Storage Performant Retrieval Understand DB Data Abstraction Layer When the same objects is represented different ways, it is difficult to get a cross functional look In a transactional system, these are areas for master data management In a warehouse, these will be transformed before storing Predictive analytics efforts may have to perform manipulation Microservices Stored Procedures

Data Flow Diagram A data flow diagram is a representation of the movement of data through a system Be aware that some people will try to pass off a data flow diagram as a data model This sample is from  Transactional Data  Warehouse Data  IoT Data Relational Data Non-Relational Data

Use Case Diagram A use case diagram is a representation of a user's interaction with the system  Transactional Data  Warehouse Data  IoT Data The use case diagrams are geared towards actors and the things they can do. Relational Data Non-Relational Data

Twitter: @beth_wolfset
Thank you. We appreciate your interest, and look forward to working with you in the future! Beth Wolfset Data Governance Guardian of the data What is collected, how and why Definitions Permissions to access for functionality and data Data integration rules Transformation Integrity Validation of data handling

Data Modeling for Predictive Analytics

Similar presentations

Presentation on theme: "Data Modeling for Predictive Analytics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Modeling for Predictive Analytics

Similar presentations

Presentation on theme: "Data Modeling for Predictive Analytics"— Presentation transcript:

Similar presentations

About project

Feedback