Data Analytics Summit III

Data Analytics Summit III
Path-to-Purchase Analytics Using a Data Lake and Spark March 9th, 2017 PRESENTED BY Joe Caserta

Caserta Timeline #HUAnalytics @joe_caserta 2016 2014 2013 2012 2009
Awarded Fastest Growing Big Data Companies 2016 2014 Top 20 Most Powerful Big Data consulting firms Awarded Top 20 Big Data Companies 2016 Dedicated to Data Governance Techniques on Big Data (Innovation) Established best practices for big data ecosystem implementations 2013 Launched Big Data Warehousing (BDW) Meetup NYC:4,000 Members Launched Data Science, Data Interaction and Cloud practices 2012 Laser focus on extending Data Analytics with Big Data solutions 2009 Launched Big Data practice 2004 Co-author, with Ralph Kimball, The Data Warehouse ETL Toolkit (Wiley) 2001 Founded Caserta Concepts in NYC Web log analytics solution published in Intelligent Enterprise magazine 1996 Data Analysis, Data Warehousing and Business Intelligence since 1996 Began consulting database programing and data modeling 1986 30+ years hands-on experience building database solutions

About Caserta Concepts
Consulting Data Innovation and Modern Data Engineering Award-winning company Internationally recognized work force Strategy, Architecture, Implementation, Governance Innovation Partner Strategic Consulting Advanced Architecture Build & Deploy Leader in Enterprise Data Solutions Big Data Analytics Data Warehousing Business Intelligence Data Science Cloud Computing Data Governance

Awards & Recognition #HUAnalytics @joe_caserta
Top 10 Fastest Growing Big Data Companies 2016 a consequence of having built a strong innovative business - Awards & recognition - recognized in the market in 2013, 2014, 2015 They demonstrate sustained recognition over the years and not just many years ago - recent 5th of IT in NYC

Our Partners

Use Case Objectives Cross-Channel Behavior Tracking / Identity Resolution Access to Atomic Level Transaction Data Blend Data Assets Fast Data Onboarding and Discovery Ensure Data Quality Single platform / Landscape simplification Understand Path-to-Purchase Customer Journey Improve Customer Experience & Increase Sales

The Recipe: Build a Dynamic Data Platform
OLD WAY: Structure Data  Ingest Data  Analyze Data Fixed Capacity Monolith NEW WAY: Ingest Data  Analyze Data  Structure Data Dynamic Capacity Ecosystem RECIPE: Cloud Data Lake Holistic Architecture & Framework The paradigm shift is in the way we onboard and process data, Gone from first structuring data before ingest and analysis, to ingest, analyze and then structure. immediate access for analysts and data scientists streamlining the path to cash register We have moved from fixed capacity to on demand Large datasets and new data sets being added time, Could grow or shrink on demand, many providers are startups Minimize cost of operations Monolith to Ecosystem No one set of tools will solve everything Use a diverse set of technologies and let them evolve over time Solve for this using a combination of three concepts: Cloud Computing, Data lake, and the Polyglot Warehouse.

Analytics: The Whole Brain Challenge
Analytics Oriented Data Science Research Revenue Oriented Revenue Goals Monetizing Data Front Operations Oriented Shared Services Data Engineering Process Oriented Data Governance Compliance Back Most disruptive time in my career. More change in last 3 years than previous 27.

It Takes a Village! Chief Data Organization (Oversight)
Data Gov Coordinator Data Librarians Data Stewards Vertical Business Area [Sales/Finance/Marketing/Operations/Customer Svc] Planning Organization Project Managers Product Owner SCRUM Master IT Organization (Oversight) Enterprise Data Architect Solution Engineers Data Integration Practice User Experience Practice QA Practice Operations Practice Development Team Business Subject Matter Expertise Data Librarian/Data Stewardship Data Science/ Statistical Skills Data Engineering / Architecture Presentation/ BI Report Development Skills Data Quality Assurance DevOps Advanced Analytics Business Analysts Data Analysts Data Scientists Statisticians Data Engineers

Unexpected Reaction to Change
I’ve been doing it this way for 15 years. It works, don’t mess with it! People must learn: Evolution is inevitable. Evolve or die.

Moving the Status Quo Forces for Change Forces Resisting Change
Global economics Period of time in present role Intensity of competition Status & perks of office/dept under threat Reduce costs No apparent reasons for proposed changes Move to cross-functional teams Lack of understanding of proposed changes New executive leadership Fear of inability to cope with new technology Speed of technical change Concern over job security Social trends and changes Kurt Lewin’s Force Field analysis Status Quo

The Data Lake Paradigm Technology: Scalable distributed storage  S3
Pluggable fit-for-purpose processing  EMR Consistent extensible framework  Spark Dimensional Data Warehouse  Redshift Functional Capabilities: Remove barriers between data ingestion and analysis Democratize Data with Just Enough Data Governance JOE Ambiguous, some would say it’s a product, or platform Analogous to the old EDW days, “I have cognos, I have a data warehouse” To onboard data we provided several interfaces for the data providers. lightweight wrapper on native S3 service methods. data is stored in the S3 landing area of the data lake partitioning schema automatically (basically a key structure of client + ingestion date or data date). The goal here is to provide immediate access to our engineers and data scientists. T The real power of “late bind”. need to “dig in” and convert this raw data into information for our analytics platform and revenue systems ASAP need to remove all obstacles from this process. #TDWI

Why AWS?

Why Spark? “Big Box” tools vs ROI? Prohibitively expensive  limited by licensing $$$ Typically limited to the scalability of a single server We Spark! Development local or distributed is identical Beautiful high level API’s Full universe of Python modules Open source and Free Blazing fast! Databricks cloud makes it easier Spark has become our default processing engine for a myriad of engineering & science problems

Data Architecture is the new Data Model
Usage Pattern Data Governance Big Data Warehouse Data Science Workspace Data Lake – Integrated Sandbox Landing Area – Source Data in “Full Fidelity” Arbitrary/Ad-hoc Queries and Reporting Fully Governed ( trusted) Munging, Blending Machine Learning Data Quality and Monitoring Metadata, ILM , Security Data Catalog Data Integration Organize, Define, Complete Ingest Raw Data Metadata, ILM, Security Data has different audience and usage patterns each tier. All tiers work cohesively to comprise the Big Data Ecosystem All tiers are governed. Only the top tier is fully governed When to use late bind, decided when to structure on case by case. 7 components of gov: Org, Metadata, Security, DQ, Business Integration, MDM, ILM Organization This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc. Metadata Definitions, lineage (where does this data come from), business definitions, technical metadata Privacy/Security Identify and control sensitive data, regulatory compliance Data Quality and Monitoring Data must be complete and correct. Measure, improve, certify Business Process Integration Policies around data frequency, source availability, etc. Master Data Management Ensure consistent business critical data i.e. Members, Providers, Agents, etc. Information Lifecycle Management (ILM) Data retention, purge schedule, storage/archiving

Data Lifecycle Data Engineer Output Ingest Data Scientist
Architect how data is organized & ensure operability Output Ingest Deliver & Deploy Model Understand Problem Data Scientist Deep analytics and modeling for hidden insights Communicate Results Ingest Data Business Analyst Work with data to apply insights to business strategy Create and build Models Explore and Understand Data App Developer Integrates data & insights with existing or new applications Evaluate Data Clean and Shape Data Analyze Transform Google CRISP-DM

Path-to-Purchase Methods
Single Touch Rules-Based Statistically Driven Assign the credit to the first or last exposure Assign the credit to each interaction based on business rules Assign the credit to interactions based on data-driven model AttributionType Ad-Click Mailing Ad-Click Mailing Ad-Click 100% 33% 33% 33% 27% 49% 24% Comments Last touch only Ignores bulk of customer journey Undervalues other interactions and influencers Subjective Assigns arbitrary values to each interaction Lacks analytics rigor to determine weights Looks at full behavior patterns Consider all touch points Can apply different models for best results Use data to find correlations between touch points (winning combinations)

Unifying the Customer Across Channels
Customer Data Integration (CDI): Match and manage customer information from all available sources Marketing channels: DMP, Salesforce, Adobe, Social, Direct Mail, Call Center, CRM In other words… We need to figure out how to LINK people across systems!

Mastering Master Data is Still MDM
Validate Standardize Match Survivorship

Standardization and Matching
Cleanse and Parse: Names Resolve nicknames Create deterministic hash, phonetic representation Addresses s Phone Numbers Matching: Join based on combinations of cleansed and standardized data to create match results: Spark map operations: Data cleansing, transformation, and standardization Address Parsing: usaddress, postal-address, etc Name Hashing: fuzzy, etc Genderization: sexmachine, etc A COMBINATION OF DETERMINISTIC AND PROBALISTIC MATCHING OF various fields based on business rules

Mastering Unmanageable Source Data
Reveal Wait for the customer to “reveal” themselves Create link between anonymous self and known profile Vector May need behavioral statistical profiling Compare use vectors Rebuild Recluster all prior activities Rebuild the Graph

The Matching Process The matching process output gives us the relationships between customers: Great, but it’s not very useable, you need to traverse the dataset to find out 1234 and 1235 are the same person (and this is a trivial case) And we need to cluster and identify our survivors (vertex) xid yid match_type 1234 4849 phone 5499 1235 address 7788 cookie

Graph to the Rescue Don’t think table… think Graph!
1234 4849 5499 7788 These matches are actually communities 1235 We just need to import our edges into a graph and “dump” out communities

Connected Components Connected Components algorithm labels each connected component of the graph with the ID of its lowest-numbered vertex This lowest number vertex can serve as our “survivor” (not field survivorship) xid yid 1234 4849 5499 1235 7788

Identity Resolution Process

Data Ecosystem Architecture
ODS ETL/ID Res

The Notebook is the ETL Tool

The Notebook is the Data Science Tool

The Redshift DW is still Dimensional

Use Graph for Data Lineage

The Goal: Top Conversion Paths
distributed lag-econometric model (time decay) discrete time survival model (all people start at different times and there is no universal ending) compare media touch points with positive result vs negative result use logistic link function to connect predictors and Y variables tokenization Source: Accelerom AG, Zurich

Recap Data Lake: S3 / NoSQL or SQL as Needed
Identity Resolution: Spark / Graph Frames ETL: Spark / Airflow Path-to-Purchase: S3 / Spark / MLlib Data Warehouse: RedShift Shared Interface: Notebooks (Jupyter or Databricks) Business Intelligence: Tableau

Thank You Joe Caserta President, Caserta Concepts @Joe_Caserta Data is not important, it’s what you do with it that’s important! Award-winning company Transformative Data Strategies Modern Data Engineering Advanced Architecture Innovation Partner Strategic Consulting Advanced Technical Design Build & Deploy Solutions BDW Meetup New York City 3,000+ members Knowledge sharing

Data Analytics Summit III

Similar presentations

Presentation on theme: "Data Analytics Summit III"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Analytics Summit III

Similar presentations

Presentation on theme: "Data Analytics Summit III"— Presentation transcript:

Similar presentations

About project

Feedback