Emerging Trends and Technologies in BI GlobalLogic Sunil K Singh

Slides:



Advertisements
Similar presentations
Symantec 2010 Windows 7 Migration Global Results.
Advertisements

1 Senn, Information Technology, 3 rd Edition © 2004 Pearson Prentice Hall James A. Senns Information Technology, 3 rd Edition Chapter 7 Enterprise Databases.
Prof. Dr. Hans-Jürgen Scheruhn | Online Process Management Hochschule Harz Wernigerode University for Applied Sciences Prof. Dr. Hans-Jürgen.
Chapter 1: The Database Environment
Distributed Systems Architectures
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
McGraw-Hill/Irwin Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved. Extended Learning Module D (Office 2007 Version) Decision Analysis.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 4 Computing Platforms.
Processes and Operating Systems
Data Warehousing and Data Mining J. G. Zheng May 20 th 2008 MIS Chapter 3.
Designing Services for Grid-based Knowledge Discovery A. Congiusta, A. Pugliese, Domenico Talia, P. Trunfio DEIS University of Calabria ITALY
1 Advanced Tools for Account Searches and Portfolios Dawn Gamache Cindy Bylander.
Database Systems: Design, Implementation, and Management
Auto-scaling Axis2 Web Services on Amazon EC2 By Afkham Azeez.
Week 2 The Object-Oriented Approach to Requirements
Impressive Star Softwares (P) Ltd. Presents Sent Item Box-Detail of Mails from Tally ( 1.0 )
I n t e g r i t y - S e r v i c e - E x c e l l e n c e Headquarters U.S.A.F. 1 Commodity Councils 101 NAME (S) SAF/AQCDATE.
Our Digital World Second Edition
Configuration management
Data Warehousing Design Transparencies
Data Warehousing – A Technology Marvel -by Swati Chawla.
Database Performance Tuning and Query Optimization
Discovering Computers Fundamentals, 2012 Edition
Microsoft Confidential. We look at the world... with our own eyes...
CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
EIS Bridge Tool and Staging Tables September 1, 2009 Instructor: Way Poteat Slide: 1.
Chapter 6 Data Design.
An overview of Data Warehousing and OLAP Technology Presented By Manish Desai.
1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.
IS 4420 Database Fundamentals Chapter 11: Data Warehousing Leon Chen
GIS Lecture 8 Spatial Data Processing.
BY LECTURER/ AISHA DAWOOD DW Lab # 2. LAB EXERCISE #1 Oracle Data Warehousing Goal: Develop an application to implement defining subject area, design.
CHAPTER 8 INFORMATION IN ACTION
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 12 View Design and Integration.
Essential Cell Biology
ANSC644 Bioinformatics-Database Mining 1 ANSC644 Bioinformatics §Carl J. Schmidt §051 Townsend Hall §
Chapter 13 The Data Warehouse
1 DIGITAL INTERACTIVE MEDIA Wednesday, October 28, 2009.

© 2007 by Prentice Hall Management Information Systems, 10/e Raymond McLeod and George Schell 1 Management Information Systems, 10/e Raymond McLeod Jr.
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
Data Sources Data Warehouse Analysis Results Data visualisation Analytical tools OLAP Data Mining Overview of Business Intelligence Data visualisation.
Data Warehousing - 3 ISYS 650. Snowflake Schema one or more dimension tables do not join directly to the fact table but must join through other dimension.
Chapter 13 The Data Warehouse
TOPIC 1: GAINING COMPETITIVE ADVANTAGE WITH IT (CONTINUE) SUPPLY CHAIN MANAGEMENT & BUSINESS INTELLIGENCE.
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
Week 6 Lecture The Data Warehouse Samuel Conn, Asst. Professor
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Tang: Introduction to Data Mining (with modification by Ch. Eick) I: Introduction to Data Mining A.Short Preview 1.Initial Definition of Data Mining 2.Motivation.
Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.
1 1 Slide Introduction to Data Mining and Business Intelligence.
@ ?!.
OLAP & DSS SUPPORT IN DATA WAREHOUSE By - Pooja Sinha Kaushalya Bakde.
BUSINESS DRIVEN TECHNOLOGY
1 Categories of data Operational and very short-term decision making data Current, short-term decision making, related to financial transactions, detailed.
1 What is Data Mining? l Data mining is the process of automatically discovering useful information in large data repositories. l There are many other.
MIS2502: Data Analytics Advanced Analytics - Introduction.
Advanced Database Concepts
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
Foundations of Information Systems in Business
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
BUSINESS INTELLIGENCE. The new technology for understanding the past & predicting the future … BI is broad category of technologies that allows for gathering,
MIS2502: Data Analytics Advanced Analytics - Introduction
Chapter 13 The Data Warehouse
Data Warehousing and Data Mining
Data Mining: Introduction
Big DATA.
Analytics, BI & Data Integration
Presentation transcript:

Emerging Trends and Technologies in BI GlobalLogic Sunil K Singh January, 2011 GlobalLogic

Company Overview 10 years of leadership in global software R&D services Provides full lifecycle product engineering and advisory services for ISVs and software-enabled businesses Privately held and backed by Sequoia Capital, NEA, Draper Atlantic / NAV and Goldman Sachs “A product development company like GlobalLogic is doing more than just providing offshore developers — it is seeking to collaborate with clients at a strategic level and provide executives with on-demand access to global innovation networks.” — Forrester Research “Being Innovative Means Moving Beyond the Hype” US $170M in revenue, 40%+ CAGR 175+ client partnerships under active management 5,500+ employees Headquartered in the US with business offices in the UK, Germany, Israel and India Global R&D Centers and Innovation Labs in the US, Ukraine, India, China and Argentina 2

Globallogic—A Software R&D Services Company GlobalLogic has created a network of global innovation hubs made up on some of the brightest and most innovative software minds connected by a platform that supports agile collaboration which together accelerate breakthrough products to market.

Industry Focus Digital Media Retail Finance Infrastructure Electronics Healthcare Telecom Mobile Copyright GlobalLogic 2009

The BI (R) Evolution!

First came the Relational Database

Typical Retail Operational Database create table product_categories ( product_category_id integer primary key, product_category_name varchar(100) not null ); create table manufacturers ( manufacturer_id integer primary key, manufacturer_name varchar(100) not null create table products ( product_id integer primary key, product_name varchar(100) not null, product_category_id references product_categories, manufacturer_id references manufacturers create table cities ( city_id integer primary key, city_name varchar(100) not null, state varchar(100) not null, population integer not null create table stores ( store_id integer primary key, city_id references cities, store_location varchar(200) not null, phone_number varchar(20) create table sales ( product_id not null references products, store_id not null references stores, quantity_sold integer not null, date_time_of_sale date not null );

Marketing Trying to do Some Sales Analysis How many Oreo cookies were sold yesterday in cities with population less than fifty thousand people? select sum(sales.quantity_sold) from sales, products, product_categories, manufacturers, stores, cities where manufacturer_name = 'Oreo' and product_category_name = 'cookie' and cities.population < 50000 and trunc(sales.date_time_of_sale) = trunc(sysdate-1) -- restrict to yesterday and sales.product_id = products.product_id and sales.store_id = stores.store_id and products.product_category_id = product_categories.product_category_id and products.manufacturer_id = manufacturers.manufacturer_id and stores.city_id = cities.city_id; This query has six join from all 7 tables. It is a very expensive query Let’s copy the data to another databases for the marketing people

Then Came the Data Warehouse

Pick a FACT as the Center of Data Warehouse Marketing Cares Most About Sales Let us create a Fact table on sales create table sales_fact ( sales_date date not null, product_id integer, store_id integer, unit_sales integer, dollar_sales number ); You can fill this table at a scheduled time from the operational database This is you ETL process

Different DIMENSIONS can be created about the FACT For example, we are interested in sales from a store Let us create a DiMENSION table create table stores_dimension ( stores_key integer primary key, name varchar(100), city varchar(100), county varchar(100), state varchar(100), zip_code varchar(100), date_opened date, date_remodeled date, store_size varchar(100), ... ); Now query on sales from a city take one join on 2 tables select sd.city, sum(f.dollar_sales) from sales_fact f, stores_dimension sd where f.stores_key = sd.stores_key group by sd.city

Traditional Approach to BI Data Warehouse End User Tools Enterprise Systems Staging Datamart OLAP layer Data cleanup Lookup Validation Mapping Value Sort Join Aggregation . etc Data Warehouse Enterprise Reporting (Crystal, BIRT…) Core Production Systems Transform Slice Load Extract Dice Load Financial Systems Extract Analytic Application (SAS, SPSS …) Rollup Load Sales Systems Extract Load Drilldown Load Other Systems & Flat Files Machine Learning Extract Pivot Load External Data Decision Modeling Load Extract Feedback loop

Data Warehouse Collection of a large amount of data which is cleaned, transformed and cataloged and is made available for use in data mining, online analytical processing, market research and decision support Method of storage – Normalized vs. Dimensional Normalized: Similar to Database Normalization Rules. Tables are grouped by subject area Dimensional: Transactions are split into “Facts” and “Dimensions”. Facts are numbers, whereas Dimension are reference information of Facts

Data Warehouse (Cont.) Schema design – Snowflake or Star Schema Read-only access The term OLAP was created as a slight modification of the traditional database term OLTP (OnLine Transaction Processing) MOLAP: Multi-dimensional OLAP, which uses multi-dimensional cube to store the data ROLAP: Relational OLAP, with RDBMS as the underneath storage technology HOLAP: Hybrid OLAP, which uses a mix of Relational and Multi-dimensional technology ETL stands for Extract, Transform, Load Some shops use home grown ETL Language: Shell Script, Perl, Python and Ruby, Java Other use ETL tools Informatica, SAP and MS SISS (Commercial) Talend and Pentaho Kettle (Open Source)

Then Came the Internet and the Explosion of Data on the Web

User Behavior Analysis User Behavior Analysis Web 2.0 BI Approach DMZ Cooperate Data Center User Behavior Analysis User Behavior Analysis Load balancer Request Dispatcher Request Logger Log entry Request Dispatcher Request Logger Log entry Service Processor Result Website Request Internet Request Service Processor Result Website Request Request Result Result Response Decision Support Response Result Decision Support Service Responser Rules Result Response Service Responser Rules Operation data & rules Response Operation data & rules New Rules New Rules Result Result Trend Analysis Web Crawler Transaction related Info Third-party Supplier (e.g. Doubleclick) Customer behavior Statistics Web Application Data Provider Map/Reduce Task

And suddenly Data Mining is the new BI !

Data Mining – a process view Many Definitions Non-trivial extraction of implicit, previously unknown and potentially useful information from data Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

Why Mine Data – Commercial Viewpoint Lots of data is being collected and warehoused Web data Yahoo! collects 10GB/hour purchases at department/ grocery stores Walmart records  20 million transactions per day Bank/Credit Card transactions Computers have become cheaper and more powerful Competitive Pressure is Strong Provide better, customized services for an edge (e.g. in Customer Relationship Management)

Why Mine Data – Scientific Viewpoint Data collected and stored at enormous speeds (GB/hour) remote sensors on a satellite NASA EOSDIS archives over 1-petabytes of Earth Science data per year telescopes scanning the skies Sky survey data gene expression data scientific simulations terabytes of data generated in a few hours Traditional techniques infeasible for raw data Data mining may help scientists in automated analysis of massive data sets in hypothesis formation

Common Data Mining Techniques Clustering Predictive Modeling Anomaly Detection Association Rules Milk

Amazon.com Case Study: Personalized Customer Relationship Management

Amazon.com 5-step loyalty model Step Amazon’s action anticipate/stimulate Need Creation provide /assist Information search assist / negate Evaluate alternatives optimise /reward Purchase transaction Post purchase experience add value

Step1: Need Creation anticipate/stimulate Need Creation

Step2: Information Search provide /assist Information search

Step3: Evaluation of Alternatives assist / negate Evaluate alternatives

Step4: Purchase Optimisation/Reward optimise /reward Purchase transaction 1-click purchase ‘slippery check out counter’ vs. ‘sticky aisles’

Step5: Post-purchase experience add value Post purchase experience

Internet Marketing Insight – Jeff Bezos Role of Advertisement – get customer to the store Customer experience – get customer to buy Brick & mortar stores Getting customer to store is the hard part Shopping cart abandonment is not common, since the overhead of going to another store is very high – especially in Minnesota winters! Marketing expenses 80% for advertisement; 20% for customer experience The 80-20 rule should be reversed for on-line stores

Difference in Two BI Approaches Traditional (Enterprise approach) Mainly use for exec reports, consumed by human Medium size data volume at enterprise-scale, not web-scale Very batch-oriented, weekly or monthly is norm. ETL (Informatica) Data Warehouse (RDBMS, Fact / Dimension tables, Star / Snowflake schema) Multi-dimensional (ROLAP, MOLAP, Slice / Dice / Rollup / Drilldown) Analytic Tool (Business Object) Modern (Web 2.0 company approach) Mainly use for data mining, and automatic feedback loop for adaptation Gigantic size data volume at web-scale, from many different sources Tight feedback loop, latency is within seconds or minutes. ETL (more tolerance on unclean data, but must be processed at high speed) Data Warehouse (Distributed Files Systems, NOSQL) Map/Reduce Parallel Processing (Hadoop) Analytic Tool (Hive / R)

BI with Unstructured Data Hadoop + Vertica

Big Data comes in Three Forms Unstructured Images, sound, video Semi-structured Logs, data feeds, event streams Fully Structured Relational tables

Near Time BI Reporting on Continuous Data Stream Expected high volume incoming data stream Processing System Streaming Data Operational System BI Reporting System MOM / CEP / HOP MapReduce Data Real Time Dashboard Near Time Reporting HDFS M R BI Adaptor Aggregator Queries Any BI Reporting Tool The data volume will determine underneath technology framework (MOM, CEP or HOP) Lookup DB Using Commodity Hardware

Near Real-Time BI Reporting Raw incoming data gets processed real-time Depending on incoming data streaming velocity, different technologies will be use to pre-process data MOM (Message Oriented Middleware) CEP (Complex Event Processing) HOP (Hadoop Online Prototype) Incoming data will be divided in smaller batch, forwarded to MapReduce processer Processed data will typically be stored in a distributed file system such as HDFS Processed data will be pushed or pulled to target BI reporting application or tools

What do people do with Hadoop? Parse Logs Look for Patterns Archive data Transform data

Vertica® Analytic Database MPP columnar architecture Second to sub-second queries 300GB/node load times Scales to hundreds of TBs Standard ETL & Reporting Tools www.vertica.com

Availability, Scalability and Efficiency …how fast can you go from data to answers? Unstructured data needs to be analyzed to make sense. Semi-structure data parsed based on spec (or brute force). Structured data can be optimized for ad-hoc analysis.

Distributed processing framework (MapReduce) Hadoop / Vertica Distributed processing framework (MapReduce) Distributed storage layer (HDFS) Vertica can be used as a data source and target for MapReduce Data can also be moved between Vertica and HDFS (sqoop) Hadoop talks to Vertica via custom Input and Output Formatters

Vertica serves as a structured data repository for hadoop Hadoop / Vertica Hadoop Compute Cluster Map Reduce Vertica serves as a structured data repository for hadoop

Hadoop / Vertica Vertica’s input formatter takes a parameterized query Relational Map operations can be pushed down to the database Vertica’s output formatter takes an existing table name or a description Vertica output tables can be optimized directly from hadoop

Federate multiple Vertica database clusters with hadoop Hadoop / Vertica Hadoop Compute Cluster Map Reduce Hadoop Compute Cluster Map Reduce Hadoop Compute Cluster Map Reduce Hadoop Compute Cluster Map Reduce Federate multiple Vertica database clusters with hadoop

Data Mining for Computational Social Sciences A Case Study from Virtual Worlds

Online Games Massively Multiplayer Online Role Playing Games (MMORPG) are computer games that allow hundreds to thousands of players to interact and play together in a persistent online world Popular MMO Games- Everquest 2, World of Warcraft and Second Life

MMORPG – Everquest 2 MMORPGs (MMO Role Playing Games) are the most popular of MMO Games Examples: World of Warcraft by Blizzard and Everquest 2 by Sony Online Entertainment Various logs of players’ behavior are maintained Player activity in the environment as well his/her chat is recorded at regular time instances, each such record carries a time stamp and a location ID Some of the logs capture different aspects of player behavior Guild membership history (member of, kicked out of, joined, left) Achievements (Quests completed, experience gained) Items exchanged and sold/bought between players Economy (Items/properties possessed/sold/bought, banking activity, looting, items found/crafted) Faction membership (faction affiliation, record of actions affecting faction affiliation)

Social Science Data Mining with EverQuest 2 Data improve understanding of the dynamics of group behavior MMORPG data enables us to look at dynamics of groups in a new way Multiple groups are part of a large social network Individuals from the social network can join or leave groups Groups are not isolated and some of them can be related i.e. they may be geared towards specific objectives, each of which works towards a larger goal (e.g. different teams working towards disaster recovery) The emergence, destruction as well as dynamic memberships of the groups depend on the underlying social network as well as the environment

For more information, please contact @ info@globallogic.com Thank You! We are always looking for good engineers who are passionate about technology. For more information, please contact @ info@globallogic.com