Presentation is loading. Please wait.

Presentation is loading. Please wait.

Emerging Trends and Technologies in BI GlobalLogic Sunil K Singh

Similar presentations


Presentation on theme: "Emerging Trends and Technologies in BI GlobalLogic Sunil K Singh"— Presentation transcript:

1 Emerging Trends and Technologies in BI GlobalLogic Sunil K Singh
January, 2011 GlobalLogic

2 Company Overview 10 years of leadership in global software R&D services Provides full lifecycle product engineering and advisory services for ISVs and software-enabled businesses Privately held and backed by Sequoia Capital, NEA, Draper Atlantic / NAV and Goldman Sachs “A product development company like GlobalLogic is doing more than just providing offshore developers — it is seeking to collaborate with clients at a strategic level and provide executives with on-demand access to global innovation networks.” — Forrester Research “Being Innovative Means Moving Beyond the Hype” US $170M in revenue, 40%+ CAGR 175+ client partnerships under active management 5,500+ employees Headquartered in the US with business offices in the UK, Germany, Israel and India Global R&D Centers and Innovation Labs in the US, Ukraine, India, China and Argentina 2

3 Globallogic—A Software R&D Services Company
GlobalLogic has created a network of global innovation hubs made up on some of the brightest and most innovative software minds connected by a platform that supports agile collaboration which together accelerate breakthrough products to market.

4 Industry Focus Digital Media Retail Finance Infrastructure Electronics
Healthcare Telecom Mobile Copyright GlobalLogic 2009

5 The BI (R) Evolution!

6 First came the Relational Database

7 Typical Retail Operational Database
create table product_categories ( product_category_id integer primary key, product_category_name varchar(100) not null ); create table manufacturers ( manufacturer_id integer primary key, manufacturer_name varchar(100) not null create table products ( product_id integer primary key, product_name varchar(100) not null, product_category_id references product_categories, manufacturer_id references manufacturers create table cities ( city_id integer primary key, city_name varchar(100) not null, state varchar(100) not null, population integer not null create table stores ( store_id integer primary key, city_id references cities, store_location varchar(200) not null, phone_number varchar(20) create table sales ( product_id not null references products, store_id not null references stores, quantity_sold integer not null, date_time_of_sale date not null );

8 Marketing Trying to do Some Sales Analysis
How many Oreo cookies were sold yesterday in cities with population less than fifty thousand people? select sum(sales.quantity_sold) from sales, products, product_categories, manufacturers, stores, cities where manufacturer_name = 'Oreo' and product_category_name = 'cookie' and cities.population < 50000 and trunc(sales.date_time_of_sale) = trunc(sysdate-1) -- restrict to yesterday and sales.product_id = products.product_id and sales.store_id = stores.store_id and products.product_category_id = product_categories.product_category_id and products.manufacturer_id = manufacturers.manufacturer_id and stores.city_id = cities.city_id; This query has six join from all 7 tables. It is a very expensive query Let’s copy the data to another databases for the marketing people

9 Then Came the Data Warehouse

10 Pick a FACT as the Center of Data Warehouse
Marketing Cares Most About Sales Let us create a Fact table on sales create table sales_fact ( sales_date date not null, product_id integer, store_id integer, unit_sales integer, dollar_sales number ); You can fill this table at a scheduled time from the operational database This is you ETL process

11 Different DIMENSIONS can be created about the FACT
For example, we are interested in sales from a store Let us create a DiMENSION table create table stores_dimension ( stores_key integer primary key, name varchar(100), city varchar(100), county varchar(100), state varchar(100), zip_code varchar(100), date_opened date, date_remodeled date, store_size varchar(100), ... ); Now query on sales from a city take one join on 2 tables select sd.city, sum(f.dollar_sales) from sales_fact f, stores_dimension sd where f.stores_key = sd.stores_key group by sd.city

12 Traditional Approach to BI
Data Warehouse End User Tools Enterprise Systems Staging Datamart OLAP layer Data cleanup Lookup Validation Mapping Value Sort Join Aggregation . etc Data Warehouse Enterprise Reporting (Crystal, BIRT…) Core Production Systems Transform Slice Load Extract Dice Load Financial Systems Extract Analytic Application (SAS, SPSS …) Rollup Load Sales Systems Extract Load Drilldown Load Other Systems & Flat Files Machine Learning Extract Pivot Load External Data Decision Modeling Load Extract Feedback loop

13 Data Warehouse Collection of a large amount of data which is cleaned, transformed and cataloged and is made available for use in data mining, online analytical processing, market research and decision support Method of storage – Normalized vs. Dimensional Normalized: Similar to Database Normalization Rules. Tables are grouped by subject area Dimensional: Transactions are split into “Facts” and “Dimensions”. Facts are numbers, whereas Dimension are reference information of Facts

14 Data Warehouse (Cont.) Schema design – Snowflake or Star Schema
Read-only access The term OLAP was created as a slight modification of the traditional database term OLTP (OnLine Transaction Processing) MOLAP: Multi-dimensional OLAP, which uses multi-dimensional cube to store the data ROLAP: Relational OLAP, with RDBMS as the underneath storage technology HOLAP: Hybrid OLAP, which uses a mix of Relational and Multi-dimensional technology ETL stands for Extract, Transform, Load Some shops use home grown ETL Language: Shell Script, Perl, Python and Ruby, Java Other use ETL tools Informatica, SAP and MS SISS (Commercial) Talend and Pentaho Kettle (Open Source)

15 Then Came the Internet and the Explosion of Data on the Web

16 User Behavior Analysis User Behavior Analysis
Web 2.0 BI Approach DMZ Cooperate Data Center User Behavior Analysis User Behavior Analysis Load balancer Request Dispatcher Request Logger Log entry Request Dispatcher Request Logger Log entry Service Processor Result Website Request Internet Request Service Processor Result Website Request Request Result Result Response Decision Support Response Result Decision Support Service Responser Rules Result Response Service Responser Rules Operation data & rules Response Operation data & rules New Rules New Rules Result Result Trend Analysis Web Crawler Transaction related Info Third-party Supplier (e.g. Doubleclick) Customer behavior Statistics Web Application Data Provider Map/Reduce Task

17 And suddenly Data Mining is the new BI !

18 Data Mining – a process view
Many Definitions Non-trivial extraction of implicit, previously unknown and potentially useful information from data Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

19 Why Mine Data – Commercial Viewpoint
Lots of data is being collected and warehoused Web data Yahoo! collects 10GB/hour purchases at department/ grocery stores Walmart records  20 million transactions per day Bank/Credit Card transactions Computers have become cheaper and more powerful Competitive Pressure is Strong Provide better, customized services for an edge (e.g. in Customer Relationship Management)

20 Why Mine Data – Scientific Viewpoint
Data collected and stored at enormous speeds (GB/hour) remote sensors on a satellite NASA EOSDIS archives over 1-petabytes of Earth Science data per year telescopes scanning the skies Sky survey data gene expression data scientific simulations terabytes of data generated in a few hours Traditional techniques infeasible for raw data Data mining may help scientists in automated analysis of massive data sets in hypothesis formation

21 Common Data Mining Techniques
Clustering Predictive Modeling Anomaly Detection Association Rules Milk

22 Amazon.com Case Study: Personalized Customer Relationship Management

23 Amazon.com 5-step loyalty model
Step Amazon’s action anticipate/stimulate Need Creation provide /assist Information search assist / negate Evaluate alternatives optimise /reward Purchase transaction Post purchase experience add value

24 Step1: Need Creation anticipate/stimulate Need Creation

25 Step2: Information Search
provide /assist Information search

26 Step3: Evaluation of Alternatives
assist / negate Evaluate alternatives

27 Step4: Purchase Optimisation/Reward
optimise /reward Purchase transaction 1-click purchase ‘slippery check out counter’ vs. ‘sticky aisles’

28 Step5: Post-purchase experience
add value Post purchase experience

29 Internet Marketing Insight – Jeff Bezos
Role of Advertisement – get customer to the store Customer experience – get customer to buy Brick & mortar stores Getting customer to store is the hard part Shopping cart abandonment is not common, since the overhead of going to another store is very high – especially in Minnesota winters! Marketing expenses 80% for advertisement; 20% for customer experience The rule should be reversed for on-line stores

30 Difference in Two BI Approaches
Traditional (Enterprise approach) Mainly use for exec reports, consumed by human Medium size data volume at enterprise-scale, not web-scale Very batch-oriented, weekly or monthly is norm. ETL (Informatica) Data Warehouse (RDBMS, Fact / Dimension tables, Star / Snowflake schema) Multi-dimensional (ROLAP, MOLAP, Slice / Dice / Rollup / Drilldown) Analytic Tool (Business Object) Modern (Web 2.0 company approach) Mainly use for data mining, and automatic feedback loop for adaptation Gigantic size data volume at web-scale, from many different sources Tight feedback loop, latency is within seconds or minutes. ETL (more tolerance on unclean data, but must be processed at high speed) Data Warehouse (Distributed Files Systems, NOSQL) Map/Reduce Parallel Processing (Hadoop) Analytic Tool (Hive / R)

31 BI with Unstructured Data
Hadoop + Vertica

32 Big Data comes in Three Forms
Unstructured Images, sound, video Semi-structured Logs, data feeds, event streams Fully Structured Relational tables

33 Near Time BI Reporting on Continuous Data Stream
Expected high volume incoming data stream Processing System Streaming Data Operational System BI Reporting System MOM / CEP / HOP MapReduce Data Real Time Dashboard Near Time Reporting HDFS M R BI Adaptor Aggregator Queries Any BI Reporting Tool The data volume will determine underneath technology framework (MOM, CEP or HOP) Lookup DB Using Commodity Hardware

34 Near Real-Time BI Reporting
Raw incoming data gets processed real-time Depending on incoming data streaming velocity, different technologies will be use to pre-process data MOM (Message Oriented Middleware) CEP (Complex Event Processing) HOP (Hadoop Online Prototype) Incoming data will be divided in smaller batch, forwarded to MapReduce processer Processed data will typically be stored in a distributed file system such as HDFS Processed data will be pushed or pulled to target BI reporting application or tools

35 What do people do with Hadoop?
Parse Logs Look for Patterns Archive data Transform data

36 Vertica® Analytic Database
MPP columnar architecture Second to sub-second queries 300GB/node load times Scales to hundreds of TBs Standard ETL & Reporting Tools

37 Availability, Scalability and Efficiency
…how fast can you go from data to answers? Unstructured data needs to be analyzed to make sense. Semi-structure data parsed based on spec (or brute force). Structured data can be optimized for ad-hoc analysis.

38 Distributed processing framework (MapReduce)
Hadoop / Vertica Distributed processing framework (MapReduce) Distributed storage layer (HDFS) Vertica can be used as a data source and target for MapReduce Data can also be moved between Vertica and HDFS (sqoop) Hadoop talks to Vertica via custom Input and Output Formatters

39 Vertica serves as a structured data repository for hadoop
Hadoop / Vertica Hadoop Compute Cluster Map Reduce Vertica serves as a structured data repository for hadoop

40 Hadoop / Vertica Vertica’s input formatter takes a parameterized query
Relational Map operations can be pushed down to the database Vertica’s output formatter takes an existing table name or a description Vertica output tables can be optimized directly from hadoop

41 Federate multiple Vertica database clusters with hadoop
Hadoop / Vertica Hadoop Compute Cluster Map Reduce Hadoop Compute Cluster Map Reduce Hadoop Compute Cluster Map Reduce Hadoop Compute Cluster Map Reduce Federate multiple Vertica database clusters with hadoop

42 Data Mining for Computational Social Sciences
A Case Study from Virtual Worlds

43 Online Games Massively Multiplayer Online Role Playing Games (MMORPG) are computer games that allow hundreds to thousands of players to interact and play together in a persistent online world Popular MMO Games- Everquest 2, World of Warcraft and Second Life

44 MMORPG – Everquest 2 MMORPGs (MMO Role Playing Games) are the most popular of MMO Games Examples: World of Warcraft by Blizzard and Everquest 2 by Sony Online Entertainment Various logs of players’ behavior are maintained Player activity in the environment as well his/her chat is recorded at regular time instances, each such record carries a time stamp and a location ID Some of the logs capture different aspects of player behavior Guild membership history (member of, kicked out of, joined, left) Achievements (Quests completed, experience gained) Items exchanged and sold/bought between players Economy (Items/properties possessed/sold/bought, banking activity, looting, items found/crafted) Faction membership (faction affiliation, record of actions affecting faction affiliation)

45 Social Science Data Mining with EverQuest 2 Data
improve understanding of the dynamics of group behavior MMORPG data enables us to look at dynamics of groups in a new way Multiple groups are part of a large social network Individuals from the social network can join or leave groups Groups are not isolated and some of them can be related i.e. they may be geared towards specific objectives, each of which works towards a larger goal (e.g. different teams working towards disaster recovery) The emergence, destruction as well as dynamic memberships of the groups depend on the underlying social network as well as the environment

46 For more information, please contact @ info@globallogic.com
Thank You! We are always looking for good engineers who are passionate about technology. For more information, please


Download ppt "Emerging Trends and Technologies in BI GlobalLogic Sunil K Singh"

Similar presentations


Ads by Google