4 Founded2006 HeadquartersToronto, Canada; offices in Boston, MA and Warsaw, Poland The Infobright Data Warehouse Simplicity: No new schemas, no indices, no data partitioning, easy to maintain Scalability: Designed for rapidly growing volumes. Ideal for up to 30 TB Low TCO: Industry-leading compression, less storage, industry standard servers, low software costs, minimal ongoing operational expenses The Open Source Solution Community (open source) and Enterprise Editions are available MySQL Integration Leverages MySQL connectivity to ETL and BI Provides MySQL customers with scalable, enterprise-ready data warehouse MySQL/SUN Microsystems invests in Infobright Sept 15, 2008 About Infobright
6 Data Warehousing Challenges. Traditional Data Warehousing Labor intensive, heavy indexing and partitioning Hardware intensive: massive storage; big servers Expensive and complex More Data, More Data Sources More Kinds of Output Needed by More Users, More Quickly Limited Resources and Budget 010101010101010101010101010 1 0101010101010101010101010 01010101010101010101 01 1 10 101 10 01 0 1 10 01 1 0 10 1 0 101 1 1 0 0 10 101 10 01 0 1 10 01 1 0 10 1 0 101 01 010 0 1 0 101 01 1 1010 0101 1 01 0 10 101 10 01 0 1 10 01 1 0 10 1 0 101 1 0 010101010101010101010 1010 01010101010101010101010101 01 1 101 10 0 101 1010 10 1 101 010 0 0 10 1 0 01 0 0 Real time data Multiple databases External Sources 6
7 New Demands: Larger transaction volumes driven by the internet Impact of Cloud Computing More -> Faster -> Cheaper Data Warehousing – Raising The Bar Data Warehousing Matures: Near real time updates Integration with master data management Data mining using discrete business transactions Provision of data for business critical applications Early Data Warehouse Characteristics: Integration of internal systems Monthly and weekly loads Heavy use of aggregates
9 Loading millions of transactions with a limited batch window Summarizing transactional data for trend analysis Extracting transactional detail based on specific constraints Ad hoc query support across many dimensional attributes Infobright is a good fit for; Real-time transactional updates (operational data entry) Full data extracts (select * from …) Row based operations that need to access all columns of a table are typically better suited to row based databases Avoid using Infobright for;
10 Customer Experience – Load Speed Custom front end developed using MySQL JDBC driver Completed design, test, deployment in < 3 months with no assistance from Infobright Allowed for expansion from 7 to 90 days of online SMS history Supports plan for 70% annual growth Rollout to allow for 120 concurrent users Mavenir - OEM customer deploying a world wide telco application Application provides operators with access to detailed SMS traffic Needed a low cost solution with the ability to load 20K records per second Peak of 70M messages per hour during Chinese New year Business Requirement Solution
11 Customer Experience – Query Performance Sulake - Online Social Networking service with 126M users across 31 countries 990M page impressions per month Need to quickly analyze online spend on a daily basis to enhance online experience and drive additional revenue Existing InnoDB solution was able to process business queries in a reasonable time frame (queries taking hours to complete) Business opportunities were being lost due to inability to analyze subscriber behavior using transactions Business Requirement Customer used existing data model and deployed the application using Business Objects – Data Integrator for ETL, Web-Intelligence for BI Existing ETL workflows were converted to Infobright in less than 4 weeks without assistance Historically long running queries (hours) now running in minutes and seconds Additional benefits due to compression were a reduced need for disk storage and an overall reduction in I/O and network traffic Solution
12 Customer Experience - TCO A global provider of electronic trading solutions across 22 time zones and 700 financial exchanges Wanted to expand analytical access to financial transactions to include both current (30 days) and archived transactions (4 years) Expansion of existing Sybase solution was too costly Business Requirement Infobright was able to achieve performance benchmarks within the first 3 days of a proof of concept using production data 28,000 records per second load speed Join 100M row with a 30Mrow table -> 400k rows, returned in 185 seconds Additional queries that did not complete using Sybase, finished in minutes using Infobright Final solution deployed using Pentaho Kettle for ETL and Crystal Reports for BI Success with modest data size (150GB) has opened opportunities for additional more detailed transactional analysis Solution $
13 Customer Experience – Query Performance and TCO TradeDoubler – Based in Sweden, a global digital marketing company, serving 1600+ online advertisers across Europe and Asia. TradeDoubler optimizes Web marketing campaigns by analyzing Web clicks, impressions and purchases. Analyzing terabytes of data about the results of its programs is central to the company’s success. Selected Infobright to produce analytical results rapidly, seamless interoperability with their MySQL database and low TCO Business Requirement Deployed solution using a single, $12,500 Dell server with 8 CPU cores and 16 GB RAM Used Pentaho Kettle for ETL and Jaspersoft Server Pro Reports for BI Needed to process and analyze data 20 billion online transactions/month In POC, loaded > 3.2 billion rows at > 300,000 rows / second In production, achieved 30x data compression Extremely fast query speed. 3 queries that previously did not return, now returned within a minute Solution
15 Introducing Infobright Smarter architecture Load data and go No indices or partitions to build and maintain Knowledge Grid automatically updated as data packs are created or updated Super-compact data foot- print can leverage off-the- shelf hardware Data Packs – data stored in manageably sized, highly compressed data packs Data compressed using algorithms tailored to data type Knowledge Grid – statistics and metadata “describing” the super- compressed data Column Orientation 15
16 Column vs. Row-Oriented EMP_IDFNAMELNAMESALARY 1MoeHoward10000 2CurlyJoe12000 3LarryFine9000 Row Oriented (1,Moe,Howard,10000; 2,Curly, Joe,12000; 3,Larry,Fine,9000;) Works well if all the columns are needed for every query. Efficient for transactional processing if all the data for the row is available Works well with aggregate results (sum, count, avg. ) Only columns that are relevant need to be touched Consistent performance with any database design Allows for very efficient compression Column Oriented (1,2,3; Moe,Curly,Larry; Howard,Joe,Fine; 10000,12000,9000;)
17 Data Packs and Compression 64K Data Packs Each data pack contains 65, 536 data values Compression is applied to each individual data pack The compression algorithm varies depending on data type and data distribution Compression Results vary depending on the distribution of data among data packs A typical overall compression ratio seen in the field is 10:1 Some customers have seen results have been as high as 40:1 Patent Pending Compression Algorithms
18 Knowledge Grid This metadata layer = 1% of the compressed volume Data Pack Nodes (DPN) A separate DPN is created for every data pack created in the database to store basic statistical information Character Maps (CMAPs) Every Data Pack that contains text creates a matrix that records the occurrence of every possible ASCII character Histograms Histograms are created for every Data Pack that contains numeric data and creates 1024 MIN-MAX intervals. Pack-to-Pack Nodes (PPN) PPNs track relationships between Data Packs when tables are joined. Query performance gets better as the database is used.
19 A Simple Query using the Knowledge Grid SELECT count(*) FROM employees WHERE salary > 50000 AND age < 65 AND job = ‘Shipping’ AND city = ‘TORONTO’; salaryagejobcity Rows 1 to 65,536 65,537 to 131,072 131,073 to …… 2. Find the Data Packs that contain age < 65 3. Find the Data Packs that have job = ‘Shipping’ 4. Find the Data Packs that have City = “Toronto’ All packs ignored 5. Now we eliminate all rows that have been flagged as irrelevant. Only this pack will be decompressed 6. Finally we have identified the data pack that needs to be decompressed 1. Find the Data Packs with salary > 50000 Completely Irrelevant Suspect All values match
20 A Join Query using the Knowledge Grid SELECT MIN(sale), MAX(discount), name FROM carsales, salesperson WHERE carsales.id = salesperson.id AND carsales.prov = ‘ON’ AND carsales.date = ‘2008-02-29’ GROUP BY name; Car Sales idsalediscountprovdate Sales Person idname 1. Eliminate the Car Sales Data Packs that are irrelevant based on constraints in the SQL 2. Determine the related Sales Person Data Packs based on the values of carsales_id found in the relevant Car Sales Data Packs. 4. Any subsequent queries will be able to use the PPN to resolve joins between Car Sales and Sales Person 3. Create a Pack-to-Pack node that stores the results of the join condition between Car Sales and Sales Person. Pack-to-Pack carsales_id vs salesperson_id carsales.id salesperson.id 0 1 0 1 1 0 Indicates that the Data Packs are related
22 Infobright Optimizer and Executor Infobright Optimizer and Executor MySQL/Infobright Architecture CONNECTORS: Native C API, JDBC, ODBC,.NET, PHP, Python, Perl, Ruby, VB Management Services & Utilities Infobright Loader / Unloader Infobright Loader / Unloader CONNECTION POOL: Authentication, Thread Reuse, Connection Limits, Check Memory, Caches CONNECTION POOL: Authentication, Thread Reuse, Connection Limits, Check Memory, Caches SQL Interface MySQL Loader Parser Caches & Buffers MyISAM Views Users Permissions Tables Defs MyISAM Views Users Permissions Tables Defs Knowledge Grid Data Pack Knowledge Node Data Pack Node Data Pack Node Data Pack Node Data Pack Node Data Pack Node Data Pack Node Data Pack Node Data Pack Node Data Pack Node Data Pack Node Data Pack Node Data Pack Node Data Pack Node Data Pack Node Compressor / Decompressor Infobright – Embedded With MySQL Infobright Components IB Storage Engine consisting of 64Kb Data Packs, Compressor, and the Knowledge Grid Knowledge Grid Data Pack Knowledge Node Data Pack Node Data Pack Node Data Pack Node Data Pack Node Data Pack Node Data Pack Node Data Pack Node Data Pack Node Data Pack Node Data Pack Node Data Pack Node Data Pack Node Data Pack Node Data Pack Node Compressor / Decompressor Infobright Optimizer and Executor Infobright Optimizer and Executor IB Optimizer that uses rough set algorithms and the knowledge grid to navigate the database Infobright Loader / Unloader Infobright Loader / Unloader IB Loader supports text based and binary data formats My SQL Optimizer Infobright ships with the full MySQL binaries. The MySQL architecture is used to support database components such as connectors, security and memory management.
23 Optimized SQL for Infobright The Infobright Optimizer supports a large amount of MySQL syntax and functions. When the optimizer encounters SQL syntax that is not supported, then the query is executed using the MySQL optimizer. MySQL Infobright Optimized SQL Select Statements Comparison Operators Logical Operators String Comparison Functions (LIKE,..) Aggregate Functions Arithmetic Operators Data Manipulation Language (I/U/D) Data Definition Language (CREATE & DROP) String Functions Date/Time Functions Numeric Functions Trigonometric Functions Case Statements
24 Infobright Data Types Numeric Date String Most of the data types expected for a MySQL database engine are fully supported. The data types that are currently not implemented within Infobright include BLOB, ENUM, SET and Auto Increment.
25 Increased efficiency with popular platforms Deeper ETL Integration Jaspersoft, Talend, Pentaho Leverages end-to-end data management provided by ETL tools Improved support for Data Manipulation Language (DML) Leverage existing IT tools and resources for fast, simple deployments and low TCO ETL Integration
26 Data Loading with & without custom ETL connectors Loading Infobright tables with custom connectors: Kettle from Pentaho Talend ETL from Talend Jaspersoft ETL (Talend) from Jaspersoft Two ways to invoke Infobright loader without connectors Generate a CSV or binary file and invoke the Infobright loader to load the file Named pipe technique: Create a named pipe (i.e. mkfifo /home/mysql/s_mysession1.pipe ) Launch the Infobright loader in the background to read from the pipe Launch the ETL process that writes data to the named pipe When the ETL process runs, as records are written to the named pipe, the loader reads them and writes them to an Infobright database table
28 Comparison of ICE and IEE Features Technical Support Forums and/or one-time 4-hr support pack Available Warranty and Indemnification NoIncluded INSERT/UPDATE/DELETE NoSupported Infobright Loader Up to 50 GB/hrMulti-threaded, Up to 300 GB/hr Data Load Types Text only Text & Binary (100% faster) MySQL Loader NoSupported Platform Support 64-bit Intel and AMD RHEL 5, CentOS 5, Debian 32-bit Intel and AMD for Windows XP, Ubuntu 8.04, Fedora 9 64-bit Intel and AMD Windows Server 2003, Windows Server 2008, RHEL 5, CentOS 5, Debian, Solaris 10