Infobright Meetup Host Avner Algom May 28, 2012

Infobright Meetup Host Avner Algom May 28, 2012
Good morning, I want to start by thanking the host for the opportunity to be with you here today. Infobright Meetup Host Avner Algom May 28, 2012

Agenda Infobright WebCollage Zaponet Q/A Use cases
Paul Desjardins What part of the Big Data problem does Infobright solve? VP Bus Dev Where does Infobright fit in the database landscape? Infobright Inc Joon Kim Technical Overview Sr. Sales Engineer Use cases WebCollage Lior Haham WebCollage Case Study: Using Analytics for Hosted Web Applications Zaponet Asaf Birenzvieg Introduction / Experience with Infobright CEO Q/A Together with Joon Kim our Senior Sales Engineer, and Anu from InMobi, it is our pleasure to be able to share how our columnar database can help provide competitive advantage for your Big Data challenges We are delighted that Anu has also joined us this morning I am going to take 15 minutes and cover three topics Where does Infobright fit in the database landscape? What specific part of the Big Data problem does Infobright solve? Share some sample use cases Joon will then give you a review of the Technology Any will review how InMobi uses Infobright with Hadoop for analytics

Growing Customer Base across Use Cases and Verticals
1000 direct and OEM installations across North America, EMEA and Asia 8 of Top 10 Global Telecom Carriers using Infobright via OEM/ISVs Logistics, Manufacturing, Business Intelligence Online & Mobile Advertising/Web Analytics Government Utilities Research Financial Services Telecom & Security Gaming, Social Networks Focused on machine-generated data use cases 50+ employees: Toronto, Warsaw, Chicago, Dublin, other US Increasingly used as embedded DB within OEM/SaaS applications 300 direct/OEM/SaaS customers >1000 installations through OEMs End User/SaaS: Online businesses / web analytics (online adv, gaming, marketing services, online retail etc) Mobile analytics Mid-tier capital markets Telecom SaaS applications OEMs Telecom ISV / technology providers SIEM Network mgmt

The Machine-Generated Data Problem
“Machine-generated data is the future of data management.” Curt Monash, DBMS2 Machine-generated/hybrid data Weblogs Computer, network events CDRs Financial trades Sensors, RFID etc Online game data Human-generated data - input from most conventional kinds transactions Purchase/sale Inventory Manufacturing Employment status change MORE DATA More online activity more web data Growth of mobile more call data, web data Servers/networks/sensors lots of log/event data MORE DATA MINING Target individual customers Identify micro-segments Find security threats Identify fraud Machine-generated data is being defined in different ways by different people. Some say it is strictly data that is generated without any direct human intervention, and others say it is data that includes the machine tracking of human activities as well, such as web log data. Examples beyond web logs include all other types of logs, such as computer, network and security logs, sensor data, call detail records, financial trading data, ATM transactions or RFID tags. But whether the definition of machine-generated data is precise or a little more open, certain key characteristics nearly always apply: new records are added with a high frequency, and the data itself is seldom if ever changed. The problem? The rate of growth and volume of this data far exceeds other business data but organizations need to be able to quickly and efficiently extract useful intelligence. This can be like finding the proverbial “needle in the haystack.” Rate of Growth

Mobile Advertising Analytics
The Value in the Data “Analytics drives insights; insights lead to greater understanding of customers and markets; that understanding yields innovative products, better customer targeting, improved pricing, and superior growth in both revenue and profits.” Accenture Technology Vision, 2011 Network Analytics Network optimization Troubleshooting Capacity Planning Customer Assurance Fraud Detection CDR Analytics Customer Behavior Analysis Marketing Campaigns/Services Analysis Optimize network capacity Compliance and Audit Mobile Advertising Analytics Need to capture web data, mobile data, network data Mobile ad campaign analytics

Current Technology: Hitting the Wall Today’s database technology requires huge effort and massive hardware How Performance Issues are Typically Addressed – by Pace of Data Growth While traditional databases are well suited for initially storing machine-generated data, they often have trouble analyzing it. They simply run out of horsepower in terms of volume, query speed, and the disk and processing infrastructure required to support it. The instinctive response to this challenge seems to be to throw more people or money at the problem, or to scale back by archiving further to reduce the size of the dataset Recently a survey was conducted by the Independent Oracle Users Group (IOUG). A majority of respondents report having performance and budget issues due to exponential data growth. Key survey findings include the following: nine out of 10 respondents’ companies said Data is growing rapidly, and business growth is driving this expansion in data stores. Sixteen percent of companies are experiencing data growth exceeding 50 percent a year. An overwhelming majority of respondents say growing volumes of data are inhibiting application performance to some degree. (does that sound familiar?) The problem is even more acute at enterprises with the highest levels of data growth. However, most still attempt to address the problem with more hardware, versus smarter approaches. Many companies feel compelled to retain data for extended periods of time—forever in some cases—and are having difficulty making it accessible to end users. Budgets to keep up with bloated data stores also keep expanding. A sizable segment of companies with fast growing data stores spend more than one-fourth of their IT budgets on storage requirements. . All of these efforts generally yield a minimal short-term reprieve, but issues quickly resurface—which leads us to the data warehouse. This approach is seen by many as the only solution to the countless information management challenges presented by machine-generated data. The problem is, data warehouse projects are generally very costly in terms of people, hardware, software and maintenance. Source: KEEPING UP WITH EVER-EXPANDING ENTERPRISE DATA By Joseph McKendrick, Research Analyst; Unisphere Research October 2010

Infobright Customer Performance Statistics
Fast query response with no tuning or indexes Analytic Queries 2+ hours with MySQL <10 seconds Alternative Mobile Data (15MM events) 43 min with SQL Server 23 seconds Alternative Oracle Query Set 10 seconds – 15 minutes 0.43 – 22 seconds Alternative BI Report 7 hrs in Informix 17 seconds Alternative Data Load 11 hours in MySQL ISAM 11 minutes Alternative This data is all from customer testing – not Infobright’s. Performance always depends on many factors including the specific query, database size, datatype, hardware configuration etc. So, mileage will vary! Addition results: Telecom CDRs from CBeyond Query 1 ran in 46 secs on IB versus 5m55s on Oracle Query 2 ran in 45 secs on IB versus 4m04s on Oracle Query 3 ran in 46 secs on IB versus 3m42s on Oracle What was interesting was that in addition to the performance increase: Oracle License would have been $500,000 upfront $100,000 annually verus $30,000 annualy for Infobright Oracle required one server

Save Time, Save Cost Fastest time to value Minimal administration
Download in minutes, install in minutes No indexes, no partitions, no projections No complex hardware to install Minimal administration Self-tuning Self-managing Eliminate or reduce aggregate table creation Outstanding performance Fast query response against large data volume Load speeds over 2TB /hour with DLP High data compression 10:1 to 40:1+ Economical Low subscription cost Less data storage Industry-standard servers Fastest time to value Download in minutes No setup needed, install in minutes No need for specialized schemas Use existing data model No indexes to create Simple HW – standard laptop to high-powered server Low administration No indexes to maintain No data partitioning Self-tuning Standard SQL Eliminate or reduce aggregate table creation Outstanding compression = less time to backup High performance Designed for analytics Fast query response against large data volume High speed loader

Where does Infobright fit in the database landscape?
One Size DOESN’T fit all. Specialized Databases Deployed Excellent at what they were designed for More open source specialized databases than commercial Cloud / SaaS use for specialty DBMS becomes popular Database Virtualization Significantly lowered DBA costs Your Warehouse Row Column NoSQL NewSQL Hadoop The database landscape is in the midst of rapid change – I am old enough to remember when companies tried to position an Enterprise Data Warehouse as the single solution for organizations That thinking is long gone – people have come to realize that depending on your application there are purpose built databases that provide NoSQL, NewSQL, Hadoop, Columnar Database and the traditional row based databases

The Emerging Database Landscape
Row / NewSQL* Columnar NoSQL-Key Value Store NoSQL – Document Store NoSQL – Column Store Basic Description Structured data stored in rows on disk Structured data is vertically striped and stored in columns on disk Data stored usually in memory with some persistent backup Persistent storage along with some SQL like querying functionality Very large data storage, MapReduce support Common use cases Transaction processing, interactive transactional applications Historical data analysis, data warehousing, business intelligence Used as a cache for storing frequently requested data for a web app Web apps or any app which needs better performance without having to define columns in an RDBMS Real-time data logging such as in finance or web analytics Positives Strong for capturing and inputting new records. Robust, proven technology. Fast query support, especially for ad hoc queries on large datasets; compression Scalability, very fast storage and retrieval of unstructured and partly structured data Persistent store with scalability features such as sharding built in with better query support than key-value stores. Very high throughput for Big Data, strong partitioning support, random read write access Negatives Scale issues - less suitable for queries, especially against large databases Not suited for transactions; import and export speed; heavy computing resource utilization Usually all data must fit into memory, no complex query capabilities Lack of sophisticated query capabilities Low-level API, inability to perform complex queries, high latency of response for queries Key Player MySQL, Oracle, SQL Server, Sybase ASE Infobright, Aster Data, Sybase IQ, Vertica, ParAccel MemCached, Amazon S3, Redis, Voldemort MongoDb, Couchdb, SimpleDb HBase, Big Table, Cassandra

Why use Infobright to deal with large volumes of machine generated data?
EASY TO INSTALL TO USE AFFORD- ABLE LESS HW LOW SW COST FAST FAST QUERY FAST LOAD

Technical Overview of Infobright
Hello, my name is Joon Kim and I’d like to welcome everyone to our webinar today, an introduction to infobright. We will be starting in a few minutes so please be patient as we wait for everyone to log in. Over the next 60+ minutes, I’d like to give you a lite overview of the column based analytics database space and how infobright fits into it AND how we can help you improve your results while decreasing your TCO. I’d like to start by having everyone take a quick survey to help me understand what you would like to gain from todays session. So please take a moment to select one or more options to the following statement. Next, lets get started and please feel free to ask questions in the area on the right and we can either answer them as we go, or wait till the end. Let s get started. Joon Kim Senior Sales Engineer

Key Components of Infobright
003 Column-Oriented Smarter architecture Load data and go No indices or partitions to build and maintain Knowledge Grid automatically updated as data packs are created or updated Super-compact data footprint can leverage off-the- shelf hardware Knowledge Grid – statistics and metadata “describing” the super-compressed data Data Packs – data stored in manageably sized, highly compressed data packs Data compressed using algorithms tailored to data type One of the main drivers behind the development of Infobright is the need to support ad-hoc query capability. To address this need we selected a column oriented architecture that breaks the data up into individual data packs of 65,536 row elements each. In this way the data is stored independent of row structure and users should expect a query on any column in the database to have a consistent level of performance. Then instead of indexing, we developed a knowledge grid that stores statistical information about each data pack (such as minimum and maximum value) to help determine which data packs are needed to answer each query. In many cases the answer is determined directly from the knowledge grid without touching the data packs at all. This results is amazing query results as well as a large reduction in required. A major drawback of traditional DBMSs is the need to predict the kinds of queries a user would ask and thereby grouping data together and heavily indexing the relational tables. This means the need for more storage, in some cases 2:1 where if you had 1Tb of raw data you would need 2Tb of data base storage to support those indexes. Whereas the knowledge grid represents less than 1% of overall database storage. In addition, the data packs are compressed individually using patented compression algorithms, giving us our industry-leading overall compression ratios of anywhere from 10:1 to 40:1. Data is read in without having to alter a business’s existing data model, and there are no indexes to build or maintain. The Knowledge Grid is build automatically during the load, and doesn’t require any work on the part of the user. And because Knowledge Grid information is generated only relative to the data being loaded, our incremental load speeds are constant, regardless of the growing size of the database. What this means is that we’ve designed a product that provides fast answers to complex questions, a product that deals with massive amounts of information, and a product that does so without any additional burdens on IT. Rather than using a brute-force approach with more hardware power to increase query performance, we use knowledge about the data itself to intelligently isolate the relevant information and return results more quickly. Users are no longer limited to pre-determined queries and reporting structures, and the efforts of IT are reduced to a minimum

Infobright Architecture

1. Column Orientation Incoming Data Column Oriented Layout
EMP_ID FNAME LNAME SALARY 1 Moe Howard 10000 2 Curly Joe 12000 3 Larry Fine 9000 Column Oriented Layout (1,2,3; Moe,Curly,Larry; Howard,Joe,Fine; 10000,12000,9000;) Works well with aggregate results (sum, count, avg. ) Only columns that are relevant need to be touched Consistent performance with any database design Allows for very efficient compression

2. Data Packs and Compression
Each data pack contains 65,536 data values Compression is applied to each individual data pack The compression algorithm varies depending on data type and distribution 64K 64K Compression Results vary depending on the distribution of data among data packs A typical overall compression ratio seen in the field is 10:1 Some customers have seen results of 40:1 and higher For example, 1TB of raw data compressed 10 to 1 would only require 100GB of disk capacity Patent Pending Algorithms 64K 64K During the load process, each column of data is segmented into Data Packs of 64k, or 65,536 row elements. Then, for each Data Pack, Infobright applies multiple compression algorithms multiple times if necessary to achieve the highest compression ratio for that Data Pack. The compression algorithms we use are a combination of industry standards and internally developed patent-pending Infobright algorithms and are chosen based on the data within the column. The overall compression ratio for each Data Pack will depend on the data type and the repetitiveness of the data from one Data Pack to the next. By addressing each Data Pack individually, we see that the compression varies within one column from one Data Pack to the next, and of course it varies from one column to the next as well. This means that the overall compression ratio for the whole table can be very high. Typical compression achieved by our customers is 10:1, but we frequently see up to 30:1 and 40:1 compression. Keep in mind that many databases with stated compression ratios of 5:1 and 10:1 will often add significantly to their footprint with indexes or projections, resulting in an overall storage requirement that can be equal to or greater than the original uncompressed data. Our compression method means a significant savings in storage requirements, a lower Total Cost of Ownership, and reduced I/O since smaller volumes of data are being moved around.

3. The Knowledge Grid Knowledge Grid Applies to the whole table
Knowledge Nodes Built for each Data Pack Information about the data Basic Statistics Calculated during load Column A Column A Column B … DP1 DP1 Numerical Ranges DP2 DP3 Character Maps DP4 DP5 DP6 The Knowledge Grid is a summary of statistical collected about each table as the data is loaded. Its information about the data and is comprised of components called Knowledge Nodes. The Knowledge Grid information is collected automatically and persisted for the use of our unique optimizer without the need to define indexes in support of specific query patterns. There is no configuration or setup required in advance. One knowledge node is calculated for every data pack. It stores Basic Statistical Information like minimum and maximum value. For numerical columns, the occurrence of data within numerical ranges is kept in a binary array or Numerical Histogram. For character columns the occurrence and position of characters are stored in a binary matrix or Character Map Other, dynamic Knowledge Nodes are built at the time of query. These knowledge nodes cache intermediate results to reduce the amount of work needed if similar queries need to be run later. In this way query performance continues to improve as these intermediate artifacts are stored for later use. Dynamic Calculated during query 17

Knowledge Grid Internals
006 Data Pack Nodes (DPN) A separate DPN is created for every data pack created in the database to store basic statistical information Character Maps (CMAPs) Every Data Pack that contains text creates a matrix that records the occurrence of every possible ASCII character Histograms Histograms are created for every Data Pack that contains numeric data and creates 1024 MIN-MAX intervals. Data Pack Nodes (DPNs) Data Pack Nodes contain a set of statistics about the data that is stored and compressed in each of the Data Packs. There is always a 1 to 1 relationship between Data Packs and DPNs. DPN’s always exist, so Infobright has some information about all the data in the database, unlike traditional databases where indexes are created for only a subset of columns. Character Maps (CMAPs) A CMAP is built for every character based Data Pack. It is basically a binary matrix that tracks the occurrence of any possible ASCII character in the first 64 character positions of the data. For example the FNAME data pack built for “Larry”, “Curly”, and “Moe” would have a “1” for position 3 of character “r” since 2 values have an “r” in the position. Whereas the CMAP matrix will have a “0” for position 1 of character “a” since none of names start with the letter “a”. Histograms (HISTs) A histogram is built for every data pack that contains numeric data. It basically breaks the data down into 1024 intervals of minimum and maximum range. Each interval is flagged with a “1” or “0” to indicate if the data pack has a value in the range. Pack-to-Pack Nodes Pack-to-Pack nodes track the relationships of Data Packs between 2 tables. These nodes are built and persistently stored by queries that join tables. In this way the knowledge grid grows as the database is used and query performance actually improves over time. The last point is that the knowledge grid footprint is actually quite small and represents about 1% of the compressed database size. This means that if you have a database with 1Tb of raw data, it will be compressed to about 100Gb when stored in Inforbright with 1Gb of storage needed for the knowledge grid. Whereas traditional indexing can typically double the size of your analytical database. Pack-to-Pack Nodes (PPN) PPNs track relationships between Data Packs when tables are joined. Query performance gets better as the database is used. This metadata layer = 1% of the compressed volume 18

Optimizer / Granular Computing Engine
Query received Engine iterates on Knowledge Grid Each pass eliminates Data Packs If any Data Packs are needed to resolve query, only those are decompressed Knowledge Grid Query Results 1% Q: How are my sales doing this year? When a query comes in, it’s run through the Infobright optimizer, which looks at the Knowledge Grid first to resolve the query. Because the Knowledge Grid stores aggregate information about the data from the Data Packs, the query can often be answered using only the Knowledge Grid, without having to look at the data specifically. The Knowledge Grid also stores information about the range of values within each Data Pack, so in cases where more detail is required, the Knowledge Grid will narrow the field of search to only those Data Packs relevant to the query, and then only decompress and read these relevant Data Packs. Compressed Data 19

How the Optimizer Works
007 SELECT count(*) FROM employees WHERE salary > 50000 AND age < 65 AND job = ‘Shipping’ AND city = ‘TORONTO’; salary age job city Row s 1 to 65,5 36 65,53 7 to 131,0 72 131,0 73 to …… Find the Data Packs with salary > 50000 Completely Irrelevant Suspect All values match Find the Data Packs that have City = “Toronto’ Find the Data Packs that contain age < 65 Find the Data Packs that have job = ‘Shipping’ All packs ignored Now we eliminate all rows that have been flagged as irrelevant. Only this pack will be decompressed Finally we have identified the data pack that needs to be decompressed Lets see how the knowledge grid is used to evaluate a simple query SELECT DISTINCT CITY FROM employees WHERE salary > 50000 AND age < 65 AND job = ‘Shipping’ AND city = ‘TORONTO’; First looking at the constraint salary > we can eliminate a number of Data Packs packs using the min-max in the Data Pack Nodes salary > – in this case 3 data packs were found to have no values greater than and 1 data pack was found to have all the values of salary > 50000 age < 65 flags 2 Data Packs as suspect and 2 Data Packs that have all values of age < 65 job = ‘Shipping’ flags 2 Data Packs as suspect City = “Toronto’ eliminates 2 more Data Packs and flags 2 as suspect Now we eliminate all rows that have been flagged as irrelevant and we now only have 1 Data Pack to decompress. Actually if the query had been something like count(*), then no decompression would have been needed at all. The knowledge grid would have been able to answer the question directly. 20

Infobright Architected on MySQL
“The world’s most popular open source database” Infobright leverages the connectors and interoperability of MySQL standards. We are built within the MySQL architecture. This integration with MySQL allows our solution to tie in seamlessly with any ETL and BI tool that is compatible with MySQL. For current MySQL users who are looking for a highly scalable analytic database, Infobright is ideal – It scales from hundreds of gigabytes to 50 terabytes and more, has the ease of use MySQL users expect, and uses the familiar MySQL administrative interface. Simple scalability path for MySQL users and OEMs No new management interface to learn Enables seamless connectivity to BI tools and MySQL drivers for ODBC, JDBC, C/C++, .NET, Perl, Python, PHP, Ruby, Tcl, etc.

Sample Script (Create Table, Import, Export)
065 USE Northwind; DROP TABLE IF EXISTS customers; CREATE TABLE customers ( CustomerID varchar(5), CompanyName varchar(40), ContactName varchar(30), ContactTitle varchar(30), Address varchar(60), City varchar(15) Region char(15) PostalCode char(10), Country char(15), Phone char(24), Fax varchar(24), CreditCard float(17,1), FederalTaxes decimal(4,2) ) ENGINE=BRIGHTHOUSE; -- Import the text file. Set AUTOCOMMIT=0; = 'txt_variable'; LOAD DATA INFILE "/tmp/Input/customers.txt" INTO TABLE customers FIELDS TERMINATED BY ';' ENCLOSED BY 'NULL' LINES TERMINATED BY '\r\n'; COMMIT; -- Export the data into BINARY format. = 'binary'; SELECT * INTO OUTFILE "/tmp/output/customers.dat" FROM customers; -- Export the data into TEXT format. SELECT * INTO OUTFILE "/tmp/output/customers.text" FIELDS TERMINATED BY ';' ENCLOSED BY 'NULL' LINES TERMINATED BY '\r\n' FROM customers; 22

Infobright 4.0 – Additional Features
Built-in intelligence for machine-generated data: Find ‘Needle in the Haystack’ faster DomainExpert Intelligence about machine-generated data drives faster performance Enhanced Knowledge Grid with domain intelligence Automatically optimizes database, no fine tuning Users can directly add domain expertise to drive faster performance Near-real time, ad-hoc analysis of Big Data DLP: Linear scalability of data load for very high performance Rough Query: Data mining “drill down” at RAM speed Identify patterns in common web data types. Compress faster / load. Query faster / decompress Enable you to define your own data types. No fine tuning of db Near real-time analysis No query based tuning required.

Work with Data Even Faster
DomainExpert: Breakthrough Analytics Enables users to add intelligence into Knowledge Grid directly with no schema changes Pre-defined/Optimized for web data analysis IP addresses addresses URL/URI Can cut query time in half when using this data definition DomainExpert Intelligence to automatically optimize the database

DomainExpert: Prebuilt plus DIY options
Pattern recognition enables faster queries Patterns defined and stored Complex fields decomposed into more homogeneous parts Database uses this information when processing query Users can also easily add their own data patterns Identify strings, numerics, or constants Financial Trading example– ticker feed “AAPL–350,354,347,349” encoded “%s-%d,%d,%d,%d” Will enable higher compression No schema changes.

Near-real time ad-hoc analysis
Get Data In Faster: DLP Near-real time ad-hoc analysis Linear scalability of data load for very high performance Distributed Load Processor (DLP) Add-on product to IEE which linearly scales load performance Remote servers compress data and build Knowledge Grid elements on-the-fly… Appended to the data server running the main Infobright database It’s all about speed: Faster Load & Queries Total load speed depends on number and type of remote servers Loads are faster due to distributing work on other server nodes. Queries are faster than 2TB per hour.

Get Data In Faster: Hadoop
Near-real time ad-hoc analysis Hadoop connectivity Use the right tool for the job Big Data - Hadoop Support DLP Hadoop connector Extracts data from HDFS, load into Infobright at high speeds You load 100s of TBs or Petabytes into Hadoop for bulk storage and batch processing Then load TBs into Infobright for near-real time analytics using Hadoop connector and DLP Infobright / Hadoop: Perfect complement to analyze Big Data

Rough Query: Speed Up Data Mining by 20x
Rough Query – Another Infobright Breakthrough Enables very fast iterative queries to quickly drill down into large volumes of data “Select roughly” to instantaneously see interval range for relevant data, uses only the in-memory Knowledge Grid information Filtering can narrow results Need more detail? Drill down further with rough query or query for exact answer Near-real time ad-hoc analysis Rough Query: Data mining “drill down” at RAM speed

The Value Infobright Delivers
High performance with much less work and lower cost Faster queries without extra work No indexes No projections or cubes No data partitioning Faster ad-hoc analytics Fast load / High compression Multi-machine Distributed Load Processor Query while load (DLP) 10:1 to 40:1+ compression Lower costs Less storage and servers Low cost HW Low-cost subscriptions 90% less administration Faster time to production Download in minutes Minimal configuration Implement in days

Infobright Use Cases

Infobright and Hadoop in Video Advertising: LiveRail
LiveRails’s Need Infobright’s Solution LiveRail’s platform enables publishers, advertisers, ad networks and media groups to manage, target, display and track advertising in online video. With a growing number of customers, LiveRail was faced with managing increasingly large data volumes. They also needed to provide near real-time access to their customers for reporting and ad hoc analysis. LiveRail chose two complementary technologies to manage hundreds of millions of rows of data each day -- Apache Hadoop and Infobright. Detail is loaded hourly into Hadoop and at the same time summarized and loaded into Infobright. Customers access Infobright 7x24 for ad-hoc reporting and analysis and can schedule time if needed to access cookie-level data stored in Hadoop. “Infobright and Hadoop are complementary technologies that help us manage large amounts of data while meeting diverse customers needs to analyze the performance of video advertising investments.” Andrei Dunca, CTO of LiveRail

Example in Mobile Analytics: Bango
Bango’s Need Infobright’s Solution A leader in mobile billing and analytics services utilizing a SaaS model Received a contract with a large media provider 150 million rows per month 450GB per month on SQL Server SQL Server could not support required query performance Needed a database that could scale for much larger data sets with fast query response with fast implementation and low maintenance in a cost-effective solution Reduced queries from minutes to seconds Reduced size of one customer’s database from 450 GB to 10 GB for one month of data Query SQL Server Infobright 1 Month Report (5MM events) 11 min 10 secs (15MM events) 43 min 23 secs Complex Filter (10MM events) 29 min 8 secs

Online Analytics: Yahoo!
Customer’s Need Infobright’s Solution Pricing and Yield Management team responsible for pricing online display ads Requires sophisticated analysis of terabytes of ad impression data With prior database, could only store 30 days of summary data Needed a database that could: Store 6 months+ of detailed data Reduce hardware needed Eliminate database admin work Execute ad-hoc queries much faster Loading over 30 million records per day Can now store all detailed data, retain 6 billion records 6TB of data is compressed to 600GB on disk Queries are very fast, Yahoo! can do ad- hoc analysis without manual tuning Easy to maintain and support “Using Infobright allows us to do pricing analyses that would not have been possible before. We now have access to all of our detailed Web impression data, and we can keep 6x the amount of data history we could previously.” Sr. Director PYM, Yahoo! Setting the prices for display ads is the responsibility of Yahoo!'s Pricing and Yield Management (PYM) team. They are also responsible for pricing analytics and pricing-yield business operations for all of Yahoo!'s display advertising business. This involves sophisticated analysis of very large volumes of impression-level Web data by highly skilled analysts within the PYM team. With the move toward finer and finer ad targeting (using more attributes to determine which ad to display), the process of pricing has become more complex as the team must price each specific set of attributes. In addition to setting the right price in the guaranteed display ad market, the PYM team must also determine how impressions are monetizing in their secondary marketplace—this is required for establishing appropriate pricing and deal evaluation. Not surprisingly, as more attributes are captured for analysis, the volume of data collected every day continues to increase. For the PYM pricing analysis application, that equals 20 to 30-million records per day. Previously Yahoo! used a traditional row-based database to capture and store the data. However the high cost and high administrative effort it took was a barrier to storing all of the data the PYM team needed. Only summary data of user behavior could be stored, and the data history was limited to 30 days. This severely limited the number of attributes that could be considered in the pricing analysis, thereby limiting the analysis the PYM team could do and making the analysis less accurate. The PYM team decided they needed a new database to meet their needs: Allow the storage of all of the detailed impression data instead of only summary data Extend data history to 6 months versus the 30 days they could keep previously Provide fast queries and flexible ad-hoc analytics Reduce the amount of hardware required Reduce the administrative effort involved in managing and tuning the database Infobright solution Store all detailed data: Yahoo! is currently loading close to 30-millions records every day. This now includes all of the detailed data that the team wanted to have access to, rather than just the summary information they were able to store previously. Keep much more history, in much less space: Yahoo! now has the ability to store the six months of data they need for the most accurate analysis. This means that there are about 6 billion records in the database. What's more, as Infobright provides outstanding levels of data compression, the approximately 6TB of raw data that has been loaded only uses 600GB on disk (10:1 compression). Faster queries, flexible ad-hoc analytics: Queries that took several minutes previously now run in seconds with Infobright. Easier to maintain and support: This allows the PYM team to focus on their primary responsibilities rather than devoting effort to database maintenance.

Case Study: JDSU Annual revenues exceeded $1.3B in 2010
4700 employees are based in over 80 locations worldwide Communications sector offers instruments, systems, software, services, and integrated solutions that help communications service providers, equipment manufacturers, and major communications users maintain their competitive advantage

JDSU Service Assurance Solutions
Ensure high quality of experience (QoE) for wireless voice, data, messaging, and billing. Used by many of the world’s largest network operators At Tier 1 carrier, load requirements were: 300,000 records per second 18 million records per minute 1 billion rows per hour Total amount of data to be stored is between 700TB – 1.8 PB’s

JDSU Project Goals New version of Session Trace solution that would:
Support very fast load speeds to keep up with increasing call volume and the need for near real-time data access Reduce the amount of storage by 5x, while also keeping much longer data history Reduce overall database licensing costs 3X Eliminate customers’ “DBA tax,” meaning there should require zero maintenance or tuning while enabling flexible analysis Continue delivering the fast query response needed by Network Operations Center (NOC) personnel when troubleshooting issues and supporting up to 200 simultaneous users

High Level View

Session Trace Application
For deployment at Tier 1 network operators, each site will store between 6 and 45TB of data, and the total data volume will range from 700TB to 1PB of data. The Session Trace application supports a broad set of network protocols, including the latest generation of 4G/LTE networks. Call data streams in at high speeds and populates multiple Infobright instances, based on geographic regions. The application includes a Web-based front end that passes information to a query generator, which then sends the query to the appropriate database instance.

Infobright Implementation

Save Time, Save Cost Fastest time to value Minimal administration
Download in minutes, install in minutes No indexes, no partitions, no projections No complex hardware to install Minimal administration Self-tuning Self-managing Eliminate or reduce aggregate table creation Outstanding performance Fast query response against large data volume Load speeds over 2TB /hour with DLP High data compression 10:1 to 40:1+ Economical Low subscription cost Less data storage Industry-standard servers Fastest time to value Download in minutes No setup needed, install in minutes No need for specialized schemas Use existing data model No indexes to create Simple HW – standard laptop to high-powered server Low administration No indexes to maintain No data partitioning Self-tuning Standard SQL Eliminate or reduce aggregate table creation Outstanding compression = less time to backup High performance Designed for analytics Fast query response against large data volume High speed loader

What Our Customers Say “Using Infobright allows us to do pricing analyses that would not have been possible before.” “With Infobright, [this customer] has access to data within minutes of transactions occurring, and can run ad-hoc queries with amazing performance.” "Infobright offered the only solution that could handle our current data load and scale to accommodate a projected growth rate of 70 percent, without incurring prohibitive hardware and licensing costs. “Using Infobright allowed JDSU to meet the aggressive goals we set for our new product release: reducing storage and increasing data history retention by 5x, significantly reducing costs, and meeting the fast data load rate and query performance needed by the world’s largest network operators.”

Where does Infobright fit in the database landscape?
One Size DOESN’T fit all. Specialized Databases Deployed Excellent at what they were designed for More open source specialized databases than commercial Cloud / SaaS use for specialty DBMS becomes popular Database Virtualization Significantly lowered DBA costs Your Warehouse Row Column NoSQL NewSQL Hadoop The database landscape is in the midst of rapid change – I am old enough to remember when companies tried to position an Enterprise Data Warehouse as the single solution for organizations That thinking is long gone – people have come to realize that depending on your application there are purpose built databases that provide NoSQL, NewSQL, Hadoop, Columnar Database and the traditional row based databases

NoSQL: Unstructured Data Kings
Tame the Unstructured Store Anything Keep Everything Schema-less Designs Extreme Transaction Rates Massive Horizontal Scaling Heavy Data Redundancy Niche Players Top NoSQL Offerings

120+ Variants : Find More at nosql-databases.org
NoSQL: Breakout Key-Value Document Store Hybrid Column Store Graph NoSQL Key-Value Store Single object stored in memory with some persistent backup. Cache for frequently requested data web applications online shopping carts social-media sites NoSQL Document Store Object persistently stored with some cachg. Indexing schema-less data to power Web-traffic analysis User-behavior/action analysis Log-file analysis in real time NoSQL Column Store Massive, Persistent Data Storage Schema-less groupings of similar traits allow Dynamic, All-inclusive Websites (Think Facebook) Disparate Data Warehousing Real-time, Interactive Communities (Think Twitter) 120+ Variants : Find More at nosql-databases.org

What do we see with NoSQL
Strengths Application Focused Programmatic API Capacity Lookup Speed Streaming data Weakness Generally no SQL Interface Programmatic Interfaces Expensive Infrastructure Complex Limits with Analytics What is the summary on NoSQL – like the other databases great strengths – but not a silver bullet to meet your BI and analytic requirements Great at running Batch jobs against incredible volumes of unstructured data We have several customers running Infobright in conjunction with Hadoop – Anu who will be sharing the InMobi story will relay their experience doing this

Lest We Forget Hadoop Value Add
Scalable, fault-tolerant distributed system for data storage and processing Hadoop Distributed File System (HDFS): self-healing high-bandwidth clustered storage MapReduce: fault-tolerant distributed processing Value Add Flexible : store schema-less data and add as needed Affordable : low cost per terabyte Broadly adopted : Apache Project with a large, active ecosystem Proven at scale : petabyte+ implementations in production today Nature of the Data Complex data Multiple data sources Lots of it Nature of the analysis Batch processing Parallel execution Spread data over a cluster of servers and take the computation to the data

Hadoop Data Extraction
Nature of the Data Complex data Multiple data sources Lots of it Nature of the analysis Batch processing Parallel execution Spread data over a cluster of servers and take the computation to the data

NewSQL: Operational, Relational Powerhouses
Overclock Relational Performance Scale-Out Scale “Smart” New, Scalable SQL Extreme Transaction Rates Diverse Technologies ACID Compliance

Infobright Meetup Host Avner Algom May 28, 2012

Similar presentations

Presentation on theme: "Infobright Meetup Host Avner Algom May 28, 2012"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Infobright Meetup Host Avner Algom May 28, 2012

Similar presentations

Presentation on theme: "Infobright Meetup Host Avner Algom May 28, 2012"— Presentation transcript:

Similar presentations

About project

Feedback