Presentation is loading. Please wait.

Presentation is loading. Please wait.

Addressing Data Chaos: Using MySQL and Kettle to Deliver World-Class Data Warehouses Matt Casters: Chief Architect, Data Integration and Kettle Project.

Similar presentations


Presentation on theme: "Addressing Data Chaos: Using MySQL and Kettle to Deliver World-Class Data Warehouses Matt Casters: Chief Architect, Data Integration and Kettle Project."— Presentation transcript:

1 Addressing Data Chaos: Using MySQL and Kettle to Deliver World-Class Data Warehouses Matt Casters: Chief Architect, Data Integration and Kettle Project Founder MySQL User Conference, Wednesday April 25, 2007

2 Agenda Big News Data Integration challenges and open source BI adoption Pentaho company overview Pentaho Data Integration Fundamentals Schema design Kettle basics Demonstration Resources and links

3 Announcing Pentaho Data Integration 2.5.0 Again we offer big improvements over smash hit version 2.4.0 Advanced error handling Tight Apache VFS integration Allows us to directly load and save files from any location: file systems, web servers, FTP sites, ZIP-files, tar-files, etc. Dimension key caching dramatically improving speed A slew of new job entries and steps (including MySQL bulk operations) Hundreds of bug fixes

4 Managing Data Chaos: Data Integration Challenges Data is everywhere Customer order information in one system, customer service information in another Data is inconsistent The record of the customer is different in each system Performance is an issue Running queries to summarize 3 years of data in the operational system takes forever AND it brings the operational system to its knees The data is never ALL in the data warehouse Acquisitions, Excel spreadsheets, new applications Customer Service History Customer Order History Marketing Campaigns Data Warehouse XML Acquired System

5 How Pentaho Extends MySQL with ETL MySQL Provides Data storage SQL query execution Heavy-duty sorting, correlation, aggregation Integration point for all BI tools Kettle Provides Data extraction, transformation, and loading Dimensional modeling SQL generation Aggregate creation Data enrichment / calculations Data migration

6 Sample Companies that Use MySQL and Kettle from Pentaho “With professional support and world-class ETL from Pentaho, we've been able to simplify our IT environment and lower our costs. We were also surprised at how much faster Pentaho Data Integration was than our prior solution.” “We selected Pentaho for its ease-of- use. Pentaho addressed many of our requirements -- from reporting and analysis to dashboards, OLAP and ETL, and offered our business users the Excel-based access that they wanted.”

7 “We chose Pentaho because it has a full range of functionality, exceptional flexibility, and a low total cost of ownership because of its open source business model. We can start delivering value to our business users quickly with embedded, web-based reporting, while integrating our disparate data sources for more strategic benefits down the road.” Other Kettle Users And Thousands More……

8 Pentaho Introduction World’s most popular enterprise open source BI Suite 2 million lifetime downloads, averaging 100K / month Founded in 2004: Pioneer in professional open source BI Key Projects JFreeReport Reporting Kettle Data Integration Mondrian OLAP Pentaho BI Platform Weka Data Mining Management and Board Proven BI veterans from Business Objects, Cognos, Hyperion, SAS, Oracle Open source leaders - Larry Augustin, New Enterprise Associates, Index Ventures MySQL Gold Partner

9 Overview: Data Warehouse Data Flow From source systems … to the data warehouse … to reports … to analyses … to dashboard reports … to better information

10 Pentaho Introduction Strategic Operational SalesMarketingInventoryFinancialProduction Scorecards Analysis Aggregates Reports Departmental

11 The star schema: a new data model is needed Because data from various sources is “mixed” we need to design a new data model: a star schema. A star schema is designed based on the requirements and populated by the ETL engine. During modeling we split up the requirements in Facts and Dimensions: Dimensions Facts

12 The star schema: a new data model is needed After grouping the dimension attributes by subject we get our data model. For example: Customer Product Order Line Fact Table Date Order

13 Overview: A new data model is needed The fact table contains ONLY facts and dimension technical keys ColumnTypeData type date_tkTechnical keyBigint customer_tkTechnical keyBigint order_tkTechnical keyBigint product_tkTechnical keyBigint number_of_productsFactSmallint TurnoverFactFloat Pct_discountFactTinyint DiscountFactFloat

14 Overview: A new data model is needed TKVersiondate_fromdate_tocust_idnameNAL*Birth_date 101100Matt C.Address11900-01-01 The dimensions contain technical fields, typically like in this customer dimension entry for customer_id = 100 NAL = Name, Address & Location TKVersiondate_fromdate_tocust_idnameNAL*Birth_date 101T1100Matt C.Address11900-01-01 542T1100Matt C.Address21900-01-01 If the address changes (at time T1) we get a new entry in the dimension. This is called a Ralph Kimball type II dimension update.

15 Overview: A new data model is needed NAL = Name, Address & Location TKVersiondate_fromdate_tocust_idnameNAL*Birth_date 101T1100Matt C.Address11969-02-14 542T1100Matt C.Address21969-02-14 If the birth_date changes we update all entries in the dimension. This is called a Ralph Kimball type I dimension update.

16 Implications We are making it easier to create reports by using star schemas We are shifting work from the reporting side to the ETL We need a good toolset to do ETL because of the complexities We need to turn everything upside down … and this is where Pentaho Data Integration comes in.

17 Data Transformation and Integration Examples Data filtering Is not null, greater than, less than, includes Field manipulation Trimming, padding, upper and lowercase conversion Data calculations + - X /, average, absolute value, arctangent, natural logarithm Date manipulation First day of month, Last day of month, add months, week of year, day of year Data type conversion String to number, number to string, date to number Merging fields & splitting fields Looking up date Look up in a database, in a text file, an excel sheet, …

18 Pentaho Data Integration (Kettle) Components Spoon Connect to data sources Define transformation rules and design target schema(s) Graphical job execution workflow engine for defining multi-stage and conditional transformation jobs Pan Command-line execution of single, pre-defined transformation jobs Kitchen Scheduler for multi-stage jobs Pentaho BI Platform Integrated scheduling of transformations or jobs Ability to call real-time transformations and use output in reports and dashboards

19 Demonstration - create a MySQL db + repository - create dimensions - create facts - auditing & incremental loading - jobs

20 Case Study: Pentaho Data Integration Organization: Flemish Government Traffic Centre Use case: Monitoring the state of the road network Application requirement: Integrate minute-by-minute data from 570 highway locations for analysis Technical challenges: Large volume of data, more than 2.5 billion rows Business Usage: Users can now compare traffic speeds based on weather conditions, time of day, date, season Best practices: Clearly understand business user requirements first There are often multiple ways to solve data integration problems, so consider the long-term need when choosing the right way

21 Case Study: Replacement of Proprietary Data Integration Organization: Large, public, North American based genetics and pharmaceutical research firm Application requirement: Data warehouse for analysis of patient trials, and research spending Incumbent BI vendor: Oracle (Oracle Warehouse Builder) Decision criteria: Ease of use, openness, cost of ownership “It was so much quicker and easier to do the things we wanted to do, and so much easier to maintain when our users’ business requirements change.” Best practices: Evaluate replacement costs holistically Treat migrations as an opportunity to improve a deployment, not just move it Good deployments are iterative and evolve regularly – if users like what you give them, they will probably ask for more

22 Summary and Resources Pentaho and MySQL can address help you manage your data infrastructure Extraction, Transformation and Loading for Data Warehousing and Data Migration kettle.pentaho.org Kettle project homepage kettle.javaforge.com Kettle community website: forum, source, documentation, tech tips, samples, … www.pentaho.org/download/ All Pentaho modules, pre-configured with sample data Developer forums, documentation Ventana Research Open Source BI Survey www.mysql.com White paper - http://dev.mysql.com/tech-resources/articles/mysql_5.0_pentaho.htmlhttp://dev.mysql.com/tech-resources/articles/mysql_5.0_pentaho.html Kettle Webinar - http://www.mysql.com/news-and-events/on-demand- webinars/pentaho-2006-09-19.phphttp://www.mysql.com/news-and-events/on-demand- webinars/pentaho-2006-09-19.php Roland Bouman blog on Pentaho Data Integration and MySQL http://rpbouman.blogspot.com/2006/06/pentaho-data-integration-kettle-turns.html


Download ppt "Addressing Data Chaos: Using MySQL and Kettle to Deliver World-Class Data Warehouses Matt Casters: Chief Architect, Data Integration and Kettle Project."

Similar presentations


Ads by Google