Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tao Su Xin Xiao Computer and Network Security Group Data Base And Data Mining Group Open Source Data Mining Software.

Similar presentations


Presentation on theme: "Tao Su Xin Xiao Computer and Network Security Group Data Base And Data Mining Group Open Source Data Mining Software."— Presentation transcript:

1 Tao Su Xin Xiao Computer and Network Security Group Data Base And Data Mining Group Open Source Data Mining Software

2 Free Software License a notice that grants the recipient of a piece of software extensive rights to modify and redistribute that software. the rights-holder (usually the author) of a piece of software can remove the restrictions in copyright law by accompanying the software with free software license The most widely used free software license is the GNU General Public License (GPL).

3 Open Source Software Source code available and licensed with an open-source license Often developed in public and collaborative manner Provides users with the freedom to run, copy, distribute, study, change and improve the software.

4 What is Data Mining? The process of extracting new and useful knowledge from large amounts of data. Used to solve many business problems, such as customer behavior modeling, credit scoring, product recommendation, etc. Adopted in many industries, e.g., retail, bank, finance, medicine, etc.

5 Data Mining Framework (CRISP-DM) Business objective Data understanding Data preparation Modeling Evaluation Deployment

6 Prediction Methods use some variables to predict unknown or future values of other variables. Classification: generalize known structure to apply to new data, e.g. email -> “legitimate” or “spam”. Regression: find a function to model the data with the least error, e.g. advertising expenditure -> sales amount. Deviation Detection: detect significant deviations from normal behavior, e.g. Credit Card Fraud Detection. Data Mining Techniques (1)

7 Data Mining Techniques (2) Description Methods find human-interpretable patterns that describe the data. Clustering: discover clusters in which data are more similar to each other, e.g. subdivide a market into distinct subsets of customers based on their geographical and lifestyle information. Association Rule Discovery: search for relationships between items, determine which products are frequently bought together in supermarket. Sequential Pattern Discovery: find sets of ordered items that occur together frequently in some sequences, e.g. DNA sequence analysis.

8 Open Source Software in Data Mining Commonly used in data mining research, education, industrial applications, enterprises. Available at no cost, free to learn the data mining techniques Integration of different techniques, including data preprocessing Allows extension of new methods and modification of source code for specific purpose Ability to operate on large datasets and access data sources in different formats

9 Open Source Data Mining Software Examples Data Mining Software System may integrate many operations, and provides an easy-to-use (often graphical) user interface to effectively perform the data mining performance. Orange Weka ScaVis KNIME R RapidMiner

10 Component based data mining and machine learning software suit Python bindings and libraries for scripting Graphical user interface building upon cross-platform Qt framework Distributed free under GNU General Public License Supports multi-platform

11 Visual Programming: Data analysis process can be designed through visual programming Visualization: scatter plots, bar charts, trees, dendrograms, networks, heatmaps Large Toolbox: Over 100 widgets and growing Scripting Interface: Use python scripting interface to program new algorithms and develop complex data analysis procedures Extensible: Users can develop their own widgets, extend scripting interface or create self-contained add-ons

12 Developed at the University of Waikato, New Zealand A popular suit of machine learning software Written in java Use Explorer as user interface, can be accessed also from command-line Free software available under GNU General Public License Supports multi-platform

13 Designed for interactive scientific plots in 2D and 3D for scientific computation, data analysis and data visualization Written in java and jython Run scripts or java code either in GUI driven mode or batch mode Mixed license: core engine is GPL, but the installer, documentation and other components are only free for non-commercial purposes Supports multi-platform

14 Analytic Computations: In this mode, Matlab/Octave high-level interpreted language can be used Statistical Packages: More than 10 thousand java classes and methods packed in 50MB library pack API for data input and output: native java I/O, python I/O, java-native SQL database, object-oriented database, etc. IDE with code assist: Also supports C/C++, PHP, FORTRAN, etc. Code assist allows to mix Jython/Python code with LaTeX equations to make scientific articles.

15 Developed at the University of Konstanz, Germany Integrates various components for machine learning and data mining Written in java and based on Eclipse Graphical user interface Distributed free under GNU General Public License Supports multi-platform

16 Open Integration Platform: Over 1000 modules Big Data Extensions are available for distributed frameworks such as Hadoop Integration: integrates modules of Weka and R scripts Extensible: based on Eclipse platform, can be extended through its modular API

17 Programming language and a software environment for statistical computing and graphics Polls and surveys show that R's popularity has increased substantially in recent years Written in C, Fortran and R Use command line interface R is a GNU project, freely available under GPL Supports multi-platform

18 Environment for machine learning, data mining, text mining, predictive analytics, and business analytics Ranked first in data mining/analytic tools used for real projects in 2010 in the polls by Kdnuggets More than 500 operators Developed in java Distributed under AGPL open source license, which is based on GPL Provides GUI and generates XML file describing the process Integrates many extensions such as Weka extension and R extension Supports multi-platform

19 Data Mining Process Example in RapidMiner In DBDMG group in PoliTo, most people use RapidMiner as the tool for data mining analysis. Example: process to get frequent itemsets and rules using FP-Growth algorithm (association rule technique)

20 Add || Modify || Extend Code in RapidMiner There are many ways to modify the code in RapidMiner for specific purpose: Write script in the “Execute Script” operator directly to build new operator Run RapidMiner in Eclipse, add or modify the java classes in Eclipse Write in Eclipse your own RapidMiner Extension, and build new plugin.

21 References Xiaojun Chen, Graham Williams, and Xiaofei Xun, A Survey of Open Source Data Mining Systems. In: Proceedings of the 2007 ACM international conference on Emerging technologies in knowledge discovery and data mining (PAKDD'07). http://en.wikipedia.org/wiki/Data_mining


Download ppt "Tao Su Xin Xiao Computer and Network Security Group Data Base And Data Mining Group Open Source Data Mining Software."

Similar presentations


Ads by Google