Tao Su Xin Xiao Computer and Network Security Group Data Base And Data Mining Group Open Source Data Mining Software.

Slides:



Advertisements
Similar presentations
Weka & Rapid Miner Tutorial By Chibuike Muoh. WEKA:: Introduction A collection of open source ML algorithms – pre-processing – classifiers – clustering.
Advertisements

© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Software Frameworks for Acquisition and Control European PhD – 2009 Horácio Fernandes.
Department of Computer Science, University of Waikato, New Zealand Eibe Frank WEKA: A Machine Learning Toolkit The Explorer Classification and Regression.
University of Minnesota
Data Mining – Intro.
Data Mining & Data Warehousing PresentedBy: Group 4 Kirk Bishop Joe Draskovich Amber Hottenroth Brandon Lee Stephen Pesavento.
TURKISH STATISTICAL INSTITUTE INFORMATION TECHNOLOGIES DEPARTMENT (Muscat, Oman) DATA MINING.
Chapter 5: Data Mining for Business Intelligence
Data Mining Techniques
MAKING THE BUSINESS BETTER Presented By Mohammed Dwikat DATA MINING Presented to Faculty of IT MIS Department An Najah National University.
Data Mining. 2 Models Created by Data Mining Linear Equations Rules Clusters Graphs Tree Structures Recurrent Patterns.
Intelligent Systems Lecture 23 Introduction to Intelligent Data Analysis (IDA). Example of system for Data Analyzing based on neural networks.
Biostatistics, statistical software II. A brief survey of statistical program systems Krisztina Boda PhD Department of Medical Informatics, University.
ROOT: A Data Mining Tool from CERN Arun Tripathi and Ravi Kumar 2008 CAS Ratemaking Seminar on Ratemaking 17 March 2008 Cambridge, Massachusetts.
CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.
Appendix: The WEKA Data Mining Software
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
Department of Computer Science, University of Waikato, New Zealand Eibe Frank WEKA: A Machine Learning Toolkit The Explorer Classification and Regression.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Data mining. Data mining, at its core, is the transformation of large amounts of data into meaningful patterns and rules.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
Data Mining Tools some examples.
1 Introduction to Data Mining C hapter 1. 2 Chapter 1 Outline Chapter 1 Outline – Background –Information is Power –Knowledge is Power –Data Mining.
 Programming - the process of creating computer programs.
MIS2502: Data Analytics Advanced Analytics - Introduction.
Application of Data Mining Techniques on Survey Data using R and Weka
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Spam Detection Kingsley Okeke Nimrat Virk. Everyone hates spams!! Spam s, also known as junk s, are unwanted s sent to numerous recipients.
Ganga/Dirac Data Management meeting October 2003 Gennady Kuznetsov Production Manager Tools and Ganga (New Architecture)
In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Data Mining NATE BUTLER, BRENT DAVIS, BROCK NOLAN, AND NICK THORNHILL.
Your Interactive Guide to the Digital World Discovering Computers 2012 Chapter 13 Computer Programs and Programming Languages.
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
Zohreh Raghebi.  A software platform provides an integrated environment  Machine learning  Data mining  Text mining  Predictive analytics  Business.
DATA MINING and VISUALIZATION Instructor: Dr. Matthew Iklé, Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University.
Introduction to Algorithm. What is Algorithm? an algorithm is any well-defined computational procedure that takes some value, or set of values, as input.
INTRO. To I.T Razan N. AlShihabi
Popular Database Management Systems
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining – Intro.
MIS2502: Data Analytics Advanced Analytics - Introduction
DATA MINING © Prentice Hall.
Introduction to R Programming with AzureML
Data Mining 101 with Scikit-Learn
CO6025 Advanced Programming
Chapter 4 Computer Software.
Waikato Environment for Knowledge Analysis
WEKA.
Sangeeta Devadiga CS 157B, Spring 2007
Data Warehousing and Data Mining
Chapter 2: The Linux System Part 1
Machine Learning with Weka
Overview of big data tools
Supporting End-User Access
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Warehousing Data Mining Privacy
orange.biolab.si A general-purpose open source component-based
Lecture 10 – Introduction to Weka
Welcome! Knowledge Discovery and Data Mining
MIS2502: Data Analytics Introduction to Advanced Analytics and R
Data Mining CSCI 307, Spring 2019 Lecture 7
Presentation transcript:

Tao Su Xin Xiao Computer and Network Security Group Data Base And Data Mining Group Open Source Data Mining Software

Free Software License a notice that grants the recipient of a piece of software extensive rights to modify and redistribute that software. the rights-holder (usually the author) of a piece of software can remove the restrictions in copyright law by accompanying the software with free software license The most widely used free software license is the GNU General Public License (GPL).

Open Source Software Source code available and licensed with an open-source license Often developed in public and collaborative manner Provides users with the freedom to run, copy, distribute, study, change and improve the software.

What is Data Mining? The process of extracting new and useful knowledge from large amounts of data. Used to solve many business problems, such as customer behavior modeling, credit scoring, product recommendation, etc. Adopted in many industries, e.g., retail, bank, finance, medicine, etc.

Data Mining Framework (CRISP-DM) Business objective Data understanding Data preparation Modeling Evaluation Deployment

Prediction Methods use some variables to predict unknown or future values of other variables. Classification: generalize known structure to apply to new data, e.g. -> “legitimate” or “spam”. Regression: find a function to model the data with the least error, e.g. advertising expenditure -> sales amount. Deviation Detection: detect significant deviations from normal behavior, e.g. Credit Card Fraud Detection. Data Mining Techniques (1)

Data Mining Techniques (2) Description Methods find human-interpretable patterns that describe the data. Clustering: discover clusters in which data are more similar to each other, e.g. subdivide a market into distinct subsets of customers based on their geographical and lifestyle information. Association Rule Discovery: search for relationships between items, determine which products are frequently bought together in supermarket. Sequential Pattern Discovery: find sets of ordered items that occur together frequently in some sequences, e.g. DNA sequence analysis.

Open Source Software in Data Mining Commonly used in data mining research, education, industrial applications, enterprises. Available at no cost, free to learn the data mining techniques Integration of different techniques, including data preprocessing Allows extension of new methods and modification of source code for specific purpose Ability to operate on large datasets and access data sources in different formats

Open Source Data Mining Software Examples Data Mining Software System may integrate many operations, and provides an easy-to-use (often graphical) user interface to effectively perform the data mining performance. Orange Weka ScaVis KNIME R RapidMiner

Component based data mining and machine learning software suit Python bindings and libraries for scripting Graphical user interface building upon cross-platform Qt framework Distributed free under GNU General Public License Supports multi-platform

Visual Programming: Data analysis process can be designed through visual programming Visualization: scatter plots, bar charts, trees, dendrograms, networks, heatmaps Large Toolbox: Over 100 widgets and growing Scripting Interface: Use python scripting interface to program new algorithms and develop complex data analysis procedures Extensible: Users can develop their own widgets, extend scripting interface or create self-contained add-ons

Developed at the University of Waikato, New Zealand A popular suit of machine learning software Written in java Use Explorer as user interface, can be accessed also from command-line Free software available under GNU General Public License Supports multi-platform

Designed for interactive scientific plots in 2D and 3D for scientific computation, data analysis and data visualization Written in java and jython Run scripts or java code either in GUI driven mode or batch mode Mixed license: core engine is GPL, but the installer, documentation and other components are only free for non-commercial purposes Supports multi-platform

Analytic Computations: In this mode, Matlab/Octave high-level interpreted language can be used Statistical Packages: More than 10 thousand java classes and methods packed in 50MB library pack API for data input and output: native java I/O, python I/O, java-native SQL database, object-oriented database, etc. IDE with code assist: Also supports C/C++, PHP, FORTRAN, etc. Code assist allows to mix Jython/Python code with LaTeX equations to make scientific articles.

Developed at the University of Konstanz, Germany Integrates various components for machine learning and data mining Written in java and based on Eclipse Graphical user interface Distributed free under GNU General Public License Supports multi-platform

Open Integration Platform: Over 1000 modules Big Data Extensions are available for distributed frameworks such as Hadoop Integration: integrates modules of Weka and R scripts Extensible: based on Eclipse platform, can be extended through its modular API

Programming language and a software environment for statistical computing and graphics Polls and surveys show that R's popularity has increased substantially in recent years Written in C, Fortran and R Use command line interface R is a GNU project, freely available under GPL Supports multi-platform

Environment for machine learning, data mining, text mining, predictive analytics, and business analytics Ranked first in data mining/analytic tools used for real projects in 2010 in the polls by Kdnuggets More than 500 operators Developed in java Distributed under AGPL open source license, which is based on GPL Provides GUI and generates XML file describing the process Integrates many extensions such as Weka extension and R extension Supports multi-platform

Data Mining Process Example in RapidMiner In DBDMG group in PoliTo, most people use RapidMiner as the tool for data mining analysis. Example: process to get frequent itemsets and rules using FP-Growth algorithm (association rule technique)

Add || Modify || Extend Code in RapidMiner There are many ways to modify the code in RapidMiner for specific purpose: Write script in the “Execute Script” operator directly to build new operator Run RapidMiner in Eclipse, add or modify the java classes in Eclipse Write in Eclipse your own RapidMiner Extension, and build new plugin.

References Xiaojun Chen, Graham Williams, and Xiaofei Xun, A Survey of Open Source Data Mining Systems. In: Proceedings of the 2007 ACM international conference on Emerging technologies in knowledge discovery and data mining (PAKDD'07).