Presentation is loading. Please wait.

Presentation is loading. Please wait.

Application of Data Mining Techniques on Survey Data using R and Weka

Similar presentations


Presentation on theme: "Application of Data Mining Techniques on Survey Data using R and Weka"— Presentation transcript:

1 Application of Data Mining Techniques on Survey Data using R and Weka
Supunmali Ahangama 29/11/2013

2 X R Outline Introduction to data mining in R
Introduction to data mining in Weka Example R X

3 What is R?

4 Why Learn R? R offers more analytical methods and now over 1000 add-on packages are available R is far more flexible in the type of data it can analyze R’s procedures (functions), are open for you to see and modify R is free If you already know SAS or SPSS, why should you bother to learn R? Both SAS and SPSS are excellent statistics packages. I use them both almost daily. If they meet your needs, and you do not mind paying for them, there is little point in learning another package. However, R offers a lot: R offers more analytical methods. There are now well over 1000 add-on packages available for R, and R can download and install them directly from the Internet. It takes most statistics packages at least 5 years to add a major new analytic method. Statisticians who develop new methods often work in R, so R users often get to use new methods immediately. You can use R while knowing very little about it. You can do all your data management with any software you prefer, and learn just enough R to import a file and run the procedure you need. If you are an SPSS user, you can run R programs from within SPSS programs, allowing you to do much of your work in a familiar environment while avoiding the cost of the various add-on modules for SPSS. R is far more flexible in the type of data it can analyze. While SAS and SPSS require you to store your data in rectangular datasets, R offers a rich variety of data structures that are much more flexible. You can perform analyses that include variables from different data structures easily without having to merge them. R’s language is more powerful than SAS or SPSS. R developers write most of their analytic methods using the R language; SAS and SPSS developers do not use their own languages to develop their procedures. R’s procedures, which it calls functions, are open for you to see and modify. Functions that you write in R are automatically on an equal footing with those that come with the software. The ability to write your own completely integrated procedures in SAS or SPSS requires using a different language such as C or Python, and in the case of SAS, a developer’s kit. R’s graphics are extremely flexible and are of publication quality. They are flexible enough to overlay data from different datasets, even at different levels of aggregation. R runs on almost any computer, including Windows, Macintosh, Linux, and UNIX R has full matrix capabilities that are quite similar to MATLAB, and it even offers a MATLAB emulation package [6]. For a comparison of R and MATLAB, see id=gettingstarted:translations:octave2r. R is free.

5 The Popularity of R is Growing Fast
#1 most used data mining tool (in both 2010 and 2011). Up from #5 in 2007 An increasing number of data miners consider R their primary tool #2 in parimary tool rankings (in 2011). Up from #7 in 2008. Reference: Rexer Analytic Data miner survey summary report

6 Data Mining Software Reference: Rexer Analytic Data miner survey summary report

7 Graphical User Interface (GUI)
R Studio R Commander Rattle Deducer Revolution Analytics Reference: Rexer Analytic Data miner survey summary report

8 Rattle Installation Startup R (v3.0.2) and then > install.packages("rattle") > library("rattle") > rattle() Artificial Neural Network (ANN) package: neuralnet 1.32

9 Weka Waikato Environment for Knowledge Analysis
A collection of machine learning algorithms and visualization tools Written in Java RWeka – An R interface for Weka

10 Data set (Y. Hayashi & R. Seti0no 2010)
Aim: To discover factors that could be used to distinguish the consumers who eat out frequently to those who do not. Survey is conducted in major cities in Taiwan in 2003 among consumers aged 15 to 64 years Target: Class 1 – if the respondent’s eat out frequency is less than 25 per month on average Class 2 –otherwise Predictor variables – respondent’s eating out considerations and personal characteristics (socio-demographics, psychological information)

11 Methodology: ANN ANN - Artificial Neural Network
Set of connected input and output units in which each connection has a weight associated with it Network learn by adjusting the weights so as to be able to predict the correct class label of the input tuples

12 ANN Black box

13 Tools Rattle GUI Weka

14 Techniques for Ferretting Out Information from Trained ANN
Sensitivity analysis Probe ANN with test inputs, and record the outputs Determining the impact or effect of an input variable on the output hold the other inputs to some fixed value (e.g. mean or median value), vary only the input while monitoring the change in outputs measure of the degree to which each input contributes to the output error the largest error  the largest impact Rule Extraction

15 Sensitivity Analysis Relative Importance Where
wji = weight from the ith input node to the jth hidden node wkj = weight from the jth hidden node to the kth output node.

16 In a nutshell Appreciation of R as a leading statistical tool
How Rattle GUI and Weka could be used for data mining How ANN could be applied into consumer behaviour study Identification of the relationship between predictors and dependent variable through sensitivity analysis

17 Thank You.


Download ppt "Application of Data Mining Techniques on Survey Data using R and Weka"

Similar presentations


Ads by Google