1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.

1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010

2 © Goharian & Grossman 2003 Course Outline l Introduction l Data Pre-processing l Data Mining Algorithms »Recommenders »Classifiers –Naïve Bayes –Decision Trees –Support Vector Machines »Clustering –K-Means –LDA »Other –Association Rules

3 © Goharian & Grossman 2003 Introduction l Mining useful facts from a large amount of data. Examples: –Product A sells a lot when we sell –People who take a loan are more likely to default if they have the following characteristics –A person committing credit card fraud is likely to do X –A person who likes these movies will probably like movies about X

4 © Goharian & Grossman 2003 Different Data Sources l Relational Database l Data Warehouse l Flat Files l Web l Object Oriented database l Multi Media

5 © Goharian & Grossman 2003 Data Warehouse l Many enterprises consolidate data from their different homogeneous and heterogeneous data repositories into one common data source called Data Warehouse (DW). l Data Warehouse contains current and historical data to be used for planning and forecasting in Decision Support Systems (DSS). l Traditional Databases are operational databases that are day-to-day data. l Star-schema, Snow-Flakes, Galaxy are modeling schemes in DW.

6 © Goharian & Grossman 2003 Data Warehouse (Cont’d) l To improve the performance in DW different techniques such as Summarization and Denormalization are used. l Usually but not always DW is accessed by On-Line Analytical Processing (OLAP). l SQL gives a precise answer to a user query. l OLAP gives a multi-dimensional view of data and is as extension of some aggregate functions in SQL. l OLAP Operations are Slice, Dice, Roll-up, and Drill- down.

7 © Goharian & Grossman 2003 OLAP vs. Data Mining l OLAP is a data summarization/ aggregation tool that facilitates the data analysis for the user by providing a multi-dimensional view of the data. l Data Mining Tool provides an automated discovery of knowledge and gives more in-depth knowledge about data and hidden information. l OLAM (OLAP Mining) is the integration of OLAP with Data Mining.

8 © Goharian & Grossman 2003 OLAM Architecture DB Data warehouse OLAP Graphical user interface DB Data Mining/OLAM

9 © Goharian & Grossman 2003 DM vs Statistics and ML l Data Mining (DM), Statistics and ML have lots in common: »Finding patterns »Building models to make predictions l Statistics largely relies on probabilistically rational mathmatical models. l ML is extremely useful, but tends not to care about large data sets. If a process runs out of memory, its not an ML problem. l DM wants accuracy but is designed for very large datasets.

10 © Goharian & Grossman 2003 Scalability l Statistical approach deal with small data sets. »Believe that all data must be cleaned and reduced. l Machine Learning deal with small data sets. »Goal is to make machine learn. »Applications such as Chess Playing rather than applications that deal market analysis. l Real life data to be mined can be huge, thus need scalable algorithms.

11 © Goharian & Grossman 2003 DM Algorithms l Supervised (Classification / Categorization) »Bayesian »Neural Network »Decision Tree »Others: Genetic Algorithms, Fuzzy Set, K-Nearest Neighbor l Unsupervised »Association Rules »Clustering l Collaborative Filtering

12 © Goharian & Grossman 2003 Supervised vs. Unsupervised l Supervised algorithms »Learning by example: –Use training data which has correct answers (class label attribute) –Create a model by running the algorithm on the training data –Identify a class label for the incoming new data l Unsupervised algorithms » Do not use training data. »Classes may not be known in advance.

13 © Goharian & Grossman 2003 Supervised Algorithms Training Data Test Data Classifier Algorithm Model 1 2 3 Classification 4

14 © Goharian & Grossman 2003 Collaborative Filtering l Use user recommendations such as »Ratings »Clicks »Purchases l Provide recommendations to other users l All users “collaborate” in order to “filter” the search for the best item.

15 © Goharian & Grossman 2003 Introduction to Evaluation (Evaluating Supervised Algorithms) l Goal: We want to measure the effectiveness of a classification (supervised) algorithm. »Take the training dataset »Build model »Test using the training dataset l This usually leads to a very optimistic result that has little ability to predict the real accuracy of the model.

16 © Goharian & Grossman 2003 Cross-Validation (Cont’d) l Another approach is to take the training data set, cut it in half and use half on training and half for testing. l This leads to potential errors in estimating the real classification rate because the half we hold for testing may be very different that the half we used for training.

17 © Goharian & Grossman 2003 l Take the data set and use the first 90 percent of the data for training and then test on the final ten percent. Then use the next 10 percent for testing, etc. 10-fold Cross Validation Train Test Run 1 Train Test Run 2 Train Test Run 10 Train …

18 © Goharian & Grossman 2003 10-fold Cross Validation l Each run will result in a particular classification rate. l Ex: If we classified 50/100 of the test records correctly our classification rate for that run is 50%. l Choose the model that generated the highest classification rate. The final classification rate for the model is the average of the ten classification rates.

19 © Goharian & Grossman 2003 DM Open Source l Weka »Pros –Weka has numerous algorithms and plenty of data mining classes uses it. –Has a great tool to run experiments, try lots of algorithms and see which one works. »Cons –Not typically use for large production systems. Lacks large-scale distributed implementation. l Mahout »Runs on top of Hadoop - - that’s where it got its name. It’s a highly scalable system. Hadoop is meant for widespread distributed processing. More on it later. »Its real -- used by real, production systems.

20 © Goharian & Grossman 2003 Open Source l A reasonable goal of this course is to make it so you can meaningfully contribute to open source. l This will ensure you understand the concepts being taught and it also will help you as you progress. l If you do that the conversation about you will change from: »“oh I did a cool data mining project in school and got an A” l To: »“I designed and implemented a new, highly scalable algorithm that is now downloaded and used by 200 major systems. People have gone through it and found a few bugs, but most were relatively small, and I work to improve it from time to time”.

21 © Goharian & Grossman 2003 Summary l Data Mining algorithms are used to detect the information that we did not know. l There are various data sources, types,formats and applications for Data Mining. l Scalability differentiates DMfrom statistics and Machine Learning. l Mahout is a robust, open source framework for implementation of data mining algorithms.

1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.

Similar presentations

Presentation on theme: "1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.

Similar presentations

Presentation on theme: "1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010."— Presentation transcript:

Similar presentations

About project

Feedback