Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Introduction Peixiang Zhao.

Similar presentations


Presentation on theme: "Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Introduction Peixiang Zhao."— Presentation transcript:

1 Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Introduction Peixiang Zhao

2 Welcome to CIS4930 Course Website: – http://www.cs.fsu.edu/~zhao/cis4930/main.html http://www.cs.fsu.edu/~zhao/cis4930/main.html – Everything about the course can be found here Syllabus, announcements, policies, schedules, slides, assignments, resource… – Make sure you check the course website periodically Please read the class syllabus, policies, and lecture schedule; ask now if you have questions 1

3 Teaching Staff Instructor: Peixiang Zhao – Research interest Generally, data and information science including database systems and data mining Specifically, graph data, information network analysis, large-scale data-intensive computation and analytics – Brief history Illinois (Ph.D. from UIUC) Florida (Assistant professor at FSU starting from Aug. 2012) TA: – Yongjiang Liang (liang@cs.fsu.edu)liang@cs.fsu.edu – Office hours: Tuesday 10am – 11am 2

4 Prerequisite Must know how to program, and have data structure and algorithm background – COP3330: Object-oriented Programming – COP4530: Data structures and algorithms – Knowledge on probability theory, statistics, and linear algebra 3

5 Textbook Data Mining: Concepts and Techniques. 3 rd edition – Jiawei Han, Micheline Kamber, Jian Pei References – Introduction to Data Mining Introduction to Data Mining – Data Mining: The Textbook Data Mining: The Textbook – The Elements of Statistical Learning The Elements of Statistical Learning – Pattern recognition and Machine Learning Pattern recognition and Machine Learning4

6 Course Format Two 75-min lectures/week – Lecture slides are used to complement the lectures, not to substitute the textbook Four homework (40%) – Written assignments and machine problems Datasets or software might be provided – Individual work – Due right before the class starts in the due date – No late homework will be accepted One midterm (15%) and one final (40%) – Check dates and make sure no conflict! Quizzes (5%) 5

7 You Tell Me -- Why Are You Taking this Course? – https://www.youtube.com/watch?v=vbb-AjiXyh0 https://www.youtube.com/watch?v=vbb-AjiXyh0 – https://www.youtube.com/watch?v=1i6uESo98Yo https://www.youtube.com/watch?v=1i6uESo98Yo – Data mining tops LinkedIn’s list of the “hottest skills of 2014” – Data scientist: the sexiest job of 21 st century (Harvard Business Review) – Data scientist: 2015’s hottest profession (Mashable) 6

8 Why Data Mining? 7 Big Data However, we are drowning in data, but starving for knowledge! – There is often information “ hidden ” in the data that is not readily evident – Human analysts may take weeks to discover useful information – Much of the data is never analyzed at all

9 What is Data Mining Non-trivial extraction of implicit, previously unknown, and potentially useful information from data – a.k.a. KDD (knowledge discovery in databases) – Data to be mined Relational databases, data warehouses; Data streams and sensor data; Time-series data, temporal data, sequence data; Graphs, social networks and multi-linked data; Spatial data and spatiotemporal data; Multimedia data; Text data; WWW data – Knowledge to be obtained Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis 8

10 The Goal: Decision Support Typical procedure – Data  Knowledge  Action/Decision  Goal Examples – Netflix collects user ratings of movies  What types of movies you will like  Recommend new movies to you  Users stay with Netflix – Gene sequences of cancer patients  Which genes lead to cancer?  Appropriate treatment  Save life – Road traffic  Which road is likely to be congested?  Suggest better routes to drivers  Save time and energy 9

11 Example: Association Rule Mining Data – A set of transactions, each of which consists of a set of items Association rules – A set of rules that characterize associations between items 10 Market-Basket transactions Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}

12 Example: Classification Process – Construct models (functions) based on training data with known class labels – Describe and distinguish classes or concepts for future prediction – Predict testing data with unknown class labels Applications – Spam identification – Treatment prediction – Document categorization – …… 11

13 Ads Targeting 12 featuresclass labels a classifier: f(x)=y: features  class labels training testing

14 Fraud Detection 13 categorical continuous class Test Set Training Set Model Learn Classifier

15 Example: Clustering Goal – Finding groups of objects such that the objects in a group will be similar to one another and different from the objects in other groups 14

16 Example: Outlier Detection Outliers (Anomalies) – Global: observations inconsistent with rest of the dataset – Local: Observations inconsistent with their neighborhoods A local instability or discontinuity Applications – Fraud/intrusion detection – Customized marketing – Weather prediction 15 One persons noise could be another person’s signal. - Edward Ng

17 Data Mining Tasks Prediction Methods: Use some variables to predict unknown or future values of other variables – Classification – Regression – Outlier detection Description Methods: Find human-interpretable patterns that describe the data – Clustering – Association rule mining 16

18 Data Mining: Confluence of Multiple Disciplines 17 Data Mining Machine Learning Statistics Applications Algorithm Pattern Recognition High-Performance Computing Visualization Database Technology

19 The Top 10 Data Mining Algorithms 1.C4.5: classification 2.K-Means: clustering 3.SVM: classification 4.Apriori: association analysis 5.EM: statistical learning 6.PageRank: link mining 7.AdaBoost: bagging and boosting 8.kNN: classification 9.Naive Bayes: classification 10.CART: classification 18

20 Questions Any questions? Please feel free to raise your hands. 19


Download ppt "Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Introduction Peixiang Zhao."

Similar presentations


Ads by Google