Presentation is loading. Please wait.

Presentation is loading. Please wait.

G54DMT – Data Mining Techniques and Applications Dr. Jaume Bacardit

Similar presentations


Presentation on theme: "G54DMT – Data Mining Techniques and Applications Dr. Jaume Bacardit"— Presentation transcript:

1 G54DMT – Data Mining Techniques and Applications Dr. Jaume Bacardit Lecture 0: Introduction

2 Outline of the lecture What is Data Mining? Administrative bits Module structure Resources

3 We are buried in data….

4

5

6

7 And in business as well… Generating better movie recommending methods from customer ratings Training set of 100M ratings from over 480K customers on 18K movies Data collected from October 1998 and December, M$ prize to generate a recommender system 10% better than the Netflix proprietary method Took 3 years to solve the challenge

8

9 What is Data Mining? “The extraction of knowledge from large amounts of data” (Han and Kamber, 2006) “Data mining is defined as the process of discovering patterns in data. The process must be automatic or (more usually) semiautomatic. The patterns discovered must be meaningful in that they lead to some advantage, usually an economic advantage. The data is invariably present in substantial quantities” (Witten and Frank, 2005)

10 So what is the data? In its origin data can be heterogeneous, it can have multiple sources and uncertainty (i.e. distorted or missing entries) In most cases we will assume that data is structured as a table where the rows are instances and the columns are attributes And in certain cases the records will have one or more labels associated to them, a class

11 Data can be… Piles of Records Datasets with a high number of records – This is probably the most visible dimension of large scale data mining – GenBank (the genetic sequences database from the NIH) contains (Feb, 2008) more than 82 million gene sequences and more than 85 billion nucleotides

12 Data can be… High Dimensionality High dimensionality domains – Sometimes each record is characterized by hundreds, thousands (or even more) features – Microarray technology (as many other post-genomic data generation techniques) can routinely generate records with tens of thousands of variables – Creating each record is usually very costly, so datasets tend to have a very small number of records. This unbalance between number of records and number of variables is yet another challenge (Reinke, 2006, Image licensed under Creative Commons)

13 Data can be… Rare Class unbalance – Challenge to generate accurate classification models where not all classes are equally represented – Contact Map prediction datasets (briefly explained later in the tutorial) routinely contain millions of instances from which less than 2% are positive examples – Tissue type identification is highly unbalance—see figure (Llora, Priya, Bhargava, 2009)

14 Data can be… Lots of Classes Yet another dimension of difficulty Reuters dataset is a text categorization task with 672 categories Very related to the class unbalance problem Machine learning methods need to make an extra effort to make sure that underrepresented data is taken into account properly

15 And what do we do with the data? The whole process of integrating, cleaning, selecting, mining and visualising the data is generally known as Knowledge Discovery in Databases (KDD) (Han and Kamber, 2006)

16 Fields related to Data Mining Machine Learning – “How to construct programs that learn from experience” (Mitchell, 1997) – ML generally concentrates on the central part of the KDD process, the pattern extraction. – Also, ML is generally seen to focus on the algorithms, while DM focuses on the process Pattern recognition – Mathematical view of the pattern extraction process in opposition to the computational view of ML Text mining – Focused on analyzing human texts. Very specialised version of DM

17 Educational aims To provide the students with a strong knowledge of data mining, and its application to real-world scenarios To understand the need of data mining to analyse large-scale real-world data To provide the students with a sneak peak of the challenges and opportunities of data mining

18 The objective of this module is to study the methods and application of data mining techniques. The focus of the module will be on the technology, but by illustrating their usage with challenging problems we aim at providing a clear understanding of how these methods can be applied in the real world The successful completion of the module will endow a student with: – Strong understanding of core data mining problems (e.g. classification, regression, clustering, feature and prototype selection, dimensionality reduction) and the state-of-the-art methods for solving these – Strong understanding of the application of data mining to important real-world problems – Familiarity with the operation and principles behind publicly available data mining packages (e.g. Weka)

19 Lectures and labs Lectures: Thursdays, 15:00 – 17:00, JC-AMEN- B11+ Labs: Mondays, 11: :00, JC-COMPSCI- B52 (labs start on the 11/2) – The laboratory sessions will be used to develop the coursework. I will be present to answer questions – Sometimes there will be directed sessions, but these will be few, and advertised in advance

20 Coursework Coursework 1 (50% or the mark) – Study in detail of one aspect of data preprocessing – How to perform a proper ML evaluation protocol – Deadline: 8/3/2013 Project 1 (with 50% of the mark) – I will give you a challenging large-scale dataset and you are free to mine it using a combination of any of the techniques described in the module – Deadline: 10/5/2013

21 How to contact me? At lectures and lab sessions My office is B81 in the Computer Science building. However, for many reasons the chances are that if you just pop by randomly, I can't attend you Thus, the preferred contact method is

22 Module structure Four topics (described in the next slides) Some topics will take several lectures to cover All lectures will be posted at Take notes – Not everything is in the slides – I will use the whiteboard often After each lecture I will provide a list of resources to complement the material Also, whenever necessary, I will introduce background material If you feel that you are missing some background material, tell me straight away!

23 Module structure Topic 1: Preliminaries – This topic deals with several concepts that will be used across the module Data infrastructure: simple and advanced file formats Experimental validation procedures Statistical tests Most popular data mining packages

24 Module structure Topic 2: Data Preparation – Which steps do we follow to transform the data in order to facilitate the pattern extraction process – Many methods fall in this category Feature selection Instance selection Dimensionality reduction Missing values handling Discretisation

25 Module structure Topic 3: Data Mining – This topic deals with the central part of the KDD pipeline, the extraction of patterns from data – This process can be done in many different ways. The most usual ones are Classification Regression Clustering Association Rules Mining

26 Module structure Topic 4: Applications – We will see a few examples of how the methods studied through the module are applied to challenging real world problems

27 Resources Books – J. Han and M. Kamber, Data Mining, Conceptes and techniques, Elsevier, 2006 – I Witten and E. Frank, Data Mining - Practical Machine Learning Tools and Techniques, Elsevier, 2005 – Tom M. Mitchell, Machine Learning, McGraw-Hill, 1997 – Chris Bishop, Pattern Recognition and Machine Learning, Springer 2006 – Trevor Hastie, Robert Tibshirani and Jerome Friedman, The Elements of Statistical Learning, 2nd ed., Springer, 2009 Online resources – KDNuggets, newsletter and website about data mining KDNuggets Software packages – WEKA WEKA – RapidMiner RapidMiner – Keel Keel

28 Questions?


Download ppt "G54DMT – Data Mining Techniques and Applications Dr. Jaume Bacardit"

Similar presentations


Ads by Google