Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,

Similar presentations


Presentation on theme: "Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,"— Presentation transcript:

1 Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL, London, UK. robert.burbidge@ptp.sira.co.uk http://www.cs.ucl.ac.uk/staff/r.burbidge

2 Definition ‘We are drowning in information, but starving for knowledge’ John Naisbett Data Mining is the search for ‘nuggets’ of useful information Data Mining is an automated search for ‘interesting’ patterns in large databases

3 Overview Data Pre- Processing Analysis Business Solutions Aims Domain Knowledge

4 Before We Begin... Getting the Data Assessing Usefulness of the Data Noise in the Data Volume of Available Data Domain Knowledge and Expertise

5 Getting the Data Are the data easily available? –What format are the data in? –Are the data in a live database or a data warehouse? –Are the data online? 1010111.... ID0Xc2 Jones, H., 24 00011002210 GRsa4 7 8 3 2 1 0.... 9 4 3 2 3 4.................................. objects variables

6 Assessing Usefulness of the Data Are the available data relevant to the task at hand? –E.g. to predict ice-cream sales information about the FTSE would (probably) not be useful Are there missing factors which are likely to be predictive? –E.g. temperature is likely to be predictive of ice-cream sales

7 Noise in the Data Are the data contaminated by noise? –E.g. experimental error, typing mistakes, corrupted storage media Can this be eliminated? –E.g. improved experimental set up, data cleaning How seriously is this likely to affect the results?

8 Volume of Available Data Are there enough data... –... to learn a useful concept? –... to give statistically significant results? Should more data be collected? –More examples –More information about the examples –Meta data

9 Domain Knowledge Domain knowledge can be incorporated into some techniques –To choose priors in Bayesian analysis –To encode invariances in the data –Expert systems Use of expertise can avoid blind search –Feature selection –Building a model

10 Résumé 1 Before we begin we must –Obtain the data –Make sure it’s useful –Make sure there’s enough –Identify available expert knowledge This is all pretty obvious –If you don’t do this you’re headed for trouble

11 Pre-Processing Visualization Feature Selection Feature Extraction Feature Derivation Data Reduction

12 Visualization Histogram plots –Identify Distributions Clustering –k-means –Kohonen nets –Relational –Hierarchical –Outlier detection

13 Feature Selection Performance Measures –Filters –Wrappers Search Algorithms –Exhaustive –Branch-and-bound –Mathematical Programming –Stochastic 7 8 3 2 1 0.... 9 4 3 2 3 4.... objects variables 7 3 2 1 9 3 2 3 objects variables

14 Feature Extraction Domain knowledge –E.g. edges in images Informative features –Kohonen nets –Principle components analysis Useful for visualization –Projecting data to two or three dimensions –Identifying the number of clusters

15 Feature Derivation Transforming continuous attributes to discrete attributes –Fuzzy or rough linguistic concepts –Binning Deriving numeric features –Products, ratios, differences, etc –E.g. taking differences of start and finish times, taking ratios of price changes

16 Data Reduction Large amounts of data require longer training times –Some data points are more relevant than others Reducing the modality of a variable –Makes solutions more easily interpretable Support Vector Machine

17 Résumé 2 Assess the data statistically Visualize the data Identify, extract or create useful features Reduce the size of the problem if necessary

18 Discovering Patterns and Rules Rule Induction Statistical Pattern Recognition Neural Networks Hybrid Systems Performance Analysis

19 Rule Induction Discover rules that describe the data –e.g. marketing – who buys what? IF age > 55 AND income > 20 000 THEN holiday IF age 20 THEN pension Easy to understand –identifies important features Can be fuzzified IF age_low AND income_high THEN car_high

20 Statistical Pattern Recognition Model the underlying distribution –Classification Bayesian solution is optimal Gives confidence values –Regression Identifies useful features Robust techniques to handle noise Difficult in many practical applications

21 Neural Networks Based on neuronal brain model Each neuron forms a weighted sum of its inputs Flexible learners Prone to over-fitting Messy optimization problem inputs hidden layer output

22 Hybrid Systems Combine techniques for increased functionality and accuracy –function replacing neural network accurate but unreadable combine with a decision tree –committee multiple classifiers with different set-ups aggregate with a decision tree inputs NN1NN2NN3 Decision Tree output

23 Performance Analysis Accuracy –error rate –discrimination –variable costs Readability Time –training –using ROC curve; Neyman Pearson at 20%

24 Résumé 3 Identify key criteria Assess data characteristics Choose an algorithm Set the parameters Try combining multiple techniques to improve results Assess statistical significance

25 Post-Processing Understanding Significance Implementation

26 Understanding What does it mean? –if easily understandable, does it make sense? –if numeric, how to interpret Which features were important? –sensitivity analysis

27 Significance Are the results interesting? –are they new and unobvious? e.g. IF age > 100 THEN NOT pension –are they relevant What is the significance? –are further studies required with more data specific to the discovered pattern –change of business plan

28 Implementation How to convince the money men –solid results –clear and concise How to test your hypothesis –experimental design –controlled studies to eliminate sampling bias

29 Résumé 4 Assess the usefulness of the results –Interpretability –Relevance to initial problem Identify the next step –Sales pitch –Further experiments –Field trials –Towards knowledge discovery

30 Example Applications at UCL Intelligent fraud detection with Fuzzy GAs (Lloyd’s TSB) Drug Design by SVMs (SmithKline Beecham and Glaxo-Wellcome) Consumer Profiling with Bayes Nets (Unilever) Process Control (AstraZeneca)

31 ‘Data Snooping’ – A Warning Artefacts – ‘patterns’ that aren’t there Sampling bias Statistical tests may not show significance –this does not mean results aren’t significant The extremum of a collection of Gaussians is highly skewed – beware coincidence Data mining is a dangerous tool in the wrong hands

32 Summary Get the right data Use domain knowledge Pre-process the data Discover patterns and rules –machine learning –statistics Analyze results – but be wary

33 Conclusions With vast amounts of data available, it has become necessary to use automated techniques Advances in data processing, machine learning and statistics have made this possible Data mining is a necessary tool for business survival in the information age

34 Internet Resources www.kdnuggetts.com www.data-miners.com www.crisp-dm.org www.research.microsoft.com/profiles/fayyad www.cs.sfu.ca/~han etc...


Download ppt "Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,"

Similar presentations


Ads by Google