Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining Knowledge Discovery in Databases Data 31.

Similar presentations

Presentation on theme: "Data Mining Knowledge Discovery in Databases Data 31."— Presentation transcript:

1 Data Mining Knowledge Discovery in Databases Data 31

2 Data Mining Data mining is a capability to support the recognition of previously unknown but potentially useful relationships within large databases/ data warehouses. Aim: find useful patterns in the data. Uses statistical, mathematical, artificial intelligence, and machine-learning techniques Data 32

3 Data Mining Tools Data mining tools use statistical or rules-based methods to identify patterns and create predictive models. Tools look for patterns using a variety of models – Statistical methods e.g. correlation – Decision trees – Case based reasoning – Neural computing – Intelligent agents – Genetic algorithms Data 33

4 Text Mining Text Mining – Analyse text documents. – Find Hidden content – Group by themes – Determine relationships between documents Data 34

5 Process of Data Mining/ Knowledge Discovery Data 35 Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation

6 What does it let you do? Data mining automates the process of sifting through historical data in order to discover new information. Data Mining techniques enable users to identify patterns and correlations within a set of data These can then be used as predictive models that anticipate behaviour or events based on trends in the data. Data 36

7 Correlation versus Causation Correlation – A statistical relation between two or more variables such that changes in the value of one variable are accompanied by changes in the value of the other Causation – Changes in one variable cause changes in another. Data 37

8 What do you need for Data Mining? Massive data collection Powerful computers Data mining algorithms Data 38

9 Five Basic Operations Clustering – Identifies groups of items that share a particular characteristic Classification – infers the defining characteristics of a certain group Association – identifies relationships between events that occur at the one time Sequencing: – relationships over time Forecasting – estimates future values based on patterns within large sets of data Data 39

10 Clustering The process of identifying relationships between similar records without any preconceived notion of what that that similarity might involve. Examples: – Disease clusters, – Similarities in customers telephone usage Often used as an exploratory exercise before further data mining using a classification technique. Data 310

11 Classification DM system learns from examples of the data how to partition or classify the data i.e. it formulates classification rules which can be used for prediction. – Example : Bank classifies customers and may offer them differing levels of service, different offers, different charges. Can build loan approval models. Data 311

12 Association Looks for links between records in a data set – e.g. items purchased at the one time. Patterns can be identified to indicate probabilities e.g. 500,000 transactions 20,000 nappies 30,000 beer 10,000 nappies + beer – Beer and nappies occur together in 2% of transactions. – “when people buy beer they buy nappies 1/3 of the time” – “when people buy nappies they buy beer 50% of the time” Data 312

13 Sequential Analysis A form of association used to track relationships over time. – E.g. health insurance claims. – E.g. 10% of customers who bought a tent bought a backpack within one month. – Weather patterns e.g. tidal wave in Hawaii follows hurricane in N. Atlantic x% of the time. Data 313

14 Forecasting Concerns the prediction of continuous variables e.g. sales, share values, stock market levels, oil prices etc. Often done with regression functions statistical methods for examining the relationship between variables in order to predict a future value. 2 types – Forecasting single continuous value based on unordered examples. e.g. predict income based on personal details. – Predict one or more values based on a sequential pattern – time series forecasting. Data 314

15 Data Mining Tools in more detail Case-based Reasoning – Use historical cases to identify patterns. Neural Computing : – Examine historical data for pattern recognition e.g. identify potential customers for a new product. Intelligent agents – Retrieve information from large databases. Other tools e.g. decision trees, rule induction, data visualisation. Data 315

16 Some Key Applications Areas Data mining is used in many different areas Two big areas are: – Market analysis and management Initial Data Gathered From Credit card transactions, loyalty cards, discount coupons, customer complaint calls, lifestyle studies, focus groups – Fraud detection and management Data 316

17 Examples Market analysis and management Target marketing – Find clusters of “model” customers who share the same characteristics: e.g. interests, income Determine customer purchasing patterns over time Cross-market analysis uses associations/co-relations between product sales and predicts based on the association information Customer profiling: – What types of customers buy what products Identifying customer requirements- – Identifying the best products for different customers, use prediction to find what factors will attract new customers Data 317

18 Fraud detection and management Used in health care, retail, credit card services, telecommunications (phone card fraud), etc. Use historical data to build models of fraudulent behavior and use data mining to help identify similar instances Examples – auto insurance: detect a group of people who stage accidents to collect on insurance – money laundering: detect suspicious money transactions – medical insurance: detect professional patients and ring of doctors and ring of references Data 318

19 Text Mining -Application of data mining to unstructured or less structured files. -Text mining operates with less structured information and helps organisations to:- – Find hidden content of documents including useful relationships. – Relate documents across unnoticed divisions e.g. customers in 2 product division have the same characteristics. – Group documents by themes e.g. all customers who have similar complaints. Data 319

20 Some more example applications by area Marketing:- Predicting customers to respond to internet banners or buy a product. Segmenting customer demographics. Banking : forecasting bad loans and fraudulent credit card usage, credit card spending by new customers and which customers will respond bet to new loan offers. Retailing and Sales: Predicting sales, correct stock levels, distribution schedules Manufacturing and Production: predicting when to expect machinery failures, finding key factors that control the optimisation of manufacturing capacity. Data 320

21 Brokerage and Securities Trading:- Predicting when bond prices will change, forecasting range of stock fluctuation for particular issues, determining when to trade stock. Insurance: forecasting claim amounts, medical coverage costs, classifying the most important elements that affect medical coverage, predicting which customers will buy new policies. Computer Hardware and Software: Predicting drive failure, forecasting creation time for new chips, predicting potential security violations. Government and Defence: Forecasting cost of moving military equipment, testing strategies for potential military engagements, predicting resource consumption. Data 321

22 Airlines: Capturing data on what customers are flying and destination of those who change carriers midflight. Healthcare : correlating demographics of patients with critical illnesses. Broadcasting – programs best shown in prime time and how to maximize returns by inserting advertisements. Police: tracking crime patterns, locations, criminal behaviour and attributes to help crack criminal cases. Data 322

23 Problems with data mining Need clear business objectives and access to the appropriate data. Need the right data. – Bad data quality can lead to spurious results Models are not fail-safe. Privacy, property and other legal and ethical issues. Companies must change mode of operation and maintain the effort (e.g. loyalty programs such as air miles). Data 323

24 Conclusion Data Mining is an attractive sounding technology which is still evolving. The key is that the algorithms discover useful relationships. – Unlike standard research where researchers hypothesise correlations and then search for them. There are ethical issues: – E.g. Criminal profiling. Data 324

Download ppt "Data Mining Knowledge Discovery in Databases Data 31."

Similar presentations

Ads by Google