Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Similar presentations


Presentation on theme: "Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa"— Presentation transcript:

1 Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa http://www-kdd.di.unipi.it/ A tutorial @ EDBT2000

2 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 2 Contributors and acknowledgements zThe people @ Pisa KDD Lab: Francesco BONCHI, Giuseppe MANCO, Mirco NANNI, Chiara RENSO, Salvatore RUGGIERI, Franco TURINI and many students zThe many KDD tutorialists and teachers which made their slides available on the web (all of them listed in bibliography) ;-) zIn particular: yJiawei HAN, Simon Fraser University, whose forthcoming book Data mining: concepts and techniques has influenced the whole tutorial yRajeev RASTOGI and Kyuseok SHIM, Lucent Bell Labs yDaniel A. KEIM, University of Halle yDaniel Silver, CogNova Technologies zThe EDBT2000 board who accepted our tutorial proposal

3 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 3 Tutorial goals zIntroduce you to major aspects of the Knowledge Discovery Process, and theory and applications of Data Mining technology zProvide a systematization to the many many concepts around this area, according the following lines ythe process ythe methods applied to paradigmatic cases ythe support environment ythe research challenges zImportant issues that will be not covered in this tutorial: ymethods: time series, exception detection, neural nets ysystems: parallel implementations

4 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 4 Tutorial Outline 1.Introduction and basic concepts 1.Motivations, applications, the KDD process, the techniques 2.Deeper into DM technology 1.Decision Trees and Fraud Detection 2.Association Rules and Market Basket Analysis 3.Clustering and Customer Segmentation 3.Trends in technology 1.Knowledge Discovery Support Environment 2.Tools, Languages and Systems 4.Research challenges

5 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 5 Introduction - module outline zMotivations zApplication Areas zKDD Decisional Context zKDD Process zArchitecture of a KDD system zThe KDD steps in short

6 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 6 Evolution of Database Technology: from data management to data analysis z1960s: yData collection, database creation, IMS and network DBMS. z1970s: yRelational data model, relational DBMS implementation. z1980s: yRDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.). z1990s: yData mining and data warehousing, multimedia databases, and Web technology.

7 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 7 Motivations “Necessity is the Mother of Invention” zData explosion problem: yAutomated data collection tools, mature database technology and internet lead to tremendous amounts of data stored in databases, data warehouses and other information repositories. zWe are drowning in information, but starving for knowledge! ( John Naisbett) zData warehousing and data mining : yOn-line analytical processing yExtraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases.

8 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 8 zAlso referred to as: Data dredging, Data harvesting, Data archeology zA multidisciplinary field: y Database y Statistics y Artificial intelligence x Machine learning, Expert systems and Knowledge Acquisition y Visualization methods A rapidly emerging field

9 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 9 Motivations for DM zAbundance of business and industry data zCompetitive focus - Knowledge Management zInexpensive, powerful computing engines zStrong theoretical/mathematical foundations y machine learning & logic y statistics y database management systems

10 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 10 What is DM useful for? Marketing Database Marketing Data Warehousing KDD & Data Mining Increase knowledge to base decision upon. E.g., impact on marketing

11 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 11 The Value Chain Data Data Customer data Store data Demographical Data Geographical data Information X lives in Z S is Y years old X and S moved W has money in Z Knowledge A quantity Y of product A is used in region Z Customers of class Y use x% of C during period D Decision Promote product A in region Z. Mail ads to families of profile P Cross-sell service B to clients C

12 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 12 Application Areas and Opportunities zMarketing: segmentation, customer targeting,... zFinance: investment support, portfolio management zBanking & Insurance: credit and policy approval zSecurity: fraud detection zScience and medicine: hypothesis discovery, prediction, classification, diagnosis zManufacturing: process modeling, quality control, resource allocation zEngineering: simulation and analysis, pattern recognition, signal processing zInternet: smart search engines, web marketing

13 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 13 Classes of applications zMarket analysis xtarget marketing, customer relation management, market basket analysis, cross selling, market segmentation. zRisk analysis xForecasting, customer retention, improved underwriting, quality control, competitive analysis. zFraud detection zText (news group, email, documents) and Web analysis.

14 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro Market Analysis zWhere are the data sources for analysis? yCredit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies. zTarget marketing yFind clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. zDetermine customer purchasing patterns over time yConversion of single to a joint bank account: marriage, etc. zCross-market analysis yAssociations/co-relations between product sales yPrediction based on the association information.

15 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro zCustomer profiling ydata mining can tell you what types of customers buy what products (clustering or classification). zIdentifying customer requirements yidentifying the best products for different customers yuse prediction to find what factors will attract new customers zProvides summary information yvarious multidimensional summary reports; ystatistical summary information (data central tendency and variation) Market Analysis and Management Market Analysis (2)

16 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro Risk Analysis zFinance planning and asset evaluation: ycash flow analysis and prediction ycontingent claim analysis to evaluate assets ycross-sectional and time series analysis (financial-ratio, trend analysis, etc.) zResource planning: ysummarize and compare the resources and spending zCompetition: ymonitor competitors and market directions (CI: competitive intelligence). ygroup customers into classes and class-based pricing procedures yset pricing strategy in a highly competitive market

17 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro Fraud Detection zApplications: ywidely used in health care, retail, credit card services, telecommunications (phone card fraud), etc. zApproach: yuse historical data to build models of fraudulent behavior and use data mining to help identify similar instances. zExamples: yauto insurance: detect a group of people who stage accidents to collect on insurance ymoney laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) ymedical insurance: detect professional patients and ring of doctors and ring of references

18 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro zMore examples: yDetecting inappropriate medical treatment: xAustralian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr). yDetecting telephone fraud: xTelephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm. xBritish Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud. yRetail: Analysts estimate that 38% of retail shrink is due to dishonest employees. Fraud Detection (2)

19 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro zSports yIBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat. zAstronomy yJPL and the Palomar Observatory discovered 22 quasars with the help of data mining zInternet Web Surf-Aid yIBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc. yWatch for the PRIVACY pitfall! Other applications

20 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 20 zThe selection and processing of data for: y the identification of novel, accurate, and useful patterns, and y the modeling of real-world phenomena. zData mining is a major component of the KDD process - automated discovery of patterns and the development of predictive and explanatory models. What is KDD? A process!

21 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 21 Selection and Preprocessing Data Mining Interpretation and Evaluation Data Consolidation Knowledge p(x)=0.02 Warehouse Data Sources Patterns & Models Prepared Data Consolidated Data The KDD process

22 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 22 The KDD Process Core Problems & Approaches zProblems: y identification of relevant data y representation of data y search for valid pattern or model zApproaches: y top-down deduction by expert y interactive visualization of data/models y * bottom-up induction from data * Data Mining OLAP

23 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro zLearning the application domain: yrelevant prior knowledge and goals of application zData consolidation: Creating a target data set zSelection and Preprocessing yData cleaning : (may take 60% of effort!) yData reduction and projection: xfind useful features, dimensionality/variable reduction, invariant representation. zChoosing functions of data mining ysummarization, classification, regression, association, clustering. zChoosing the mining algorithm(s) zData mining: search for patterns of interest zInterpretation and evaluation: analysis of results. yvisualization, transformation, removing redundant patterns, … zUse of discovered knowledge The steps of the KDD process

24 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 24 Identify Problem or Opportunity Measure effect of Action Act on Knowledge Results Strategy Problem The virtuous cycle

25 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 25 Applications, operations, techniques

26 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 26 Roles in the KDD process

27 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 27 Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration OLAP, MDA Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts Data Sources Paper, Files, Information Providers, Database Systems, OLTP Data mining and business intelligence

28 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 28 Graphical User Interface Data Consolidation Selection and Preprocessing Data Mining Interpretation and Evaluation Warehouse Knowledge Data Sources Architecture of a KDD system

29 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 29 A business intelligence environment

30 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 30 Selection and Preprocessing Data Mining Interpretation and Evaluation Data Consolidation Knowledge p(x)=0.02 Warehouse Data Sources Patterns & Models Prepared Data Consolidated Data The KDD process

31 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 31 Garbage in Garbage out zThe quality of results relates directly to quality of the data z50%-70% of KDD process effort is spent on data consolidation and preparation zMajor justification for a corporate data warehouse Data consolidation and preparation

32 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 32 From data sources to consolidated data repository RDBMS Legacy DBMS Flat Files Data Consolidation and Cleansing Warehouse Object/Relation DBMS Multidimensional DBMS Deductive Database Flat files External Data consolidation

33 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 33 zDetermine preliminary list of attributes zConsolidate data into working database y Internal and External sources zEliminate or estimate missing values zRemove outliers (obvious exceptions) zDetermine prior probabilities of categories and deal with volume bias Data consolidation

34 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 34 Selection and Preprocessing Data Mining Interpretation and Evaluation Data Consolidation Knowledge p(x)=0.02 Warehouse The KDD process

35 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 35 zGenerate a set of examples y choose sampling method y consider sample complexity y deal with volume bias issues zReduce attribute dimensionality y remove redundant and/or correlating attributes y combine attributes (sum, multiply, difference) zReduce attribute value ranges y group symbolic discrete values y quantize continuous numeric values zTransform data y de-correlate and normalize values y map time-series data to static representation zOLAP and visualization tools play key role Data selection and preprocessing

36 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 36 Selection and Preprocessing Data Mining Interpretation and Evaluation Data Consolidation Knowledge p(x)=0.02 Warehouse The KDD process

37 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 37 Data mining tasks and methods zAutomated Exploration/Discovery y e.g.. discovering new market segments yclustering analysis zPrediction/Classification y e.g.. forecasting gross sales given current factors yregression, neural networks, genetic algorithms, decision trees zExplanation/Description y e.g.. characterizing customers by demographics and purchase history ydecision trees, association rules x1 x2 f(x) x if age > 35 and income < $35k then...

38 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 38 zClustering : partitioning a set of data into a set of classes, called clusters, whose members share some interesting common properties. zDistance-based numerical clustering y metric grouping of examples (K-NN) y graphical visualization can be used zBayesian clustering y search for the number of classes which result in best fit of a probability distribution to the data y AutoClass (NASA) one of best examples Automated exploration and discovery

39 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 39 zLearning a predictive model zClassification of a new case/sample zMany methods: y Artificial neural networks y Inductive decision tree and rule systems y Genetic algorithms y Nearest neighbor clustering algorithms y Statistical (parametric, and non-parametric) Prediction and classification

40 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 40 zThe objective of learning is to achieve good generalization to new unseen cases. zGeneralization can be defined as a mathematical interpolation or regression over a set of training points zModels can be validated with a previously unseen test set or using cross-validation methods f(x) x Generalization and regression

41 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 41 Classification and prediction yClassify data based on the values of a target attribute, e.g., classify countries based on climate, or classify cars based on gas mileage. yUse obtained model to predict some unknown or missing attribute values based on other information.

42 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 42 Objective: Develop a general model or hypothesis from specific examples zFunction approximation (curve fitting) zClassification (concept learning, pattern recognition) x1 x2 A B f(x) x Summarizing: inductive modeling = learning

43 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 43 zLearn a generalized hypothesis (model) from selected data zDescription/Interpretation of model provides new knowledge zMethods: y Inductive decision tree and rule systems y Association rule systems y Link Analysis y … Explanation and description

44 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 44 zGenerate a model of normal activity zDeviation from model causes alert zMethods: y Artificial neural networks y Inductive decision tree and rule systems y Statistical methods y Visualization tools Exception/deviation detection

45 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 45 Outlier and exception data analysis zTime-series analysis (trend and deviation): yTrend and deviation analysis: regression, sequential pattern, similar sequences, trend and deviation, e.g., stock analysis. ySimilarity-based pattern-directed analysis yFull vs. partial periodicity analysis zOther pattern-directed or statistical analysis

46 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 46 Selection and Preprocessing Data Mining Interpretation and Evaluation Data Consolidation and Warehousing Knowledge p(x)=0.02 Warehouse The KDD process

47 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro zA data mining system/query may generate thousands of patterns, not all of them are interesting. zInterestingness measures: yeasily understood by humans yvalid on new or test data with some degree of certainty. ypotentially useful ynovel, or validates some hypothesis that a user seeks to confirm zObjective vs. subjective interestingness measures yObjective: based on statistics and structures of patterns, e.g., support, confidence, etc. ySubjective: based on user’s beliefs in the data, e.g., unexpectedness, novelty, etc. Are all the discovered pattern interesting?

48 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro zFind all the interesting patterns: Completeness. yCan a data mining system find all the interesting patterns? zSearch for only interesting patterns: Optimization. yCan a data mining system find only the interesting patterns? yApproaches xFirst generate all the patterns and then filter out the uninteresting ones. xGenerate only the interesting patterns - mining query optimization. Completeness vs. optimization

49 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 49 Evaluation zStatistical validation and significance testing zQualitative review by experts in the field zPilot surveys to evaluate model accuracy Interpretation zInductive tree and rule models can be read directly zClustering results can be graphed and tabled zCode can be automatically generated by some systems (IDTs, Regression models) Interpretation and evaluation

50 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 50 zVisualization tools can be very helpful y sensitivity analysis (I/O relationship) y histograms of value distribution y time-series plots and animation y requires training and practice Response Velocity Temp Interpretation and evaluation

51 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro z1989 IJCAI Workshop on KDD yKnowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, eds., 1991) z1991-1994 Workshops on KDD yAdvances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, eds., 1996) z1995-1998 AAAI Int. Conf. on KDD and DM (KDD’95-98) yJournal of Data Mining and Knowledge Discovery (1997) z1998 ACM SIGKDD z1999 SIGKDD’99 Conf. Important dates of data mining

52 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 52 References - general zP. Adriaans and D. Zantinge. Data Mining. Addison-Wesley: Harlow, England, 1996. zM. S. Chen, J. Han, and P. S. Yu. Data mining: An overview from a database perspective. IEEE Trans. Knowledge and Data Engineering, 8:866-883, 1996. zU. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996. zJ. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000. To appear. zT. Imielinski and H. Mannila. A database perspective on knowledge discovery. Communications of ACM, 39:58-64, 1996. zG. Piatetsky-Shapiro, U. Fayyad, and P. Smith. From data mining to knowledge discovery: An overview. In U.M. Fayyad, et al. (eds.), Advances in Knowledge Discovery and Data Mining, 1-35. AAAI/MIT Press, 1996. zG. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991. zMichael Berry & Gordon Linoff. Data Mining Techniques for Marketing, Sales and Customer Support. John Wiley & Sons, 1997. zSholom M. Weiss and Nitin Indurkhya. Predictive Data Mining: A Practical Guide. Morgan Kaufmann, 1997. zW.H. Inmon, J.D. Welch, Katherine L. Glassey. Managing the data warehouse. Wiley, 1997. zT. Mitchell. Machine Learning. McGraw-Hill, 1997.

53 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 53 Main Web resources zhttp://www.kdnuggets.com zKDD Newsletter and comprehensive website zhttp://www.acm.org/sigkdd zACM SIGKDD zhttp://www.research.microsoft.com/datamine/ zJournal of Data Mining and Knowledge Discovery

54 Konstanz, 27-28.3.2000 EDBT2000 tutorial - Intro 54 Tutorial Outline zIntroduction and basic concepts xMotivations, applications, the KDD process, the techniques zDeeper into DM technology yDecision Trees and Fraud Detection yAssociation Rules and Market Basket Analysis yClustering and Customer Segmentation zTrends in technology xKnowledge Discovery Support Environment xTools, Languages and Systems zResearch challenges


Download ppt "Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa"

Similar presentations


Ads by Google