2 Introduction Motivation: Why data mining? What is data mining? Data Mining: On what kind of data?Data mining functionalityAre all the patterns interesting?Classification of data mining systemsMajor issues in data mining
3 Why Data Mining?The Explosive Growth of Data: from terabytes to petabytesData collection and data availabilityAutomated data collection tools, database systems, Web, computerized societyMajor sources of abundant dataBusiness: Web, e-commerce, transactions, stocks, …Science: Remote sensing, bioinformatics, scientific simulation, …Society and everyone: news, digital cameras,We are drowning in data, but starving for knowledge!“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets
4 Evolution of Database Technology Data collection, database creation, IMS and network DBMS1970s:Relational data model, relational DBMS implementation1980s:RDBMS, advanced data models (extended-relational, OO, deductive, etc.)Application-oriented DBMS (spatial, scientific, engineering, etc.)1990s:Data mining, data warehousing, multimedia databases, and Web databases2000sStream data management and miningData mining and its applicationsWeb technology (XML, data integration) and global information systems
5 What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of dataAlternative nameKnowledge discovery in databases (KDD)Watch out: Is everything “data mining”?Query processingExpert systems or statistical programs
6 Why Data Mining?—Potential Applications Data analysis and decision supportMarket analysis and managementTarget marketing, customer relationship management (CRM), market basket analysis, market segmentationRisk analysis and managementForecasting, customer retention, quality control, competitive analysisFraud detection and detection of unusual patterns (outliers)
7 Why Data Mining?—Potential Applications Other ApplicationsText mining (news group, , documents) and Web miningStream data miningBioinformatics and bio-data analysis
8 Market Analysis and Management Where does the data come from?Credit card transactions, discount coupons, customer complaint callsTarget marketingFind clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc.Determine customer purchasing patterns over time
9 Market Analysis and Management Cross-market analysisAssociations/co-relations between product sales, & prediction based on such associationCustomer profilingWhat types of customers buy what productsCustomer requirement analysisIdentifying the best products for different customersPredict what factors will attract new customers
10 Fraud Detection & Mining Unusual Patterns Approaches: Clustering & model construction for frauds, outlier analysisApplications: Health care, retail, credit card service, telecomm.Medical insuranceProfessional patients, and ring of doctorsUnnecessary or correlated screening testsTelecommunications:Phone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected normRetail industryAnalysts estimate that 38% of retail shrink is due to dishonest employees
11 Other Applications Internet Web Surf-Aid IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.
12 Data Mining: A KDD Process KnowledgeData mining—core of knowledge discovery processPattern EvaluationData MiningTask-relevant DataSelectionData WarehouseData CleaningData IntegrationDatabases
13 Steps of a KDD Process Learning the application domain Relevant prior knowledge and goals of applicationCreating a target data set: data selectionData cleaning and preprocessing: (may take 60% of effort!)Data reduction and transformationFind useful features, dimensionality/variable reduction.Choosing functions of data miningSummarization, classification, regression, association, clustering.Choosing the mining algorithm(s)Data mining: search for patterns of interestPattern evaluation and knowledge presentationVisualization, transformation, removing redundant patterns, etc.Use of discovered knowledge
14 Architecture: Typical Data Mining System Graphical user interfacePattern evaluationData mining engineKnowledge-baseDatabase or data warehouse serverFilteringData cleaning & data integrationDataWarehouseDatabases
15 Data Mining: On What Kinds of Data? Relational databaseData warehouseTransactional databaseAdvanced database and information repositorySpatial and temporal dataTime-series dataStream dataMultimedia databaseText databases & WWW
16 Data Mining Functionalities Concept description: Characterization and discriminationGeneralize, summarize, and contrast data characteristicsAssociation (correlation and causality)Diaper à Beer [0.5%, 75%]Classification and PredictionConstruct models (functions) that describe and distinguish classes or concepts for future predictionPresentation: decision-tree, classification rule, neural network
17 Data Mining Functionalities Cluster analysisClass label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patternsMaximizing intra-class similarity & minimizing interclass similarityOutlier analysisOutlier: a data object that does not comply with the general behavior of the dataUseful in fraud detection, rare events analysisTrend and evolution analysisTrend and deviation: regression analysisSequential pattern mining, periodicity analysis
18 Are All the “Discovered” Patterns Interesting? Data mining may generate thousands of patterns: Not all of them are interestingSuggested approach: Human-centered, query-based, focused miningInterestingness measuresA pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirmObjective vs. subjective interestingness measuresObjective: based on statistics and structures of patterns, e.g., support, confidence, etc.Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty.
19 Data Mining: Confluence of Multiple Disciplines DatabaseSystemsStatisticsData MiningMachineLearningVisualizationAlgorithmOtherDisciplines
20 Data Mining: Classification Schemes Different views, different classificationsKinds of data to be minedKinds of knowledge to be discoveredKinds of techniques utilizedKinds of applications adapted
21 Multi-Dimensional View of Data Mining Data to be minedRelational, data warehouse, transactional, stream, object-oriented/relational, active, spatial, time-series, text, multi-media, heterogeneous, WWWKnowledge to be minedCharacterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc.Multiple/integrated functions and mining at multiple levels
22 Multi-Dimensional View of Data Mining Techniques utilizedDatabase-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc.Applications adaptedRetail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, Web mining, etc.
23 OLAP Mining: Integration of Data Mining and Data Warehousing Data mining systems, DBMS, Data warehouse systems couplingOn-line analytical mining dataIntegration of mining and OLAP technologiesInteractive mining multi-level knowledgeNecessity of mining knowledge and patterns at different levels of abstraction.Integration of multiple mining functionsCharacterized classification, first clustering and then association
24 Major Issues in Data Mining Mining methodologyMining different kinds of knowledge from diverse data types, e.g., bio, stream, WebPerformance: efficiency, effectiveness, and scalabilityPattern evaluation: the interestingness problemIncorporation of background knowledgeHandling noise and incomplete dataParallel, distributed and incremental mining methodsIntegration of the discovered knowledge with existing one: knowledge fusion
25 Major Issues in Data Mining User interactionData mining query languages and ad-hoc miningExpression and visualization of data mining resultsInteractive mining of knowledge at multiple levels of abstractionApplications and social impactsDomain-specific data mining & invisible data miningProtection of data security, integrity, and privacy
26 SummaryData mining: discovering interesting patterns from large amounts of dataA natural evolution of database technology, in great demand, with wide applicationsA KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentationMining can be performed in a variety of information repositoriesData mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc.Data mining systems and architecturesMajor issues in data mining
27 Where to Find References? More conferences on data miningPAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc.Data mining and KDDConferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.Journal: Data Mining and Knowledge Discovery, KDD ExplorationsDatabase systemsConferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAAJournals: ACM-TODS, IEEE-TKDE, JIIS, J. ACM, etc.AI & Machine LearningConferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), etc.Journals: Machine Learning, Artificial Intelligence, etc.StatisticsConferences: Joint Stat. Meeting, etc.Journals: Annals of statistics, etc.VisualizationConference proceedings: CHI, ACM-SIGGraph, etc.Journals: IEEE Trans. visualization and computer graphics, etc.