2 Outline What is data mining? Basic Data Mining Tasks Classification ClusteringAssociationData mining AlgorithmsAre all the patterns interesting?
3 What is Data MiningAmount of data in databases and files grows exponentially 9Petabytes for Earth observation project in 2010 and 14Petabytes in 2015.Data Mining is interested in finding information in these huge data sourcesTypical database query: SQL, Access and other database languages to get dataData Mining query differs from Database queryQuery not well formulatedData in many sourcesOutput is mostly either visual or multimediaData Mining algorithms to get the information are consisting of three partsModel – The purpose of the algorithm to fit the model to the dataPreferences – Criteria to decide which model is betterSearch – All algorithms require some search techniques
4 Data Mining Information retrieval Statistic Knowledge Based System AlgorithmsMachine Learning
5 Statistic is not Data Mining A big objection to data mining was that it was looking for so many vague connections that it was sure to find things that were bogusThe Rhine Paradox: a great example of how not to conduct scientific research.David Rhine was a parapsychologist in the 1950’s who hypothesized that some people had Extra-Sensory Perception (ESP).He devised an experiment where subjects were asked to guess 10 hidden cards --- red or blue.He discovered that almost 1 in 1000 had ESP --- they were able to get all 10 right!
6 Example(con’t)He told these people they had ESP and called them in for another test of the same type.Alas, he discovered that almost all of them had lost their ESP.What did he conclude?You shouldn’t tell people that they have ESP: it causes them to lose it
7 Example (con’t) What has really happened: There are 1024 combinations of red and bluecombinations of red and blue of length 10.Thus with probability 0.98 at least one person will guessthe sequence of red blue correctly
8 Knowledge Based System are not Data Mining KDD process selects the data and finds knowledge in the dataData Mining in addition trying to make inferences from the dataHowever, the boundaries are not easy to define
9 Machine Learning is not Data Mining Machine Learning design systems that can learn in the process of processing dataCheckers program designed by one of the scientist eventually learned to play better than the program designerData Mining incorporates the Machine learning methods but also benefits from the methods of other disciplines such as database and statistic
10 What is Data MiningData Mining major task is to find all and only interesting patterns in a set of data sourcesFind all interesting patterns means – CompletenessCan it be doneHeuristic vs Exhaustive searchFind only interesting patterns – ConsistencyIs it possibleApproaches: Generate all patterns and filter out uninteresting patterns; generate only patterns that are interesting
11 Data Mining: On What Kind of Data? Relational databases – Universal relation vs Multirelational searchData warehousesTransactional databasesAdvanced DB and information repositoriesObject-oriented and object-relational databasesSpatial databasesTime-series data and temporal dataSensor DataText databases and multimedia databasesHeterogeneous and legacy databasesWWW
12 Data Mining: On What Kind of Data? Attribute Types:Categorical – attribute that has a finite number of valuesOrdinal – attributes can be ordered by their valuesAttribute Transformations:Continuing - attribute that may have infinite but countable set of values. These attributes always can be orderedInterval scaleBooleanNominal – attributes that cannot be ordered by their valuesOperational - example measurement of programming productivity as a*m(n+m)log(a+b)/2b, where a is the number of unique operators,b is the number of unique operands, n-number of total operators occurrences and m the number of total operands occurrences
13 Data Mining Models Data Mining Descriptive Models Predictive Models TimeseriesClusteringSequenceDiscoverySummarizationClassificationSRegressionAssociationRulesPrediction
14 ClassificationGiven a set of classes, distribute the data into a given set of classes so that a newly arrived data will be with the high probability will fall into one of the classes.Credit Card example: 4 classes: authorize; request more info; do not authorize; contact policeData is a set of credit card applications that contain Name, age, credit score, address, income, own or rent primary residence, etc.
15 RegressionRegression is a process of mapping a given data to some function. Regression may be linear (mapping into a linear function the set of given data or non-linear function.For example, one may map saving amount to a person age as follows:samt = a*age+b, where constant a and b aredetermined by existing dataFitting the rest of the data into a defined function should have the least possible error
16 Time Series AnalysisGiven data that changes with time to predict the data behavior based on the known dataExample: predict stock market, predict the stock price of a specific companyVisualization is an important tool of time series analysisThere are special operations on time series that facilitate the time series analysis
17 Prediction Differences between Classification and Prediction: Classification deals with an existing dataPrediction deals with future eventsMathematical Models are normally used for prediction: Weather forecast, quake forecast, etc.
18 ClusteringClustering is a process of distributing given data into several sets so that distance between different sets is larger than the distance between elements in the same setDifference between Clustering and Classification is that the number of clusters is not known in advance, whereas the number of classes is known in advance.Examples
19 Association Rules and Sequence Discovery Association rules discovery relates to uncovering unexpected relationships between data attribute values. For example people who buy coffee may not buy tee, or man who buy diapers also buy beer. However, women who buy diapers do not buy beerSequence discovery – an ability to determine sequential patterns in the data
20 Data Mining Tasks Data Selection Data Integration Data Cleaning Data TransformationData MiningOutlier AnalysisResult InterpretationTrend and Evolution Analysis
21 Data VisualizationGraphical Interface – bar charts, histograms, line graphsGeometric – scatter diagrams techniquesIcon based – figures, colors to improve results presentationHierarchical – Divide a display area into segmentsHybrid – a combination all of the above
22 Data Mining Major Issues Human InterfaceModel SelectionHow to deal with outliersResults InterpretationsVisualization ResultsDealing with large amounts of dataDimensionality CurseMultimedia DataMissing DataIrrelevant dataIntegrationApplication
23 Data Mining Major Issues Mining MethodologyMining different types of data in databasesInteractive data miningIncorporation of known dataNoise and incomplete dataPerformance and scalabilitySocial Impact –Data Privacy and Security
24 Potential Applications Database analysis and decision supportMarket analysis and managementtarget marketing, customer relation management, market basket analysis, cross selling, market segmentationRisk analysis and managementForecasting, customer retention, improved underwriting, quality control, competitive analysisFraud detection and managementOther ApplicationsText mining (news group, , documents) and Web analysis.Intelligent query answering
25 Market Analysis and Management (1) Where are the data sources for analysis?Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studiesTarget marketingFind clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc.Determine customer purchasing patterns over timeConversion of single to a joint bank account: marriage, etc.Cross-market analysisAssociations/co-relations between product salesPrediction based on the association information
26 Market Analysis and Management (2) Customer profilingdata mining can tell you what types of customers buy what products (clustering or classification)Identifying customer requirementsidentifying the best products for different customersuse prediction to find what factors will attract new customersProvides summary informationvarious multidimensional summary reportsstatistical summary information (data central tendency and variation)
27 Corporate Analysis and Risk Management Finance planning and asset evaluationcash flow analysis and predictioncontingent claim analysis to evaluate assetscross-sectional and time series analysis (financial-ratio, trend analysis, etc.)Resource planning:summarize and compare the resources and spendingCompetition:monitor competitors and market directionsgroup customers into classes and a class-based pricing procedureset pricing strategy in a highly competitive market
28 Fraud Detection and Management (1) Applicationswidely used in health care, retail, credit card services, telecommunications (phone card fraud), etc.Approachuse historical data to build models of fraudulent behavior and use data mining to help identify similar instancesExamplesauto insurance: detect a group of people who stage accidents to collect on insurancemoney laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network)medical insurance: detect professional patients and ring of doctors and ring of references
29 Fraud Detection and Management (2) Detecting inappropriate medical treatmentDetecting telephone fraudTelephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm.British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud.RetailAnalysts estimate that 38% of retail shrink is due to dishonest employees.
30 Other Applications Sports IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami HeatAstronomyJPL and the Palomar Observatory discovered 22 quasars with the help of data miningInternet Web Surf-AidIBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.
31 Data Mining System Architecture Database, data warehouse, data files- set of data to be mined. Data Cleaning and data integration may be performed at this stageDatabase or data warehouse server is responsible for fetching relevant data. How to define relevancy?Knowledge Base – Domain knowledge that drives a search for patterns. Concept hierarchy, User Beliefs, Interestingness ConstraintsData Mining Engine-Functional algorithms to perform a search for domain expertsPattern Evaluation – Use knowledge base and other methods to narrow search for domain pattersGUI – Communicator between users and data mining system
32 Architecture of a Typical Data Mining System Graphical user interfacePattern evaluationData mining engineKnowledge-baseDatabase or data warehouse serverFilteringData cleaning & data integrationDataWarehouseDatabases
33 SummaryData mining: discovering interesting patterns from large amounts of dataA natural evolution of database technology, in great demand, with wide applicationsA KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentationMining can be performed in a variety of information repositoriesData mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc.Classification of data mining systemsMajor issues in data mining