2 What is Data Mining? Different perspectives: CS, Business, IT As a field of research in CS:Science of extracting useful information from large data sets or databasesAlso known asKnowledge Discovery and Data Mining (KDD) Knowledge Discovery in Databases (KDD)
3 Knowledge Discovery and Data Mining (KDD) KDD can be said to lie at the intersection of statistics, machine learning, data bases, pattern recognition, information retrieval and artificial intelligence.
4 Data Mining Definitions Analysis of datasets to find unsuspected relationshipsSummarize data in novel ways that are understandable useful to data ownerExtraction of knowledge from datanon-trivial extraction of implicit, previously unknown & potentially useful knowledge from dataProcess of discovering patterns:automatically or semi-automatically, in large quantities of dataPatterns discovered must be useful: meaningful in that they lead to some advantage, usually economic
5 Why Data Mining?Large datasets are common: due to advances in digital data acquisition and storage technology.Automatic data production leads to need for automatic data consumptionLarge databases mean vast amounts of informationDifficulty lies in accessing itBusinessSupermarket transactionsCredit card usage recordsTelephone call detailsGovernment statisticsScientificImages of astronomical bodiesMolecular databasesMedical records
6 Why Data Mining?Data mining is ready for application in the business community because it is supported by three technologies that are now sufficiently mature:Massive data collectionPowerful multiprocessor computersData mining algorithms
7 Example of Data MiningIf a store tracks the purchases of a customer and notices that a customer buys a lot of silk shirts, the data mining system will make a correlation between that customer and silk shirts.The store may begin direct mail marketing of silk shirts to that customer or it may alternatively attempt to get the customer to buy a wider range of products .Another example: analysts found that beers and diapers were often bought together .So place the high-profit diapers next to the high-profit beers.This technique is often referred to as "Market Basket Analysis".
8 Steps in the Evolution of Data Mining Evolutionary StepBusiness QuestionEnabling TechnologiesData Collection(1960s)"What was my total revenue in the last five years?"Computers, tapes, disksData Access(1980s)"What were unit sales in New England last March?"Relational databases (RDBMS), Structured Query Language (SQL), ODBCData Warehousing &Decision Support(1990s)"What were unit sales in New England last March? Drill down to Boston."On-line analytic processing (OLAP), multidimensional databases, data warehousesData Mining(Emerging Today)"What’s likely to happen to Boston unit sales next month? Why?"Advanced algorithms, multiprocessor computers, massive databases
9 The Scope of Data Mining Automated prediction of trends and behaviors.Data mining uses data on past promotional mailings to identify the targets most likely to maximize return on investment in future mailings.Automated discovery of previously unknown patterns.An example of pattern discovery is the analysis of retail sales data to identify seemingly unrelated products that are often purchased together.More columns.High performance data mining allows users to explore the full depth of a database, without pre-selecting a subset of variables.More rows.Larger samples yield lower estimation errors and variance, and allow users to make inferences about small but important segments of a population.
10 Data Mining vs. Statistics Objective of data mining exercise plays no role in data collection strategyIn this way it differs from much of statisticsFor this reason, data mining is referred to as secondary data analysisKDD more complicated than initially thought80% preparing data20% mining data
11 Query: Data Base vs. Data Mining Data Base: When you know exactly what you are looking for.Data Mining: When you only vaguely know what you are looking for.
12 Data Mining Tasks and Techniques Not so much a single techniqueIdea that there is more knowledge hidden in the data than shows itself on the surfaceAny technique that helps to extract more out of data is usefulFive major task types:1. Exploratory Data Analysis (Visualization)2. Descriptive Modeling (Density estimation, Clustering)3. Predictive Modeling (Classification and Regression)4. Discovering Patterns and Rules (Association rules)5. Retrieval by Content (Retrieve items similar to pattern of interest)
13 Privacy concernsFor example, if an employer has access to medical records, they may screen out people who have diabetes or have had a heart attack. Screening out such employees will cut costs for insurance, but it creates ethical and legal problems.Essentially, data mining gives information that would not be available otherwise. It must be properly interpreted to be useful. When the data collected involves individual people, there are many questions concerning privacy, legality, and ethics.
14 Notable Uses of Data Mining Data mining has been cited as the method by which the U.S. Army intelligence unit, Able Danger, supposedly had identified the 9/11 attack leader, Mohamed Atta, and three other 9/11 hijackers as possible members of an al Qaeda cell operating in the U.S. more than a year before the attack.