3Data mining in commerce About 13 million customers per month contact the West Coast customer service call center of the Bank of AmericaIn the past, each caller would have listened to the same marketing advertisement, whether or not it was relevant to the caller’s interests.Chris Kelly, vice president and director of database marketing: “rather than pitch the product of the week, we want to be as relevant as possible to each customer”Thus, based on individual customer profiles, the customer can be informed of new products that may be of greatest interest.Data mining helps to identify the type of marketing approach for a particular customer, based on the customer’s individual profile.
5Why mine data – commercial viewpoint Lots of data is being collectedWeb data, e-commercepurchases at department/grocery storesBank/Credit Card transactionsComputers have become cheaper and more powerfulCompetitive pressure is strongProvide better, customized servicesR. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”
6Why mine data – scientific viewpoint Data collected and stored at enormous speeds (GB/hour)remote sensors on a satellitetelescopes scanning the skiesmicroarrays generating gene expression datascientific simulations generating terabytes of dataTraditional techniques infeasible for raw dataData mining may help scientistsin classifying and segmenting datain hypothesis formationR. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”
7Data mining in bioinformatics Brain tumors represent the most deadly cancer among childrenGene expression database for pediatric brain tumors was built, in an effort to develop more effective treatment.
8Clearly, a lot of data is being collected. However, what is being learned from all this data? What knowledge are we gaining from all this information?“we are drowning in information but starved for knowledge”The problem today is not that there is not enough data. Rather, the problem is that there are not enough trained human analysts available who are skilled at translating all of this data into knowledge.
9Data mining is the process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical an mathematical techniques.(www.gartner.com)Data mining is an interdisciplinary field bringing togther techniques from machine learning, pattern recognition, statistics, databases, and visualization to address the issue of information extraction from large data bases.(Peter Cabena, Pablo Hadjinian, Rolf Stadler, JaapVerhees, and Alessandro Zanasi, Discovering Data Mining: From Concept to Implementation, Prentice Hall, Upper Saddle River, NJ, 1998.)this is where data mining comes into play- there exist many many many definitions of DM, I picked just two of them
10The growth in this field has been fueled by several factors: growth in data collectionstoring of the data in data warehousesavailability of increased access to data from Webcompetitive pressure to increase market sharedevelopment of data mining software suitestremendous growth in computing power and storage capacity
11Need for human direction of DM Don’t believe software vendors advertising their analytical software as being plug-and-play out-of-the-box application providing solutions without the need of human interaction!Data mining is not a product that can be bought, it is a discipline that must be mastered!
12It is easy to do data mining badly. Software always gives some result. A little knowledge is especially dangerouse.g. analysis carried out on unpreprocessed data can lead to errorneous conclusions, the models can be way offif deployed, the errors can lead to very expensive failuresThe costly errors stem from the black-box approach.
13Data maning trap If we try hard enough, we always find some patterns. However, they may be just a matter of chance. They don’t have to be characteristic for process that generates the data.Derogatory sport definition of data mining:Data mining means sorting through a huge volume of data, extracting decision rules that seem to favor one team over another, but without regard to whether or not there is any cause-and-effect relationship. Data mining is the equivalent of sitting a huge number of monkeys down at keyboards, and then reporting on the monkeys who happened to type actual words.Derogatory – opovržlivý„Data mining“ ve sportovním odvětví znamená toto:
14Instead, apply a “white-box” methodology. i.e. understand algorithms and statistical model structures underlying a softwareThe white-box approach is the reason why you are attending this lecture (apart from the fact, that the lecture is compulsory).
15Data mining as a process One of the fallacies associated with DM is that DM represents an isolated set of toolsInstead, DM should be viewed as a processThe process is standardized – CRISP-DM framework (http://www.crisp-dm.org/)Cross-Industry Standard Process for Data Miningdeveloped in 1996 by analysts from DaimlerChrysler, SPSS, and NCRprovides a nonproprietary and freely available standard process for fitting data mining into the general problem-solving strategy of a business or research unitfallacy – mylna predstava, bludNCR Corporation, formerly National Cash Register - a technology company specializing in products for the retail, financial, travel, healthcare, food service, entertainment, gaming and public sector industries
16CRISP-DM starts here - life cycle consisting of six phases the phase sequence is adaptive (e.g. depending on the behavior and characteristics of the model, we may have to return to the data preparation phase for further refinement before moving forward to the model evaluation phase)
17Business understanding phase Data understanding phase Formulate the project objectives and requirementsData understanding phasecollect the datause EDA (exploratory data analysis) to familiarize yourself with the dataevaluate the quality of the dataData preparation phaseprepare from the initial raw data the final data set. This phase is very labor intensive.select the cases and variables you want to analyzeperform transformation of variables, if neededclean the raw data so they are ready for modelling tools
18Modeling phase Evaluation phase select and apply appropriate modeling techniquescalibrate model settings to optimize resultsoften, several different techniques may be usedif necessary, loop back to the data preparation phase to bring the form of the data into line with the specific requirements of a particular data mining techniqueEvaluation phaseevaluate models for quality and effectivnessestablish whether some important facet of the business or research problem has not been accounted for sufficiently- enunciate - formulovat
19Deployment phase make use of the models created examples of deployment:reportimplement a parallel DM process in another department
20CRISP-DM example Investigated patterns in the warranty claims for DaimlerChrysler automobilesBusiness understandingObjectives: reduce costs associated with warranty claims and improve customer satisfactionSpecific business problems can be formulated:Are there interdependencies among warranty claims?Are past warranty claims associated with similar claims in the future?Jochen Hipp and Guido Lindner, Analyzing warranty claims of automobiles: an application description following the CRISP–DM data mining process, in Proceedings of the 5th International Computer Science Conference (ICSC ’99), pp. 31–40, Hong Kong, December 13–15, 1999
21Data understandinguse of DaimlerChrysler’s Quality Information System (QUIS)it contains information on over 7 million vehicles and is about 40 gigabytes in sizeQUIS contains production details about how and where a particular vehicle was constructed + warranty claim informationresearchers stressed the fact that the database was entirely unintelligible to domain nonexpertsexperts from different departments had to be located and consulted, a task that turned out to be rather costly- unintelligible - nesrozumitelny
22Data preparationthe QUIS DB did not contain all information needed for the modelling purposese.g. the variable “number of days from selling date until first claim” had to be derived from the appropriate date attributesresearchers then turned to DM software where they ran into a common roadblock: data format requirements varied from algorithm to algorithmresult was further exhaustive preprocessing of the dataresearchers mention that the data preparation phase took much longer than they had planned- unintelligible - nesrozumitelny
23Modeling to investigate dependencies, researchers used Bayesian networksAssociation rules miningthe details of the results are confidential, but we can get general idea of dependencies uncovered by modelsparticular combination of construction specifications doubles the probability of encountering an automobile electrical cable problem- unintelligible - nesrozumitelny
24EvaluationThe researchers were disappointed that association rules models were found to be lacking in effectiveness and to fall short of the objectives set for them in the business understanding phase“In fact, we did not find any rule that our domain experts would judge as interesting.”To account for this, the researchers point to the “legacy” structure of the database, for which automobile parts were categorized by garages and factories for historic or technical reasons and not designed for data mining.They suggest redesigning the database to make it more amenable to knowledge discovery.- unintelligible - nesrozumitelny
25DeploymentIt was a pilot project, without intention to deploy any large-scale models from the first iteration.Product: report describing lessons learned from this projecte.g. change of the structure of the database (new variables, different categorization of automobile parts)- unintelligible - nesrozumitelny
26Lessons learneduncovering hidden nuggets of knowledge in databases is a rocky roadintense human participation and supervision is required at every stage of the data mining processthere is no guarantee of positive results
27Connection to other fields Machine learningPattern recognitionVizualizationData MiningDatabasesystemsStatistics
28Machine learning A subfield of artificial intelligence. Discipline that is concerned with the design and development of algorithms that allow computers to evolve behavior based on experience.experience – empirical data, such as from sensors or databasesevolve behavior – usually through search of patterns in datasimilar goal as DM, DM uses algorithms from ML
29Pattern recognitionProblem of searching patterns - a fundamental one, long and successful history.For instance, the extensive astronomical observations of Tycho Brahe in the 16th century allowed Johannes Kepler to discover the empirical laws of planetary motion, which in turn provided a springboard for the development of classical mechanics.
30Pattern recognitionautomatic discovery of regularities in data through the use of computer algorithms and with the use of these regularities to take actions such as classifying the data into different categories
31Pattern recognition data patterns if train has 2 wagons, it goes to the left
33Connection to other fields Machine learningPattern recognitionVizualizationData MiningDatabasesystemsStatistics
34Iris Sample Data SetMany of the exploratory data techniques are illustrated with the Fisher’s Iris Plant data set.From the statistician Douglas Fisher, mid-1930sCan be obtained from the UCI Machine Learning Repositorybased on WEKA tutorial
35Contains flower dimension measurements on 50 samples of each species. iris setosairis versicoloriris virginicaContains flower dimension measurements on 50 samples of each species.Let’s examine the origins of Fisher’s iris dataset.During the 1930’s, botanist Edgar Anderson traveled to the Gaspe’ Peninsula to study irises.His work resulted in the creation of a dataset (as we now term it) containing measurements of flower petal and sepal dimensions for each of three different species of iris.The dataset contains 50 samples of each of the three species, for a total of 150 samples.Anderson’s dataset was then used by Ronald Fisher in a landmark study describing a means of classifying an iris of unknown species based on flower dimension measurements.Fisher, R.A. (1936). "The Use of Multiple Measurements in Taxonomic Problems". Annals of Eugenics 7: 179–188,
36These dimensions were measured: sepal (kališní lístek) length sepal widthpetal (korunní lístek) lengthpetal widthMeasurements on these iris species:setosaversicolorvirginicaAnderson performed measurements on four flower dimensions:sepal length, sepal width, petal length, and petal width.He performed these measurements on the three iris species: setosa, versicolor, and virginica.Data mining terminologyThe four iris dimensions are termed attributes, input attributes, featuresThe three iris species are termed classes, output attributesEach example of an iris is termed a sample, instance, object, data pointbased on WEKA tutorial
37Sample Class Nominal Numerical Here is a segment of the Fisher’s dataset spreadsheet that you downloaded showing the four input attributes and one output attribute, or class, for five of the 150 instances, or samples, of iris. For this dataset, the input attributes are numerical attributes, meaning that the attributes are given as real numbers, in this case in centimeters. The output attribute is a nominal attribute, in other words, a name for a particular species of iris.NominalNumericalbased on WEKA tutorial
38Statistics statistical analysis Exploratory Data Analysis (EDA) summary statistics (mean, median, standard deviation)Exploratory Data Analysis (EDA)A preliminary exploration of the data to better understand its characteristics.Created by statistician John TukeyA nice online introduction can be found in Chapter 1 of the NIST Engineering Statistics Handbook
39EDA Helps to select the right tool for preprocessing or analysis People can recognize patterns not captured by data analysis toolsIn EDA, as originally defined by TukeyThe focus was on visualizationClustering and anomaly detection were viewed as exploratory techniquesIn data mining, clustering and anomaly detection are major areas of interest, and not thought of as just exploratoryHuman makes and validates hypothesesWhile in DM computer makes and validates hypotheses
40versicolor setosa virginica As it turns out for this histogram, dark blue is setosa, red is versicolor, and bright blue is virginica.The histogram shows that there are, for example, 16 samples in the lower histogram bin for sepal length, all of which are setosa irises, and shows that there are 7 samples in the highest bin, all of which are virginica.based on WEKA tutorial
41versicolor virginica setosa By comparing the histograms for all of the input attributes, we can begin to get a sense of how the four input attributes vary with different iris species.For example, it appears that iris setosa tends to have relatively small sepal length, petal length, and petal width, but relatively large sepal width.These are the types of patterns that data mining algorithms use to perform classification and other functions.Notice also that the Species histogram verifies that we have 50 of each iris species in our dataset.based on WEKA tutorial
42sepal width sepal length Here we see a plot of the dataset. Sepal length is displayed on the x-axis, and sepal width on the y-axis.The plot screen also shows us which colors are associated with which classes on the various Weka screens.sepal lengthbased on WEKA tutorial
43Connection to other fields Machine learningPattern recognitionVizualizationData MiningDatabasesystemsStatistics
44VisualizationCan reveal hypothesesbased on WEKA tutorial
45Connection to other fields Machine learningPattern recognitionVizualizationData MiningDatabasesystemsStatistics
46Data warehouseA data warehouse is a repository of an organization's electronically stored data.Data warehouses are designed to facilitate reporting and analysis.Technology:relational database systemmultidimensional database system
47Data warehousing process of constructing and using data warehouse Data warehousing is the coordinated, periodic copying of data from various sources, both inside and outside the enterprise, into an environment optimized for analytical and informational processing.
48data warehousing includes business intelligence tools -ERP - Enterprise resource planning-CRM - Customer relationship managementdata warehousing includesbusiness intelligence toolstools to extract, transform, and load datatools to manage and retrieve metadata
49Business intelligence tools a type of application software designed to report, analyze and present datathey includereporting and querying software“Tell me what happened.”tools that extract, sort, summarize, and present selected dataOLAP (On-Line Analytical Processing )“Tell me what happened and why.”data mining“Tell me what might happened.” (predict)“Tell me something interesting.” (relationships)
50OLAPQuery and report data is typically presented in row after row of two-dimensional data.OLAP: “Tell me what happened and why.”To support this type of processing, OLAP operates against multidimensional databases.
51Example: Iris dataWe show how the attributes, petal length, petal width, and species type can be converted to a multidimensional arrayFirst, we discretized the petal width and length to have categorical values: low, medium, and highWe get the following table - note the count attribute
52LengthEach unique tuple of petal width, petal length, and species type identifies one element of the array.This element is assigned the corresponding count value.The figure illustrates the result.-All non-specified tuples are 0.
53Slices of the multidimensional array are shown by the following cross-tabulations SetosaVersicolorVirginica
54Creating a Multidimensional Array Two key steps in converting tabular data into a multidimensional array.identify which attributes are to be the dimensions and which attribute is to be the target attribute whose values appear as entries in the multidimensional array.The attributes used as dimensions must have discrete valuesThe target value is typically a count or continuous valuefind the value of each entry in the multidimensional array by summing the values (of the target attribute) or count of all objects that have the attribute values corresponding to that entry.
55OLAP Operations: Data Cube The key operation of an OLAP is the formation of a data cube.A data cube is a multidimensional representation of data, together with all possible aggregates.
56By all possible aggregates, we mean the aggregates that result by selecting a proper subset of the dimensions and summing over all remaining dimensions.For example, if we choose the species type dimension of the Iris data and sum over all other dimensions, the result will be a one-dimensional entry with three entries, each of which gives the number of flowers of each type.Length
57Data Cube ExampleConsider a data set that records the sales of products at a number of company stores at various dates.This data can be represented as a 3 dimensional arrayThere are 3 two-dimensional aggregates (3 choose 2 ), 3 one-dimensional aggregates, and 1 zero-dimensional aggregate (the overall total)
58The following figure table shows one of the two dimensional aggregates, along with two of the one-dimensional aggregates, and the overall total
59OLAP Operations Various operations are defined on the data cube: Slicing/Dicing - selecting a group/subgroup of cells from the entire multidimensional array by specifying a specific value for one or more dimensions.Roll-up and Drill-down - granularityDicing - selecting a subset of cells by specifying a range of attribute values.Slicing - selecting a group of cells from the entire multidimensional array by specifying a specific value for one or more dimensions.
61OLAP Operations: Roll-up and Drill-down Attribute values often have a hierarchical structure.Each date is associated with a year, month, and week.A location is associated with a continent, country, state (province, etc.), and city.Products can be divided into various categories, such as clothing, electronics, and furniture.Note that these categories often nest and form a tree or latticeA year contains months which contains dayA country contains a state which contains a city
62OLAP Operations: Roll-up and Drill-down This hierarchical structure gives rise to the roll-up and drill-down operations.For sales data, we can aggregate (roll up) the sales across all the dates in a month.Conversely, given a view of the data where the time dimension is broken into months, we could split the monthly sales totals (drill down) into daily sales totals.