Presentation on theme: "Principles of data mining"— Presentation transcript:
1Principles of data mining Chapter TwoPrinciples of data mining
2Chapter Overview The process of data mining Approaches of data mining Categories of data mining problemsInformation patterns to be discoveredOverview of data mining solutionsImportance of evaluationUndertaking a data mining task in WekaReview of basic concepts in statistics and probability
3Data Mining Process Input Preparing Data Input Data Mining Patterns Post-processingInputDataOutputA data mining stageFlow of control from one stage to the next stageFlow of control from one stage to the previous stageRepetition of the tasks at one stage
4Data Mining Process Preparation Selecting relevant features Selecting relevant recordsData cleaningDeal with unknown dataData transformationOriginal Data setsCollected Data setIntegrating dataGetting necessary data detailsTarget Data setPre-Processed Data setFormattedData setFormatting data into acceptable form by the mining tool
5Data Mining Process Mining Determining data mining tasks Assigning roles for data for certain tasksSelecting data mining solution(s) to each taskSetting necessary parameters for the solutionCollecting result patternsFormattedData setSolution3(w1, w2, …, wm)Solution2(t1, t2, …, tr)Parameter settingsSolution1(p1, p2, …, pn)Mining solutionsPatterns
6Data Mining Process Post-processing Pattern evaluation Pattern selectionPattern interpretationPatternsEvaluationcriteriarejectValidSelectionSelectedacceptPatternInterpretationKnowledge learnt
7Data Mining Process Roles of participants in data mining Participants include:Data miners / data analysts: main participant of a DM projectDomain expert: main collaborators of DM projectDecision makers: clients of a DM projectRisk of human bias in the discovery processImportant roles of domain expertPattern interpretation (for usefulness)Pattern evaluation (for significance)Mining options (for suitable tasks, limited)Advisory on data pre-processing (for suitable operations, limited)Balancing the strength of human and machine
8Data Mining Approaches Hypothesis testing approachTop-down lead by a hypothesis statementProcedure:Forming a hypothesis statementCollecting and selecting data of relevanceConducting data analysis and collecting patternsInterpreting the patterns to accept/reject the hypothesisDiscovery approachBottom-up without a hypothesis in mindCollecting and preparing data of interestConducting data analysis and discovering possible patternsEvaluating the importance and interestingness
9Data Mining Approaches Discovery approach (cont’d)Directed discovery (supervised learning):Certain aspects of the outcome, i.e. the goal, of the discovery have been specified. The discovery is to find those patterns satisfying the goal.e.g. patterns relating to the outcome of a class variableUndirected discovery (unsupervised learning):There is no specification of the goal of the discovery. The discovery is to find those patterns of some kind of significance.e.g. associative links among some attribute values
10Data Mining: Problems & Patterns ClassificationConstruct a classification model to determine the class of a given recordModelConstructionMethodClassificationModelExample Data Set(a) Model Development PhaseUnseen Data Record with undetermined classData Record with the determined classClassificationModel(b) Model Use Phase
11Data Mining: Problems & Patterns Various forms of classification modelsInstance spaceNeural networkDecision treeMany more …List of ordered classification rulesFunction (linear regression)
12Data Mining: Problems & Patterns Cluster detectionMeasure similarity among data objects and group them into clusters accordinglyInput data pointsClusteringMethodCluster Memberships of Data Points
13Data Mining: Problems & Patterns Forms of clustering resultsClusters of various shapesHierarchical clustering resultsEclipse shaped clusters
14Data Mining: Problems & Patterns Association rule miningDiscover significant relationships between data objectsAssociationMining MethodX YVarious associationsBetween values, e.g. Apple CokeBetween categories of values, e.g. Food MagazineBetween values of attributes, e.g. Married:yes OwnHouse:yesOver time period, e.g. year 1: Database year 2: Data Mining
15Data Mining: Problems & Patterns An exampleClassification model?Clusters?Association rules?
16Data Mining Solutions: An Overview Classification solutionsDecision tree e.g. ID3k nearest neighbour (kNN) e.g. PEBLSRules e.g. Sequential CoverBayesian theorem e.g. Naïve BayesArtificial neural networkClustering SolutionsPartition-based methods e.g. K-meansHierarchical methods e.g. agglomerationDensity-based methods e.g. DBScanModel-based methods e.g. Expectation-MaximisationGraph-based methods e.g. Chameleon
17Data Mining Solutions: An Overview Association rule solutionsGreedy methods e.g. AprioriGraph-based methods e.g. FP-GrowthMethods for various associationsBoolean associationsGeneralised associations (multi-level associations)Quantitative associations (multidimensional associations)Sequential associations (sequential patterns)Since one type of data mining problems can be transformed to another type of data mining problems, some solutions for one type can also be applied to another type.
18Evaluation of Patterns Importance of evaluating result patternsClassification model must be accurate enough to be creditableClusters must genuinely existAssociation rules must have enough strengths to be believedData descriptions must be general enough to cover a large part of the data setHow do we evaluate the discovered patterns ?
19Evaluation of Patterns Possible measures of interestingnessObjective measures based on data and patternConciseness of pattern, e.g. minimum description lengthCoverage, e.g. coverage for classification rulesReliability, e.g. accuracy of a classification modelPeculiarity, e.g. measures of difference from the normDiversity, e.g. tendency of clustersSubjective measures based on domain knowledgeNoveltySurprisingnessUsefulnessApplicability
20Evaluation of Patterns Commonly used measuresAccuracy rate or error rate for classification modelsTrue positiveFalse positiveFalse negative (see section 6.5.1)Quality of clustersQuality of a clusterOverall quality of all clusters (see section 4.5.1)Strengths of associationsSupportConfidenceLift (see section and 8.6)
21Data Mining in Weka Explorer The roadmapAssociate Tab pagePreprocess Tab pageTree Visualiser windowCluster Tab pageClassify Tab page(1)(3)(2)
22Data Mining in Weka Explorer PreprocessGenerate random data setDisplay & edit dataSave data set into a fileOpen data set from different sourcesFilters for pre-processingData summarySelected attribute summaryAttribute display, selection & removal from the opened data setVisualise all attributesSelected attribute visualisationFeedback messages
23Data Mining in Weka Explorer Classify (as an example)Method selection & parameter settingTest option settingResult display windowTask list. Menu of options available with right click.
24Data Mining in Weka Explorer Classify (as an example)Method ListSelecting &Changing parametersSelecting a specific method
25Data Mining in Weka Explorer VisualisationScatter plot of data object of different classesAn Example Decision Tree
26Probability & Statistics: A Brief Review Where probability and statistics used?Patterns found from data are probabilistic in natureUsed in various measures of evaluation, e.g. confidence measure of association rulesUsed in data exploration stage for better understanding, e.g. maximum, minimum, mean, variance, skewnessUsed during the mining process to assist the discovery of patterns, e.g. information gain for decision tree inductionUsed as a part of patterns, e.g. naïve Bayes, Gaussian mixture modelUsed in comparison of patterns, e.g. classification model with significantly better accuracy
27Probability & Statistics: A Brief Review Probability and conditional probabilityProbability of event P(E) and its meanings when:P(E) = 0, P(E) = 1 and 0 < P(E) < 1Probabilities of multiple events:P(E and F), P(E or F) = P(E) + P(F) – P(E and F)Mutually exclusive events:P(E and F) = 0 and P(E and F) = P(E) + P(F)Conditional probability of event E given event F:P(E|F) = P(E and F)/P(F)Independent events:P(E and F) = P(E)P(F), and P(E|F) = P(E)
28Probability & Statistics: A Brief Review Probability & conditional probability (example)
29Probability & Statistics: A Brief Review Probability distribution of random variablesDiscrete random variableContinuous random variable68%95%P(X = x)P(a X < b)
30Probability & Statistics: A Brief Review Basic StatisticsSample mean, median and modeVariance and standard deviationSkewness
31Probability & Statistics: A Brief Review Confidence interval estimateSample mean is only an estimate of the true mean for the data population.Central limit theorem: sample means follows a normal distribution that:The mean is the true population mean XThe standard deviation isBased on the central limit theorem and using the sample standard deviation to replace the true one, the following expression is used to estimate the interval for the true mean at confidence level of 1-
32Probability & Statistics: A Brief Review Confidence interval estimate (example)For this data set, n = 12, age = 26 and sage = At confidence level of 95%, i.e. 1 - = 0.95 and /2 = 0.025, n – 1 = 11, and therefore, t = The interval estimate is:The interval is estimated as [21.347, ] at confidence level of 95%
33Probability & Statistics: A Brief Review Hypothesis testingAs an introduction to statistical inference and statistic significance.Procedure:Forming null and alternative hypothesesDeciding the level of significance pDetermining a test statistic and calculating its valueComparing the calculated value against known value and deciding if the null hypothesis should be rejected
34Probability & Statistics: A Brief Review Hypothesis testing (example)Assuming age = 25Hypotheses:Null:Alternative:Calculating the statistic t as:Less than t = for p/2 = and n – 1 = 11.Conclusion: null hypothesis is not rejected, i.e. the difference between the sample mean and the population mean is insignificant.
35Chapter SummaryThe data mining process involves preparation of data, mining of patterns and post-processing of the patterns.Top-down and bottom-up approaches are both useful. The discovery approach can be directed or undirected.Three main streams of data mining tasks and various forms of patterns and models are introduced.Specific solutions are required for specific types of problemsThe importance of evaluation of patterns must be appreciated.Normal procedure of conducting data mining in Weka is explainedSome important basic concepts in probability and statistics are reviewed.
36ReferencesRead Chapter 2 of Data Mining Techniques and Applications Useful further references Han, J. and Kamber, M. (2006), Data Mining: Concepts and Techniques, 2nd Edition, Morgan Kaufmann Publishers, Chapter 1 Berry, M. J. A. and Linoff, G. (2004), Data Mining Techniques: For Marketing, Sales and Customer Relationship Management, 2nd ed. Wiley Computer Publishing, Chapters 1 – 2