Data Mining Basic Concepts, Algorithms and Implementations A summary.

Data Mining Basic Concepts, Algorithms and Implementations A summary

Basic Concepts n Data mining & Machine learning n Phases of Data mining Data  knowledge & pattern n Problems in data mining and learning scheme n Data mining components - clustering - classification - association - numeric prediction n Application examples Chapter 1

Why data mining? n Large number of data (attributes and/or instances) n Need to take decision from them n When size of data increases, information can be gain easily from them are decreases n Not to automate decision but the process itself must be automatic or semi-automatic n Discovering useful, structural pattern Process of discovering “knowledge/patterns” in data

Machine learning n Data mining deploys machine learning n Train to recognize patterns n Improvement from experiences n Only small amount of data (training set & test set) is to be applied to the learning machine

From data to knowledge 1Data selection 2Data preprocessing 3Data transformation 4Data mining 5Output interpretation

Problems n Data are “dirty” n Noisy, incomplete, inconsistent n Need to be cleaned  data preprocessing e.g. integration, transformation, reduction, discretization, etc. n Limited available data might not represent well the true real world problem

n Classification Find the class a new instance belong to e.g. whether a cell is a normal cell or a cancerous cell n Association Finding rules/conclusions among attributes e.g. a high-blood-pressure patient is most likely to have heart-attack disease What to do in data mining

n Clustering Process to cluster/group the instances into classes  before existence of any classes e.g. deriving/classify a new disease into different possible types/groups n Numeric prediction Variation of classification where the output is numeric classes e.g. frequency of cancerous cell found What to do (contd.)

Real world applications n Loan/credit decisions making n Weather forecast n Medical diagnosis n Biomedical data mining, DNA analysis n Customers, marketing & sales n Bank, financial data mining

What Is The Input? n Concepts n Instances/Examples n Attributes nominal v.s. numeric attributes n Preparing inputs Chapter 2

Concepts n The thing to be learned whether by classification learning, association learning, clustering or numeric prediction. n The data with the hidden information inside n Practically take the form of relations/tables n Output produced by the learning is a concept description (knowledge/pattern)

Instances/Examples n Individual independent examples that input to a machine learning scheme n Part of the overall concepts n Positive examples v.s. negative examples n Input example must be finite n Some instances, in fact a lot, might have missing values from some of its attributes

Attributes n Also known as features n Generally 2 types n Nominal/categorical/discrete attributes - unordered - distinct symbols - special type: dichotomy (only 2 value) n Numeric/continuous attributes n Sometimes nominal attributes might be ordered n And numeric attributes might be discritize into nominal attributes

Preparing input n Data gathering and selection integration, refreshing n Data format (.arff files) n Attribute types (normalization & standardization) n Missing, inaccurate values n Duplicate data n Data visualization

What About The Output? n Decision tables n Decision trees n Classification rules n Numeric predictions n Instance-based n Association rules n Clustering Chapter 3

Decision tables n Simplest form, just like the input Decision trees n Divide & conquer approach n Top-down n At each node test 1 attribute, or can be multiway testing n Leaf nodes  class n Missing value: can be treated as another possible value hence another branch

n Missing values: follow the most popular/nearest branch Classification rules n Can be directly inferred from decision trees n If A then B n Precodition/antecedent  a rule or series (AND-ed together) of tests between an attribute value with a constant n Can be comparison among attributes also  relational rules n Consequence/conclusion  class values Decision trees (contd.)

Classification rules (contd.) n Harder to infer the rule directly from the data compare to read the rules off from a decision tree, but need to be pruned to remove redundant tests n Rule  tree? n Possible but not so straight forward (replicated subtree problem) n Rules are nearly the minimal subsets n Might specify exception for some instances that do not satisfy (just a bit different) with other majority instances

n For numeric classes n Can use decision trees (regression tree) or classification rules (regression equation) n Regression equation is a linear equation n Using regression tree is more accurate compare to regression equation because the data might not be representable in linear model, BUT using regression tree will yield very complicated tree structures n Combination between the two are usually used Numeric prediction

n Lazy learning approach n Work directly from the examples n Compare new instances with the examples available and work out their distance (differences in attributes’ values) n Does not make explicit learning structures, does not really describe the patterns in data Instance-based

Association rules n For association learning n No different from classification rules n BUT can predict any attributes n And also can predict combinations of attributes n coverage/support: number of instances that really predicted correctly n confidence/accuracy: number of instances it predicts n A lot of association rules can be derived

n Specify minimum coverage and accuracy values n So only find the most populated rules rather than one rule for any possible sets of attributes Association rules (contd.)

n Before existence of any classes n Output takes the form of a diagram that shows how the instances fall into clusters n Some allow one instance to belong to more than one cluster (overlapped) n Some express in probabilities of instances fall into a certain cluster n Another type is dendrograms, i.e. a hierarchical tree structures of clusters where the lower level instances are more tightly clustered Clustering

Learning Algorithms n 1R Algorithm n Bayesian Algorithm n Decision Trees [Divide & Conquer] n Covering Algorithm [Separate & Conquer] n Linear models n Instance-based Algorithm n Association Rule  Aprori Chapter 4

1R n “Simple” ideas often work well n Not all attributes are relevant or have equal contribution n Some attributes contribute independently while others need to be combined n 1R generate single level decision tree n Just one attribute is sufficient to determine the class accurately n Evaluate the error rate for each attribute’s rule set  choose the attribute with smallest error rate  Table 4.1 n Missing values are treated as another values

1R (contd.) n Can be applied for both nominal and numeric attributes n Numeric attributes are converted to nominal attributes (sets of intervals) using simple discretization n Sort the values of the numeric attributes then partition them into intervals, group them according to the majority class n Each interval is assign to a class value

Bayesian Algorithm n Use all attributes n Assume all attributes are equally important and independent of each other  e.g. Naive Bayes n Each probability sets (from each value of the attributes) for each class are multiplied together with probability of the class occurrence  Table 4.2 n The class with the highest probability will be chosen n Based on Bayes’s rule of conditional probability Pr[C|A] = Pr[A|C].Pr[C] / Pr[A] C = class label A = a certain set of attributes’ values

Bayesian Algorithm (contd.) n If a particular attribute value does not occur in the training set in conjunction with every class value  final probability of that class would be zero n Use Laplace estimator (adding 1 to each numerator and 3 to the denominator) n Missing values are not a problem at all n For numerical attributes, the values are assumed have a normal probability distribution  Table 4.4 n Then each class probability is calculated using the probability density function

Decision trees n Constructing a decision tree recursively by applying divide & conquer method and also information theory n Which attribute to split first? n Attribute that if split will results most number of purest child nodes n Purity measured by amount of information gain [in bits] from each split using a attribute n amount of information of an attribute = entropy n and information gain of an attribute = amount of information from overall classes subtracted by amount of information from the attribute

Decision trees (contd.) n If an attribute can split all instances into classes purely (e.g. ID code), then the information of the attribute is zero and the information gain is maximum n Choose the attribute with maximum information gain to split the tree n e.g. ID3 (only nominal attributes), C4.5, J4.8

Covering Algorithm n For building classification rules n Separate and conquer approach iteratively n Keep add in new rule to cover as many as possible a class of interest (positive instances) BUT try using at least number of rule as possible n Choose attribute to separate which can maximize the probability of the desired classification hence higher accuracy n Accuracy measured by p/t p = positive examples of a class t = total of instances covered by new rule (attribute being used)

Covering Algorithm (contd.) n Choose the attribute with maximum p/t n For two or more attributes with equal p/t, choose the one with higher coverage, i.e. the one with greater p n Do not need to consider the instances covered by the new rule accepted for next iteration n e.g. Prism algorithm (no numeric attributes)

Linear models n Linear regression n When all attributes and outcome are numeric n Express the class as a linear combination of the attributes with predetermined weights x = w 0 + w 1 a 1 + w 2 a 1 + … + w k a k a n = value of attribute n;w m = weights x = class This is only predicted ones n The difference between the predicted and actual values must be minimized hence the sum of the squares of these differences over all training instances is minimized

Instance-based n Distance function: determine which member from the training set is closest to the unknown test instance n e.g. of distance function: Euclidean distance n Distance between (1): a training instance and (2): the unknown test instance: sqrt((a 1 (1) – a 1 (2) ) 2 + (a 2 (1) – a 2 (2) ) 2 + … + (a k (1) – a k (2) ) 2 a 1 … a k = attributes a x (1) = value for attribute a x of the training instance a x (2) = value for attribute a x of the test instance

Instance-based (contd.) n The distance value calculated will be normalize n So if two instances are the same their distance will be 0 and if they are different the distance will be between 0 and 1 n Missing values are treated having a distance of 1 from the training attributes n E.g. of instance-based algorithm: nearest neighbor algorithm and k-nearest neighbor algorithm

Association rules n Apriori algorithm for mining association rules n Because can result enormous of rules, mining should be limited with the coverage and accuracy of the association, so that only interesting association with high coverage will be found n Item: any pair (can be alone) of the attributes values n First step of the algorithm generate all one-item sets with the minimum coverage n Then using them to generate the two-item sets and continue with three-item sets and so on

Association rules (contd.) n Two-item sets can only satisfy minimum coverage if both its constituent one-item sets have minimum coverage and the same for three-item sets, and so on n The second step will generate rules from the item-sets which satisfy the minimum accuracy n Apriori algorithm cannot handle numeric attributes

Use Which Methods? n Training & testing n Cross-validating n Comparing schemes Chapter 5

Training & testing n Performance on the training set is not a good indicator because data is limited n For example classification learning, a training set with the classes information available n Using the training set generally come out with a classifier n The performance of the classifier is measured by error rate, i.e. how accurate the instances are classified to the correct class n Error rate on training data is not likely to be a good indicator for future performance on new data, it is overfitted to the training data

Training & testing (contd.) n So must use a different set of data to estimate the error rate, i.e. the test set n The larger the training sample, the better the classifier and the larger the test sample, the more accurate the error estimate n All data is partition into data for training and testing (part of the data might even being used for validating the classifier), certain amount of data is assign for training and the other for testing  holdout procedure

Cross-validation n The amount of data for training and testing is limited and by using the holdout procedure the data is partition for training and testing n But somehow the samples for training (or testing) may not be representative hence the classifier may not learned well form the data n Must ensure random sampling of the data for the training and testing set n Might need a checking to verify the classifier n Cross-validation divides the whole data into n folds (partitions) and perform n times testing

Cross-validation (contd.) n At each testing, 1 partition of data is used as the testing set and the rest is training set n Other validation scheme: leave-one-out cross- validation and bootstrap validation

Comparing schemes n Learning different application or different type of data might require different approach n Some schemes are more suitable for certain problem n Weka experimenter can give the result of few schemes at several test to the same data set n The error rate by each scheme can then be compared

Implementations n Decision trees n Rules induction n Instance-based learning n Clustering Chapter 6

Decision Trees n Divide & conquer, recursive n Numeric attributes n Missing values n Pruning decision trees n Complexity of decision tree induction n Decision trees  classification rule n Examples

Numeric attributes › binary split - group the instances to two group by the numeric attribute - information theory [entropy] › multi-way test testing again several different constant at a single node › prediscretize Missing values › another possible value (when it is significant) › notionally splitting the instance into pieces using a numeric weighting scheme

Pruning the trees › simplicity better › postpruning (backward pruning) v.s. prepruning (forward pruning) › combination of attributes individual contributions are hardly to be seen › actions: subtree replacement & subtree raising/lifting Subtree replacement › select some subtrees and replace with single leaves › replace an internal node by a leaf node › decrease the accuracy on the training set but increase the accuracy of an independently chosen test set › from the leaves back up toward the root

Subtree raising › replace an internal node by one of the nodes below › use in influential decision tree C4.5 › time consuming › raise the subtree of the most popular branch › need to reclassify the nodes below the replaced nodes Error rate › to decide whether to do subtree replacement or subtree raising need to calculate the error rate › error at internal nodes and leaf nodes › reduced error pruning: hold back some data from the tree and use it as an independent test set for error estimation disadvantage: actual tree has less data › or some error estimation based on the training set

Complexity › n instances and m attributes › assume the depth of the tree is O(log n) › building tree cost: O(mn log n) - at each tree depth all n instances must be considered O(n log n) - at each node all m attributes must be considered O(mn log n) › nominal attributes do not have to be considered at each node because attributes that are used further up the tree cannot be reused › numeric attributes must be sorted  O(n log n) numeric attributes can be reused, so have to be considered at every tree level

Complexity (contd.) › in subtree replacement, error estimation must be made for every tree node › all nodes should be considered for replacement › tree has at most n leaves › complexity of subtree replacement: O(n) › for subtree raising, basic complexity equal to subtree replacement complexity › added cost in subtree raising: reclassification of each instance at every node between its leaf and the root O(log n) › n instances  O(n log n) › reclassification near the root node: O(log n) › total complexity for subtree raising: O(n (log n) 2 ) ›So the total complexity of decision tree induction: O(mn log n) + O(n (log n) 2 )

Trees  rules › rules can be directly read from a decision tree › all tests encountered on the path from root to leaf are conjuncted together (AND-ing all) › might need to prune the rules › keep looking for conditions that can be deleted and leave the rule when no more conditions can be removed › greedy approach to detect redundant rule › re-evaluating the training instances with the set of new rules › generating rules (pruned) off from a decision tree is slower than directly generation of classification rules

Examples › ID3 › C4.5 (Ross Quilan) › C5.0 › CART › PART › J4.8 › SPRINT

Classification rules n Separate & conquer approach n Criteria: p/t or information gain from the set before and after applying the new rule n Correctness-based: prefer exact rule n Information-based: prefer higher coverage rule n Obtaining rule from partial decision trees

Instance-based learning Disadvantages › tends to be slow for large training sets › performs badly with noisy data › performs badly when different attributes affect the outcome to different extents › does not perform explicit generalization (patterns) Number of instances increases, the accuracy of the model improves BUT most of the data/examples are redundant Nearest neighbor algorithm

Pruning noise examples › noisy data may caused repeated misclassifying of instances › can deal with k-nearest neighbors and assign the majority class to the unknown › problem is how to determine the value of k › k = 1  usual nearest neighbor › another way to deal with the noisy data is to monitor the performance of each examples that is stored and discard ones that do not perform well

Clustering n Result of clustering may be exclusive, overlapping, probabilistic, or hierarchical n 3 different clustering methods k-means algorithm and incremental algorithm (CobWeb), EM n k-means algorithm will result disjoint clusters while incremental clustering will result hierarchical ones n incremental algorithm uses a category utility to measure the quality of partition of instances into clusters

Introduction to Weka n The University of Waikato, New Zealand n Machine learning software in Java implementation n CLI, Explorer and Experimenter n Well-suited for developing new machine learning/data mining scheme Data Mining System

Weka Knowledge Explorer n Data preprocessing selecting attributes, filtering n Classification different testing option (training, test set, cross-validation) n Clustering EM, simple k-Means, CobWeb n Association  Apriori n Attribute selection n Datasets visualization

References [1] Witten, I. H. and Frank E. (1999) Data mining: practical machine learning tools & techniques with java implementations. Morgan Kaufmann Publishers. [2] The Queen’s University of Belfast. Data mining: an introduction, student notes. http://www.pcc.qub.ac.uk/tec/courses/datamining/stu_n otes/dm_book_1.html [3] Rastogi R. and Shim K. Scalable algorithms for mining large databases. Lucent Bell Laboratories. http://www.bell-labs.com/project/serendip [4] Han J. and Kamber M. Data mining: concepts and techniques. School of Computing Science - Simon Fraser University, Canada. Chapter 3, 5, 7 and 10. http://www.cs.sfu.ca [5] Mitchell T. M. Machine learning. School of Computer Science, Carnegie Mellon University. http://www- 2.cs.cmu.edu/afs/cs.cmu.edu/project/theo- 3/www/ml.html

[6] The University of Waikato New Zealand. Weka Data Mining System. http://www.cs.waikato.ac.nz/ml/weka/

Integrated Clinico-Genomics: A Use-Case Scenario Patient Clinical Clinical Information CLISLISHPIS Clinical Information Demographics History Physiological Laboratory Information Indicators Hematological Biochemical Pathologo- Anatomical Information Tumor Sample/Tissue GIS Genomic Information DNA-sequences Gene-Expression profiles Gene - Markers Differential Gene - Markers Patient Genomic Genomic Information Supporting TechnologyClusteringClassification Discr. Analysis DST….. Focused/Interesting Clinical Profiles Group-A vs. Group-B Focused/Interesting GE profils Group-Χ vs. Group-Υ Supporting TechnologyClusteringClassification Gene Selection Visualization……….. Genomics world Clinical world XA X -- A YB Y -- B

Towards an Integrated Clinico-Genomics Environment ExternalGenomic Information Sources ExternalClinical (Cancer) Information Sources BioInformatics Functional Genomics Medical Informatics Clinical Practice Information Modeling Clinical Data Models Ontology Info. Modeling Info. Modeling Gene Ontology Genomic Data Models Data Analysis DSSVisualization Data Extraction Gateways Building-Blocks GIS Genomic Information DNA-sequences Gene-Expression profiles Gene - Markers Differential Gene - Markers Patient Genomic Genomic Information Patient Clinical Clinical Information CLIS LISHPIS Clinical Information Demographics History Physiological Laboratory Information Indicators Hematological Biochemical HistoPathology Information Tumor Sample/Tissue PACSImages REGISTRIES POPULATION

ICGE: Integration Issues & Enabling Technology CLISLISHPIS GIS RDF/ XML Generator & Filters <!– XML Use For Clinical Observations Model Programmer : Kwstas Christofis --> <!DOCTYPE Query [ <!ATTLIST Query TimeOfQuery CDATA #REQUIRED WhoAsk CDATA #REQUIRED SelectedQuery CDATA #REQUIRED GeographicRegion CDATA #REQUIRED TimeRange CDATA #REQUIRED Gender CDATA #REQUIRED Age CDATA #REQUIRED> <!ATTLIST CompositeObservation ObservationType CDATA #REQUIRED ObservationTime CDATA #IMPLIED> <!ATTLIST AtomicObservation ObservationType CDATA #REQUIRED ObservationTime CDATA #IMPLIED> ……………………………………………………………………. XML Documents DAS Data Analysis Suite Data Mining DSS …… RESULTS HCI / Web-based Human Computer Interface Information System Wrapper GENOMIC Genomic Data ModelGenomic Ontology GO Gene Ontology MGED MIAME USER Query Formulation Operations & Workflow PACSImages REGISTRIES CLINICAL Information Systems Wrappers Clinical Data Model(s)Clinical Ontology UMLS SNOP ICD COAS HL7

Data Mining Basic Concepts, Algorithms and Implementations A summary.

Similar presentations

Presentation on theme: "Data Mining Basic Concepts, Algorithms and Implementations A summary."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining Basic Concepts, Algorithms and Implementations A summary.

Similar presentations

Presentation on theme: "Data Mining Basic Concepts, Algorithms and Implementations A summary."— Presentation transcript:

Similar presentations

About project

Feedback