Data Management and Database Technologies 1 DATA MINING Extracting Knowledge From Data Petr Olmer CERN

Slides:



Advertisements
Similar presentations
COMP3740 CR32: Knowledge Management and Adaptive Systems
Advertisements

Content-based Recommendation Systems
Data Mining Classification: Alternative Techniques
DECISION TREES. Decision trees  One possible representation for hypotheses.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Decision Tree Approach in Data Mining
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
Kansas State University Department of Computing and Information Sciences Laboratory for Knowledge Discovery in Databases (KDD) KDD Group Research Seminar.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A
Classification and Decision Boundaries
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures.
Relational Data Mining in Finance Haonan Zhang CFWin /04/2003.
CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Knowledge Representation. 2 Outline: Output - Knowledge representation  Decision tables  Decision trees  Decision rules  Rules involving relations.
Decision Trees (2). Numerical attributes Tests in nodes are of the form f i > constant.
Recommender systems Ram Akella November 26 th 2008.
Chapter 5 Data mining : A Closer Look.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
1 An Introduction to Data Mining Hosein Rostani Alireza Zohdi Report 1 for “advance data base” course Supervisor: Dr. Masoud Rahgozar December 2007.
Issues with Data Mining
Data Mining – Output: Knowledge Representation
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Text mining.
Bayesian Networks. Male brain wiring Female brain wiring.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
Decision Trees & the Iterative Dichotomiser 3 (ID3) Algorithm David Ramos CS 157B, Section 1 May 4, 2006.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Decision Trees. Decision trees Decision trees are powerful and popular tools for classification and prediction. The attractiveness of decision trees is.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Bab 5 Classification: Alternative Techniques Part 1 Rule-Based Classifer.
Classification Techniques: Bayesian Classification
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.
Copyright R. Weber Machine Learning, Data Mining INFO 629 Dr. R. Weber.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 6.2: Classification Rules Rodney Nielsen Many.
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
Data Management and Database Technologies 1 DATA MINING Extracting Knowledge From Data Petr Olmer CERN
Data Mining and Decision Support
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Chapter 4: Algorithms CS 795. Inferring Rudimentary Rules 1R – Single rule – one level decision tree –Pick each attribute and form a single level tree.
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Data Mining and Text Mining. The Standard Data Mining process.
Data Mining Lecture 11.
Data Mining CSCI 307, Spring 2019 Lecture 11
Presentation transcript:

Data Management and Database Technologies 1 DATA MINING Extracting Knowledge From Data Petr Olmer CERN

Data Management and Database Technologies 2 Petr Olmer: Data Mining Motivation ● What if we do not know what to ask? ● How to discover a knowledge in databases without a specific query? Computers are useless, they can only give you answers.

Data Management and Database Technologies 3 Petr Olmer: Data Mining Many terms, one meaning ● Data mining ● Knowledge discovery in databases ● Data exploration ● A non trivial extraction of novel, implicit, and actionable knowledge from large databases. – without a specific hypothesis in mind! ● Techniques for discovering structural patterns in data.

Data Management and Database Technologies 4 Petr Olmer: Data Mining What is inside? ● Databases – data warehousing ● Statistics – methods – but different data source! ● Machine learning – output representations – algorithms

Data Management and Database Technologies 5 Petr Olmer: Data Mining CRISP-DM CRoss Industry Standard Process for Data Mining task specification task specification data understanding data understanding data preparation data preparation deployment evaluation modeling

Data Management and Database Technologies 6 Petr Olmer: Data Mining Input data: Instances, attributes ABCD Mon21ayes Wed19byes Mon23ano Sun23dyes Fri24cno Fri18dyes Sat20cno Tue21bno Mon25bno

Data Management and Database Technologies 7 Petr Olmer: Data Mining Output data: Concepts ● Concept description = what is to be learned ● Classification learning ● Association learning ● Clustering ● Numeric prediction

Data Management and Database Technologies 8 Petr Olmer: Data Mining Task classes ● Predictive tasks – Predict an unknown value of the output attribute for a new instance. ● Descriptive tasks – Describe structures or relations of attributes. – Instances are not related!

Data Management and Database Technologies 9 Petr Olmer: Data Mining Models and algorithms ● Decision trees ● Classification rules ● Association rules ● k-nearest neighbors ● Cluster analysis

Data Management and Database Technologies 10 Petr Olmer: Data Mining Decision trees Inner nodes – test a particular attribute against a constant Leaf nodes – classify all instances that reach the leaf a b ca class C1 class C1 class C2 class C2 class C1 = 5 blue red > 0 <= 0hot cold

Data Management and Database Technologies 11 Petr Olmer: Data Mining Classification rules If precondition then conclusion An alternative to decision trees Rules can be read off a decision tree – one rule for each leaf – unambiguous, not ordered – more complex than necessary If (a>=5) then class C1 If (a 0) then class C1 If (a<5) and (b=“red”) and (c=“hot”) then class C2

Data Management and Database Technologies 12 Petr Olmer: Data Mining Classification rules Ordered or not ordered execution? ● Ordered – rules out of context can be incorrect – widely used ● Not ordered – different rules can lead to different conclusions – mostly used in boolean closed worlds ● only yes rules are given ● one rule in DNF

Data Management and Database Technologies 13 Petr Olmer: Data Mining Decision trees / Classification rules 1R algorithm for each attribute: for each value of that attribute: count how often each class appears find the most frequent class rule = assign the class to this attribute-value calculate the error rate of the rules choose the rules with the smallest error rate

Data Management and Database Technologies 14 Petr Olmer: Data Mining Decision trees / Classification rules Naïve Bayes algorithm ● Attributes are – equally important – independent ● For a new instance, we count the probability for each class. ● Assign the most probable class. ● We use Laplace estimator in case of zero probability. ● Attribute dependencies reduce the power of NB.

Data Management and Database Technologies 15 Petr Olmer: Data Mining Decision trees ID3: A recursive algorithm ● Select the attribute with the biggest information gain to place at the root node. ● Make one branch for each possible value. ● Build the subtrees. ● Information required to specify the class – when a branch is empty: zero – when the branches are equal: a maximum – f(a, b, c) = f(a, b + c) + g(b, c) ● Entropy:

Data Management and Database Technologies 16 Petr Olmer: Data Mining Classification rules PRISM: A covering algorithm ● For each class seek a way of covering all instances in it. ● Start with: If ? then class C1. ● Choose an attribute-value pair to maximize the probability of the desired classification. – include as many positive instances as possible – exclude as many negative instances as possible ● Improve the precondition. ● There can be more rules for a class! – Delete the covered instances and try again. only correct unordered rules

Data Management and Database Technologies 17 Petr Olmer: Data Mining Association rules ● Structurally the same as C-rules: If - then ● Can predict any attribute or their combination ● Not intended to be used together ● Characteristics: – Support = a – Accuracy = a / (a + b) Cnon C Pab non Pcd

Data Management and Database Technologies 18 Petr Olmer: Data Mining Association rules Multiple consequences ● If A and B then C and D ● If A and B then C ● If A and B then D ● If A and B and C then D ● If A and B and D then C

Data Management and Database Technologies 19 Petr Olmer: Data Mining Association rules Algorithm ● Algorithms for C-rules can be used – very inefficient ● Instead, we seek rules with a given minimum support, and test their accuracy. ● Item sets: combinations of attribute-value pairs ● Generate items sets with the given support. ● From them, generate rules with the given accuracy.

Data Management and Database Technologies 20 Petr Olmer: Data Mining k-nearest neighbor ● Instance-based representation – no explicit structure – lazy learning ● A new instance is compared with existing ones – distance metric ● a = b, d(a, b) = 0 ● a <> b, d(a, b) = 1 – closest k instances are used for classification ● majority ● average

Data Management and Database Technologies 21 Petr Olmer: Data Mining Cluster analysis ● Diagram: how the instances fall into clusters. ● One instance can belong to more clusters. ● Belonging can be probabilistic or fuzzy. ● Clusters can be hierarchical.

Data Management and Database Technologies 22 Petr Olmer: Data Mining Data mining Conclusion ● Different algorithms discover different knowledge in different formats. ● Simple ideas often work very well. ● There’s no magic!

Data Management and Database Technologies 23 Petr Olmer: Data Mining Text mining ● Data mining discovers knowledge in structured data. ● Text mining works with unstructured text. – Groups similar documents – Classifies documents into taxonomy – Finds out the probable author of a document – … ● Is it a different task?

Data Management and Database Technologies 24 Petr Olmer: Data Mining How do mathematicians work Settings 1: – empty kettle – fire – source of cold water – tea bag How to prepare tea: – put water into the kettle – put the kettle on fire – when water boils, put the tea bag in the kettle Settings 2: – kettle with boiling water – fire – source of cold water – tea bag How to prepare tea: – empty the kettle – follow the previous case

Data Management and Database Technologies 25 Petr Olmer: Data Mining Text mining Is it different? ● Maybe it is, but we do not care. ● We convert free text to structured data… ● … and “follow the previous case”.

Data Management and Database Technologies 26 Petr Olmer: Data Mining Google News How does it work? ● ● Search web for the news. – Parse content of given web sites. ● Convert news (documents) to structured data. – Documents become vectors. ● Cluster analysis. – Similar documents are grouped together. ● Importance analysis. – Important documents are on the top

Data Management and Database Technologies 27 Petr Olmer: Data Mining From documents to vectors ● We match documents with terms – Can be given (ontology) – Can be derived from documents ● Documents are described as vectors of weights – d = (1, 0, 0, 1, 1) – t1, t4, t5 are in d – t2, t3 are not in d

Data Management and Database Technologies 28 Petr Olmer: Data Mining TFIDF Term Frequency / Inverse Document Frequency TF(t, d) = how many times t occurs in d DF(t) = in how many documents t occurs at least once Term is important if its – TF is high – IDF is high Weight(d, t) = TF(t, d) · IDF(t)

Data Management and Database Technologies 29 Petr Olmer: Data Mining Cluster analysis Vectors – Cosine similarity On-line analysis – A new document arrives. – Try k-nearest neighbors. – If neighbors are too far, leave it alone.

Data Management and Database Technologies 30 Petr Olmer: Data Mining Text mining Conclusion ● Text mining is very young. – Research is on-going heavily ● We convert text to data. – Documents to vectors – Term weights: TFIDF ● We can use data mining methods. – Classification – Cluster analysis – …

Data Management and Database Technologies 31 Petr Olmer: Data Mining References ● Ian H. Witten, Eibe Frank: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations ● Michael W. Berry: Survey of Text Mining: Clustering, Classification, and Retrieval ● ●

Data Management and Database Technologies 32 Questions? Petr Olmer Computers are useless, they can only give you answers.