Comp3776: Data Mining and Text Analytics Intro to Data Mining By Eric Atwell, School of Computing, University of Leeds (including re-use of teaching resources.

Slides:



Advertisements
Similar presentations
Artificial Intelligence
Advertisements

COMP3410 DB32: Technologies for Knowledge Management 08 : Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds.
COMP3740 CR32: Knowledge Management and Adaptive Systems
COMP3740 CR32: Knowledge Management and Adaptive Systems Data Mining outputs: What knowledge can Data Mining learn? By Eric Atwell, School of Computing,
COMP3740 CR32: Knowledge Management and Adaptive Systems Introduction By Eric Atwell, School of Computing, University of Leeds.
COMP3740 CR32: Knowledge Management and Adaptive Systems Overview and example KM exam questions By Eric Atwell, School of Computing, University of Leeds.
COMP3740 CR32: Knowledge Management and Adaptive Systems
COMP3410 DB32: Technologies for Knowledge Management Lecture 7: Query Broadening to improve IR By Eric Atwell, School of Computing, University of Leeds.
- A Powerful Computing Technology Department of Computer Science Wayne State University 1.
Data Mining in Computer Games By Adib Adam Hussain & Mohammed Sarfraz.
Weka & Rapid Miner Tutorial By Chibuike Muoh. WEKA:: Introduction A collection of open source ML algorithms – pre-processing – classifiers – clustering.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.
CSCI 347 / CS 4206: Data Mining Module 02: Input Topic 03: Attribute Characteristics.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
An overview of The IBM Intelligent Miner for Data By: Neeraja Rudrabhatla 11/04/1999.
Data Mining.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003.
Data Mining – Intro.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
GUHA method in Data Mining Esko Turunen Tampere University of Technology Tampere, Finland.
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Data Mining Techniques
Data Mining – Algorithms: OneR Chapter 4, Section 4.1.
Data Mining. 2 Models Created by Data Mining Linear Equations Rules Clusters Graphs Tree Structures Recurrent Patterns.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
Spatial Statistics and Spatial Knowledge Discovery First law of geography [Tobler]: Everything is related to everything, but nearby things are more related.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Short Introduction to Machine Learning Instructor: Rada Mihalcea.
Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent.
Appendix: The WEKA Data Mining Software
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.
Data Mining Knowledge on rough set theory SUSHIL KUMAR SAHU.
Decision Support Systems
COMP3410 DB32: Technologies for Knowledge Management 10 : Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
W E K A Waikato Environment for Knowledge Analysis Branko Kavšek MPŠ Jožef StefanNovember 2005.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
1Weka Tutorial 5 - Association © 2009 – Mark Polczynski Weka Tutorial 5 – Association Technology Forge Version 0.1 ?
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.2 Statistical Modeling Rodney Nielsen Many.
Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.
W E K A Waikato Environment for Knowledge Aquisition.
Machine Learning in Practice Lecture 5 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Academic Year 2014 Spring Academic Year 2014 Spring.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Fundamentals, Design, and Implementation, 9/e KDD and Data Mining Instructor: Dragomir R. Radev Winter 2005.
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics Semantic distance between two words.
Data Mining Practical Machine Learning Tools and Techniques Chapter 6.3: Association Rules Rodney Nielsen Many / most of these slides were adapted from:
Data Mining – Intro.
Data Science Algorithms: The Basic Methods
School of Computer Science & Engineering
Prepared by: Mahmoud Rafeek Al-Farra
CSE 711: DATA MINING Sargur N. Srihari Phone: , ext. 113.
Data Mining: Concepts and Techniques Course Outline
Clustering.
Welcome! Knowledge Discovery and Data Mining
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

comp3776: Data Mining and Text Analytics Intro to Data Mining By Eric Atwell, School of Computing, University of Leeds (including re-use of teaching resources from other sources, esp. Comp3740 Knowledge Management and Adaptive Systems Comp3740 Knowledge Management and Adaptive Systems School of Computing, University of Leeds)

What has Machine Learning got to do with Computing / Information Systems? Most international organizations produce more information in a week than many people could read in a lifetime Adriaans and Zantinge

Objectives of knowledge discovery or machine learning or data mining Data mining is about discovering patterns in data. For this we need: –KD/DM techniques, algorithms, tools, eg BootCat, WEKA –A methodological framework to guide us, in collecting data and applying the best algorithms: CRISP-DM

Data Mining, Machine Learning, Knowledge Discovery, Text Mining Data Mining was originally about learning patterns from DataBases, data structured as Records, Fields Knowledge Discovery is exotic term for DM??? Increasingly, data is unstructured text (WWW), so Text Mining is a new subfield of DM, focussing on Knowledge Discovery from unstructured text data

define: data mining Data mining, also known as knowledge-discovery in databases (KDD), is the practice of automatically searching large stores of data for patterns. To do this, data mining uses computational techniques from artificial intelligence, statistics and pattern recognition. en.wikipedia.org/wiki/Data_mining en.wikipedia.org/wiki/Data_mining

define: text mining Text mining, also known as intelligent text analysis, text data mining or knowledge-discovery in text (KDT), refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text. Text mining is a young interdisciplinary field which draws on information retrieval, data mining, machine learning, statistics and computational linguistics.... en.wikipedia.org/wiki/Text_mining en.wikipedia.org/wiki/Text_mining

define: knowledge discovery Knowledge discovery is the process of finding novel, interesting, and useful patterns in data. Data mining is a subset of knowledge discovery. It lets the data suggest new hypotheses to test. tutorial/glossary/go01.html tutorial/glossary/go01.html Data mining, also known as knowledge-discovery in databases (KDD), is the practice of automatically searching large stores of data for patterns. To do this, data mining uses computational techniques from AI, statistics and pattern recognition. en.wikipedia.org/wiki/Knowledge_discovery en.wikipedia.org/wiki/Knowledge_discovery

Data Mining: Overview Concepts, Instances or examples, Attributes Data Mining Concept Descriptions Each instance is an example of the concept to be learned or described. The instance may be described by the values of its attributes.

Instances Input to a data mining algorithm is in the form of a set of examples, or instances. Each instance is represented as a set of features or attributes. Usually in DB Data-Mining this set takes the form of a flat file; each instance is a record in the file, each attribute is a field in the record. In text-mining, instance may be word/term in context (surrounding words/document) The concepts to be learned are formed from patterns discovered within the set of instances.

concepts The types of concepts we try to learn include: Key indicators – features or terms specific to our domain Clusters or Natural partitions; –Eg we might cluster customers according to their shopping habits. –Eg is this web-page British or American English? Rules for classifying examples into pre-defined classes. –Eg Mature students studying information systems with high grade for General Studies A level are likely to get a 1 st class degree General Associations –Eg People who buy nappies are in general likely also to buy beer

More concepts The types of concepts we try to learn include: Unexpected (suspicious?) associations or coincidences –Eg known suspects A, B, C all phoned D last week Numerical prediction –Eg look for rules to predict what salary a graduate will get, given A level results, age, gender, programme of study and degree result – this may give us an equation: Salary = a*A-level + b*Age + c*Gender + d*Prog + e*Degree (but are Gender, Programme really numbers???)

DB Example: weather to play?

outlook {sunny, overcast, temperature {hot, mild, humidity {high, windy {TRUE, play {yes, sunny,hot,high,FALSE,no sunny,hot,high,TRUE,no overcast,hot,high,FALSE,yes rainy,mild,high,FALSE,yes rainy,cool,normal,FALSE,yes rainy,cool,normal,TRUE,no overcast,cool,normal,TRUE,yes sunny,mild,high,FALSE,no sunny,cool,normal,FALSE,yes rainy,mild,normal,FALSE,yes sunny,mild,normal,TRUE,yes overcast,mild,high,TRUE,yes overcast,hot,normal,FALSE,yes rainy,mild,high,TRUE,no

outlook temperature humidity windy {TRUE, play {yes, sunny,85,85,FALSE,no sunny,80,90,TRUE,no overcast,83,86,FALSE,yes rainy,70,96,FALSE,yes rainy,68,80,FALSE,yes rainy,65,70,TRUE,no overcast,64,65,TRUE,yes sunny,72,95,FALSE,no sunny,69,70,FALSE,yes rainy,75,80,FALSE,yes

Text mining example: Which English dominates the WWW, UK or US? First catch your rabbit (Mrs Beatons cookbook): Other tools are possible, but WWW-BootCat was easier to use … First: sign up for Domain, SketchEngine account, Google key; download seeds-en from (see comp3740 specifications and lecture notes …)comp3740

Example 2: Data Mining for an ontology Ontology: the concepts in a discipline, and meaning- relationships between these concepts (01.ppt)01.ppt concepts roughly equates to terminology – specialist words and phrases in a discipline WordNet is freely-available for general English What about other languages? – EuroWordnet, BalkaNet, … (but not ALL languages!) What about specific domains? Domain-specific ONTOLOGIES have to be devised (by experts) What about my own specific domain/language? Automatic extraction of key words / concepts from example documents (machine learning / knowledge discovery)

Automatic terminology extraction Terminology extraction = thesaurus construction based on documents (either retrieved set or the whole collection) as Corpus – training text set define a measure of how close one index term is to another – in meaning-space, ?or literal distance? for each term, form a neighbourhood comprising the nearest n terms treat these neighbourhoods like related thesaurus classes terms with similar neighbourhoods are treated as synonyms.

Finding coordinate terms One attempt to define how close a term is to another: If two terms are both used to index the same document many times in the collection, then they are deemed to be close. From document-term matrix, compute term-correlation matrix The term correlation matrix can be normalised so that terms that index a lot of documents dont have an unfair chance – reduce weight of common words

Other ways to find specialist terms Other ways to find domain-specific terms and relations: Collect a domain corpus, find terms different from a generic gold standard corpus: British National Corpus Collocation-groups: For each term, collect its collocations in the Corpus: other words it appears next to (or near to). If two terms have similar collocation- sets, then they are deemed to be close. Association matrix based on proximity: compute average distance between pairs of terms (no. of words between them, literally), use this as closeness metric

Why build a thesaurus? a thesaurus or ontology can be used to normalise a vocabulary and queries (?or documents?) it can be used (with some human intervention) to increase recall and precision generic thesaurus/ontology may not be effective in specialized collections and/or queries Semi-automatic construction of thesaurus/ontology based on the retrieved set of documents has produced some promising results, e.g. Semantic Web

Data Mining: Key points Knowledge Discovery (Data Mining) tools semi- automate the process of discovering patterns in data. Tools differ in terms of what concepts they discover (differences, key-terms, clusters, decision-trees, rules)… … and in terms of the output they provide (eg clustering algorithms provide a set of subclasses) Selecting the right tools for the job is based on business objectives: what is the USE for the knowledge discovered

A Data Mining consultant… You should be able to: –Decide which is the appropriate data mining technique for a given a problem defined in terms of business objectives. –Decide which is the most appropriate form of input (which attributes/features will be useful for learning) and output (what does your client want to see?)