Presentation on theme: "Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute INTRODUCTION TO KNOWLEDGE DISCOVERY IN DATABASES AND DATA MINING."— Presentation transcript:
Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute INTRODUCTION TO KNOWLEDGE DISCOVERY IN DATABASES AND DATA MINING
“Non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” [Fayyad et al. 1996] Raw Data Data Mining Patterns Analytical Patterns (rules, decision trees) Statistical Patterns (data distribution) Visual Patterns Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. "From Data Mining to Knowledge Discovery in Databases" AAAI Magazine, pp. 37-54. Fall 1996. WHAT IS DATA MINING? OR MORE GENERALLY, KNOWLEDGE DISCOVERY IN DATABASES (KDD)
NEED FOR DATA MINING Data are being gathered and stored extremely fast Computational tools and techniques are needed to help humans in summarizing, understanding, and taking advantage of accumulated data
DATA ANALYSIS (KDD)PROCESS data sources data analysis data mining analytical statistical visual models model/patterns deployment prediction decision support new data data management databases data warehouses “good” model model/pattern evaluation quantitative qualitative data “pre”- processing noisy/missing data dim. reduction clean data
Machine Learning (AI) Contributes (semi-)automatic induction of empirical laws from observations & experimentation Statistics Contributes language, framework, and techniques Pattern Recognition Contributes pattern extraction and pattern matching techniques Databases Contributes efficient data storage, data cleansing, and data access techniques Data Visualization Contributes visual data displays and data exploration High Performance Comp. Contributes techniques to efficiently handling complexity Application Domain Contributes domain knowledge KDD IS INTERDISCIPLINARY TECHNIQUES COME FROM MULTIPLE FIELDS
Confirmatory (verification) Given a hypothesis, verify its validity against the data Exploratory (discovery) Prescriptive patterns Patterns for predicting behavior of newly encountered entities Descriptive patterns Patterns for presenting the behavior of observed entities in a human-understandable format DATA MINING MODES
WHAT DO YOU WANT TO LEARN FROM YOUR DATA? KDD APPROACHES Data classification regression clustering summarization dependency/assoc. analysis change/deviation detection IF a & b & c THEN d & k IF k & a THEN e IF A & B THEN IF A & D THEN A B C D 0.5 0.75 0.3 A, B -> C 80% C, D -> A 22%
COMMERCIAL DATA MINING SYSTEMS Matlab Oracle data mining and lots more ….
WEKA Frank et al., University of Waikato, New Zealand ACADEMIC DATA MINING SYSTEMS RapidMiner Klinkenberg et al., Univ. of Dortmund, Germany R Programming Language Ross Ihaka and Robert Gentleman, Univ. of Auckland, New Zealand and many more ….
DATA MINING RESOURCES – JOURNALS Data Mining and Knowledge Discovery Journal Newsletters: ACM SIGKDD Explorations Newsletter Related Journals: TKDE: IEEE Transactions in Knowledge and Data Engineering TODS: ACM Transaction on Database Systems JACM: Journal of ACM Data and Knowledge Engineering JIIS: Intl. Journal of Intelligent Information Systems
DATA MINING RESOURCES – CONFERENCES KDD: ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining ICDM: IEEE International Conference on Data Mining, SIAM International Conference on Data Mining PKDD: European Conference on Principles and Practice of Knowledge Discovery in Databases PAKDD Pacific-Asia Conference on Knowledge Discovery and Data Mining DaWak: Intl. Conference on Data Warehousing and Knowledge Discovery Related Conferences: ICML: Intl. Conf. On Machine Learning IDEAL: Intl. Conf. On Intelligent Data Engineering and Automated Learning IJCAI: International Joint Conference on Artificial Intelligence AAAI: American Association for Artificial Intelligence Conference SIGMOD/PODS: ACM Intl. Conference on Data Management ICDE: International Conference on Data Engineering VLDB: International Conference on Very Large Data Bases
DATA MINING RESOURCES – BOOKS, DATASETS, … See resources webpage at: http://web.cs.wpi.edu/~ruiz/KDDRG/resources.html
SUMMARY KDD is the “non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” The KDD process includes data collection and pre-processing, data mining, and evaluation and validation of those patterns Data mining is the discovery and extraction of patterns from data, not the extraction of data Important challenges in data mining: privacy, security, scalability, real- time, and handling non-conventional data
KDDRG: KNOWLEDGE DISCOVERY AND DATA MINING RESEARCH GROUP KDDRG Meetings WHEN? Fridays at 1 pm WHERE? Beckett Conference Room in Fuller Labs To receive announcements of the talks, please subscribe to the KDDRG mailing list I’ll send you an email with instructions on how to do so