Southern Methodist University

Slides:



Advertisements
Similar presentations
Web Mining.
Advertisements

Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Association Rules Spring Data Mining: What is it?  Two definitions:  The first one, classic and well-known, says that data mining is the nontrivial.
LOGO Association Rule Lecturer: Dr. Bo Yuan
Nadia Andreani Dwiyono DESIGN AND MAKE OF DATA MINING MARKET BASKET ANALYSIS APLICATION AT DE JOGLO RESTAURANT.
DATA MINING Introductory
Data Mining Sangeeta Devadiga CS 157B, Spring 2007.
Data Mining Techniques Cluster Analysis Induction Neural Networks OLAP Data Visualization.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Data Mining Techniques Outline
Building an Intelligent Web: Theory and Practice Pawan Lingras Saint Mary’s University Rajendra Akerkar American University of Armenia and SIBER, India.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
Association Rules Presented by: Anilkumar Panicker Presented by: Anilkumar Panicker.
1 DATA MINING. 2 Introduction Outline Define data mining Data mining vs. databases Basic data mining tasks Data mining development Data mining issues.
Data Mining By Archana Ketkar.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Lecture14: Association Rules
Mining Association Rules
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
CIS 674 Introduction to Data Mining
Chapter 5 Data mining : A Closer Look.
Overview of Search Engines
Beyond Opportunity; Enterprise Miner Ronalda Koster, Data Analyst.
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
CSE Data Mining, 2002Lecture 11.1 Data Mining - CSE5230 Web Mining CSE5230/DMS/2002/11.
1 DATA MINING Source : Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides for the text by.
Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Some working definitions…. ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably Data mining = –the discovery of interesting,
Data Mining By Dave Maung.
© Prentice Hall1 CIS 674 Introduction to Data Mining Srinivasan Parthasarathy Office Hours: TTH 4:30-5:25PM DL693.
EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.
3-1 Data Mining Kelby Lee. 3-2 Overview ¨ Transaction Database ¨ What is Data Mining ¨ Data Mining Primitives ¨ Data Mining Objectives ¨ Predictive Modeling.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of.
What is Data Mining? process of finding correlations or patterns among dozens of fields in large relational databases process of finding correlations or.
 Based on observed functioning of human brain.  (Artificial Neural Networks (ANN)  Our view of neural networks is very simplistic.  We view a neural.
CSE 5331/7331 F'07© Prentice Hall1 CSE 5331/7331 Fall 2007 Machine Learning Margaret H. Dunham Department of Computer Science and Engineering Southern.
Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data.
CURE Clustering Using Representatives Handles outliers well. Hierarchical, partition First a constant number of points c, are chosen from each cluster.
1 Introduction to Data Mining C hapter 1. 2 Chapter 1 Outline Chapter 1 Outline – Background –Information is Power –Knowledge is Power –Data Mining.
Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides for the text by Dr. M.H.Dunham, Data Mining,
Data Mining and Decision Support
Academic Year 2014 Spring Academic Year 2014 Spring.
Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides for the text by Dr. M.H.Dunham, Data Mining,
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Part II - Classification© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II - Classification Margaret H. Dunham Department of Computer.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.
Data Mining: Confluence of Multiple Disciplines Data Mining Database Systems Statistics Other Disciplines Algorithm Machine Learning Visualization.
Data Mining Functionalities
By Arijit Chatterjee Dr
DATA MINING Introductory and Advanced Topics Part III – Web Mining
DATA MINING © Prentice Hall.
DATA MINING CSE 8331 Spring 2002 Part I
Web Mining Ref:
Data Mining.
Adrian Tuhtan CS157A Section1
Sangeeta Devadiga CS 157B, Spring 2007
DATA MINING Introductory and Advanced Topics Part II - Clustering
Market Basket Analysis and Association Rules
Discovery of Significant Usage Patterns from Clickstream Data
DATA MINING Source : Margaret H. Dunham
Presentation transcript:

Southern Methodist University DATA MINING OVERVIEW ME Margaret H. Dunham CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu 10/30/02

UNCOVER HIDDEN INFORMATION Data is growing at a phenomenal rate Users expect more sophisticated information How? UNCOVER HIDDEN INFORMATION DATA MINING 10/30/02

Data Mining Definition Finding hidden information in a database Fit data to a model Similar terms Exploratory data analysis Data driven discovery Deductive learning 10/30/02

Database Processing vs. Data Mining Processing Query Poorly defined No precise query language Query Well defined SQL Data Operational data Data Not operational data Output Precise Subset of database Output Fuzzy Not a subset of database 10/30/02

Data Mining Development 10/30/02

KDD Process Selection: Obtain data from various sources. Modified from [FPSS96C] Selection: Obtain data from various sources. Preprocessing: Cleanse data. Transformation: Convert to common format. Transform to new format. Data Mining: Obtain desired results. Interpretation/Evaluation: Present results to user in meaningful manner. 10/30/02

KDD Process Ex: Web Log Selection: Select log data (dates and locations) to use Preprocessing: Remove identifying URLs Remove error logs Transformation: Sessionize (sort and group) Data Mining: Identify and count patterns Construct data structure Interpretation/Evaluation: Identify and display frequently accessed sequences. Potential User Applications: Cache prediction Personalization 10/30/02

Basic Data Mining Tasks Classification maps data into predefined groups Pattern Recognition Regression Clustering partitions database into groups Groups not known apriori Determined by the data (similarity) Link Analysis uncovers relationships among data Association Rules Ex: 60% of the time bread is sold so is peanut butter Sequence Analysis Ex: Most people who purchase CD players will purchase a CD within one week Not causal Not functional dependencies 10/30/02

Survey of Data Mining Tasks Classification Decision Trees Neural Networks Clustering Agglomerative Partitional Association Rules Web Mining 10/30/02

Classification Problem Given a database D={t1,t2,…,tn} and a set of classes C={C1,…,Cm}, the Classification Problem is to define a mapping f:DgC where each ti is assigned to one class. Actually divides D into equivalence classes. Prediction is similar, but may be viewed as having infinite number of classes. 10/30/02

Classification Examples Pattern matching Fraud detection Identification of plant/animal specifies Profiling (this is not a bad word) Predicting terrorists or potential terrorist events Web searches (Information Retrieval) 10/30/02

Defining Classes Partitioning Based Distance Based 10/30/02

Decision Trees Decision Tree (DT): Tree where the root and each internal node is labeled with a question. The arcs represent each possible answer to the associated question. Each leaf node represents a prediction of a solution to the problem. Popular technique for classification; Leaf node indicates class to which the corresponding tuple belongs. 10/30/02

Decision Tree Example 10/30/02

Based on observed functioning of human brain. Neural Networks Based on observed functioning of human brain. (Artificial Neural Networks (ANN) Our view of neural networks is very simplistic. We view a neural network (NN) from a graphical viewpoint. Alternatively, a NN may be viewed from the perspective of matrices. Used in pattern recognition, speech recognition, computer vision, and classification. 10/30/02

Classification Using Neural Networks Typical NN structure for classification: One output node per class Output value is class membership function value Supervised learning For each tuple in training set, propagate it through NN. Adjust weights on edges to improve future classification. Algorithms: Propagation, Backpropagation, Gradient Descent 10/30/02

Neural Network Example 10/30/02

Propagation Output Tuple Input 10/30/02

Backpropagation Error 10/30/02

Clustering Problem Given a database D={t1,t2,…,tn} of tuples and an integer value k, the Clustering Problem is to define a mapping f:Dg{1,..,k} where each ti is assigned to one cluster Kj, 1<=j<=k. A Cluster, Kj, contains precisely those tuples mapped to it. Unlike classification problem, clusters are not known a priori. 10/30/02

Segment customer database based on similar buying patterns. Clustering Examples Segment customer database based on similar buying patterns. Group houses in a town into neighborhoods based on similar features. Identify new plant species Identify similar Web usage patterns 10/30/02

Agglomerative Example B C D E 1 2 3 4 5 A B E C D Threshold of 1 2 3 4 5 A B C D E 10/30/02

Association Rule Problem Given a set of items I={I1,I2,…,Im} and a database of transactions D={t1,t2, …, tn} where ti={Ii1,Ii2, …, Iik} and Iij  I, the Association Rule Problem is to identify all association rules X  Y with a minimum support and confidence. Link Analysis NOTE: Support of X  Y is same as support of X  Y. 10/30/02

Example: Market Basket Data Items frequently purchased together: Bread PeanutButter Uses: Placement Advertising Sales Coupons Objective: increase sales and reduce costs 10/30/02

Association Rule Definitions Set of items: I={I1,I2,…,Im} Transactions: D={t1,t2, …, tn}, tj I Itemset: {Ii1,Ii2, …, Iik}  I Support of an itemset: Percentage of transactions which contain that itemset. Large (Frequent) itemset: Itemset whose number of occurrences is above a threshold. 10/30/02

Association Rules Example I = { Beer, Bread, Jelly, Milk, PeanutButter} Support of {Bread,PeanutButter} is 60% 10/30/02

Intra-page structures Inter-page structures Usage data Web Data Web pages Intra-page structures Inter-page structures Usage data Supplemental data Profiles Registration information Cookies 10/30/02

Web Structure Mining Mine structure (links, graph) of the Web PageRank Create a model of the Web organization. May be combined with content mining to more effectively retrieve important pages. 10/30/02

PageRank Used by Google Prioritize pages returned from search by looking at Web structure. Importance of page is calculated based on number of pages which point to it – Backlinks. Weighting is used to provide more importance to backlinks coming form important pages. PR(p) = c (PR(1)/N1 + … + PR(n)/Nn) PR(i): PageRank for a page i which points to target page p. Ni: number of links coming out of page i 10/30/02

Web Usage Mining Extends work of basic search engines Search Engines IR application Keyword based Similarity between query and document Crawlers Indexing Profiles Link analysis 10/30/02

Web Usage Mining Applications Personalization Improve structure of a site’s Web pages Aid in caching and prediction of future page references Improve design of individual pages Improve effectiveness of e-commerce (sales and advertising) 10/30/02