Chapter 2: Data Mining Dr. Goutam Sarker,

Chapter 2: Data Mining Dr. Goutam Sarker,
Fellow: IE(I), Fellow: IETE(I), Senior Member: IEEE(USA), Associate Professor, CSE, NITD 9/23/2017 6:11 AM Data Mining / CSE Department/ Dr. Goutam Sarker

Data Mining / CSE Department/ Dr. Goutam Sarker
What is Data Mining ? The term “data mining” refers to the finding of relevant and useful information from databases. 9/23/2017 6:11 AM Data Mining / CSE Department/ Dr. Goutam Sarker

Definition 1 Data mining or knowledge discovery in databases, is the non trivial extraction of implicit, previously unknown and potentially useful information from the data. This encompasses a number of technical approaches, such as clustering, data summarization, classification, pattern recognition, etc. 9/23/2017 6:11 AM Data Mining / CSE Department/ Dr. Goutam Sarker

Definition 2 Data mining is the search for the relationships and global patterns that exist in large databases but are hidden among vast amounts of data. 9/23/2017 6:11 AM Data Mining / CSE Department/ Dr. Goutam Sarker

Definition 3 Data mining is the process of discovering meaningful, new correlation patterns and trends by sifting through large amount of data stored in repositories, using pattern recognition techniques as well as statistical and mathematical techniques. 9/23/2017 6:11 AM Data Mining / CSE Department/ Dr. Goutam Sarker

KDD vs. Data Mining Knowledge Discovery in Database (KDD): was formalized in 1989, with reference to the general concept of being broad and high level in the pursuit of seeking knowledge from data. Data mining: is the only one of the many steps involved in knowledge discovery in databases. The various steps in the knowledge discovery process include data selection, data cleaning and preprocessing, data transformation and reduction, data mining algorithm selection and finally the post processing and the interpretation of the discovered knowledge. The KDD process tends to be highly iterative and interactive. 9/23/2017 6:11 AM Data Mining / CSE Department/ Dr. Goutam Sarker

Stages of KDD Selection. Preprocessing. Transformation. Data Mining. Interpretation and Evaluation. Data Visualization. 9/23/2017 6:11 AM Data Mining / CSE Department/ Dr. Goutam Sarker

Stages of KDD contd. Selection: This stage is concerned with selecting or segmenting the data that are relevant to some criteria. Preprocessing: Preprocessing is the data cleaning stage where unnecessary information is removed. Transformation: The data is not merely transferred across, but transformed in order to be suitable for the task of data mining. In this stage, the data is made usable and navigable. Data Mining: This stage is concerned with the extraction of patterns from the data. Interpretation and Evaluation: The pattern obtained in the data mining stage are converted into knowledge, which in turn is used to support decision making. Data Visualization: Data visualization makes it possible for the analyst to gain a deeper, more intuitive understanding of the data. 9/23/2017 6:11 AM Data Mining / CSE Department/ Dr. Goutam Sarker

DBMS vs. DM We know that DBMS supports query languages which are useful for query triggered data exploration, whereas data mining supports automatic data exploration. If we know exactly what information we are seeking, a DBMS query would suffice; whereas if we vaguely know the possible correlations or patterns, then data mining techniques are useful. One of the tasks of data mining is hypothesis testing, wherein we formulate a hypothesis and test it by sifting through the database. The data mining application goes where the naturally reside. This avoids performance degradation and takes full advantage of database technology. 9/23/2017 6:11 AM Data Mining / CSE Department/ Dr. Goutam Sarker

Related Areas: Statistics Machine Learning Supervised Learning. Unsupervised Learning.

Artificial Intelligence (AI) vs. Data Mining
The tasks of automatically discovering patterns in the data has so far been mostly the domains of Artificial Intelligence. There are mainly 2 aspects to differentiate DM from AI. These are:

Data Mining emphasizes the human understandability of discovered patterns; whereas in AI, the discovered patterns are meant to be used by the machine itself. Data Mining techniques are meant to be scalable to huge store of data such as the world wide web (www). In contrast, the traditional AI approaches have mostly been researched using small “toy” data sets that fit in the main memory.

Data Mining has borrowed a good deal from AI, especially from the field of machine learning in which a program dynamically improves itself. Almost all classification techniques of machine learning have been used in data mining. Only those classification models that are not easily understandable by human users (e.g. neural network techniques have been omitted.

Goals and DM Techniques
Two fundamental goals of data mining Prediction Description Prediction makes use of existing variables in the database in order to predict unknown or future values of interest. Description focuses on finding patterns describing the data and subsequent presentation for user interpretation. 9/23/2017 6:11 AM Data Mining / CSE Department/ Dr. Goutam Sarker

Classification of Techniques
User guided or verification driven data mining Discovery driven or automatic discovery of rules 9/23/2017 6:11 AM Data Mining / CSE Department/ Dr. Goutam Sarker

Data Mining Techniques
Verification Model: In this process of data mining, the user makes a hypothesis and tests the hypothesis on the data to verify its validity. The emphasis is on the user who is responsible for formulating the hypothesis. Discovery Model: The discovery model differs in its emphasis. It is the system automatically discovering important information hidden in the data. The data is sifted in search of frequently occurring patterns, trends and generalizations about the data without guidance from the user. 9/23/2017 6:11 AM Data Mining / CSE Department/ Dr. Goutam Sarker

Discovery Driven Tasks
Discovery of association rules Discovery of classification rules Clustering Discovery of frequent episodes Deviation detection 9/23/2017 6:11 AM Data Mining / CSE Department/ Dr. Goutam Sarker

Discovery of Association Rules
An association rule has the form X ⇒ Y, where X and Y are the sets of items. The intuitive meaning of such a rule is that the transaction of database which contains X tends to contain Y Given a database, the goal is to discover all the rules that have the support and confidence greater than or equal to the minimum support and confidence. 9/23/2017 6:11 AM Data Mining / CSE Department/ Dr. Goutam Sarker

Classification * Classification involves finding rules that partition the data into disjoint groups. The input for the classification is the training data set, whose class labels are already known. 9/23/2017 6:11 AM

Clustering *Clustering is a method of grouping data into different groups, so that the data in each group share similar trends and patterns Clustering constitutes a major class of data mining algorithms The objectives of clustering are: To uncover natural grouping To initiate hypothesis about the data To find out consistent and valid organization of the data 9/23/2017 6:11 AM

Discovery of Classification Rules
Classification involves finding rules that partition the data into disjoint groups. The input to the classification is the training data set whose class labels are already known. This can be termed as supervised learning also.

There are several classification discovery models:
Decision Trees. Neural Networks. Genetic Algorithms.

Frequent Episodes Frequent episodes are the sequence of events that occur frequently, close to each other and are extracted from the time sequence 9/23/2017 6:11 AM

R is a set of event types A is a particular type of event Therefore A ϵ R An event is defined as a pair (A, t) , where as above A ϵ R

A sequence of events (also called event sequence ) S of R is a triple (TS, TC, S) Where TS = starting time TC = ending time S= {(A1,t1), (A2,t2), … … … (An, tn) } is the ordered sequence of events, such that

Ai ϵ R and Ts <= ti <= Tc for all i = 1,2, … … … n-1

3 types of episodes a) Serial episodes: Which occur in sequence.
b) Parallel episodes: No constraints on the occurrence of event types. c) Non serial non parallel: If the occurrences of A and B preceed an occurrence of C, and there is no constraint on the occurrences of A and B

Deviation Detection Deviation detection is to identify outlying points in a particular data set, and explain whether they are due to noise or other impurities being present in the data or due to trivial reasons 9/23/2017 6:11 AM

Mining Problems Neural Networks Genetic Algorithms
Rough Set Techniques Support Vector Machines 9/23/2017 6:11 AM

Other Mining Problems:
Sequence Mining: is concerned with mining sequence data. Web Mining: World Wide Web is a fertile area for data mining research having the huge amount of information available online. Text Mining: Text documents are structured by means of information extraction, text categorization etc 9/23/2017 6:11 AM

Spatial Data Mining: Spatial Data mining is the branch of data mining that deals with spatial (location) data. Geographically referenced data Digital mapping Remote Sensing

DM Applications: case studies
Housing Loan Prepayment Prediction Crime Detection Customer Retention Brand Loyalty 9/23/2017 6:11 AM

5. Banking Detection of patterns of fraudulent credit card use. Identifying ‘loyal’ customers. Determining ‘credit card spending’ by customer group

6. Astronomy: Detection of unusual stars or galaxies or nebulas or super galaxies may lead to the discovery of previously unknown phenomena and terrestrial body.

End of Chapter 2 9/23/2017 6:11 AM

Chapter 2: Data Mining Dr. Goutam Sarker,

Similar presentations

Presentation on theme: "Chapter 2: Data Mining Dr. Goutam Sarker,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 2: Data Mining Dr. Goutam Sarker,

Similar presentations

Presentation on theme: "Chapter 2: Data Mining Dr. Goutam Sarker,"— Presentation transcript:

Similar presentations

About project

Feedback