Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining & Knowledge Discovery: A Review of Issues and a Multi-strategy Approach Ryszard S. Michalski and Kenneth A. Kaufman 2.

Similar presentations


Presentation on theme: "Data Mining & Knowledge Discovery: A Review of Issues and a Multi-strategy Approach Ryszard S. Michalski and Kenneth A. Kaufman 2."— Presentation transcript:

1 Data Mining & Knowledge Discovery: A Review of Issues and a Multi-strategy Approach Ryszard S. Michalski and Kenneth A. Kaufman 2

2 [ 개요 ] Emergence of new research area : Data mining & Knowledge discovery

3 [2.1] Introduction How to extract useful, task-oriented knowledge from abundant raw data? Tradition/current methods Limitation: Primarily oriented toward the explanations of quantitative & statistical data characteristics

4 Continued Traditional statistical methods can But can’t

5 Continued 그리고 traditional methods 는 스스로 domain knowledge 를 취하여 자동적으로 관련된 attributes 를 만들어 내지는 못한다. Goal of the researches in this field : To develop computational models for acquiring knowledge from data and background knowledge

6 Continued Machine learning 과 기존의 전통적인 방법을 적용하여 Task-oriented data characteristics 와 generalization 을 도출해낸다. ‘Task-oriented’ 는 동일한 data 로부터 다른 knowledge 를 얻을수 있어야 함을 의미하므로 결국은 Multi-strategy approach 를 요한다. ( 다른 task 는 다른 data exploration 과 knowledge generalization 을 요하므로 ) Multi-strategy approach 의 목적은 human expert 가 얻을수 있는 data description 과 유사한 형태의 Knowledge 를 얻는것이다. Main constraints: domain expert 가 쉽게 이해 / 해석할 수 있는 Knowledge description 이어야 한다.

7 Continued Distinction between Data mining & Knowledge discovery D-M: Application of Machine learning and other methods to the enumeration of patterns over the data K-D: The whole process of data analysis lifecycle

8 [2.2] Machine learning & multi-strategy data exploration Two points to be explained here Relationship between Machine learning methodology & goals of Data mining and Knowledge discovery How methods of symbolic M-L can be used for (semi- )automating tasks with conceptual exploration of data and a generation of task-oriented knowledge from data?

9 [2.2.1] Determining general rules from specific cases Multi-strategy data exploration is based on “Symbolic inductive learning” Two types of data exploration operators (1)Operators for defining general symbolic descriptions of a designed group or groups of entities in a data set. 각 group 내의 entity 에 대한 공통적 특성을 기술 ‘Constructive induction’ 이라고 하는 mechanism 을 이용해 original data 에 존재하지 않는 추상적 개념을 이용할 수 있다. Learning “Characteristic concept descriptions ”

10 Continued (2)Operators for defining differences between different groups of entities Learning “Discriminant concept descriptions ” Basic assumptions in concept learning Examples don’t have errors. All attributes have a specified values in them. All examples are located in the same database. All concepts have a precise(crisp) description that doesn’t change over time.

11 Continued Integrating qualitative & quantitative discovery : To define sets of equations for a given set of data points, and qualitative conditions for the application of their equations. Qualitative prediction : Sequence/process 내의 pattern 을 찾고 이것을 이용해서 미래의 input 에 대해 정 량적으로 예측.

12 [2.2.2] Conceptual clustering Another class of machine learning methods related to D-M & K-D. Similar to traditional cluster analysis but quite different. classical cluster 기법과의 주된 차이 Diffenence between Conceptual & Traditional clustering In Traditional clustering : similar measure is a function only of the properties(attribute values) of the entities. Similarity(A,B) = f(properties)

13 Continued In Conceptual clustering : similarity measure is a function of properties of entities, and two other factors Conceptual cohesiveness(A,B) = f(properties,L,E) Fig. An illustration of the difference between closeness and conceptual cohesiveness Two points A and B may be put into the same cluster in the viewpoint of the Traditional method but into the different clusters in the conceptual clustering.

14 [2.2.3] Constructive induction In learning rules or decision trees from examples, the initially given attributes may not be directly relevant or irrelevant to the learning problem at hand. Advantage of the symbolic methods over statistical methods : symbolic methods 가 statistical method 에 비해 non-essential attributes 를 쉽게 판단 할 수 있다. How to improve the representation space (1)Removing less relevant attributes. (2)Generating new relevant attributes. (3)Abstracting attributes.(or Grouping some attribute value) “Constructive Induction” consists of two pahses (1)Construction of the best representation space (2)Generation of the best hypothesis in the found space above

15 [2.2.4] Selection of the most representative examples Usually database is very large => Process of determining, generating patterns/rules is quite time-consuming. Therefor extraction of the most representative cases of given classes is necessary to make the process more efficient. [2.2.5] Integration of Qualitative & Quantitative discovery Numerical attributes 를 포함한 database 의 경우 equation 을 통해 이들 attributes 들간의 관계를 잘 설명하는 quantitative discovery 를 수행할 수 있으나 different qualitative condition 에서는 이러한 고정된 quantitative equation 만으로는 설명이 불가능하므로 qualitative condition 에 따라 quantitative equation 을 결정하는 방법이 요구된다. [2.2.6] Qualitative prediction The goal is not to predict a specific value of a variable(as in Time series analysis), but to qualitatively describe a plausible future object

16 [2.2.7] Summarizing the ML-oriented approach Traditional statistical methods Oriented towards numerical characterization of a data set Used for globally characterizing a given class of objects Machine learning methods Primarily oriented towards symbolic logic-style descriptions of data Can determine the description for predicting class membership of future objecs But Multi-strategy approach combining the above two is necessary, since different type of questions require different exploratory strategies.

17 [2.3] Classification of data exploration tasks How to use the GDT(General Data Table) to relate Machine learning techniques to data exploration problems? (1) Learning rules from examples 하나의 discrete attribute 를 output attribute 로 하고 나머지 attributes 를 input 으로 하여 주어진 set of rows 를 training samples 로 하여 이들간의 relationship(rule) 을 구한다. => 모든 attributes 들에 대해 적용할 수 있다. (2) Determining tree-dependent patterns Detection of temporal patterns in sequences of data arranged along the true dimension in a GDT. Using Multi-model method for qualitative prediction Temporal constructive induction technique (3) Example selection Select rows from the table corresponding to the most representative examples of different classes.

18 Continued (4) Attribute selection Feature selection 이라고도 하며 least relevant attributes to the learning 에 해당하는 column 을 제거한다. 주로 Gain ration 나 Promise level 과 같은 attribute selection 기법을 이용한다. (5) Generating new attributes 앞에서 설명한 Constructive induction 에 의해 초기에 주어진 attribute 으로부터 새로운 relevant attributes 를 생성한다. (6) Clustering 역시 앞에서 설명한 Conceptual clustering 에 의해 rows of the GDT 를 목적하는 group(cluster) 로 partition 한다. => 이 결과로 나온 cluster 를 기술하는 Rule 은 Knowledge base 에 저장된다. (7) Determining attribute dependencies Determine relationships(e.g., correlation, causal dependencies, logical dependencies) among attributes(column) using statistical/logical methods

19 Continued (8) Incremental rule update Update the working knowledge(rules) to accommodate new information (9) Searching for approximate patterns in the (imperfect) data Determine the best hypothesis that accounts for most of the available data (10) Filling in missing data Determine the plausible values of the missing entities through the analysis of the currently available data (11) Determining decision structures for declarative knowledge(Decision rules) 주어진 data set(GDT) 에 대한 general decision rule 이 가정되었을 때 새 로운 case 에 대한 예측을 위해 사용되기 위해서는 decision tree(or decision structure) 의 형태로 변환하는 것이 바람직하다.


Download ppt "Data Mining & Knowledge Discovery: A Review of Issues and a Multi-strategy Approach Ryszard S. Michalski and Kenneth A. Kaufman 2."

Similar presentations


Ads by Google