Core Text Mining Operations 2007 년 02 월 06 일 부산대학교 인공지능연구실 한기덕 Text : The Text Mining Handbook pp.19~41.

Slides:



Advertisements
Similar presentations
Recap: Mining association rules from large datasets
Advertisements

Mining Association Rules from Microarray Gene Expression Data.
DATA MINING Association Rule Discovery. AR Definition aka Affinity Grouping Common example: Discovery of which items are frequently sold together at a.
Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
gSpan: Graph-based substructure pattern mining
IT 433 Data Warehousing and Data Mining Association Rules Assist.Prof.Songül Albayrak Yıldız Technical University Computer Engineering Department
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
10 -1 Lecture 10 Association Rules Mining Topics –Basics –Mining Frequent Patterns –Mining Frequent Sequential Patterns –Applications.
Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rules l Mining Association Rules between Sets of Items in Large Databases (R. Agrawal, T. Imielinski & A. Swami) l Fast Algorithms for.
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
TC3 Meeting in Montreal (Montreal/Secretariat)6 page 1 of 10 Structure and purpose of IEC ISO - IEC Specifications for Document Management.
Data Mining Association Analysis: Basic Concepts and Algorithms
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
Temporal Pattern Matching of Moving Objects for Location-Based Service GDM Ronald Treur14 October 2003.
Data Mining Association Analysis: Basic Concepts and Algorithms
Fast Algorithms for Association Rule Mining
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.
Mining Association Rules
Elec471 Embedded Computer Systems Chapter 4, Probability and Statistics By Prof. Tim Johnson, PE Wentworth Institute of Technology Boston, MA Theory and.
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
Information Retrieval in Practice
Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
Apriori algorithm Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK Presentation Lauri Lahti.
Social Network Analysis via Factor Graph Model
Chapter 10 Functional Dependencies and Normalization for Relational Databases.
Association Rules. 2 Customer buying habits by finding associations and correlations between the different items that customers place in their “shopping.
Census A survey to collect data on the entire population.   Data The facts and figures collected, analyzed, and summarized for presentation and.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Association Rules. CS583, Bing Liu, UIC 2 Association rule mining Proposed by Agrawal et al in Initially used for Market Basket Analysis to find.
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
RCDL Conference, Petrozavodsk, Russia Context-Based Retrieval in Digital Libraries: Approach and Technological Framework Kurt Sandkuhl, Alexander Smirnov,
Web Usage Mining for Semantic Web Personalization جینی شیره شعاعی زهرا.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003.
Mining various kinds of Association Rules
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
Gao Cong, Long Wang, Chin-Yew Lin, Young-In Song, Yueheng Sun SIGIR’08 Speaker: Yi-Ling Tai Date: 2009/02/09 Finding Question-Answer Pairs from Online.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia.
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
Single Document Key phrase Extraction Using Neighborhood Knowledge.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
1 Discovering Calendar-based Temporal Association Rules SHOU Yu Tao May. 21 st, 2003 TIME 01, 8th International Symposium on Temporal Representation and.
Finding Text Trends l Word usage tracks interest changes l Segment documents by time period l Phrase frequency = number of documents l Phrase must have.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Gspan: Graph-based Substructure Pattern Mining
Profiling: What is it? Notes and reflections on profiling and how it could be used in process mining.
Chapter 12 Understanding Research Results: Description and Correlation
Market Basket Analysis and Association Rules
Targeted Association Mining in Time-Varying Domains
Data Mining Association Rules Assoc.Prof.Songül Varlı Albayrak
Practical Database Design and Tuning
Market Basket Analysis and Association Rules
Presentation transcript:

Core Text Mining Operations 2007 년 02 월 06 일 부산대학교 인공지능연구실 한기덕 Text : The Text Mining Handbook pp.19~41

Contents  II.1 Core Text Mining Operations  II.1.1 Distributions  II.1.2 Frequent and Near Frequent Sets  II.1.3 Associations  II.1.4 Isolating Interesting Patterns  II.1.5 Analyzing Document Collections over Time  II.2 Using Background Knowledge for Text Mining  II.3 Text Mining Query Languages

II Core Text Mining Operations  Core mining operations in text mining systems are algorithms of the creation of queries for discovering patterns in document collections.

II.1 CORE TEXT MINING OPERATIONS  Core text mining operations consist of various mechanisms for discovering patterns of concept with in a document collection.  Three types of patterns  Distributions ( = proportions)  Frequent and near frequent sets  Associations  Symbols  D : a collection of documents  K : a set of concepts  k : a concept

II.1.1 Distributions  Definition II.1. Concept Selection  Selecting some subcollection of documents that is labeled by one or more given concepts  D/K –Subset of documents in D labeled with all of the concepts in K  Definition II.2. Concept Proportion  The proportion of a set of documents labeled with a particular concept  f(D, K) = |D/K| / |D| –The fraction of documents in D labeled with all of the concepts in K

II.1.1 Distributions  Definition II.3. Conditional Concept Proportion  The proportion of a set of documents labeled with a concept that are also labeled with another concept  f(D, K 1 |K 2 ) = f(D/K 2, K 1 ) –The proportion of all those documents in D labeled with K 2 that are also labeled with K 1  Definition II.4. Concept Proportion Distribution  The proportion of documents in some collection that are labeled with each of a number of selected concepts  F K (D, x) –The proportion of documents in D labeled with x for any x in K

II.1.1 Distributions  Definition II.5. Conditional Concept Proportion Distribution  The proportion of those documents in D labeled with all the concepts in K’ that are also labeled with concept x(with x in K)  F K (D,x | K’) = F K (D/K | K’, x)  Definition II.6. Average Concept Proportion  Given a collection of documents D, a concept k, and an internal node in the hierarchy n, an average concept proportion is the average value of f(D,k | k’), where k’ ranges over all immediate children of n.  a(D,k | n) = Avg {k’ is a child of n} {f(D,k | k’)}

II.1.1 Distributions  Definition II.7. Average Concept Distribution  Given a collection of documents D and two internal nodes in the hierarchy n and n’, average concept distribution is the distribution that, for any x that is a child of n, averages x’s proportions over all children of n’  A n (D,x | n’ ) = Avg {k’ is a child of n’} {F n (D,x | k’)}

II.1.2 Frequent and Near Frequent Sets  Frequent Concept Sets  A set of concepts represented in the document collection with co- occurrences at or above a minimal support level (given as a threshold parameter s; i.e., all the concepts of the frequent concept set appear together in at least s documents)  Support –The number (or percent) of documents containing the given rule – that is, the co-occurrence frequency  Confidence –The percentage of the time that the rule is true

II.1.2 Frequent and Near Frequent Sets  Algorithm II.1 : The Apriori Algorithm (Agrawal and Srikant 1994)  Discovery methods for frequent concept sets in text mining

II.1.2 Frequent and Near Frequent Sets

 Near Frequent Concept Sets  An undirected relation between two frequent sets of concepts  This relation can be quantified by measuring the degree of overlapping, for example, on the basis of the number of documents that include all the concepts of the two concept sets.

II.1.3 Associations  Associations  Directed relations between concepts or sets of concepts  Associations Rule  An expression of the from A => B, where A and B are sets of features  An association rule A => B indicates that transactions that involve A tend also to involve B.  A is the left-hand side (LHS)  B is the right-hand side (RHS)  Confidence of Association Rule A => B (A, B : frequent concept sets)  The percentage of documents that include all the concept in B within the subset of those documents that include all the concepts in A  Support of Association Rule A => B (A, B : frequent concept sets)  The percentage of documents that include all the concepts in A and B

II.1.3 Associations  Discovering Association Rules  The Problem of finding all the association rules with a confidence and support greater than the user-identified values minconf (the minimum confidence level) and minsup (the minimum support level) thresholds  Two step of discovering associations –Find all frequent concept sets X (i.e., all combinations of concepts with a support greater than minsup). –Test whether X-B => B holds with the required confidence –X = {w,x,y,z}, B = {y,z}, X-B = {w,x} –X-B => B {w,x} => {y,z} –Confidence of association rule {w,x} => {y,z} confidence = support({w,x,y,z}) / support({w,x}

II.1.3 Associations  Maximal Associations (M-association)  Relations between concepts in which associations are identified in terms of their relevance to one concept and their lack of relevance to another  Concept X most often appear in association with Concept Y

II.1.3 Associations  Definition II.8. Alone with Respect to Maximal Associations  For a transaction t, a category g, and a concept-set X g i, one would say that X is alone in t if t ∩ g i = X. –X is alone in t if X is largest subset of g i that is in t –X is maximal in t … –t M-supports X … –For a document collection D, the M-support of X in D –number of transactions t D that M-support X.

II.1.3 Associations  The M-support for the maximal association  If D(X,g(Y)) is the subset of the document collection D consisting of all the transactions that M-support X and contain at least one element of g(Y), then the M-confidence of the rule

II.1.4 Isolating Interesting Patterns  Interestingness with Respect to Distributions and Proportions  Measures for quantifying the distance between an investigated distribution and another distribution => Sum-of-squares to measure the distance between two models D(P’ || P) = ∑ (p’(x) – p(x)) 2

II.1.4 Isolating Interesting Patterns  Definition II.9. Concept Distribution Distance  Given two concept distributions P’ K (x) and P k (x), the distance D(P’ K || P K ) between them D(P’ K || P K ) = ∑(P’ K (x) – P K (x)) 2  Definition II.10. Concept Proportion Distance  The value of the difference between two distributions at a particular point d(P’ K || P K ) = P’ K (x) – P K (x)

II.1.5 Analyzing Document Collections over Time  Incremental Algorithms  Algorithms processing truly dynamic document collections that add, modify, or delete documents over time  Trend Analysis  The term generally used to describe the analysis of concept distribution behavior across multiple document subsets over time  A two-phase process –First phase –Phrases are created as frequent sequences of words using the sequential patterns mining algorithms first mooted for mining structured databases –Second phase –A user can query the system to obtain all phrases whose trend matches a specified pattern.

II.1.5 Analyzing Document Collections over Time  Ephemeral Associations  A direct or inverse relation between the probability distributions of given topics (concepts) over a fixed time span  Direct Ephemeral Associations –One very frequently occurring or “peak” topic during a period seems to influence either the emergence or disappearance of other topics  Inverse Ephemeral Associations –Momentary negative influence between one topic and another  Deviation Detection  The identification of anomalous instances that do not fit a defined “standard case” in large amounts of data.

II.1.5 Analyzing Document Collections over Time  Context Phrases and Context Relationships  Definition II.11. Context Phrase –A subset of documents in a document collection that is either labeled with all, or at least one, of the concepts in a specified set of concepts. –If D is a collection of documents and C is a set of concepts, D/A(C) is the subset of documents in D labeled with all the concepts in C, and D/O(C) is the subset of documents in D labeled with at least one of the concepts in C. Both D/A(C) and D/O(C) are referred to as context phrases.

II.1.5 Analyzing Document Collections over Time  Context Phrases and Context Relationships  Definition II.12. Context Relationships –The relationship within a set of concepts found in the document collection in relation to a separately specified concept ( the context or the context concept) –If D is a collection of documents, c 1 and c 2 are individual concepts, and P is a context phase, R(D, c 1, c 2 | P) is the number of documents in D/P which include both c 1 and c 2, Formally, R(D, c 1, c 2 | P) = |(D/A({c 1,c 2 }))|P|.

II.1.5 Analyzing Document Collections over Time  The Context Graph  Definition II.13. Context Graph –A graphic representation of the relationship between a set of concepts as reflected in a corpus respect to a given context. –A context graph consists of a set of vertices (=nodes) and edges. –The vertices of the graph represent concepts –Weighted “edges” denote the affinity between the concepts. –If D is a collection of documents, C is a set of concepts, and P is a context phrase, the concept graph of D, C, P is a weighted graph G = (C,E), with nodes in C and a set of edges E = ({c1,c2} | R(D, c1, c2 | P) > 0). For each edge, {c 1,c 2 } E, one defines the weight of the edge, w{c 1,c 2 } = R(D, c 1, c 2 | P).

II.1.5 Analyzing Document Collections over Time  Example of Context Graph in the context of P Concept1(C 1 ) Concept3(C 3 ) Concept2(C 2 ) R(D, c 1, c 2 | P) = 10 R(D, c 1, c 3 | P) = 15

II.1.5 Analyzing Document Collections over Time  Definition II.14. Temporal Selection (“Time Interval”)  If D is a collection of documents and I is a time range, date range, or both, D I is the subset of documents in D whose time stamp, date stamp, or both, is within I. The resulting selection is sometimes referred to as the time interval.

II.1.5 Analyzing Document Collections over Time  Definition II.15. Temporal Context Relationship  If D is a collection of documents, c 1 and c 2 are individual concepts, P is a context phrase, and I is the time interval, then R I (D, c 1, c 2 | P) is the number of documents in D I in which c 1 and c 2 co-occur in the context of P – that is, R I (D, c 1, c 2 | P) is the number of D I /P that include both c 1 and c 2.  Definition II.16. Temporal Context Graph  If D is a collection of documents, C is a set of concepts, P is a context phrase, and I is the time range, the temporal concept graph of D, C, P, I is a weighted graph G = (C,E I ) with set nodes in C and a set of edges E I, where E I = ({c 1,c 2 } | R(D,c 1,c 2 |P) > 0). For each edge, {c 1, c 2 } E, one defines the weight of the edge by w I {c 1,c 2 } = R I (D,c 1,c 2 |P).

II.1.5 Analyzing Document Collections over Time  The Trend Graph  A representation that builds on the temporal context graph as informed by the general approaches found in trend analysis  New Edges –Edges that did not exist in the previous graph  Increased Edges –Edges that have a relatively higher weight in relation to the previous interval  Decreased Edges –Edges that have a relatively decreased weight than the previous interval.  Stable Edges –Edges that have about the same weight as the corresponding edge in the previous interval

II.1.5 Analyzing Document Collections over Time  Example of Trend Graph (Weather Trend Graph)

II.1.5 Analyzing Document Collections over Time  The Borders Incremental Text Mining Algorithm  The Borders algorithm can be used to update search pattern results incrementally.  Definition II.17. Border Set –X is a border set if it is not a frequent set, but any proper subset Y X is frequent set

II.1.5 Analyzing Document Collections over Time  The Borders Incremental Text Mining Algorithm (con’t)  Concept Set A = {A 1, …, A m }  Relations over A: –R old : old relation –R inc : increment –R new : new combined relation  s(X/R) : support of concept set X in the relation R  s * : minimum support threshold (minsup)  Property 1: if X is a new frequent set in Rnew, then there is a subset Y X such that Y is a promoted border  Property 2: if X is a new k-sized frequent set in R new, then for each subset Y X of size k-1, Y is one of the following: (a) a promoted border, (b) a frequent set, or (c) an old frequent set with additional support in R inc.

II.1.5 Analyzing Document Collections over Time  The Borders Incremental Text Mining Algorithm (con’t)  Stage 1: Finding Promoted Borders and Generating Candidates.  Stage 2: Processing Candidates

II.1.5 Analyzing Document Collections over Time  The Borders Incremental Text Mining Algorithm (con’t)