Core Text Mining Operations 2007 년 02 월 06 일 부산대학교 인공지능연구실 한기덕 Text : The Text Mining Handbook pp.19~41.

Core Text Mining Operations 2007 년 02 월 06 일 부산대학교 인공지능연구실 한기덕 Text : The Text Mining Handbook pp.19~41

Contents  II.1 Core Text Mining Operations  II.1.1 Distributions  II.1.2 Frequent and Near Frequent Sets  II.1.3 Associations  II.1.4 Isolating Interesting Patterns  II.1.5 Analyzing Document Collections over Time  II.2 Using Background Knowledge for Text Mining  II.3 Text Mining Query Languages

II Core Text Mining Operations  Core mining operations in text mining systems are algorithms of the creation of queries for discovering patterns in document collections.

II.1 CORE TEXT MINING OPERATIONS  Core text mining operations consist of various mechanisms for discovering patterns of concept with in a document collection.  Three types of patterns  Distributions ( = proportions)  Frequent and near frequent sets  Associations  Symbols  D : a collection of documents  K : a set of concepts  k : a concept

II.1.1 Distributions  Definition II.1. Concept Selection  Selecting some subcollection of documents that is labeled by one or more given concepts  D/K –Subset of documents in D labeled with all of the concepts in K  Definition II.2. Concept Proportion  The proportion of a set of documents labeled with a particular concept  f(D, K) = |D/K| / |D| –The fraction of documents in D labeled with all of the concepts in K

II.1.1 Distributions  Definition II.3. Conditional Concept Proportion  The proportion of a set of documents labeled with a concept that are also labeled with another concept  f(D, K 1 |K 2 ) = f(D/K 2, K 1 ) –The proportion of all those documents in D labeled with K 2 that are also labeled with K 1  Definition II.4. Concept Proportion Distribution  The proportion of documents in some collection that are labeled with each of a number of selected concepts  F K (D, x) –The proportion of documents in D labeled with x for any x in K

II.1.1 Distributions  Definition II.5. Conditional Concept Proportion Distribution  The proportion of those documents in D labeled with all the concepts in K’ that are also labeled with concept x(with x in K)  F K (D,x | K’) = F K (D/K | K’, x)  Definition II.6. Average Concept Proportion  Given a collection of documents D, a concept k, and an internal node in the hierarchy n, an average concept proportion is the average value of f(D,k | k’), where k’ ranges over all immediate children of n.  a(D,k | n) = Avg {k’ is a child of n} {f(D,k | k’)}

II.1.1 Distributions  Definition II.7. Average Concept Distribution  Given a collection of documents D and two internal nodes in the hierarchy n and n’, average concept distribution is the distribution that, for any x that is a child of n, averages x’s proportions over all children of n’  A n (D,x | n’ ) = Avg {k’ is a child of n’} {F n (D,x | k’)}

II.1.2 Frequent and Near Frequent Sets  Frequent Concept Sets  A set of concepts represented in the document collection with co- occurrences at or above a minimal support level (given as a threshold parameter s; i.e., all the concepts of the frequent concept set appear together in at least s documents)  Support –The number (or percent) of documents containing the given rule – that is, the co-occurrence frequency  Confidence –The percentage of the time that the rule is true

II.1.2 Frequent and Near Frequent Sets  Algorithm II.1 : The Apriori Algorithm (Agrawal and Srikant 1994)  Discovery methods for frequent concept sets in text mining

II.1.2 Frequent and Near Frequent Sets

 Near Frequent Concept Sets  An undirected relation between two frequent sets of concepts  This relation can be quantified by measuring the degree of overlapping, for example, on the basis of the number of documents that include all the concepts of the two concept sets.

II.1.3 Associations  Associations  Directed relations between concepts or sets of concepts  Associations Rule  An expression of the from A => B, where A and B are sets of features  An association rule A => B indicates that transactions that involve A tend also to involve B.  A is the left-hand side (LHS)  B is the right-hand side (RHS)  Confidence of Association Rule A => B (A, B : frequent concept sets)  The percentage of documents that include all the concept in B within the subset of those documents that include all the concepts in A  Support of Association Rule A => B (A, B : frequent concept sets)  The percentage of documents that include all the concepts in A and B

II.1.3 Associations  Discovering Association Rules  The Problem of finding all the association rules with a confidence and support greater than the user-identified values minconf (the minimum confidence level) and minsup (the minimum support level) thresholds  Two step of discovering associations –Find all frequent concept sets X (i.e., all combinations of concepts with a support greater than minsup). –Test whether X-B => B holds with the required confidence –X = {w,x,y,z}, B = {y,z}, X-B = {w,x} –X-B => B {w,x} => {y,z} –Confidence of association rule {w,x} => {y,z} confidence = support({w,x,y,z}) / support({w,x}

II.1.3 Associations  Maximal Associations (M-association)  Relations between concepts in which associations are identified in terms of their relevance to one concept and their lack of relevance to another  Concept X most often appear in association with Concept Y

II.1.3 Associations  Definition II.8. Alone with Respect to Maximal Associations  For a transaction t, a category g, and a concept-set X g i, one would say that X is alone in t if t ∩ g i = X. –X is alone in t if X is largest subset of g i that is in t –X is maximal in t … –t M-supports X … –For a document collection D, the M-support of X in D –number of transactions t D that M-support X.

II.1.3 Associations  The M-support for the maximal association  If D(X,g(Y)) is the subset of the document collection D consisting of all the transactions that M-support X and contain at least one element of g(Y), then the M-confidence of the rule

II.1.4 Isolating Interesting Patterns  Interestingness with Respect to Distributions and Proportions  Measures for quantifying the distance between an investigated distribution and another distribution => Sum-of-squares to measure the distance between two models D(P’ || P) = ∑ (p’(x) – p(x)) 2

II.1.4 Isolating Interesting Patterns  Definition II.9. Concept Distribution Distance  Given two concept distributions P’ K (x) and P k (x), the distance D(P’ K || P K ) between them D(P’ K || P K ) = ∑(P’ K (x) – P K (x)) 2  Definition II.10. Concept Proportion Distance  The value of the difference between two distributions at a particular point d(P’ K || P K ) = P’ K (x) – P K (x)

II.1.5 Analyzing Document Collections over Time  Incremental Algorithms  Algorithms processing truly dynamic document collections that add, modify, or delete documents over time  Trend Analysis  The term generally used to describe the analysis of concept distribution behavior across multiple document subsets over time  A two-phase process –First phase –Phrases are created as frequent sequences of words using the sequential patterns mining algorithms first mooted for mining structured databases –Second phase –A user can query the system to obtain all phrases whose trend matches a specified pattern.

II.1.5 Analyzing Document Collections over Time  Ephemeral Associations  A direct or inverse relation between the probability distributions of given topics (concepts) over a fixed time span  Direct Ephemeral Associations –One very frequently occurring or “peak” topic during a period seems to influence either the emergence or disappearance of other topics  Inverse Ephemeral Associations –Momentary negative influence between one topic and another  Deviation Detection  The identification of anomalous instances that do not fit a defined “standard case” in large amounts of data.

II.1.5 Analyzing Document Collections over Time  Context Phrases and Context Relationships  Definition II.11. Context Phrase –A subset of documents in a document collection that is either labeled with all, or at least one, of the concepts in a specified set of concepts. –If D is a collection of documents and C is a set of concepts, D/A(C) is the subset of documents in D labeled with all the concepts in C, and D/O(C) is the subset of documents in D labeled with at least one of the concepts in C. Both D/A(C) and D/O(C) are referred to as context phrases.

II.1.5 Analyzing Document Collections over Time  Context Phrases and Context Relationships  Definition II.12. Context Relationships –The relationship within a set of concepts found in the document collection in relation to a separately specified concept ( the context or the context concept) –If D is a collection of documents, c 1 and c 2 are individual concepts, and P is a context phase, R(D, c 1, c 2 | P) is the number of documents in D/P which include both c 1 and c 2, Formally, R(D, c 1, c 2 | P) = |(D/A({c 1,c 2 }))|P|.

II.1.5 Analyzing Document Collections over Time  The Context Graph  Definition II.13. Context Graph –A graphic representation of the relationship between a set of concepts as reflected in a corpus respect to a given context. –A context graph consists of a set of vertices (=nodes) and edges. –The vertices of the graph represent concepts –Weighted “edges” denote the affinity between the concepts. –If D is a collection of documents, C is a set of concepts, and P is a context phrase, the concept graph of D, C, P is a weighted graph G = (C,E), with nodes in C and a set of edges E = ({c1,c2} | R(D, c1, c2 | P) > 0). For each edge, {c 1,c 2 } E, one defines the weight of the edge, w{c 1,c 2 } = R(D, c 1, c 2 | P).

II.1.5 Analyzing Document Collections over Time  Example of Context Graph in the context of P Concept1(C 1 ) Concept3(C 3 ) Concept2(C 2 ) R(D, c 1, c 2 | P) = 10 R(D, c 1, c 3 | P) = 15

II.1.5 Analyzing Document Collections over Time  Definition II.14. Temporal Selection (“Time Interval”)  If D is a collection of documents and I is a time range, date range, or both, D I is the subset of documents in D whose time stamp, date stamp, or both, is within I. The resulting selection is sometimes referred to as the time interval.

II.1.5 Analyzing Document Collections over Time  Definition II.15. Temporal Context Relationship  If D is a collection of documents, c 1 and c 2 are individual concepts, P is a context phrase, and I is the time interval, then R I (D, c 1, c 2 | P) is the number of documents in D I in which c 1 and c 2 co-occur in the context of P – that is, R I (D, c 1, c 2 | P) is the number of D I /P that include both c 1 and c 2.  Definition II.16. Temporal Context Graph  If D is a collection of documents, C is a set of concepts, P is a context phrase, and I is the time range, the temporal concept graph of D, C, P, I is a weighted graph G = (C,E I ) with set nodes in C and a set of edges E I, where E I = ({c 1,c 2 } | R(D,c 1,c 2 |P) > 0). For each edge, {c 1, c 2 } E, one defines the weight of the edge by w I {c 1,c 2 } = R I (D,c 1,c 2 |P).

II.1.5 Analyzing Document Collections over Time  The Trend Graph  A representation that builds on the temporal context graph as informed by the general approaches found in trend analysis  New Edges –Edges that did not exist in the previous graph  Increased Edges –Edges that have a relatively higher weight in relation to the previous interval  Decreased Edges –Edges that have a relatively decreased weight than the previous interval.  Stable Edges –Edges that have about the same weight as the corresponding edge in the previous interval

II.1.5 Analyzing Document Collections over Time  Example of Trend Graph (Weather Trend Graph)

II.1.5 Analyzing Document Collections over Time  The Borders Incremental Text Mining Algorithm  The Borders algorithm can be used to update search pattern results incrementally.  Definition II.17. Border Set –X is a border set if it is not a frequent set, but any proper subset Y X is frequent set

II.1.5 Analyzing Document Collections over Time  The Borders Incremental Text Mining Algorithm (con’t)  Concept Set A = {A 1, …, A m }  Relations over A: –R old : old relation –R inc : increment –R new : new combined relation  s(X/R) : support of concept set X in the relation R  s * : minimum support threshold (minsup)  Property 1: if X is a new frequent set in Rnew, then there is a subset Y X such that Y is a promoted border  Property 2: if X is a new k-sized frequent set in R new, then for each subset Y X of size k-1, Y is one of the following: (a) a promoted border, (b) a frequent set, or (c) an old frequent set with additional support in R inc.

II.1.5 Analyzing Document Collections over Time  The Borders Incremental Text Mining Algorithm (con’t)  Stage 1: Finding Promoted Borders and Generating Candidates.  Stage 2: Processing Candidates

II.1.5 Analyzing Document Collections over Time  The Borders Incremental Text Mining Algorithm (con’t)

Core Text Mining Operations 2007 년 02 월 06 일 부산대학교 인공지능연구실 한기덕 Text : The Text Mining Handbook pp.19~41.

Similar presentations

Presentation on theme: "Core Text Mining Operations 2007 년 02 월 06 일 부산대학교 인공지능연구실 한기덕 Text : The Text Mining Handbook pp.19~41."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Core Text Mining Operations 2007 년 02 월 06 일 부산대학교 인공지능연구실 한기덕 Text : The Text Mining Handbook pp.19~41.

Similar presentations

Presentation on theme: "Core Text Mining Operations 2007 년 02 월 06 일 부산대학교 인공지능연구실 한기덕 Text : The Text Mining Handbook pp.19~41."— Presentation transcript:

Similar presentations

About project

Feedback