1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 5. Document Representation and Information Retrieval.

1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 5. Document Representation and Information Retrieval

Typical Text Mining Work Flow (1) 1.Define the task -- what exactly you want to extract from documents or analyze them for. –E.g. Sentiment analysis, Prediction from text 2.Collect documents and create a corpus (collection of documents). 3.Transform documents to narrow down the part of text to analyze. –A whole document (e.g. emails) –Part of a document (e.g., user comments) –Sentence-based (rather than document) –Specific words/co-occurrences/n-grams 4.Determine exactly which particular features would be useful. –Named entities, numbers, adjectives etc. “Common Text Mining Workflow”, by Ricky Ho (https://dzone.com/articles/common-text-mining-workflow) 2

Typical Text Mining Work Flow (2) 5.Represent the documents by a Doc * Term matrix (‘Bag-Of-Words’ approach). –Row: documents, Col: words/terms –Shows the frequency of the term in doc 6.Possibly reduce dimensions of the matrix. –Using SVD, multi-dimensional scaling etc. –Drop unnecessary terms/columns. 7.Feed the matrix to mining/analytic techniques. –Supervised -- Regression, classification –Unsupervised – Clustering “Common Text Mining Workflow”, by Ricky Ho (https://dzone.com/articles/common-text-mining-workflow) 3

Raw Document by Term Matrix For text mining, the "bag-of-words model" is commonly used as the feature set. In this model, each document is represented as a word vector (a high dimensional vector with magnitude represents the importance of that word in the document). Hence all documents within the corpus is represented as a giant document/term matrix. 4 (Var 1)(Var 2)(Var 3)(Var 4)(Var 5)(Var 6)…(Var 4999)(Var 5000) applecatcatsdogdogsfarm…White HouseSenate (Obs 1)Doc 1112201…00 (Obs 2)Doc 2010110…32 (Obs 3)Doc 3010010…44 (Obs …)………………………… (Obs N)Doc N222301…00 Document by Term Matrix

Zipf’s Law 5 Let t 1, t 2,…, t n be the terms in a document collection arranged in order from most frequent to least frequent. Let f 1, f 2,…, f n be the corresponding frequencies of the terms. The frequency f k for term t k is proportional to 1 / k. Zipf’s law and its variants help quantify the importance of terms in a document collection. (Konchady 2006) “The product of the frequency of words (f) and their rank (r) is approximately constant.”

Relevance of Zipf’s Law to Text Mining –Often, a few, very frequent terms are not good discriminators. stop words, for example, the, and, an, or, of often words that are described in linguistics as closed- class words, which is a grammatical class that does not get new members –Typically, there is the following in a document collection: a high number of infrequent terms an average number of average frequency terms a low number of high frequency terms  Terms that are neither high nor low frequency are the most informative. 6

Raw Document by Term Matrix –The raw document by term matrix shows the frequencies that each term was used in each document. Here you can think of documents as observations and terms as variables. –For the table in this slide, each document is represented by a row vector of 5000 frequencies. –Doc 1 has the row vector (1, 1, 2, 2, 0, 1, …, 0, 0). –Notice that Doc 1 and Doc N have somewhat similar vector values, as do Doc 2 and Doc 3. (Var 1)(Var 2)(Var 3)(Var 4)(Var 5)(Var 6)…(Var 4999)(Var 5000) applecatcatsdogdogsfarm…White HouseSenate (Obs 1)Doc 1112201…00 (Obs 2)Doc 2010110…32 (Obs 3)Doc 3010010…44 (Obs …)………………………… (Obs N)Doc N222301…00 Document by Term Matrix

Applying Stemming, Filtering, and So On –In the previous document by term table, you saw that each document was represented by 5000 terms. That is quite a lot of variables. –By stemming terms, such as putting cat and cats together, you reduce the number of columns of the document by term matrix. –Applying synonyms and filtering out very common and very rare terms also reduce the number of columns. In this example, you go from 5000 to 1000 terms. 8 (Var 1)(Var 2)(Var 3)(Var 4)…(Var 999)(Var 1000) applecat (stemmed)dog(stemmed)farm…White HouseSenate (Obs 1)Doc 11321…00 (Obs 2)Doc 20120…32 (Obs 3)Doc 30110…44 (Obs …)…………………… (Obs N)Doc N2431…00 Reduced Document by Term Matrix after Stemming, Filtering, Synonyms, and so on

Transposing: Term by Document Matrix (Inverted List) –Transposing the table into a term by document matrix of course provides exactly the same information. –This term by document matrix is often the one presented for analytic purposes. –One can also think of the terms as the objects and the documents as the variables. The term apple is represented by the vector (1,0,0,…,2). –In this table, the terms White House and Senate have similar row vectors. 9 (Var 1)(Var 2)(Var 3)(Var …)(Var N) Doc 1Doc 2Doc 3…Doc N (Obs 1)apple100…2 (Obs 2)cat (stemmed)311…4 (Obs 3)dog(stemmed)221…3 (Obs 4)farm100…1 ………………… (Obs 999)White House034…0 (Obs 1000)Senate024…0

The Sparse, High-Dimensional Vector Spaces –After the frequency counts are obtained, you see that both terms and documents can be represented in vector spaces. –However, in both cases, even after stemming and other filtering steps have been applied, you usually still face a very high-dimensional data set. –In addition, the matrices of frequency counts are very sparse because many words appear only in just 1 or 2 documents. Typically, 90% or more of the cells in the matrices are 0. –Also, the frequency counts are highly skewed, as shown by Zipf’s law. A small number of words occur many times. 10

Handling These Problems 11 –The problems of high dimensionality and sparseness will be addressed by the application of a key theorem in linear algebra called singular value decomposition (SVD). –The problem of skewed frequency counts is addressed by applying weights to the frequencies. Local weights, also called frequency weights, are calculated for term in document. Term weights, also called global weights, are calculated for term. The final weight for each cell is the product

Step 1: Frequency Weights (Local Weights) –There are three options for the frequency weights in the Text Filter node: –The default is Log. 12 None L ij = a ij Binary L ij =  1 if term i is in document j 0 otherwise Log L ij = log 2 (a ij + 1) a ij is the number of times that term i appears in document j.

Weighted Term-Document Frequency Matrix TermIDD1D2DnDn … T11L 1,1 L 1,2 L 1,n … T22L 2,1 L 2,2 L 2,n … … … L ij = frequency weight for term i and document j Documents 13

Step 2: Term Weights (Global Weights) –There are four options for choosing the term weights for term. Entropy (default when no target present) Inverse Document Frequency (IDF) Mutual Information (only used with a target and the default when a target is present) None 14

Term Weight Formulas a i,j = frequency that term i appears in document j g i = frequency that term i appears in document collection n = number of documents in the collection d i = number of documents in which term i appears p i,j = a i,j / g i continued... 15

Term Weight Formulas Entropy – measure of information content (in Information Theory); here actually 1 – normalized entropy Entropy is often used for relatively short texts. continued... Low Information High Information 16

Term Weight Formulas IDF (Inverse Document Frequency) IDF is commonly used when texts are more than a paragraph long. continued... Low Information High Information 17

Term Weight Formulas Mutual Information G i = max(over k ) where –C 1, C 2, …, C k are the k levels of a categorical target variable. –P(t i ) is the proportion of documents containing term i. –P(C k ) is the proportion of documents having target level C k. –P(t i,C k ) is the proportion of documents where term i is present and the target is C k. –(Note that 0  G i < ∞ and the log is base 10.) Mutual Information is used when the data has a target variable (for supervised learning/classification). 18

Step 3: Weighted Term-Document Frequency Matrix After the frequency (local) and term (global) weights have been calculated for each term, the final weights used are the product of the two. â i,j = G i L i,j G i is the term (global) weight for term i. L i,j is the frequency (local) weight for term i in document j. 20

TermsD1 T1 T2 D2DnDn TmTm Documents Weighted Term-Document Frequency Matrix 21

Term Weight Guidelines –When a target is present (e.g. ground-truth of the categories), Mutual Information is the default. It is a good choice when it can be used. –Entropy and IDF weights give higher weights to rare or low frequency terms. –Entropy and IDF weights give moderate to high weights for terms that appear with moderate to high frequency, but in a small number of documents. –Entropy and IDF weights vary inversely to the number of documents in which a term appears. –Entropy is often superior for distinguishing between small documents that contain only a few sentences. –Entropy is the only term weight that depends on the distribution of terms across documents. continued... 22

Term Weight Guidelines –Remember, you can suppress both frequency weights and term weights. If you choose no weights, then the raw cell counts will be analyzed, that is,, â i, j = a i, j. –Be experimental. Try different weight settings to find what gives you the most interpretable or most predictive results for your data. 23

Simulation Study of Term Weights TermTerm FreqDoc FreqEntropyIDF Mutual Information armadillo10220.84956.64390.4943 bear105640.12641.64390.1839 cat113590.14051.76120.0421 cow110660.11071.59950.2177 dog107660.11831.59950.0478 gopher106550.15801.86250.2665 hamster109650.11941.62150.4308 horse109620.13151.68970.1818 Kitten105620.13031.68970.0307 moose19341000.09731.00000.0000 mouse108630.12961.66660.0943 otter111.00007.64390.4943 pig107580.14401.78590.0592 puppy115580.15761.78590.5447 raccoon967500.24782.00000.1086 seal10 0.50004.32190.2712 squirrel100 0.00001.00000.0000 tiger117700.10271.51460.0070 walrus25 0.30103.00000.0480 zebra38121000.10081.00000.0000 24 Note: N=100

Retrieval of Documents Finally, when a query is presented, we process each of its words. Aggregate/combine the document vectors, and return the documents which produced a relevancy value above a threshold as the query result. Other ways to retrieve documents (in IR) –Document similarity – Represent documents by Doc*Term matrix, and compute the similarity between two documents using the cosine angle (of the two term vectors) or other measures such as Euclidean distance. 25

1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 5. Document Representation and Information Retrieval.

Similar presentations

Presentation on theme: "1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 5. Document Representation and Information Retrieval."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 5. Document Representation and Information Retrieval.

Similar presentations

Presentation on theme: "1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 5. Document Representation and Information Retrieval."— Presentation transcript:

Similar presentations

About project

Feedback