1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 5. Document Representation and Information Retrieval.

Slides:

Advertisements

Similar presentations

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Advertisements

Chapter 5: Introduction to Information Retrieval

Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.

Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.

Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.

Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,

What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.

Learning for Text Categorization

IR Models: Overview, Boolean, and Vector

Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.

Learning Techniques for Information Retrieval Perceptron algorithm Least mean.

Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.

Ch 4: Information Retrieval and Text Mining

Hinrich Schütze and Christina Lioma

Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.

TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.

Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document.

IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.

1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.

The Vector Space Model …and applications in Information Retrieval.

Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.

Chapter 5: Information Retrieval and Web Search

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Advanced Multimedia Text Classification Tamara Berg.

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Chapter 2 Dimensionality Reduction. Linear Methods

Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.

1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.

Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.

Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =

Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.

Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.

Generic text summarization using relevance measure and latent semantic analysis Gong Yihong and Xin Liu SIGIR, April 2015 Yubin Lim.

Chapter 6: Information Retrieval and Web Search

Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.

1 Computing Relevance, Similarity: The Vector Space Model.

Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.

CSE3201/CSE4500 Term Weighting.

Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.

CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.

Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.

SINGULAR VALUE DECOMPOSITION (SVD)

1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.

1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 3. Word Association.

1 CS 391L: Machine Learning Text Categorization Raymond J. Mooney University of Texas at Austin.

A pTree organization for text mining... Position are April apple and an always. all again a... Term (Vocab)

1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret

CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.

Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.

1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 8. Text Clustering.

Natural Language Processing Topics in Information Retrieval August, 2002.

Term weighting and Vector space retrieval

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Plan for Today’s Lecture(s)

CSC 594 Topics in AI – Natural Language Processing

Information Retrieval and Web Search

Representation of documents and queries

Text Categorization Assigning documents to a fixed set of categories

From frequency to meaning: vector space models of semantics

CS 430: Information Discovery

CS 430: Information Discovery

Term Frequency–Inverse Document Frequency

CS 430: Information Discovery

Presentation transcript:

1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 5. Document Representation and Information Retrieval

Typical Text Mining Work Flow (1) 1.Define the task -- what exactly you want to extract from documents or analyze them for. –E.g. Sentiment analysis, Prediction from text 2.Collect documents and create a corpus (collection of documents). 3.Transform documents to narrow down the part of text to analyze. –A whole document (e.g. s) –Part of a document (e.g., user comments) –Sentence-based (rather than document) –Specific words/co-occurrences/n-grams 4.Determine exactly which particular features would be useful. –Named entities, numbers, adjectives etc. “Common Text Mining Workflow”, by Ricky Ho ( 2

Typical Text Mining Work Flow (2) 5.Represent the documents by a Doc * Term matrix (‘Bag-Of-Words’ approach). –Row: documents, Col: words/terms –Shows the frequency of the term in doc 6.Possibly reduce dimensions of the matrix. –Using SVD, multi-dimensional scaling etc. –Drop unnecessary terms/columns. 7.Feed the matrix to mining/analytic techniques. –Supervised -- Regression, classification –Unsupervised – Clustering “Common Text Mining Workflow”, by Ricky Ho ( 3

Raw Document by Term Matrix For text mining, the "bag-of-words model" is commonly used as the feature set. In this model, each document is represented as a word vector (a high dimensional vector with magnitude represents the importance of that word in the document). Hence all documents within the corpus is represented as a giant document/term matrix. 4 (Var 1)(Var 2)(Var 3)(Var 4)(Var 5)(Var 6)…(Var 4999)(Var 5000) applecatcatsdogdogsfarm…White HouseSenate (Obs 1)Doc …00 (Obs 2)Doc …32 (Obs 3)Doc …44 (Obs …)………………………… (Obs N)Doc N222301…00 Document by Term Matrix

Zipf’s Law 5 Let t 1, t 2,…, t n be the terms in a document collection arranged in order from most frequent to least frequent. Let f 1, f 2,…, f n be the corresponding frequencies of the terms. The frequency f k for term t k is proportional to 1 / k. Zipf’s law and its variants help quantify the importance of terms in a document collection. (Konchady 2006) “The product of the frequency of words (f) and their rank (r) is approximately constant.”

Relevance of Zipf’s Law to Text Mining –Often, a few, very frequent terms are not good discriminators. stop words, for example, the, and, an, or, of often words that are described in linguistics as closed- class words, which is a grammatical class that does not get new members –Typically, there is the following in a document collection: a high number of infrequent terms an average number of average frequency terms a low number of high frequency terms  Terms that are neither high nor low frequency are the most informative. 6

Raw Document by Term Matrix –The raw document by term matrix shows the frequencies that each term was used in each document. Here you can think of documents as observations and terms as variables. –For the table in this slide, each document is represented by a row vector of 5000 frequencies. –Doc 1 has the row vector (1, 1, 2, 2, 0, 1, …, 0, 0). –Notice that Doc 1 and Doc N have somewhat similar vector values, as do Doc 2 and Doc 3. (Var 1)(Var 2)(Var 3)(Var 4)(Var 5)(Var 6)…(Var 4999)(Var 5000) applecatcatsdogdogsfarm…White HouseSenate (Obs 1)Doc …00 (Obs 2)Doc …32 (Obs 3)Doc …44 (Obs …)………………………… (Obs N)Doc N222301…00 Document by Term Matrix

Applying Stemming, Filtering, and So On –In the previous document by term table, you saw that each document was represented by 5000 terms. That is quite a lot of variables. –By stemming terms, such as putting cat and cats together, you reduce the number of columns of the document by term matrix. –Applying synonyms and filtering out very common and very rare terms also reduce the number of columns. In this example, you go from 5000 to 1000 terms. 8 (Var 1)(Var 2)(Var 3)(Var 4)…(Var 999)(Var 1000) applecat (stemmed)dog(stemmed)farm…White HouseSenate (Obs 1)Doc 11321…00 (Obs 2)Doc 20120…32 (Obs 3)Doc 30110…44 (Obs …)…………………… (Obs N)Doc N2431…00 Reduced Document by Term Matrix after Stemming, Filtering, Synonyms, and so on

Transposing: Term by Document Matrix (Inverted List) –Transposing the table into a term by document matrix of course provides exactly the same information. –This term by document matrix is often the one presented for analytic purposes. –One can also think of the terms as the objects and the documents as the variables. The term apple is represented by the vector (1,0,0,…,2). –In this table, the terms White House and Senate have similar row vectors. 9 (Var 1)(Var 2)(Var 3)(Var …)(Var N) Doc 1Doc 2Doc 3…Doc N (Obs 1)apple100…2 (Obs 2)cat (stemmed)311…4 (Obs 3)dog(stemmed)221…3 (Obs 4)farm100…1 ………………… (Obs 999)White House034…0 (Obs 1000)Senate024…0

The Sparse, High-Dimensional Vector Spaces –After the frequency counts are obtained, you see that both terms and documents can be represented in vector spaces. –However, in both cases, even after stemming and other filtering steps have been applied, you usually still face a very high-dimensional data set. –In addition, the matrices of frequency counts are very sparse because many words appear only in just 1 or 2 documents. Typically, 90% or more of the cells in the matrices are 0. –Also, the frequency counts are highly skewed, as shown by Zipf’s law. A small number of words occur many times. 10

Handling These Problems 11 –The problems of high dimensionality and sparseness will be addressed by the application of a key theorem in linear algebra called singular value decomposition (SVD). –The problem of skewed frequency counts is addressed by applying weights to the frequencies. Local weights, also called frequency weights, are calculated for term in document. Term weights, also called global weights, are calculated for term. The final weight for each cell is the product

Step 1: Frequency Weights (Local Weights) –There are three options for the frequency weights in the Text Filter node: –The default is Log. 12 None L ij = a ij Binary L ij =  1 if term i is in document j 0 otherwise Log L ij = log 2 (a ij + 1) a ij is the number of times that term i appears in document j.

Weighted Term-Document Frequency Matrix TermIDD1D2DnDn … T11L 1,1 L 1,2 L 1,n … T22L 2,1 L 2,2 L 2,n … … … L ij = frequency weight for term i and document j Documents 13

Step 2: Term Weights (Global Weights) –There are four options for choosing the term weights for term. Entropy (default when no target present) Inverse Document Frequency (IDF) Mutual Information (only used with a target and the default when a target is present) None 14

Term Weight Formulas a i,j = frequency that term i appears in document j g i = frequency that term i appears in document collection n = number of documents in the collection d i = number of documents in which term i appears p i,j = a i,j / g i continued... 15

Term Weight Formulas Entropy – measure of information content (in Information Theory); here actually 1 – normalized entropy Entropy is often used for relatively short texts. continued... Low Information High Information 16

Term Weight Formulas IDF (Inverse Document Frequency) IDF is commonly used when texts are more than a paragraph long. continued... Low Information High Information 17

Term Weight Formulas Mutual Information G i = max(over k ) where –C 1, C 2, …, C k are the k levels of a categorical target variable. –P(t i ) is the proportion of documents containing term i. –P(C k ) is the proportion of documents having target level C k. –P(t i,C k ) is the proportion of documents where term i is present and the target is C k. –(Note that 0  G i < ∞ and the log is base 10.) Mutual Information is used when the data has a target variable (for supervised learning/classification). 18

19

Step 3: Weighted Term-Document Frequency Matrix After the frequency (local) and term (global) weights have been calculated for each term, the final weights used are the product of the two. â i,j = G i L i,j G i is the term (global) weight for term i. L i,j is the frequency (local) weight for term i in document j. 20

TermsD1 T1 T2 D2DnDn TmTm Documents Weighted Term-Document Frequency Matrix 21

Term Weight Guidelines –When a target is present (e.g. ground-truth of the categories), Mutual Information is the default. It is a good choice when it can be used. –Entropy and IDF weights give higher weights to rare or low frequency terms. –Entropy and IDF weights give moderate to high weights for terms that appear with moderate to high frequency, but in a small number of documents. –Entropy and IDF weights vary inversely to the number of documents in which a term appears. –Entropy is often superior for distinguishing between small documents that contain only a few sentences. –Entropy is the only term weight that depends on the distribution of terms across documents. continued... 22

Term Weight Guidelines –Remember, you can suppress both frequency weights and term weights. If you choose no weights, then the raw cell counts will be analyzed, that is,, â i, j = a i, j. –Be experimental. Try different weight settings to find what gives you the most interpretable or most predictive results for your data. 23

Simulation Study of Term Weights TermTerm FreqDoc FreqEntropyIDF Mutual Information armadillo bear cat cow dog gopher hamster horse Kitten moose mouse otter pig puppy raccoon seal squirrel tiger walrus zebra Note: N=100

Retrieval of Documents Finally, when a query is presented, we process each of its words. Aggregate/combine the document vectors, and return the documents which produced a relevancy value above a threshold as the query result. Other ways to retrieve documents (in IR) –Document similarity – Represent documents by Doc*Term matrix, and compute the similarity between two documents using the cosine angle (of the two term vectors) or other measures such as Euclidean distance. 25