CSC 594 Topics in AI – Natural Language Processing

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
Chapter 4 Probability and Probability Distributions
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Expectation Maximization Method Effective Image Retrieval Based on Hidden Concept Discovery in Image Database By Sanket Korgaonkar Masters Computer Science.
Chapter 6 Probability.
Albert Gatt Corpora and Statistical Methods. Probability distributions Part 2.
Albert Gatt Corpora and Statistical Methods – Part 2.
Text Classification, Active/Interactive learning.
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
1 Computing Relevance, Similarity: The Vector Space Model.
Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 5. Document Representation and Information Retrieval.
Two Main Uses of Statistics: 1)Descriptive : To describe or summarize a collection of data points The data set in hand = the population of interest 2)Inferential.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 3. Word Association.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
+ Chapter 5 Overview 5.1 Introducing Probability 5.2 Combining Events 5.3 Conditional Probability 5.4 Counting Methods 1.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 8. Text Clustering.
STROUD Worked examples and exercises are in the text Programme 29: Probability PROGRAMME 29 PROBABILITY.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.
From Frequency to Meaning: Vector Space Models of Semantics
Vector Semantics. Dan Jurafsky Why vector models of meaning? computing the similarity between words “fast” is similar to “rapid” “tall” is similar to.
Automated Information Retrieval
Plan for Today’s Lecture(s)
Chapter 8: Estimating with Confidence
Information Retrieval: Models and Methods
Chapter 8: Estimating with Confidence
Overview of Statistical Language Models
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
Statistical NLP: Lecture 7
Bayesian and Markov Test
Semantic Processing with Context Analysis
Information Retrieval: Models and Methods
Vector Semantics Introduction.
Corpora and Statistical Methods
Multimodal Learning with Deep Boltzmann Machines
Implementation Issues & IR Systems
Word Meaning and Similarity
Vector-Space (Distributional) Lexical Semantics
Efficient Estimation of Word Representation in Vector Space
Multimedia Information Retrieval
Compact Query Term Selection Using Topically Related Text
Chapter 4 – Part 3.
Learning with information of features
A Markov Random Field Model for Term Dependencies
Information Organization: Clustering
Chapter 4 Basic Probability.
N-Gram Model Formulas Word sequences Chain rule of probability
Word Embedding Word2Vec.
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
CS 430: Information Discovery
Geology Geomath Chapter 7 - Statistics tom.h.wilson
Word embeddings (continued)
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Information Retrieval and Web Design
Presentation transcript:

CSC 594 Topics in AI – Natural Language Processing Spring 2018 5. Word Association

What is Word Association? Word association is a relation that exists between two words. There are two types of relations: Paradigmatic and Syntagmatic. Paradigmatic: A & B have paradigmatic relation if they can be substituted for each other (i.e., A & B are in the same class) e.g. “cat” and “dog”; “Monday” and “Tuesday” Syntagmatic: A & B have syntagmatic relation if they can be combined with each other (i.e., A & B are related semantically) e.g. “cat” and “scratch”; “car” and “drive” These two basic and complementary relations can be generalized to describe relations of any items in a language Coursera “Text Mining and Analytics”, ChengXiang Zhai

Why Mine Word Associations? They are useful for improving accuracy of many NLP tasks POS tagging, parsing, entity recognition, acronym expansion Grammar learning They are directly useful for many applications in text retrieval and mining Text retrieval (e.g., use word associations to suggest a variation of a query) Automatic construction of topic map for browsing: words as nodes and associations as edges Compare and summarize opinions (e.g., what words are most strongly associated with “battery” in positive and negative reviews about iPhone 6, respectively?) Coursera “Text Mining and Analytics”, ChengXiang Zhai

Coursera “Text Mining and Analytics”, ChengXiang Zhai Word Context Coursera “Text Mining and Analytics”, ChengXiang Zhai

Coursera “Text Mining and Analytics”, ChengXiang Zhai Word Co-occurrence Coursera “Text Mining and Analytics”, ChengXiang Zhai

Mining Word Associations Paradigmatic Represent each word by its context Compute context similarity Words with high context similarity likely have paradigmatic relation Syntagmatic Count how many times two words occur together in a context (e.g., sentence or paragraph) Compare their co-occurrences with their individual occurrences Words with high co-occurrences but relatively low individual occurrences likely have syntagmatic relation Paradigmatically related words tend to have syntagmatic relation with the same word  joint discovery of the two relations These ideas can be implemented in many different ways! Coursera “Text Mining and Analytics”, ChengXiang Zhai

Word Context as “Pseudo Document” Coursera “Text Mining and Analytics”, ChengXiang Zhai

Computing Similarity of Word Context Coursera “Text Mining and Analytics”, ChengXiang Zhai

Coursera “Text Mining and Analytics”, ChengXiang Zhai

Syntagmatic Relation – Word Collocation Syntagmatic relation is word co-occurrence – called Collocation If two words occur together in a context more often than chance, they are in the syntagmatic relation (i.e., related words). Coursera “Text Mining and Analytics”, ChengXiang Zhai

Coursera “Text Mining and Analytics”, ChengXiang Zhai Word Probability Word probability – how likely would a given word appear in a text/context? Coursera “Text Mining and Analytics”, ChengXiang Zhai

Binomial Distribution Word (occurrence) probability is modeled by Binomial Distribution. Coursera “Text Mining and Analytics”, ChengXiang Zhai

Entropy as a Measure of Randomness Entropy is a measure in Information Theory, and indicates purity or (un)even/skewed distribution -- a large entropy means the distribution is even/less skewed. Entropy takes on a value [0, 1] (between 0 and 1 inclusive). Entropy of a collection S with respect to the target attribute which takes on c number of values is calculated as: This is the average number of bits required to encode an instance in the dataset. For a boolean classification, the entropy function yields: 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆)= 𝑖=1 𝑐 −𝑝 𝑖 ∙ log 2 𝑝 𝑖 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑋)= −p(X=1)∙ log 2 𝑃(𝑋=1) +p(X=0)∙ log 2 𝑃(𝑋=0)

Entropy for Word Probability Coursera “Text Mining and Analytics”, ChengXiang Zhai

Mutual Information (MI) as a Measure of Word Collocation Mutual Information is a concept in probability theory, and indicates the two random variables' mutual dependence – or the reduction of entropy. How much reduction in the entropy of X can we obtain by knowing Y? (where reduction give more predictability) 𝐼 𝑋;𝑌 = 𝑦∈𝑌 𝑥∈𝑋 𝑝(𝑥,𝑦)∙ log 𝑝(𝑥,𝑦) 𝑝(𝑥)∙𝑝(𝑦) 𝑰 𝑿;𝒀 =𝑯 𝑿 −𝑯 𝑿 𝒀 =𝑯 𝒀 −𝑯 𝒀 𝑿

Mutual Information (MI) and Word Collocation Coursera “Text Mining and Analytics”, ChengXiang Zhai 16

Coursera “Text Mining and Analytics”, ChengXiang Zhai Probabilities in MI Coursera “Text Mining and Analytics”, ChengXiang Zhai 17

Estimation of Word Probability Coursera “Text Mining and Analytics”, ChengXiang Zhai 18

Point-wise Mutual Information Point-wise Mutual Information (PMI) is often used in place of MI. PMI is a specific event of the two random variables. http://www.let.rug.nl/nerbonne/teach/rema-stats-meth-seminar/presentations/Suster-2011-MI-Coll.pdf

Positive Pointwise Mutual Information (PPMI) Vector Semantics Positive Pointwise Mutual Information (PPMI)

Word-Word matrix Sample contexts ± 7 words … …

Word-word matrix We showed only 4x6, but the real matrix is 50,000 x 50,000 So it’s very sparse Most values are 0. That’s OK, since there are lots of efficient algorithms for sparse matrices. The size of windows depends on your goals The shorter the windows , the more syntactic the representation ± 1-3 very syntacticy The longer the windows, the more semantic the representation ± 4-10 more semanticy

Problem with raw counts Raw word frequency is not a great measure of association between words It’s very skewed “the” and “of” are very frequent, but maybe not the most discriminative We’d rather have a measure that asks whether a context word is particularly informative about the target word. Positive Pointwise Mutual Information (PPMI)

Pointwise Mutual Information Do events x and y co-occur more than if they were independent? PMI between two words: (Church & Hanks 1989) Do words x and y co-occur more than if they were independent? PMI 𝑤𝑜𝑟𝑑 1 , 𝑤𝑜𝑟𝑑 2 = log 2 𝑃( 𝑤𝑜𝑟𝑑 1 , 𝑤𝑜𝑟𝑑 2 ) 𝑃 𝑤𝑜𝑟𝑑 1 𝑃( 𝑤𝑜𝑟𝑑 2 )

Positive Pointwise Mutual Information PMI ranges from −∞ to +∞ But the negative values are problematic Things are co-occurring less than we expect by chance Unreliable without enormous corpora Imagine w1 and w2 whose probability is each 10-6 Hard to be sure p(w1,w2) is significantly different than 10-12 Plus it’s not clear people are good at “unrelatedness” So we just replace negative PMI values by 0 Positive PMI (PPMI) between word1 and word2: PPMI 𝑤𝑜𝑟𝑑 1 , 𝑤𝑜𝑟𝑑 2 = max log 2 𝑃( 𝑤𝑜𝑟𝑑 1 , 𝑤𝑜𝑟𝑑 2 ) 𝑃 𝑤𝑜𝑟𝑑 1 𝑃( 𝑤𝑜𝑟𝑑 2 ) ,0

Computing PPMI on a term-context matrix Matrix F with W rows (words) and C columns (contexts) fij is # of times wi occurs in context cj

p(w=information,c=data) = p(w=information) = p(c=data) = 6/19 = .32 11/19 = .58 7/19 = .37

pmi(information,data) = log2 ( .32 / (.37*.58) ) = .58 (.57 using full precision)

PMI is biased toward infrequent events Two solutions: Weighting PMI PMI is biased toward infrequent events Very rare words have very high PMI values Two solutions: Give rare words slightly higher probabilities Use add-one smoothing (which has a similar effect)

Weighting PMI: Giving rare context words slightly higher probability Raise the context probabilities to 𝛼=0.75: This helps because 𝑃 𝛼 𝑐 >𝑃 𝑐 for rare c Consider two events, P(a) = .99 and P(b)=.01 𝑃 𝛼 𝑎 = .99 .75 .99 .75 + .01 .75 =.97 𝑃 𝛼 𝑏 = .01 .75 .01 .75 + .01 .75 =.03

Other Word Collocation Measures Likelihood Ratio (used in SAS Enterprise Miner) • n is the number of documents that contain term B • k is the number of documents containing both term A and term B • p = k/n is the probability that term A occurs when term B occurs, assuming that they are independent of each other. Then the strength of association between the terms A and B, for a given r documents, is as follows (the inverse of the log likelihood of the probability of obtaining more successes than the k observed in a binomial distribution). 𝑆𝑡𝑟𝑒𝑛𝑔𝑡ℎ= log 𝑒 1 𝑃𝑟𝑜𝑏 𝑘 𝑃𝑟𝑜𝑏 𝑘 = 𝑟=𝑘 𝑛 𝑛 𝑟 𝑝 𝑟 (1−𝑝) (𝑛−𝑟)

Conditional Counts: Concept Linking diabetes (63/63) Concept linked term: a term that co-occurs with a centered term  In this diagram, the centered term is diabetes, which occurred in 63 documents. The term insulin (and its stemmed variations) occurred in 58 documents, 14 of which also contained diabetes. +insulin (14/58) Centered term: a term that is chosen to investigate Diabetes occurs in 63 documents. Of the 58 documents that contain insulin or one of its variants, 14 have the term diabetes in the document as well.

The term diabetes occurs in 63 documents. continued...

The term insulin and its variants occur in 58 documents, and 14 of those documents also contain the term diabetes.

Terms that are primary associates of insulin are secondary associates of diabetes.