Nara Institute of Science and Technology

Nara Institute of Science and Technology
Graph-Theoretic Approaches to Minimally-Supervised Natural Language Learning Mamoru Komachi Nara Institute of Science and Technology Good morning (or good afternoon to Prof Pantel!). I’m Mamoru Komachi. Today, I will talk about graph-theoretic approaches to minimally supervised algorithms in natural language processing. At the previous time, I argued the connection between a well-known link analysis method called HITS proposed by Kleinberg in 1999 and one of the state-of-the-art bootstrapping algorithm called Espresso proposed by Prof. Pantel in 2006. I will begin by a short recap of the last presentation and talk a little bit more about other graph-theoretic approaches. Feb 16, 2010

Semantic drift is the central problem of bootstrapping
Input (Extracted from corpus) Output Instance Pattern New instance Singapore visa __ is Australia Generic patterns Patterns co-occurring with many irrelevant instances card Semantic category changed! Let me remind you one of the central problems of minimally-supervised bootstrapping algorithms. Suppose we start with a term “Singapore” to extract other terms in Travel category. We will find patterns like “visa blah blah is” from a corpus in the course of bootstraping. However, this pattern is not specific to location names, as it can extract many instances irrelevant to the seed instance. For example, it may extract an instance like “card”, and the instance “card” will lead to erroneous instances such as “messages” and “words”. The problem here is that once such *generic patterns* (強調) are acquired, the bootstrapping process will eventually learn irrelevant instances to the seeds. THIS PROBLEM IS CALLED SEMANTIC DRIFT and is a major problem of bootstrapping. card greeting __ messages words Errors propagate to successive iteration

Two major problems solved by this work
Why semantic drift occurs? Is there any way to prevent semantic drift? So, here are the two major problems I tackled in my thesis.

Answers to the problems of semantic drift
Suggest a parallel between semantic drift in Espresso [Pantel and Pennachiotti, 2006] style bootstrapping and topic drift in HITS [Kleinberg, 1999] Solve semantic drift using “relatedness” measure (regularized Laplacian) instead of “importance” measure (HITS authority) used in link analysis community The main contributions of my thesis are two-fold. First, I investigated the cause of the semantic drift, using a simplified version of a bootstrapping algorithm. From graph-theoretic point of view, Espresso-style bootstrapping algorithms can be considered as a computation of HITS, and so it inherits properties of HITS in nature. Therefore, Espresso is inherently prone to semantic drift, corresponding to topic drift in HITS. Second, I explored a way to solve semantic drift. I proposed two graph-theoretic methods to analyze and solve the problem of semantic drift. One of the lessons learned from the analysis is that “relatedness measure” is relevant to many tasks in natural language processing, including word sense disambiguation and named entity extraction. The last time I only mentioned relatedness measure used in link analysis community. Today, I will talk about another measure as well. Let me start with the first one.

Three simplifications to reduce Espresso to HITS
For graph-theoretic analysis, we will introduce 3 simplifications to Espresso Repeat Pattern extraction Pattern ranking Pattern selection Instance extraction Instance ranking Instance selection Until a stopping criterion is met The sketch of the Espresso algorithm looks like this. It follows a general framework of bootstrapping. I applied three simplifications to Espresso in order to allow graph-theoretic analysis.

Keep pattern-instance matrix constant in the main loop
Compute the pattern-instance matrix Repeat Pattern extraction Pattern ranking Pattern selection Instance extraction Instance ranking Instance selection Until a stopping criterion is met Simplification 1 Remove pattern/instance extraction steps Instead, pre-compute all patterns and instances once in the beginning of the algorithm First, I removed pattern and instance extraction steps from the main iteration loop. Instead, it pre-computes all patterns and instances once in the beginning of the algorithm. This modification has additional advantage in computational efficiency because it does not need to compute all the possible patterns and instances on each iteration. Though I will not explain in the detail about this modification, experimental results of the effect of this modification are described in Chapter 3.

Remove pattern/instance selection heuristics
Compute the pattern-instance matrix Repeat Pattern ranking Pattern selection Instance ranking Instance selection Until a stopping criterion is met Simplification 2 Remove pattern/instance selection steps which retain only highest scoring k patterns / m instances for the next iteration i.e., reset the scores of other items to 0 Instead, retain scores of all patterns and instances Second, I removed pattern and instance selection heuristics from the main loop. Instead, it retains scores of all patterns and instances on each iteration. The heuristics is to retain instances extracted in the early steps, since those instances are supposed to be less affected by semantic drift than the instances extracted in later iteration.

Remove early stopping heuristics
Compute the pattern-instance matrix Repeat Pattern ranking Instance ranking Until a stopping criterion is met Simplification 3 No early stopping i.e., run until convergence Third, I dropped a stopping criterion from the algorithm. Instead, it runs until convergence. This is not actually a simplification, but it allows a graph-theoretic analysis of semantic drift. Until score vectors p and i converge

Make Espresso look like HITS
p = pattern score vector i = instance score vector A = instance-pattern matrix ... pattern ranking ... instance ranking Again, let me briefly overview a matrix representation of Espresso algorithm. We have a pattern score vector and an instance score vector on each iteration, represented as p and I in bold font. The update of pattern and instance ranking can be re-written in matrix operation form. |P| = number of patterns |I| = number of instances normalization factors to keep score vectors not too large

Simplified Espresso Input Main loop ... pattern ranking Output
Initial score vector of seed instances Instance-pattern co-occurrence matrix M Main loop Repeat Until i and p converge Output Instance and pattern score vectors i and p ... pattern ranking ... instance ranking OK, so after three simplifications, we obtain a simplified version of Espresso， which allows graph-theoretic treatment. What we want to extract are the top-ranked instances in the final instance score vectors. It takes two inputs: an initial score vector of seed instances and an instance-pattern co-occurrence matrix. The initial score vector is a vector of dimension equal to the number of instances (pause) with 1 at the position of seed instances, and 0 elsewhere. The (I,p)-element of the instance-pattern matrix holds the pointwise mutual information between pattern p and instance i. Main loop iterates two steps: pattern ranking and instance ranking. It will output final instance and pattern score vectors when both instance and pattern score vectors have converged. -- There are three differences between Original Espresso and Simplified Espresso. First, no pattern extraction is performed in main loop. We previously showed in *our previous work* that it is not essential for Espresso to do pattern extraction on each iteration, so we just omit the step. Second, it only re-ranks patterns and instances on each iteration, instead of selecting the most confident patterns and instances. We will verify the effect of pattern and instance selection in the first experiment. Third, it runs until convergence instead of early stopping. (…)

HITS Algorithm [Kleinberg 1999]
Input Initial hub score vector Adjacency matrix M Main loop Repeat Until a and h converge Output Hub and authority score vectors a and h α: normalization factor β: normalization factor Here is the HITS algorithm using the same notation. As you can see, simplified Espresso resembles to the computation of HITS.

Summary: Espresso and semantic drift
Semantic drift happens because Espresso is designed like HITS HITS gives the same ranking list regardless of seeds Some heuristics reduce semantic drift Early stopping is crucial for optimal performance Still, these heuristics require many parameters to be calibrated but calibration is difficult To summarize (what we have argued so far), semantic drift happens because Espresso is designed like HITS, and HITS gives the same ranking list for any seed. Some of the heuristics in Original Espresso help reducing semantic drift. In particular, early stopping is crucial for optimal performance. However, these heuristics require parameter tuning ... but optimal parameter values depend on the task and domains.

Main contributions of this work
Suggest a parallel between semantic drift in Espresso-like bootstrapping and topic drift in HITS (Kleinberg, 1999) Solve semantic drift by graph kernels used in link analysis community Let us move on to the second part of this talk. Here, we propose to use graph kernels as a substitute for bootstrapping for minimally-supervised NLP tasks. These kernels are used in link analysis community and are inherently less prone to semantic drift.

Q. What caused drift in Espresso?
A. Espresso's resemblance to HITS HITS is an importance computation method (gives a single ranking list for any seeds) Why not use a method for another type of link analysis measure - which takes seeds into account? "relatedness" measure (it gives different rankings for different seeds) The question, again, is the cause of semantic drift in Espresso. As I showed in the last talk, the cause is the Espresso’s resemblance to HITS. One may ask further question: why HITS causes topic drift? Well, it is because HITS is an importance computation method which gives a single ranking regardless of any seeds. So my suggestion is to use another method used in link analysis community – “relatedness” measure. Relatedness measure gives a different rankings for different seeds, and can be used as a similarity measure. My argument here is relatedness measure is appropriate in many NLP tasks, and I presented empirical evidence for the argument.

The regularized Laplacian kernel
A relatedness measure Has only one parameter Normalized Graph Laplacian A: similarity matrix of the graph D: (diagonal) degree matrix Regularized Laplacian matrix One such relatedness measure used in link analysis community is the regularized Laplacian kernel. One of the advantages of the regularized Laplacian kernel is that it takes higher-order relations into account (because it computes the weighted sum of all paths in the graph). It can calculate similarity between instances and patterns which are not directly co-occur. Another advantage is that it has only one parameter, and its performance is stable across the parameter. α: parameter Each column of Rα gives the rankings relative to a node

Von Neuman kernel matrix
The von Neumann kernel A mixture of relatedness and importance measure [Ito+ 08] Has only one parameter Small α: relatedness measure (co-citation matrix) Large α: importance measure (HITS authority vector) A: similarity matrix of the graph D: (diagonal) degree matrix Von Neuman kernel matrix I proposed another graph kernel to show that importance measure, HITS authority vector, tends to semantic drift. The von Neumann kernel is considered to be a mixture of relatedness and importance measure. It has only one parameter just as the regularized Laplacian, and small beta gives a relatedness measure while large beta gives an importance measure. From the argument described earlier, it is supposed that the von Neumann kernel with large beta causes semantic drift. We will see if our expectation is correct or not. α: parameter Each column of Kα gives the rankings relative to a node

Condition of diffusion parameter
For convergence, diffusion parameter α should be in range where λ is the principal eigenvalue of A. Regularized Laplacian : identity matrix : uniform distribution Von Neumann kernel : co-citation matrix : HITS authority vector The parameter alpha in the regularized Laplacian and the von Neumann kernel controls performance. The parameter should reside in range greater than or equals to 0 and less than the inverse of the principal eigenvalue of A. Alpha controls the effect of seed instances in both algorithms. If alpha is small, they tend to relatedness measure. If alpha is large, they tend to importance measure.

K-step approximation to speed up the computation
Graph kernels require O(n3) time complexity (n: # of nodes) which is intractable for large graphs K-step approximation takes only the first k terms: K-step approximation = bootstrapping terminated at the K-th iteration Error is upper bounded by where V is a volume of the matrix Let me mention briefly about the scalability of the regularized Laplacian, because Prof. Matsumoto asked about it the last time. This approximation drastically reduces the amount of time necessary for obtaining the ranking, while the error is upper bounded by this formula. Another good property of this approximation is that the K-step approximation can be regarded as a bootstrapping algorithm terminated at the K-th iteration.

Memory efficient computation of the regularized Laplacian
Similarity matrix is large and dense; adjacency matrix is often large but sparse After this factorization, space complexity reduces to O(npk) where n is the number of nodes, p is the number of pattern, and k is the number of steps The K-step approximation can save computation time. However, in some cases where similarity matrix is very large, memory efficiency is also a problem. For example, computing a similarity over a web-scale corpus, the dimension of a similarity matrix can be easily exceeds tens of millions, which cannot be loaded onto memory in many situations. I proposed a factorization to take the advantage of sparse matrix of an instance-pattern co-occurrence matrix.

Experiments on graph kernels
Properties of the regularized Laplacian and the Von Neumann kernel Let us introduce the experimental result of graph kernels.

Word sense disambiguation task of Senseval-3 English Lexical Sample
Predict the sense of “bank” … the financial benefits of the bank (finance) 's employee package ( cheap mortgages and pensions, etc ) , bring this up to … Training instances are annotated with their sense In that same year I was posted to South Shields on the south bank (bank of the river) of the River Tyne and quickly became aware that I had an enormous burden We use word sense disambiguation task of Senseval-3 English Lexical Sample data. Training instances are annotated with their true sense, and the task is to predict the sense of target word in the test set. In this experiment, we predict the sense of “bank” as a benchmark. Predict the sense of target word in the test set Possibly aligned to water a sort of bank(???) by a rushing river.

Word sense disambiguation by graph kernels
Seed instance = the instance to predict its sense Proximity measure = instance score vector given by kernels Seed instance In order to perform WSD by bootstrapping, we construct a graph based on a similarity matrix computed from the final instance score vector. In this graph, an instance can be labeled or unlabeled, and the task is to predict the label of unlabeled instances. An unlabeled instance is given as a seed instance, and the system output is based on the k-nearest neighbor algorithm since similar instances tend to have the same label. We chose two heuristics of Espresso to contrast Simplified Espresso and Original Espresso. The first one is pattern and instance selection. These parameters are hand-tuned to get optimal performance by a preliminary experiment. The second one is early stopping. Simplified Espresso lacks these heuristics so that semantic drift is unavoidable, while Original Espresso can more or less reduce semantic drift. 最初にテスト事例を入れてパターンインスタンス共起行列を作っているが、最初にテスト事例がない状態でどうなるか知りたい(高村さん) ... pattern ranking ... instance ranking

Benchmark 1: Senseval-3 word sense disambiguation task
System output = k-nearest neighbor (k=3) i=(0.9, 0.1, 0.8, 0.5, 0, 0, 0.95, 0.3, 0.2, 0.4) → sense A Seed instance … the financial benefits of the bank (finance) 's employee package ( cheap mortgages and pensions, etc ) , bring this up to … In order to perform WSD by bootstrapping, we construct a graph based on a similarity matrix computed from the final instance score vector. In this graph, an instance can be labeled or unlabeled, and the task is to predict the label of unlabeled instances. An unlabeled instance is given as a seed instance, and the system output is based on the k-nearest neighbor algorithm since similar instances tend to have the same label. We chose two heuristics of Espresso to contrast Simplified Espresso and Original Espresso. The first one is pattern and instance selection. These parameters are hand-tuned to get optimal performance by a preliminary experiment. The second one is early stopping. Simplified Espresso lacks these heuristics so that semantic drift is unavoidable, while Original Espresso can more or less reduce semantic drift. 最初にテスト事例を入れてパターンインスタンス共起行列を作っているが、最初にテスト事例がない状態でどうなるか知りたい(高村さん) In that same year I was posted to South Shields on the south bank (bank of the river) of the River Tyne and quickly became aware that I had an enormous burden

Regularized Laplacian is stable across a parameter
The regularized Laplacian is stable across a diffusion parameter. I cannot plot the performance where diffusion factor is equal to 0 since the regularized Laplacian kernel is an identity matrix then. Although there is no guarantee when a diffusion factor exceeds the inverse of the principal eigenvalue of A, it still is a semi-definite matrix and thus can be used as a kernel.

Von Neumann Kernel tends to HITS authority
On the other hand, von Neumann kernel with large alpha tends to the most frequent sense. This is expected because the von Neumann kernel with large alpha tends to HITS authority vector. As we can see, the von Neumann kernel with small alpha achieves high accuracy, since it is a relatedness measure, just as the regularized Laplacian. We can conclude that relatedness measure is relevant for NLP tasks, thus small diffusion factor is preferable in practice.

Conclusions Semantic drift in Espresso is a parallel form of topic drift in HITS The regularized Laplacian reduces semantic drift in bootstrapping for natural language processing tasks inherently a relatedness measure ( importance measure) To summarize, many bootstrapping methods are designed like importance computation methods in link analysis which is the cause of semantic drift, because these methods converge to a single ranking list regardless of seeds In particular, semantic drift in Espresso is a parallel form of topic drift observed in HITS. I have proposed to use relatedness measures proposed in link analysis – namely, the regularized Laplacian and the von Neuman kernel- for tasks in which bootstrapping has been used. This method gives different rankings for different seeds, and is inherently less prone to drift.

Nara Institute of Science and Technology

Similar presentations

Presentation on theme: "Nara Institute of Science and Technology"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Nara Institute of Science and Technology

Similar presentations

Presentation on theme: "Nara Institute of Science and Technology"— Presentation transcript:

Similar presentations

About project

Feedback