# Learning on the Test Data: Leveraging “Unseen” Features Ben Taskar Ming FaiWong Daphne Koller.

## Presentation on theme: "Learning on the Test Data: Leveraging “Unseen” Features Ben Taskar Ming FaiWong Daphne Koller."— Presentation transcript:

Learning on the Test Data: Leveraging “Unseen” Features Ben Taskar Ming FaiWong Daphne Koller

Introduction Most statistical learning models make the assumption that data instances are IID samples from some fixed distribution. In many cases, the data are collected from different sources, at different times, locations and under different circumstances. We usually build a statistical model of features under the assumption that future data will exhibit the same regularities as the training data. In many data sets, however, there are scope-limited features whose predictive power is only applicable to a certain subset of the data.

Examples 1. Classifying news articles chronologically: Suppose the task is to classify news articles chronologically. New events, people and places appear and disappear) in bursts over time. The training data might consist of articles taken over some time period; these are only somewhat representative of the future articles. The training data may contain some features that are not observed in the training data. 2. Classifying customers into categories: Our training data might be collected from one geographical region which may not represent the distribution in other regions.

We can get away with this difficulty by mixing all the examples and selecting the training and test sets randomly. But this homogeneity cannot be ensured in real world task, where only the non-representative training data is actually available for training. The test data may contain many features that were never or only rarely observed in training data. These features may be used for classification. For ex, in the news article task these local features might include the names of places or people currently in the news. In the customers ex, these local features might include purchases of products that are specific to a region.

Scoped Learning Suppose we want to classify news articles chronologically. The phrase “XXX said today” might appear in many places in data for different values of “XXX” These features are called scope limited features or local features. Another example: Suppose there are 2 labels grain and trade. Words like corn or wheat often appear in phrase “tons of wheat". So we can learn that if a word appears in the context of “tons of xxx” it is likely to be associated with grain. So if we find a phrase like “tons of rye” in the test data we can infer that it has some positive interaction with label grain. Scoped learning is a probabilistic framework that combines the traditional IID features with scope limited features.

The intuitive procedure for using the local features is to use the information from the global (IID) features to infer the rules that govern the local information for a particular subset of data. When data exhibits scope they found significant gains in performance over traditional models which only uses IID features. All the data instances within a particular scope exhibit some structural regularity and we assume that all the future data will exhibit the same structural regularity.

General Framework: Notion of scope: We assume that data instances are sampled from some set of scopes, each of which is associated with some data distribution. Different distributions share a probabilistic model for some set of global features, but can contain a different probabilistic model for a scope-specific set of local features. These local features may be rarely or never seen in the scopes comprising the training data.

Let X denote global features, Z denote local features, and Y the class variable. For each global feature Xi, there is a parameter γ i. Additionally, for each scope and each local feature Z i, there is a parameter λ i S. Then the distribution of Y given all the features and weights is

Probabilistic model: We assume that the global weights can be learned from training data. So their values are fixed when we encounter a new scope and the local feature weights are unknown and can be treated as hidden variables in the graphical model. Idea: The evidence from global features for the labels of some of the instances to modify our beliefs about the role of the local feature present in these instances to be consistent with the labels. By learning about the roles of these features, we can then propagate this information to improve accuracy on instances that are harder to classify using global features alone.

To implement this idea, we define a joint distribution over λ S and y 1,..., y m. Why use Markov Random Fields: Here the association between the variables are correlated rather than causal. Markov random fields are used to model spatial interactions or interacting features.

Markov Network Let V = (V d,V c ) denote a set of random variables, where V d are discrete and V c are continuous variables, respectively. A Markov network over V defines a joint distribution over V, assigning a density over V c for each possible assignment v d to V d. A Markov network M is an undirected graph whose nodes correspond to V. It is parameterization by a set of potential functions φ 1 (C 1 ),..., φ l (C l ) such that each C V is a fully connected subgraph, or clique, in M, i.e., each V i, V j C are connected by an edge in M. Here we assume that the φ(C) is a log-quadratic function The Markov network then represents the distribution:

In our case the log-quadratic model consists of 3 types of potentials i,Y j,X i j ) =exp( i Y j X i j)1) φ(γ i,Y j,X i j ) =exp( γ i Y j X i j) X i j i relates each global feature X i j in instance i to its weight γ i and the class variables Y j of the corresponding instance i. λ i,Y j,Z i j ) = exp(λ i Y j Z i j)2) φ( λ i,Y j,Z i j ) = exp(λ i Y j Z i j) Z i j λ i relates the local feature Z i j to its weight λ i and the label Y j Finally, as the local feature weights are assumed to be hidden, we introduce a prior over their values, or the form Overall, our model specifies a joint distribution as follows:

Markov network for two instances, two global features and three local features

The graph can be simplified further when we account for varaibles whose values are fixed. The global feature weights are learned from the training data and hence their value is fixed and we also know all the feature values. The resulting Markov network is shown below (Assuming that the instance (x 1, z1, y1) contains the features Z1 and Z2, and the instance(x2, z2, y2) contains the features Z2 and Z3.) Y 2 λ 1 λ 2 λ 3 Y 1

Z i j Y j λ i. This can be reduced further. When Z i j =0 there is no interaction between Y j and any of the variables λ i. λ i and Y j In this case we can simply omit the edge between λ i and Y j And the resulting Markov network is shown below Y 2 λ 1 λ 2 λ 3 Y 1

In this model, we can see that the labels of all of the instances are correlated with the local feature weights of features they contain, and thereby with each other. Thus, for example, if we obtain evidence (from global features) about the label Y 1, it would change our posterior beliefs about the local feature weight ¸2, which in turn would change our beliefs about the label Y 2. Thus, by running probabilistic inference over this graphical model, we obtain updated beliefs both about the local feature weights and about the instance labels.

Learning the Model: Learning Global Feature Weights: γIn this case we simply learn their parameters from the training data, using standard logistic regression. Maximum-likelihood (ML) estimation finds the weights γ that maximize the conditional likelihood of the labels given the global features. Learning Local feature Distributions: We can exploit such patterns by learning a model that predicts the prior of the local feature weights using meta features— features of features. More precisely, we learna model that predicts the prior mean µi for ¸i from someset of meta-features mi. As our predictive model for the mean µi we choose to use a linear regression model, setting µi = w ·mi.

Using the model Step1: Given a training set, we first learn the model. In the training set, there local and global features are treated identically. When applying the model to the test set, however, our first decision is to determine the set of local and global features. Step 2: Our next step is to generate the Markov network for the test set. Probabilistic inference over this model infers the effect of local features. Step 3: We use Expectation Propagation for inference. It maintains approximate beliefs (marginals) over nodes of the Markov network and iteratively adjusts them to achieve local consistency.

Experimental Results: Reuters: The Reuters news articles data set contains substantial number of documents hand labeled into grain, crude, trade, and money-fx. Using this data set, six experimental setups are created, by using all possible pairings of categories from the four categories chosen. The resulting sequence is divided into nine time segments with roughly the same number of documents in each segment.

WebKB2 This data set consists of hand-labeled web pages from Computer Science department web sites of four schools: Berkeley, CMU, MIT and Stanford and they are categorized into faculty, student, course and organization. Six experimental setups are created by using all possible pairings of categories from the four categories.