Presentation on theme: "Active Learning with Feedback on Both Features and Instances H. Raghavan, O. Madani and R. Jones Journal of Machine Learning Research 7 (2006) Presented."— Presentation transcript:
Active Learning with Feedback on Both Features and Instances H. Raghavan, O. Madani and R. Jones Journal of Machine Learning Research 7 (2006) Presented by: John Paisley
Outline Discuss problem Discuss proposed solution Discuss results Conclusion
Problem of Paper Imagine you want to filter junk email via some classifier and youre willing to help train that classifier by labeling things, but you want to do it quickly because youre impatient. Imagine you want to sort a database of news articles, etc. This paper is concerned with trying to speed this process up, meaning reach a high performance in fewer iterations.
Suggestion of Paper Traditionally, active learning will query a user about instances (articles, emails etc) and the user will provide a label for that instance (one-vs-rest in this paper). This paper suggests that the user also be queried about features (words) and their relevance for distinguishing classes to speed up the learning process. The reason is that, apparently, in typical applications, all words of a document are used as features in classification. Therefore the feature is a very high dimension and, with only a few labeled data, its hard to build a good classifier. By asking about features, the dimensionality is (effectively) reduced early on with the nuisance dimensions (effectively) removed.
Traditional Active Learning 1.Several instances are selected at random and labeled by a user 2.A model is built (SVM using direct kernel here) 3.Sequentially, the most uncertain (closest to boundary and called uncertainty sampling) instances are selected, labeled, and the model updated. 4.The algorithm terminates at some point (when a high enough level of performance is reached).
Their Feature Feedback Addition 1.(same) Several instances are selected at random and labeled by a user 2.(same) A model (SVM using direct kernel here) is built. 3.(same) Sequentially, the most uncertain (closest to boundary and called uncertainty sampling) instances are selected, labeled, and the model updated. 4.Then, the user is shown a list of features (words) and asked whether they are relevant to distinguishing this class from others. Their algorithm then incorporates this in further training by simply multiplying that dimension by 10 (arbitrary) to increase the impact that dimension has on classification (because of the direct kernel I assume)
How They Assess Performance (1) Before humans are involved, they create an oracle that can rank features by importance (it has all labels a priori) as determined via Information Gain Where P(c) it the probability of the class of interest, P(t) is the probability of the word of interest appearing in an article, and P(c,t) is their joint probability. The larger the IG, the more informative the word is on determining the class (e.g. football is informative for sports).
How They Assess Performance (2) They devise their performance metric called efficiency F1 is the harmonic mean of the precision and recall, where precision is the fraction of (e.g.) articles classified as 1 that are correct and recall is the fraction of articles correctly classified as 1 to all articles with label 1 They set M = 1000, assuming that the classifier will be about perfect at that point and theyre measuring how far active learning (ACT) is from that perfection compared with random sampling. [Right: Efficiency is defined as one minus blue area divided by grey area. They only measure after seeing 42 documents throughout the paper]
Results with Oracle These results show the ideal performance of feature feedback to see if its worthwhile to begin with. Basically, they select the top n features that maximize performance (via Information Gain) and do active learning, reporting the efficiency after 42 documents, as well as the F1 score after 7 and 22 documents. The F1 results are upper bounded by the far right column. The results indicate that selecting the most informative features speeds up learning (the uninformative features are distractions for the classifier in the early stages when there are only a few labels).
Results with Human How well can a human label features compared with the oracle and, if not as well, is it still beneficial? Experiment: Have a human read an article and show the top 20 words from the oracle mixed in with some other words. Have the user mark relevant or not relevant/dont know for each. Below shows the human compared with the oracle. Also shown is the ability of 50 labeled documents (picked via uncertainty sampling) to select the top 20 words (via Information Gain) aka, traditional active learning after 50. What it says is that after seeing one document, a human can tell the relevant features better than the classifier can after 50. Kappa is a measure of how well the humans agree (which they say is good).
Putting Humans In the Loop They then took the human responses and simulated active learning with feature feedback. The experimenters were shown an article and the features to respond to (relevant or not) for that article and they input what the humans of the previous slide said. UNC is no feature feedback, ORA is the oracle (correct answers for the feature queries) and HIL is the human response (as opposed to oracle). It says that humans speed up the active learning process.
Conclusions Knowing what features are relevant at the early stages of active learning will help speed up the process of building an accurate classifier. Far fewer instances will need to be labeled for the classifier to reach a high performance. Humans are able to identify these features (in the case of identifying words for documents)