Online Learning by Projecting: From Theory to Large Scale Web-spam filtering Yoram Singer Koby Crammer (Upenn), Ofer Dekel (Google/HUJI), Vineet Gupta.

Slides:



Advertisements
Similar presentations
PEBL: Web Page Classification without Negative Examples Hwanjo Yu, Jiawei Han, Kevin Chen- Chuan Chang IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
Advertisements

The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Koby Crammer Department of Electrical Engineering
Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1.
A KTEC Center of Excellence 1 Pattern Analysis using Convex Optimization: Part 2 of Chapter 7 Discussion Presenter: Brian Quanz.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Power of Selective Memory. Slide 1 The Power of Selective Memory Shai Shalev-Shwartz Joint work with Ofer Dekel, Yoram Singer Hebrew University, Jerusalem.
Boosting Approach to ML
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
The Center for Signal & Image Processing Georgia Institute of Technology Kernel-Based Detectors and Fusion of Phonological Attributes Brett Matthews Mark.
Confidence-Weighted Linear Classification Mark Dredze, Koby Crammer University of Pennsylvania Fernando Pereira Penn  Google.
Second order cone programming approaches for handing missing and uncertain data P. K. Shivaswamy, C. Bhattacharyya and A. J. Smola Discussion led by Qi.
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
Forgetron Slide 1 Online Learning with a Memory Harness using the Forgetron Shai Shalev-Shwartz joint work with Ofer Dekel and Yoram Singer Large Scale.
Lecture: Dudu Yanay.  Input: Each instance is associated with a rank or a rating, i.e. an integer from ‘1’ to ‘K’.  Goal: To find a rank-prediction.
1 PEGASOS Primal Efficient sub-GrAdient SOlver for SVM Shai Shalev-Shwartz Yoram Singer Nati Srebro The Hebrew University Jerusalem, Israel YASSO = Yet.
Phoneme Alignment. Slide 1 Phoneme Alignment based on Discriminative Learning Shai Shalev-Shwartz The Hebrew University, Jerusalem Joint work with Joseph.
On-line Learning with Passive-Aggressive Algorithms Joseph Keshet The Hebrew University Learning Seminar,2004.
Learning of Pseudo-Metrics. Slide 1 Online and Batch Learning of Pseudo-Metrics Shai Shalev-Shwartz Hebrew University, Jerusalem Joint work with Yoram.
Learning to Align Polyphonic Music. Slide 1 Learning to Align Polyphonic Music Shai Shalev-Shwartz Hebrew University, Jerusalem Joint work with Yoram.
CS 4700: Foundations of Artificial Intelligence
Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.
Sparse vs. Ensemble Approaches to Supervised Learning
Scalable Text Mining with Sparse Generative Models
Trip Planning Queries F. Li, D. Cheng, M. Hadjieleftheriou, G. Kollios, S.-H. Teng Boston University.
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
Online Learning Algorithms
Final review LING572 Fei Xia Week 10: 03/11/
Strategy-Proof Classification Reshef Meir School of Computer Science and Engineering, Hebrew University A joint work with Ariel. D. Procaccia and Jeffrey.
Speech Signal Processing
Outline Separating Hyperplanes – Separable Case
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Topics on Final Perceptrons SVMs Precision/Recall/ROC Decision Trees Naive Bayes Bayesian networks Adaboost Genetic algorithms Q learning Not on the final:
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Tie-Yan.
Transcription of Text by Incremental Support Vector machine Anurag Sahajpal and Terje Kristensen.
Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models Kai-Wei Chang and Dan Roth Experiment Settings Block Minimization.
Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.
Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.
Daniel May Department of Electrical and Computer Engineering Mississippi State University Analysis of Correlation Dimension Across Phones.
Online Passive-Aggressive Algorithms Shai Shalev-Shwartz joint work with Koby Crammer, Ofer Dekel & Yoram Singer The Hebrew University Jerusalem, Israel.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.
An Introduction to Support Vector Machines (M. Law)
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Hierarchical Classification
Linear Discrimination Reading: Chapter 2 of textbook.
Online Learning Rong Jin. Batch Learning Given a collection of training examples D Learning a classification model from D What if training examples are.
Online Transfer Learning Algorithm ~ The Twenty-Third Annual Conference on Neural Information Processing Systems (NIPS2009) Propose the first framework.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Online Learning of Maximum Margin Classifiers Kohei HATANO Kyusyu University (Joint work with K. Ishibashi and M. Takeda) p-Norm with Bias COLT 2008.
Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Smooth ε -Insensitive Regression by Loss Symmetrization Ofer Dekel, Shai Shalev-Shwartz, Yoram Singer School of Computer Science and Engineering The Hebrew.
Learning by Loss Minimization. Machine learning: Learn a Function from Examples Function: Examples: – Supervised: – Unsupervised: – Semisuprvised:
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Page 1 CS 546 Machine Learning in NLP Review 1: Supervised Learning, Binary Classifiers Dan Roth Department of Computer Science University of Illinois.
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
An Efficient Online Algorithm for Hierarchical Phoneme Classification
General-Purpose Learning Machine
CSCI B609: “Foundations of Data Science”
Overview of Machine Learning
PEGASOS Primal Estimated sub-GrAdient Solver for SVM
Hubs and Authorities & Learning: Perceptrons
Phonetics and Phonemics
Presentation transcript:

Online Learning by Projecting: From Theory to Large Scale Web-spam filtering Yoram Singer Koby Crammer (Upenn), Ofer Dekel (Google/HUJI), Vineet Gupta (Google), Joseph Keshet (HUJI), Andrew Ng (Stanford), Shai Shalev-Shwartz (HUJI) Based on joint work with: UT Austin AIML Seminar, Jan. 27, 2005

Online Binary Classification No animal eats bees Pearls melt in vinegar Dr. Seuss finished Dartmouth There are weapons of mass destruction in Iraq True False True

Binary Classification Instances (documents, signals): Labels (true/false, good/bad): Classification and Prediction: Mistakes and losses:

Online Binary Classification Initialize your classifier ( ) For t = 1,2,3,…,T,… Receive an instance: Predict label: Receive true label: [suffer “loss”/error] Update classifier ( ) Goal: suffer small losses while learning

Why Online? Adaptive Simple to implement Fast, small memory footprint Can be converted to batch learning (O2B) Formal guarantees But: might not be as effective as a well designed batch learning algorithms

Linear Classifiers & Margins The prediction is formed as follows: The margin of an example w.r.t Positive Margin Negative Margin

Separability Assumption

Classifier Update - Passive Mode

Prediction & Margin Errors

Hinge Loss

Version Space In case of a prediction mistake then must reside

Mistake  Aggressive Mode is projected onto the feasible (dual) space

Passive-Aggressive Update

Three Decision Problems: A Unified View ClassificationRegressionUniclass

The Generalized PA Algorithm Each example induces a set of consistent hypotheses (half-space, hyper-slub, ball) The new vector is set to be the projection of onto set of consistent hyp. ClassificationRegressionUniclass

Loss Bound (Classification) If there exists such that Then where PA makes a bounded number of mistakes

Proof Sketch Define: Upper bound: Lower bound: Lipschitz Condition

Proof Sketch (Cont.) Combining upper and lower bounds L=B for classification and regression L=1 for uniclass

Unrealizable Case ???

Unrealizable Case (Classification) PA-IPA-II

(Not-really) Aggressive Updates

Mistake Bound for PA-I Loss suffered by PA-I on round t: Loss suffered by any fixed vector: #Mistakes made by PA-I is at most:

Loss Bound for PA-II Loss suffered by PA-II on round t: Loss suffered by any fixed vector: Cumulative loss ( ) of PA-II is at most:

Beyond Binary Decision Problems Applications and generalizations of PA: Multiclass categorization Topic ranking and filtering Hierarchical classification Sequence learning (Markov Networks) Segmentation of sequences Learning of pseudo-metrics

Movie Recommendation System Recommender System

Recommending by Projecting 1234 Project Apply Thresholds

Prank Update w Rank Levels Thresholds

Prank Update w

PRank w Correct Rank Interval

Prank Update w {2, 3}

PRank Update w

w x w

74424 registered Viewers 1648 listed Movies Viewers rated subsets of movies Demo: online movie recommendation EachMovie Database

Web Spam Filtering [With Vineet Gupta] Query: “hotels palo alto” Spammers: Cardinal Hotel - Palo Alto - Reviews of Cardinal Hotel... Palo Alto, California United States. Deals on Palo Alto hotels.... More Palo Altohotels.... Research other Palo Alto hotels. Is this hotel not right for you?... Cardinal Hotel - Palo Alto - Reviews of Cardinal Hotel Palo Alto Hotels - Cheap Hotels - Palo Alto Hotels... Book Palo Alto Hotels Online or Call Toll Free Keywords: Palo AltoHotel Discounts - Cheap Hotels in Palo Alto. Hotels In Palo Alto Palo Alto Hotels - Cheap Hotels - Palo Alto Hotels...

Enhancements for Web Spam Various “signals”  features Design of special kernels Multi-tier feedback (label): +2 navigational site (e.g on topic -1 off topic -2 nuke the spammer Loss is sensitive to site label Algorithmic modifications due to scale: Online-to-batch conversions Re-projections of old examples Part of a recent revision to search (Google3)

Web Spam Filtering - Results Specific queries and domains are heavily spammed: Over 50% of the returned URL for travel search Certain countries are more spam prone Training set size: over half a million domains Training time: 2 hours to 5 days Test set size: the entire web crawled by Google (over 100 million domains) A few hours to filter all domains on 100’s of cpus Current reduction achieved (estimate): 50% of spammers

Summary Unified online framework for decision problems Simple and efficient algorithms (“kernelizable”) Analyses for realizable and unrealizable cases Numerous applications Batch learning conversions & generalization Generalizations using general Bregman projections Approximate projections for large scale problems Applications of PA to other decision problems

Related Work Projections Onto Convex Sets (POCS): Y. Censor & S.A. Zenios, “Parallel Optimization” (Hildreth’s projection algorithm), Oxford UP, 1997 H.H. Bauschke & J.M. Borwein, “On Projection Algorithms for Solving Convex Feasibility Problems”, SIAM Review, 1996 Online Learning: M. Herbster, “Learning additive models online with fast evaluating kernels”, COLT 2001 J. Kivinen, A. Smola, and R.C. Williamson, “Online learning with kernels”, IEEE Trans. on SP, 2004

Relevant Publications Online Passive Aggressive Algorithms, CDSS’03 CSKSS’05 Family of Additive Online Algorithms for Category Ranking, CS’03 Ultraconservative Online Algorithms for Multiclass Problems, CS’02 CS’03 On the algorithmic implementation of Multiclass SVM, CS’03 PRanking with Ranking, CS’01 CS’04 Large Margin Hierarchical Classification, DKS’04 Learning to Align Polyphonic Music, SKS’04 Online and Batch Learning of Pseudo-metrics, SSN’04 The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees, DSS’04 A Temporal Kernel-Based Model for Tracking Hand- Movements from Neural Activities, SCPVS’04

Hierarchical Classification: Motivation Phonetic transcription of DECEMBER Gross erorr Small errors T ix s eh m bcl b er d AE s eh m bcl b er d ix s eh NASAL bcl b er

Phonetic Hierarchy b g PHONEMES Sononorants Silences Obstruents Nasals Liquids Vowels Plosives Fricatives Front Center Back n m ng d k p t f v sh s th dh zh z l y w r Affricates jh ch oy ow uh uw aa ao er aw ay iy ih ey eh ae

Common Constructions Ignore the hierarchy - solve as multiclass C A greedy approach: solve a multiclass problem at each nodeC CC

Hierarchical Classifier Assume and Associate a prototype with each label Classification rule: W4W4 W5W5 W6W6 W7W7 W8W8 W9W9 W 10 W1W1 W0W0 W2W2 W3W3

Hierarchical Classifier (cont.) Define W4W4 W5W5 W6W6 W7W7 W8W8 W9W9 W 10 W1W1 W0W0 W2W2 W3W3

A Metric Over Labels b a A given hierarchy defines a metric over the set of labels via graph distance

From PA to Hieron Replace a simple margin constraint with a tree-based margin constraint: - correct label - predicted label

Hieron - Update w4w4 w5w5 w6w6 w7w7 w8w8 w9w9 w 10 w1w1 w2w2 w3w3

Hieron - Update w6w6 w7w7 w 10

Sample Run on Synthetic Data The hierarchy given to the algorithm An edge indicates that prototypes are “close”

Experiments with Hieron Datasets used Compared two models: Hieron with knowledge of the correct hierarchy Hieron without knowledge of the correct hierarchy (flat) # train# test# labelsdepth DMOZ (web pages)85764-FCV3168 Speech (phonemes) Synthetic data

Experimental Results Each graph shows the difference between the error histograms of the two models Hieron makes fewer “gross” mistakes State-of-the-art results for frame-based phoneme classification DMOZPhoneme (TIMIT)Synthetic