Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor: Trevor Darrell

PhD defense by Kate Saenko2 The challenge of large scale object recognition How to get examples of 10,000+ categories? –Collection of training images is time- consuming, subjective –But the Web has billions of images! How to build precise models based on unlabeled image data? How to learn visual models on the fly, based on user input?

PhD defense by Kate Saenko3 Multimodal context speech text image collective knowledge

PhD defense by Kate Saenko4 Main Contributions Proposed a method that combines images and spoken utterances to learn object models Developed an unsupervised approach that learns visual models from unlabeled images, text, and dictionaries This is a bag… … The Tote is the perfect example of two handbag design principles that... The lines of this tote are incredibly sleek, but... The semi buckles that form the handle attachments are...

PhD defense by Kate Saenko5 This is a bag

PhD defense by Kate Saenko6 Noun bag, container (a flexible container with a single opening) bag, handbag, pocketbook, purse (a container used for carrying money and small personal items or accessories (especially by women)) bag, travelling bag, suitcase (a rectangular container for carrying clothes) bag … The Tote is the perfect example of two handbag design principles that... The lines of this tote are incredibly sleek, but... The semi buckles that form the handle attachments are...

PhD defense by Kate Saenko7 Outline Audio-visual object recognition –Related work –Fusion model and experiments* Unsupervised text and image models –Related work –WISDOM: probabilistic dictionary-based image sense model –Concrete WISDOM: identifying tangible objects *see Saenko and Darrell. Object category recognition using probabilistic fusion of speech and image classifiers. MLMI, 2007.

PhD defense by Kate Saenko8 Audio-visual object recognition speech text image dictionary

PhD defense by Kate Saenko9 Task: object recognition with audio- visual input* Speech recognizer Speech DB *e.g. BIRON robot, see S. Li and B.Wrede. “Why and how to model multi-modal interaction for a mobile robot companion," In Proc. AAAI, 2007. lamp label +

PhD defense by Kate Saenko10 Speech, image can be ambiguous… a pan... That’s a pen! Copy machine.. ant → fan face → bass piano → cannon

PhD defense by Kate Saenko11 Proposal: use both channels to help disambiguate underlying word object recognizer

PhD defense by Kate Saenko12 Fusion of speech and image classifiers Object Classifier Speech recognizer Speech DB Image Classifier Image DB lamp Improve existing method by using both modalities Explore late fusion of classifier outputs –Mean rule –Product rule

PhD defense by Kate Saenko13 Experiments with 101 objects Asked users to speak object name for Caltech101, added noise Plot shows benefit from fusion across noise levels

PhD defense by Kate Saenko14 Remaining issues… object recognizer

PhD defense by Kate Saenko15 Unsupervised object models speech text image dictionary

PhD defense by Kate Saenko16 Next Audio-visual object recognition –Related work –Fusion model and experiments Unsupervised text and image models –Related work –WISDOM: probabilistic dictionary-based image sense model –Concrete WISDOM: identifying tangible objects

PhD defense by Kate Saenko17 How can we learn a rich variety of visual concepts?

PhD defense by Kate Saenko18 Image Sense Disambiguation Would rather watch… Suicide watch Hurricane, tornado watch Watch out! Celebrity watch

PhD defense by Kate Saenko19 Text contexts icrystal rfid wrist watch features watch masterpiece innovative watch making craftsmanship absolute precision fine charm high scratch resistance anti- allergenic characteristics make chronometer true jewel s wrist water proof sleek stylish wrist watch solar powered available watch ticket key purse identity card special offer place order rfid wrist watch absolutely free rfid watch black wrist strap rfid watch orange wrist strap rfid watch stainless steel privacy disclaimer copyright icrystal pty website

PhD defense by Kate Saenko20 Topic 1 rolex service repair battery omega replica tag heuer breitling swiss replace gucci button price band … Topic 2 new world media right said house april obama islam march bush war american time … Latent Dirichlet allocation (LDA) (Blei et al. ‘03) One of several techniques for discovering latent dimensions in bag-of-words data α θ z w β φ KM NdNd d word P(w|z) topic document P(z|d)

PhD defense by Kate Saenko21 Latent Topics icrystal rfid wrist watch features watch masterpiece innovative watch making craftsmanship absolute precision fine charm high scratch resistance anti- allergenic characteristics make chronometer true jewel wrist water proof sleek stylish wrist watch solar powered available watch ticket key purse identity card special offer place order rfid wrist watch absolutely free rfid watch black wrist strap rfid watch orange wrist strap rfid watch stainless steel privacy disclaimer copyright icrystal pty website

PhD defense by Kate Saenko22 Overview of approaches to web- based object model learning Some learn only from image features –(Li et al.07) bootstrap from labeled images –(Fergus et al.05) select correct image topic Some incorporate text features –(Schroff et al.07) use a category-independent text classifier –(Berg and Forsyth 06) ask user to sort text topics None address polysemy directly –(Loeff et al.06) do image sense discrimination, not identification All rely on labeled images of correct sense

PhD defense by Kate Saenko23 Next Audio-visual object recognition –Related work –Fusion model and experiments Unsupervised text and image models –Related work –WISDOM: probabilistic dictionary-based image sense model* –Concrete WISDOM: identifying tangible objects *see Saenko and Darrell. Unsupervised learning of visual sense models for polysemous words. NIPS 2008.

PhD defense by Kate Saenko24 How can we ground image senses in the absence of labeled examples? WORDNET: Noun S: (n) watch, ticker (a small portable timepiece)S:ticker S: (n) watch (a period of time (4 or 2 hours) during which some of a ship's crew are on duty)S: S: (n) watch, vigil (a purposeful surveillance to guard or observe)S:vigil S: (n) watch (the period during which someone (especially a guard) is on duty)S: S: (n) lookout, lookout man, sentinel, sentry, watch, spotter, scout, picket (a person employed to keep watch for some anticipated event)S:lookoutlookout mansentinelsentryspotterscoutpicket S: (n) vigil, watch (the rite of staying awake for devotional purposes (especially on the eve of a religious festival))S:vigil WIKIPEDIA: Watch may also refer to: Watch system, a period of work dutyWatch system Tropical cyclone warnings and watches, alerts issued to coastal areas threatened by severe stormsTropical cyclone warnings and watches Watch (Unix), a Unix commandWatch (Unix) Watch (TV channel) a TV station launching in Autumn 2008Watch (TV channel) Watch (computer programming) Help:Watching pages on WikipediaHelp:Watching pages Watch (dog), name of the pet dog in the the Boxcar ChildrenWatch (dog)the Boxcar Children D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. ACL, 1995.

PhD defense by Kate Saenko25 Sense- specific classifier training images Web Image Sense DictiOnary Model Search Engine Watch Search Engine Watch is the authoritative guide to search engine marketing (SEM) and search engine optimization (SEO), offering the latest news about search... searchenginewatch.com/ - 38k - Cached - Similar pages - Note this CachedSimilar pagesNote this watch - MDC Watches for assignment to a property named prop in this object, calling handler(prop, oldval, newval) whenever prop is set and storing the return value in... developer.mozilla.org/en/Core_J avaScript_1.5_Reference/Global _Objects/Object/watch - 30k - Cached - Similar pages - Note this CachedSimilar pagesNote this Search Engine Watch Search Engine Watch is the authoritative guide to search engine marketing (SEM) and search engine optimization (SEO), offering the latest news about search... searchenginewatch.com/ - 38k - Cached - Similar pages - Note this CachedSimilar pagesNote this watch - MDC Watches for assignment to a property named prop in this object, calling handler(prop, oldval, newval) whenever prop is set and storing the return value in... developer.mozilla.org/en/Core_J avaScript_1.5_Reference/Global _Objects/Object/watch - 30k - Cached - Similar pages - Note this CachedSimilar pagesNote this dictionary definitions unlabeled text dictionary model P( sense | data) WISDOM does: 1.image sense disambiguation 2.dataset collection 3.classification of unseen images noun web images fosil wrist watch a 800 x 628 - 107k - jpg amgmedia.com watch-1(ticker)

PhD defense by Kate Saenko26 WISDOM: Using dictionary entries to ground senses Use entry text to learn a probability distribution over words for that sense Problem: entries contain very little text –Expand by adding synonyms, example sentences, etc. –Still, very few words are covered! S: (n) mouse (any of numerous small rodents typically resembling diminutive rats having pointed snouts and small ears on elongated bodies with slender usually hairless tails)S: S: (n) mouse (any of numerous small rodents typically resembling diminutive rats having pointed snouts and small ears on elongated bodies with slender usually hairless tails)S: direct hyponym / full hyponymdirect hyponymfull hyponym S: (n) house mouse, Mus musculus (brownish-grey Old World mouse now a common household pest worldwide)S:house mouseMus musculus S: (n) harvest mouse, Micromyx minutus (small reddish-brown Eurasian mouse inhabiting e.g. cornfields)S:harvest mouseMicromyx minutus S: (n) field mouse, fieldmouse (any nocturnal Old World mouse of the genus Apodemus inhabiting woods and fields and gardens)S:field mouse S: (n) nude mouse (a mouse with a genetic defect that prevents them from growing hair and also prevents them from immunologically rejecting human cells and tissues; widely used in preclinical trials)S:nude mouse S: (n) wood mouse (any of various New World woodland mice)S:wood mouse direct hypernym / inherited hypernym / sister termdirect hypernyminherited hypernymsister term S: (n) rodent, gnawer (relatively small placental mammals having a single pair of constantly growing incisor teeth specialized for gnawing)S:rodentgnawer

PhD defense by Kate Saenko27 WISDOM: Probabilistic dictionary- based model Main idea: –Using LDA, learn latent sense-like dimensions on a large amount of related text, –Model dictionary senses in LDA space: Map image contexts to topics Map topics to senses Search Engine Watch Search Engine Watch is the authoritative guide to search engine marketing (SEM) and search engine optimization (SEO), offering the latest news about search... searchenginewatch.com/ - 38k - Cached - Similar pages - Note this CachedSimilar pagesNote this watch - MDC Watches for assignment to a property named prop in this object, calling handler(prop, oldval, newval) whenever prop is set and storing the return value in... developer.mozilla.org/en/Core_J avaScript_1.5_Reference/Global _Objects/Object/watch - 30k - Cached - Similar pages - Note this CachedSimilar pagesNote this Search Engine Watch Search Engine Watch is the authoritative guide to search engine marketing (SEM) and search engine optimization (SEO), offering the latest news about search... searchenginewatch.com/ - 38k - Cached - Similar pages - Note this CachedSimilar pagesNote this watch - MDC Watches for assignment to a property named prop in this object, calling handler(prop, oldval, newval) whenever prop is set and storing the return value in... developer.mozilla.org/en/Core_J avaScript_1.5_Reference/Global _Objects/Object/watch - 30k - Cached - Similar pages - Note this CachedSimilar pagesNote this unlabeled text LDA

PhD defense by Kate Saenko28 WISDOM sense model Given a query word with sense s with values in set {1, …,S}, and a text document d, the probability of sense is d z N s Define the likelihood of topic z given sense s with entry words e s = w 1, …,w Es as To compute probability of sense given topic

PhD defense by Kate Saenko29 WISDOM: Incorporating Image Features Use LDA to discover visual topics v=1, …,L, Then estimate the conditional probability P(s|v) Given a test image d i *, we can compute Combine contributions of image and text:

PhD defense by Kate Saenko30 WISDOM classifier SVM classifier training images Search Engine Watch Search Engine Watch is the authoritative guide to search engine marketing (SEM) and search engine optimization (SEO), offering the latest news about search... searchenginewatch.com/ - 38k - Cached - Similar pages - Note this CachedSimilar pagesNote this watch - MDC Watches for assignment to a property named prop in this object, calling handler(prop, oldval, newval) whenever prop is set and storing the return value in... developer.mozilla.org/en/Core_J avaScript_1.5_Reference/Global _Objects/Object/watch - 30k - Cached - Similar pages - Note this CachedSimilar pagesNote this Search Engine Watch Search Engine Watch is the authoritative guide to search engine marketing (SEM) and search engine optimization (SEO), offering the latest news about search... searchenginewatch.com/ - 38k - Cached - Similar pages - Note this CachedSimilar pagesNote this watch - MDC Watches for assignment to a property named prop in this object, calling handler(prop, oldval, newval) whenever prop is set and storing the return value in... developer.mozilla.org/en/Core_J avaScript_1.5_Reference/Global _Objects/Object/watch - 30k - Cached - Similar pages - Note this CachedSimilar pagesNote this dictionary definitions unlabeled text dictionary model P( sense | data) noun web images fosil wrist watch a 800 x 628 - 107k - jpg amgmedia.com watch-1(ticker)

PhD defense by Kate Saenko31 Evaluation datasets Collected by querying Image Search –MIT-ISD: bass, face, mouse, speaker, watch –MIT-OFFICE: cellphone, fork, hammer, keyboard, mug, pliers, scissors, stapler, telephone, watch –UIUC-ISD: bass, crane, squash core related core related unrelated ???

PhD defense by Kate Saenko32 Experimental Setup 1.Task: Image sense disambiguation (ISD) in search results –Separate images according to visual sense –“ core ” labels are positive class, “ related ” and “ unrelated ” negative –Metrics: true positives vs. false positives (ROC), recall-precision curve (RPC) 2.Task: object classification in a novel image –Classify image as having correct object category or not –“ core ” labels are positive class, other keyword ’ s “ core ” senses are negative class –Metric: percent correct

PhD defense by Kate Saenko33 ISD example results squash: sports squash: vegetable bass: musical instrument bass: fish bass: raw web image data squash: raw web image data

PhD defense by Kate Saenko34 yahoo musical range polyph. range male singer sea bass freshwater bass basso, voice instrument spiny fish yahoo musical range polyph. range male singer sea bass freshwater bass basso, voice instrument spiny fish ISD Results: ROC using each WordNet sense for BASS BASS True positive rate False positive rate

PhD defense by Kate Saenko35 ISD Results: RPC using true sense yahoo wisdom Retrieval of core senses on UIUC-ISD

PhD defense by Kate Saenko36 Results: object classification Baseline approach: –Automatically generate sense-specific keywords from WordNet –Append word to synonyms and direct hypernyms –Limit queries to 3 terms –Example: mouse + computer, mouse + electronic device Plot shows average accuracy across five objects in the MIT-ISD dataset (each is a two-class problem with chance performance of 50%) 85% 75% 65% 55% 50 100 150 200 250 300 Number of training images baseline wisdom Accuracy

PhD defense by Kate Saenko37 Next Audio-visual object recognition –Related work –Fusion model and experiments Unsupervised text and image models –Related work –WISDOM: probabilistic dictionary-based image sense model –Concrete WISDOM: identifying tangible objects* *see Saenko and Darrell, Filtering Abstract Senses From Image Search Results, NIPS 2009.

PhD defense by Kate Saenko38 Query Word: “cup” Online Dictionary Word to search for: Noun cup Search Dictionary cup (a small open container usually used for drinking; usually has a handle) "he put the cup back in the saucer"; "the handle of the cup was missing" cup, loving cup (a large metal vessel with two handles that is awarded as a trophy to the winner of a competition) "the school kept the cups is a special glass case” a major sporting event or competition “the world cup”, “the Stanley cup” Concrete WISDOM Object Sense: drinking container Abstract Sense: sporting event Object Sense: loving cup (trophy) Removing Abstract Senses

PhD defense by Kate Saenko39 mouse rodent beaver mammal cow … … How can we identify abstract senses? Mouse: Noun S: (n) mouse (any of numerous small rodents typically resembling diminutive rats having pointed snouts and small ears on elongated bodies with slender usually hairless tails)S: S: (n) shiner, black eye, mouse (a swollen bruise caused by a blow to the eye)S:shinerblack eye S: (n) mouse (person who is quiet or timid)S: S: (n) mouse, computer mouse (a hand-operated electronic device that controls the coordinates of a cursor on your computer screen…)S:computer mouse Idea: use the ontological information available via WordNet –semantic relations between concepts (hypernym, part, etc.) –lexical tags:

PhD defense by Kate Saenko40 Experimental Setup Table: Concrete Senses Identified by WISDOM Task: ISD using concrete-sense WISDOM –all “ core ” and “ related ” labels of keyword are positive class, “ unrelated ” labels are negative class

PhD defense by Kate Saenko41 Results: Filtering visual senses Yahoo Search: “telephone” DICTIONARY 1: (n) telephone, phone, telephone set (electronic equipment that converts sound into electrical signals that can be transmitted over distances and then converts received signals back into sounds):phone telephone set 2: (n) telephone, telephony (transmitting speech at a distance): telephony

PhD defense by Kate Saenko42 Results: Filtering visual senses Artifact sense: “telephone” DICTIONARY 1: (n) telephone, phone, telephone set (electronic equipment that converts sound into electrical signals that can be transmitted over distances and then converts received signals back into sounds):phone telephone set 2: (n) telephone, telephony (transmitting speech at a distance): telephony

PhD defense by Kate Saenko43 Results: RPC of all concrete senses Retrieval of core+related concrete senses on UIUC-ISD yahoo wisdom

PhD defense by Kate Saenko44 Further Improvement: Topic adaptation Original LDA topics are learned on text-only unlabeled data Adapt to image-text data via semi-supervised Gibbs sampling E.g.: one of “ fork ” topics: product bike null tool tube seal set price oil knife spoon spring ship use item accessory handle shop order remove store custom home weight steel supply cap clamp fit false... cutlery knife spoon product set price handle steel tool item stainless null bike tube seal oil knive kitchen utensil ship order use table sp ring supply design piece carve weight shop...

PhD defense by Kate Saenko45 “fork”: using original topics unrelated: fork lift road fork bike fork knife …

PhD defense by Kate Saenko46 “fork”: using adapted topics unrelated: fork lift road fork bike fork knife …

PhD defense by Kate Saenko47 Results on MIT-OFFICE The average area under the RPC improves from 0.47 to 0.57 Detailed RPCs: yahoo wisdom

PhD defense by Kate Saenko48 Conclusions Showed that combining speech with image input may be advantageous for object recognition Presented WISDOM, an unsupervised method to learn sense- specific object models from images and text harvested from the web Extended WISDOM to filter out non-physical word senses based on WordNet semantic structure

PhD defense by Kate Saenko49 Future work: WISDOM-enabled interactive training speech text image dictionary supervised classifier WISDOM

PhD defense by Kate Saenko50

Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

Similar presentations

Presentation on theme: "Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

Similar presentations

Presentation on theme: "Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:"— Presentation transcript:

Similar presentations

About project

Feedback