“Hello! My name is... Buffy” Automatic Naming of Characters in TV Video Mark Everingham, Josef Sivic and Andrew Zisserman Arun Shyam.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Active Appearance Models

Human Identity Recognition in Aerial Images Omar Oreifej Ramin Mehran Mubarak Shah CVPR 2010, June Computer Vision Lab of UCF.

Face Alignment with Part-Based Modeling

TP14 - Local features: detection and description Computer Vision, FCUP, 2014 Miguel Coimbra Slides by Prof. Kristen Grauman.

Patch to the Future: Unsupervised Visual Prediction

Character retrieval and annotation in multimedia

Facial feature localization Presented by: Harvest Jang Spring 2002.

Instructor: Mircea Nicolescu Lecture 13 CS 485 / 685 Computer Vision.

The Viola/Jones Face Detector Prepared with figures taken from “Robust real-time object detection” CRL 2001/01, February 2001.

Broadcast News Parsing Using Visual Cues: A Robust Face Detection Approach Yannis Avrithis, Nicolas Tsapatsoulis and Stefanos Kollias Image, Video & Multimedia.

Robust Moving Object Detection & Categorization using self- improving classifiers Omar Javed, Saad Ali & Mubarak Shah.

Tracking Objects with Dynamics Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 04/21/15 some slides from Amin Sadeghi, Lana Lazebnik,

Feature extraction: Corners 9300 Harris Corners Pkwy, Charlotte, NC.

A Study of Approaches for Object Recognition

Rapid Object Detection using a Boosted Cascade of Simple Features

Robust Real-time Object Detection by Paul Viola and Michael Jones ICCV 2001 Workshop on Statistical and Computation Theories of Vision Presentation by.

1 Face Tracking in Videos Gaurav Aggarwal, Ashok Veeraraghavan, Rama Chellappa.

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.

Ensemble Tracking Shai Avidan IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE February 2007.

Local Affine Feature Tracking in Films/Sitcoms Chunhui Gu CS Final Presentation Dec. 13, 2006.

Automatic Image Alignment (feature-based) : Computational Photography Alexei Efros, CMU, Fall 2005 with a lot of slides stolen from Steve Seitz and.

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.

Scale Invariant Feature Transform (SIFT)

Automatic Face Recognition for Film Character Retrieval in Feature-Length Films Ognjen Arandjelović Andrew Zisserman.

Tracking Video Objects in Cluttered Background

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

1 Probabilistic Formulation for Skin Detection Sanun Srisuk Seminar I.

Object Tracking for Retrieval Application in MPEG-2 Lorenzo Favalli, Alessandro Mecocci, Fulvio Moschetti IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR.

Jacinto C. Nascimento, Member, IEEE, and Jorge S. Marques

Statistical Shape Models Eigenpatches model regions –Assume shape is fixed –What if it isn’t? Faces with expression changes, organs in medical images etc.

Face Recognition and Retrieval in Video Basic concept of Face Recog. & retrieval And their basic methods. C.S.E. Kwon Min Hyuk.

FACE DETECTION AND RECOGNITION By: Paranjith Singh Lohiya Ravi Babu Lavu.

Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

Wang, Z., et al. Presented by: Kayla Henneman October 27, 2014 WHO IS HERE: LOCATION AWARE FACE RECOGNITION.

Computer vision.

Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.

Prakash Chockalingam Clemson University Non-Rigid Multi-Modal Object Tracking Using Gaussian Mixture Models Committee Members Dr Stan Birchfield (chair)

BraMBLe: The Bayesian Multiple-BLob Tracker By Michael Isard and John MacCormick Presented by Kristin Branson CSE 252C, Fall 2003.

Building Face Dataset Shijin Kong. Building Face Dataset Ramanan et al, ICCV 2007, Leveraging Archival Video for Building Face DatasetsLeveraging Archival.

3D SLAM for Omni-directional Camera

Face detection Slides adapted Grauman & Liebe’s tutorial

Tracking People by Learning Their Appearance Deva Ramanan David A. Forsuth Andrew Zisserman.

Face Recognition: An Introduction

CSCE 643 Computer Vision: Extractions of Image Features Jinxiang Chai.

Feature extraction: Corners 9300 Harris Corners Pkwy, Charlotte, NC.

Dynamic Captioning: Video Accessibility Enhancement for Hearing Impairment Richang Hong, Meng Wang, Mengdi Xuy Shuicheng Yany and Tat-Seng Chua School.

1 Research Question  Can a vision-based mobile robot  with limited computation and memory,  and rapidly varying camera positions,  operate autonomously.

Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.

Expectation-Maximization (EM) Case Studies

CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.

VIP: Finding Important People in Images Clint Solomon Mathialagan Andrew C. Gallagher Dhruv Batra CVPR

Boosted Particle Filter: Multitarget Detection and Tracking Fayin Li.

Feature extraction: Corners and blobs. Why extract features? Motivation: panorama stitching We have two images – how do we combine them?

Face Image-Based Gender Recognition Using Complex-Valued Neural Network Instructor :Dr. Dong-Chul Kim Indrani Gorripati.

By: David Gelbendorf, Hila Ben-Moshe Supervisor : Alon Zvirin

Detecting Eye Contact Using Wearable Eye-Tracking Glasses.

Local features: detection and description

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.

Instructor: Mircea Nicolescu Lecture 10 CS 485 / 685 Computer Vision.

Max-Confidence Boosting With Uncertainty for Visual tracking WEN GUO, LIANGLIANG CAO, TONY X. HAN, SHUICHENG YAN AND CHANGSHENG XU IEEE TRANSACTIONS ON.

Portable Camera-Based Assistive Text and Product Label Reading From Hand-Held Objects for Blind Persons.

Evaluation of Gender Classification Methods with Automatically Detected and Aligned Faces Speaker: Po-Kai Shen Advisor: Tsai-Rong Chang Date: 2010/6/14.

Visual homing using PCA-SIFT

Paper – Stephen Se, David Lowe, Jim Little

Video Google: Text Retrieval Approach to Object Matching in Videos

PRAKASH CHOCKALINGAM, NALIN PRADEEP, AND STAN BIRCHFIELD

Brief Review of Recognition + Context

Video Google: Text Retrieval Approach to Object Matching in Videos

Presentation transcript:

“Hello! My name is... Buffy” Automatic Naming of Characters in TV Video Mark Everingham, Josef Sivic and Andrew Zisserman Arun Shyam

Objective To label television or movie footage with the identity of the people present in each frame of the video. Challenging problem owing to change in scale,lighting,pose,hair- style etc. To employ readily available textual annotation for TV, in the form of subtitles and transcripts, to automatically assign the correct name to each face image.

Outline Three main parts: 1) Processing of subtitles and script to obtain proposals for the names of the characters in the video. 2) Processing the video to extract face tracks and accompanying descriptors, and to extract descriptors for clothing. 3) Combine the textual and visual information to assign labels to detected faces in the video. Test Data: Two 40 minute episodes of the TV serial “Buffy the Vampire Slayer”.

Subtitle and Script Processing Subtitles extracted using a simple OCR algorithm. Script obtained from a fan site in HTML format. Subtitles record what is being said and when but not by whom. Script tells who says what but not when. What we need – Who, What and When. Solution is to align the script and subtitles by Dynamic Time Warping algorithm. Write the subtitle text vertically, and the script text horizontally. The task then is to find a path from top-left to bottom-right which moves only forward through either text. The word-level alignment is then mapped back onto the original subtitle units.

Subtitle-Script Alignment

Face Detection and Tracking A frontal face detector is run on every frame of the video. Better than multi-face and person detection. Any individual who appears in a video for any length of time generates a face track – that is, a sequence of face instances across time. Track provides multiple examples of the character’s appearance. Face tracks are obtained as a set of point tracks starting at some frame in the shot and continuing until some later frame. For a given pair of faces in different frames, the number of point tracks which pass through both faces is counted, and if this number is large relative to the number of point tracks which are not in common to both faces, a match is declared

Facial Feature Localization Output of the face detector gives an approximate location and scale of the face. After this the facial features are located. Nine facial features are located: the left and right corners of each eye, the two nostrils and the tip of the nose, and the left and right corners of the mouth. To locate the feature positions a Gaussian mixture model is used, where the covariance of each component is restricted to form a tree structure with each variable dependent on a single parent variable. Gives better performance in case of pose variation and less light. Appearance of each facial feature is assumed independent of the other features and is modeled by feature/non-feature classifier that uses a variation of Ada-boost algorithm and 'Haar-like image features'.

Face and Feature Detection

Representing Face Appearance Computing descriptors of the local appearance of the face around each of the located facial features. Gives robustness to pose variation, lighting, and partial occlusion compared to a global face descriptor. Before extracting descriptors normalize face region to reduce the scale uncertainty and pose variation. An affine transformation then transforms located facial feature points to a set of feature positions. Two descriptors were investigated: (i) the SIFT descriptor (ii) a simple pixel- wised descriptor which is formed by taking a vector of normalized pixels to obtain local photometric invariance. Face descriptor is formed by concatenating the descriptors for each facial feature.

Representing Clothing Appearance Sometimes matching faces is very difficult because of different expression, pose, lighting or motion blur. Cues to matching identity can be derived by representing the appearance of the clothing. For each face detection a bounding box which is expected to contain the clothing of the corresponding character is predicted relative to the position and scale of the face detection. A color histogram is computed as a descriptor of that bounding box in YCbCr color space which de-correlates the color components better. While similar clothing appearance suggests the same character, observing different clothing does not necessarily imply a different character.

Clothing Appearance Aid

Speaker Detection The combined subtitle and script face detection is highly ambiguous as: (i) there might be several detected faces present in the frame and we do not know which one is speaking; (ii) It might be a reaction shot. This ambiguity can be removed using visual clues i.e. movement of the lips. A rectangular mouth region within each face detection is identified using the located mouth corners and mean squared difference of the pixel values within the region is computed between the current and previous frame. If difference above a high threshold then classify face detections into ‘speaking’, if less than a low threshold then classify face detections into ‘non-speaking'. If in between then ‘refuse to predict’.

Speaker Detection Ambiguity

Lip Movement Detection

Classification by Exemplar Sets Tracks for which a single identity is proposed are treated as exemplars with which to label the other tracks which have no, or uncertain, proposed identity. Each unlabeled face track F is represented as a set of face descriptors and clothing descriptors {f,c}. Exemplar sets {λi} have the same representation but are associated with a particular name. For a given track F, the quasi-likelihood that the face corresponds to a particular name λi : p(F|λi) =1/Z.exp{-(df (F,λi)²/2σf²)}.exp{-(df (F,λi)²/2σc²)}

Classification by Exemplar Sets Face distance df (F,λi) is defined as the minimum distance between the Descriptors in F and in the exemplar tracks λi. The clothing distance dc(F,λi) is similarly defined. The quasi-likelihoods for each name λi are combined to obtain a posterior probability of the name by assuming equal priors on the names and applying Bayes’ rule: P(λi|F) = p(F|λi)/ Σj p(F|λ j) By thresholding the posterior, a “refusal to predict” mechanism is implemented. The faces for which the certainty of naming does not reach some threshold will be left unlabeled; this decreases the recall of the method but improves the accuracy of the labeled tracks.

Results The speaking detection labels around 25% of face tracks with around 90% accuracy. No manual annotation of any data is performed other than to evaluate the method (ground truth label for each face track). Recall here means the proportion of tracks which are assigned a name after applying the “refusal to predict” mechanism. Two baseline methods were compared to the proposed method: (i) “Prior” – label all tracks with the name which occurs most often in the script i.e. Buffy (accuracy %) (ii) “Subtitles only” – label any tracks with proposed names from the script (accuracy-45%). Using the proposed method, if forced to assign a name to all face tracks, the accuracy obtained is around 69% in both episodes. Requiring only 80% of tracks to be labeled increases the accuracy to around 80%.

Results

Conclusion Promising results obtained without any supervision except the readily available annotation. The detection method and appearance models used can be improved by including certain weak cues like hair or eye color and by using a specific body tracker rather than a generic point tracker in cases where face detection is very difficult