Using Webcast Text for Semantic Event Detection in Broadcast Sports Video IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 7, NOVEMBER 2008.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Multimedia Database Systems

Clustering Categorical Data The Case of Quran Verses

Patch to the Future: Unsupervised Visual Prediction

Personalized Abstraction of Broadcasted American Football Video by Highlight Selection Noboru Babaguchi (Professor at Osaka Univ.) Yoshihiko Kawai and.

ICIP 2000, Vancouver, Canada IVML, ECE, NTUA Face Detection: Is it only for Face Recognition?  A few years earlier  Face Detection Face Recognition 

Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.

Video Table-of-Contents: Construction and Matching Master of Philosophy 3 rd Term Presentation - Presented by Ng Chung Wing.

HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Expectation Maximization Method Effective Image Retrieval Based on Hidden Concept Discovery in Image Database By Sanket Korgaonkar Masters Computer Science.

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.

Presented by Zeehasham Rasheed

ICME 2004 Tzvetanka I. Ianeva Arjen P. de Vries Thijs Westerveld A Dynamic Probabilistic Multimedia Retrieval Model.

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

Jacinto C. Nascimento, Member, IEEE, and Jorge S. Marques

Information Retrieval in Practice

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Università degli Studi di Modena and Reggio Emilia Dipartimento di Ingegneria dell’Informazione Prototypes selection with.

Table 3:Yale Result Table 2:ORL Result Introduction System Architecture The Approach and Experimental Results A Face Processing System Based on Committee.

A Novel Framework for Semantic Annotation and Personalized Retrieval of Sports Video IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 3, APRIL 2008.

Chapter 14 Speaker Recognition 14.1 Introduction to speaker recognition 14.2 The basic problems for speaker recognition 14.3 Approaches and systems 14.4.

S EGMENTATION FOR H ANDWRITTEN D OCUMENTS Omar Alaql Fab. 20, 2014.

Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.

A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.

Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.

Discriminative Models for Spoken Language Understanding Ye-Yi Wang, Alex Acero Microsoft Research, Redmond, Washington USA ICSLP 2006.

Image Classification 영상분류

Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.

Japanese Spontaneous Spoken Document Retrieval Using NMF-Based Topic Models Xinhui Hu, Hideki Kashioka, Ryosuke Isotani, and Satoshi Nakamura National.

TEMPLATE DESIGN © Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.

Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

Character Identification in Feature-Length Films Using Global Face-Name Matching IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 7, NOVEMBER 2009 Yi-Fan.

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,

1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.

Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.

Case Study 1 Semantic Analysis of Soccer Video Using Dynamic Bayesian Network C.-L Huang, et al. IEEE Transactions on Multimedia, vol. 8, no. 4, 2006 Fuzzy.

CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.

Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.

Unsupervised Mining of Statistical Temporal Structures in Video Liu ze yuan May 15,2011.

 Present by 陳群元.  Introduction  Previous work  Predicting motion patterns  Spatio-temporal transition distribution  Discerning pedestrians  Experimental.

Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.

Using Cross-Media Correlation for Scene Detection in Travel Videos.

Towards Total Scene Understanding: Classiﬁcation, Annotation and Segmentation in an Automatic Framework N 工科所錢雅馨 2011/01/16 Li-Jia Li, Richard.

Link Distribution on Wikipedia [0407]KwangHee Park.

10.0 Latent Semantic Analysis for Linguistic Processing References : 1. “Exploiting Latent Semantic Information in Statistical Language Modeling”, Proceedings.

Wonjun Kim and Changick Kim, Member, IEEE

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

2005/09/13 A Probabilistic Model for Retrospective News Event Detection Zhiwei Li, Bin Wang*, Mingjing Li, Wei-Ying Ma University of Science and Technology.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

Event Tactic Analysis Based on Broadcast Sports Video Guangyu Zhu, Changsheng Xu, Senior Member, IEEE, Qingming Huang, Member, IEEE, Yong Rui, Senior Member,

Topic Modeling for Short Texts with Auxiliary Word Embeddings

IMAGE PROCESSING RECOGNITION AND CLASSIFICATION

Statistical Models for Automatic Speech Recognition

Dynamical Statistical Shape Priors for Level Set Based Tracking

Multimedia Information Retrieval

SMEM Algorithm for Mixture Models

Matching Words with Pictures

Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang

Topic Models in Text Processing

Text Categorization Berlin Chen 2003 Reference:

Presentation transcript:

Using Webcast Text for Semantic Event Detection in Broadcast Sports Video IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 7, NOVEMBER 2008

Outline Framework Review Web-Cast Text Analysis Event detection from webcast text using pLSA Text/Video Alignment Semantic Alignment using CRFM Feature Extraction Event Moment Detection Event Boundary Detection Experimental results

Web-Cast Text Analysis A simple way to detect event from webcast text is to use predefined keywords to match related words in the webcast text to detect event

Web-Cast Text Analysis 1)Since the event types are different for different sports games, we have to pre-define keywords for different sports games 2)Even for the same sports game (e.g., soccer), due to different presentation style and language, it is difficult to use one predefined keyword to represent one event type 3)Since the description of an event is on a sentence level, single keyword is not enough for robust event detection and sometimes may cause incorrect detection

Web-Cast Text Analysis Based on our observation, the descriptions of the same events in the webcast text have similar sentence structure and word usage We can therefore employ an unsupervised approach to first cluster the descriptions into different groups corresponding to certain events and then to extract keywords from the descriptions in each group for event detection

Web-Cast Text Analysis Here we apply probabilistic latent semantic analysis (pLSA) for text event clustering and detection Compared with LSA, pLSA is based on a mixture decomposition derived from a latent class model, which results in a solid statistic foundation

LSA example

Web-Cast Text Analysis pLSA applies a latent variable model called aspect model for co-occurrence data which associates an unobserved class variable with each observation

Web-Cast Text Analysis In webcast text, each sentence description of an event is considered as a document and all the documents can be represented as with words from a vocabulary

Web-Cast Text Analysis Based on this representation, webcast text can be summarized in a N x M co- occurrence table of counts N ij = n(d i,w j ), where n(d i,w j ) denotes how often the word w j occurred in a document d i

Web-Cast Text Analysis A joint probability model p(w,d) over N x M is defined by the mixture: where p(w|z) and p(d|z) are the class- conditional probabilities of a specific word/document conditioned on the unobserved class variable z, respectively

The Expectation Maximization (EM) algorithm is applied to estimate the values of p(w|z) and p(d|z) in latent variable models

The EM algorithm consists of two steps: 1) Expectation step to compute the following posterior probability:

2) Maximization step to estimate following terms:

Web-Cast Text Analysis Before applying pLSA, the webcast text corpus is preprocessed to filter out the names of the players and teams due to the consideration of their affecting to the analysis result For example, “ Rasheed Wallace shooting foul (Shaquille O ’ Neal draws the foul) ” is processed into “ shooting foul draws foul ”

By applying pLSA, a N x K matrix of the class- conditional probability of each document is obtained: N: the total number of documents contained in webcast text K: the total number of topics corresponding to the event categories in webcast text

Web-Cast Text Analysis We determine the optimal number of event categories K and cluster the documents into these categories ( C 1, C 1, …, C k ) in an unsupervised manner Each document can be clustered into one category based on its maximum class-conditional probability:

Web-Cast Text Analysis Each document can be represented by its class-conditional probability distribution on the K categories as a K dimensional feature vector

Web-Cast Text Analysis To measure the similarity between two documents in a K dimensional latent semantic space, a similarity function is defined as follows: where and are the kth component of and respectively

Web-Cast Text Analysis With the context of document clustering in our work, the documents within the same cluster should have maximum similarity while the documents in the different clusters should have minimum similarity n i : the number of documents in category C i

Web-Cast Text Analysis The optimal number of event categories can be determined by minimizing the value of

Web-Cast Text Analysis In sports games, an event means an occurrence which is significant and essential to the course of the game progress, hence the number of event categories recorded in the webcast text is normally less than 15, that is, the range of K is between 2 and 15

Web-Cast Text Analysis Figure illustrates the similarity matrix of the clustered soccer webcast text collected from Yahoo Match Cast Central, which contains 460 documents. The documents are sorted by their category number as follows:

Web-Cast Text Analysis After all the documents have been clustered into related categories, the words of the documents in each category are ranked by their class- conditional probability The top ranked word in each category is selected as the keyword to represent the event type

Web-Cast Text Analysis After the keywords for all the categories are identified, the text events can be detected by finding the sentences which contains the keywords and analyzing context information in the description Kobe Bryant makes 20-foot jumper Marcus Camby misses 15-foot jumper

Web-Cast Text Analysis Without using context information, the precision of “ jumper ” event detection will be reduced from 89.3% to 51.7% in our test dataset Therefore, we need to search both keyword and its adjacent words to ensure the event detection accuracy

Web-Cast Text Analysis The extracted semantic information (event category, player, team, time-tag) is used to build a data structure to annotate the video event and facilitate the searching using metadata

Text/Video Alignment Feature Extraction Shot classification Replay detection Event Moment Detection Event Boundary Detection

Event Moment Detection In broadcast sports video, the game time and video time are not synchronized due to nongame scenes such as player introduction, half-time break before the game starting and time-out during the game Therefore, we need to recognize the game time in the video to detect event moment

Event Moment Detection We proposed an approach to first detect a clock overlaid on the video and then recognize the digits from the clock to detect the game time This approach worked well for soccer video due to the non-stopping clock time in soccer video Once the game time is recognized at certain video frame, the time corresponding to other frames can be inferred based on recognized game time and frame rate

Event Moment Detection Due to unpredictable clock stopping, game time recognition in each frame is necessary in the video As the clock region in the video may disappear during the clock stopping time, we employ a detection – verification – redetection mechanism, which is an improvement to our previous approach, to recognize game time in broadcast sports video

We first segment the static overlaid region using a static region detection approach Then the temporal neighboring pattern similarity (TNPS) measure is utilized to locate the clock digit position because the pixels of clock digit region are changing periodically

As the clock may disappear during the clock stopping time in some sports games (e.g., basketball), we need to verify the matching result of the SECOND digit using following formula: T i : image pixel value for the template of digit number i D: image pixel value for the digit number to be recognized R: region of digit number : EQV operator

Event Moment Detection If M(i) is smaller than a matching threshold lasting for a certain time (2 seconds in our work), a failure of verification will occur

Event Moment Detection After the success of verification, the game time will be recorded as a time sequence and is linked to the frame number, which is called time-frame index If the verification failure occurs during the clock stopping time, the related frames are also denoted as the label of “ --:-- ” in the time-frame index

Event Boundary Detection Event moment can be detected based on the recognized game time However, an event should be a video segment which exhibits the whole process of the event (e.g., how the event is developed, players involved in the event, reaction of the players to the event, etc.) rather than just a moment

Event Boundary Detection Hence, we have to detect the start and end boundaries of an event in the video We use a conditional random field model (CRFM) to model the temporal event structure and detect the event boundary

Event Boundary Detection We first train the CRFM using labeled video data to obtain the parameters for CRFM The features described in Feature Extraction are used to train CRFM

CRF model for event boundary detection

During the detection process, the shot containing the detected event moment is used as a reference shot to obtain a search range for event boundary detection The search range is empirically set to start from the first far view shot before the reference shot and end at the first far view shot after the reference shot

We then use the trained CRFM to calculate the probability scores of all the shots within the search range and label the shot with event or non-event shot based on the highest probability There are isolated points in this sequence which can be considered as noise and may affect event boundary detection

Given the label sequence L={l 1, …, l n } where n is the length of the label sequence Two steps are applied to identify and correct the noise points, respectively

For each l i, we first use the following criteria to identify whether it is a noise point where m is the length of the identified interval predefined on L Then a window based neighboring voting (NV) scheme is used to correct the value of the noise point

For l i, the NV value computed on the neighbors fallen in the window centered at l i is defined as following: where [-k, k] is the width of the neighboring voting window

Experimental Results Text Analysis Text/Video Alignment

Text Analysis

We use boundary detection accuracy (BDA) to evaluate the detected event boundaries in the testing video set, which is defined as where t ds and t de are automatically detected start t ms and t me end event boundaries respectively, and are manually labeled start and end event boundaries respectively, α is a weight and set to 0.5 in our experiment