A. B. Benítez, D. Zhong, A. Jaimes, H. Sundaram and S.-F. Chang

MPEG-7 Implementation: Extraction and Application of MPEG-7 Description Tools
A. B. Benítez, D. Zhong, A. Jaimes, H. Sundaram and S.-F. Chang {ana, dzhong, ajaimes, sundaram, ee.columbia.edu Digital Video and Multimedia Group (DVMM) Department of Electrical Engineering Columbia University

Overview MPEG-7 Standard: Multimedia description
Describes structure, semantics and summaries, among others Segmentation, searching, filtering, understanding and summarization of multimedia are still challenges AMOS: Video object segmentation and retrieval Semi-automatic segmentation based on region tracking Retrieval based on visual features and spatio-temp. relations Visual Apprentice: Learning of Visual Object/Scene Detectors Users define visual classes and provide training examples System combines features/learning algorithms at multiple levels IMKA: Intelligent Multimedia Knowledge Application Multimedia to represent semantic/perceptual information about world Extracts multimedia knowledge for image retrieval KIA: High-Level Audio-Visual Summaries Automatic AV scene segmentation and structure discovery Generates video skims preserving semantics

Outline MPEG-7 Standard
Structure, Semantics and Summarization Description Tools AMOS Video Object Segmentation and Retrieval Visual Apprentice Learning Object/Scene Detectors from User Input IMKA Multimedia Knowledge Framework, Extraction and Application KIA AV Scene Segmentation, Discovery and Summarization Summary

Motivation for MPEG-7 Explosive proliferation of multimedia content
Efficient, intelligent, interoperable applications Storage and retrieval Multimedia editing Personalized TV Remote Sensing Applications Universal Multimedia Access Surveillance Applications Etc. Soap operas Sports News Consumer content Scientific content

International Standard
MPEG-7 Standard Flexible, extensible, multi-level, and standard framework for describing multimedia Systems, DDL, Video, Audio, MDS, Software Scope Schedule Flexible, extensible, multi-level, and standard framework for describing multimedia Feature Extraction MPEG-7 Description Search/Filtering Application Call For Proposals Working Draft Committee Draft International Standard 10/98 12/99 10/00 9/01

MPEG-7 Framework Description Definition Language (DDL)
Language to create new Ds/DSs or extend existing ones (XML Schema) Description Schemes (DSs) Structure and semantics of relations among Ds/DSs Descriptors (Ds) Representation of a feature of AV data AV Content Item Data Feature User or System to signifies 1..* Description Definition Language Scheme Descriptor 1..* 0..* defines describes

Multimedia Description Schemes
Models Collections Content organization User interaction Preferences History Content management Content description Creation & Production Media Usage Semantics Structure Navigation & Access Summaries Variations Views Schema Tools Links & Media Localization Basic Basic elements Datatypes

Multimedia Description Schemes
Models Collections Content organization User interaction Preferences History Navigation & Access Summaries Variations Views Creation & Production Media Usage Content management Content description Structure Semantics Schema Tools Links & Media Localization Basic Basic elements Datatypes

Structure Description Tools
TextAnnotation SpatialMask . . . Segment Relation CS Decomposition DS VideoSegment DS MovingRegion DS . . . StillRegion DS Segment DS describes Multimedia Content

Segments Still regions

Segment Attributes Color Dominant Color Scalable Color Color Layout
Color Structure GoF/GoP Color Texture Homogeneous Text. Texture Browsing Edge Histogram Shape Region Shape Contour Shape 3D Shape Motion Camera Motion Motion Trajectory Parametric Motion Motion Activity Localization Region Locator Spatio-Temporal Locator Media Time Other Face Recognition Text Annotation Creation Info Usage Info Media Info

Segment Relations Segment Decompositions
spatial, temporal, spatio-temporal, media source Spatial Relations south, west, northwest, southwest, left, below, under, equal, inside, covers, overlaps, disjoint Temporal Relations before, meets, overlaps, during, contains, starts, finishes, equal sequential, parallel Spatio-Temporal Relations union, intersection Other Relations keyFor, annotates

Structure Description I
TextAnnotation ColorStructure Still Region Spatial Decomposition no overlap, gap Still Regions ContourShape TextAnnotation left Spatial Relation

Structure Description I - XML
<StillRegion id="SR1"> <TextAnnotation> <FreeTextAnnotation> Alex shakes hands with Ana </FreeTextAnnotation> </TextAnnotation> <SpatialDecomposition overlap="false" gap="true"> <StillRegion id="SR2"> <TextAnnotation> <FreeTextAnnotation> Alex </FreeTextAnnotation> </TextAnnotation> <Relation xsi:type="SpatialSegmentRelationType“ name="left“ target="#SR1"/> <VisualDescriptor xsi:type=“ContourShapeType"> ... </VisualDescriptor> </StillRegion> <StillRegion id="SR3"> <TextAnnotation> <FreeTextAnnotation> Ana </FreeTextAnnotation> </TextAnnotation> </SpatialDecomposition> left

Structure Description II
Video Segment keyFor Other Relation Temporal Decomposition Video Segments no overlap, no gap Moving Region Spatial Decomposition no overlap, gap Moving Regions Spatial Decomposition no overlap, no gap Spatial Relation above

Structure Description II
Video Segment MediaTime Mosaic GoFGoPColor TextAnnotation Temporal Decomposition no overlap, no gap Video Segments Spatial Decomposition no overlap, gap MediaTime ScalableColor ParametricMotion TextureBrowsing ContourShape TextAnnotation Moving Region no overlap, no gap Spatial Decomposition Moving Regions Spatial Relation above

Semantics Description Tools
Object DS Event DS Concept DS SemanticState DS SemanticPlace DS SemanticTime DS AgentObject DS Definition . . . Label AnalyticModel DS Segment DS Semantic Relation DS SemanticBag DS SemanticBase DS captures Content Narrative World describes Semantic DS Multimedia

Semantic Description Friendship (Concept)
symbolPerception property Friendship (Concept) mediaPerception location time 9th Sept (Semantic Time) New York (Semantic Place) Shake hands (Event) Alex (Agent Object) Ana (Agent Object) accompanier agent Labels: “man” Definitions: “a primate of the family Hominidae” Properties: “tall”, “slim”

Semantic Description - XML
<Semantic id=”SM1”> <Label><Name>Alex shakes hands with Ana </Name></Label> <SemanticBase xsi:type="EventType" id="EV1"> <Label><Name>Shake hands</Name></Label> <Relation type=“urn:mpeg:mpeg7:cs:ObjectEventRelationCS:agent" target="#AO1"/> <Relation type=“urn:mpeg:mpeg7:cs:ObjectEventRelationCS:accompanier“ target="#AO2"/> <Relation type=“urn:mpeg:mpeg7:cs:ConceptSemanticBaseRelationCS:property“ target="#C1"/> <Relation type=“urn:mpeg:mpeg7:cs:SemanticPlaceSemanticBaseRelationCS:location“ target="#SP1"/> <Relation type=“urn:mpeg:mpeg7:cs:SemanticTimeSemanticBaseRelationCS:time“ target="#ST1"/> </SemanticBase> <SemanticBase xsi:type="AgentObjectType" id="AO1"> <Label><Name>Alex</Name></Label> <Agent xsi:type="PersonType"> <Name> <GivenName> Alex </GivenName> </Name> </Agent> <SemanticBase xsi:type="AgentObjectType" id="AO2"> <Label><Name>Ana</Name></Label> <Agent xsi:type="PersonType"> <Name> <GivenName> Ana </GivenName> </Name> </Agent> <SemanticBase xsi:type="ConceptType" id="C1"> <Label><Name>Comradeship</Name></Label> <SemanticBase xsi:type=”SemanticPlaceType” id=”SP1”> <Label><Name>New York</Name></Label> <SemanticBase xsi:type=”SemanticTimeType” id=”ST1”> <Label><Name>September 9</Name></Label> </Semantic>

Semantic Relations “The vessel is an example of Maya art created in Guatemala in the 8th Century. The vessel’s height is 14 cm and it has several paintings. The paintings show the realm of the lords of death with a death figure that dances and another figure that holds an axe and a handstone. The paintings represent sacrifice”.

Summary Description Tools
Hierarchical Summary DS Highlight Segment DS Highlight Summary DS Segment DS Audio Visual Context

Summary Description Summaries International Sports Environmental
Skim all news

Structure, Semantics and Summarization Description Tools AMOS (structure) Video Object Segmentation and Retrieval Visual Apprentice Learning Object/Scene Detectors from User Input IMKA Multimedia Knowledge Framework, Extraction and Application KIA AV Scene Segmentation, Discovery and Summarization Summary

Structure Description Tools
TextAnnotation SpatialMask . . . Segment Relation CS Decomposition DS VideoSegment DS MovingRegion DS . . . StillRegion DS Segment DS describes Multimedia Content

AMOS Uniform low-level feature region segmentation (e.g. using color and motion) Semantic video object segmentation Low-level feature region-based similarity search Semantic video object-based similarity search Model semantic video objects Underlying regions Visual features Spatio-temporal relations

Video Object Segmentation
1 Object Definition (user input) Region Segmentation starting frame Homogeneous Regions Region Aggregation Video Objects Semantic Video Object Foreground Regions Background Regions Region Tracking Motion Projection succeeding frame 2

Object Projection and Tracking
segmented regions at frame n-1 projected regions at frame n BG BG FG FG hole : new region Egomotion Model:

Segmentation Results maximum boundary deviation # of frames with user input Frame# Uncovered regions caused major tracking errors, corrected by few user inputs Remaining errors reflect accuracy limitation

Features and Relations
Visual features for regions and objects: Representative color Tamura texture Shape descriptors Motion trajectory Spatio-temporal relations: Spatial orientation graph (angle) Spatial topological graph (contains, not contain, contained) Temporal directional graph (start after, same time, before) Foreground Regions Semantic Video Object

Spatio-Temporal Decomposition
Mapping to MPEG-7 Visual features for regions and objects: Representative color Tamura texture Shape descriptors Motion trajectory Spatio-temporal relations: Spatial orientation graph (angle) Spatial topological graph (contains, not contain, contained) Temporal directional graph (start after, same time, before) Moving Regions Visual Descriptors Spatio-Temporal Decomposition Moving Region Segment Relations

Video Object Searching
Query Object Region Matching : for each query region, find candidate region list based on visual feature distance (i.e. color, texture, etc) Join & Validation : join candidate region lists and compute total object distance (visual features + spatio-temporal relations) Retrieved Objects

Search Interface Query Canvas Query Results Feature Weights

AMOS Segmentation System AMOS Search System
Demo Time AMOS Segmentation System AMOS Search System

Structure, Semantics and Summarization Description Tools AMOS (structure) Video Object Segmentation and Retrieval Visual Apprentice (structure + semantics) Learning Object/Scene Detectors from User Input IMKA Multimedia Knowledge Framework, Extraction and Application KIA AV Scene Segmentation, Discovery and Summarization Summary

Semantics Description Tools
Object DS Event DS Concept DS SemanticState DS SemanticPlace DS SemanticTime DS AgentObject DS Definition . . . Label AnalyticModel DS Segment DS Semantic Relation DS SemanticBag DS SemanticBase DS captures Content Narrative World describes Semantic DS Multimedia

The Visual Apprentice Focus on semantics, object/scene structure, and user preferences Visual Object/Scene Detectors Automatically assign semantic labels to objects (e.g., sky) or scenes (e.g., handshake scene) User input and learning User defines models and provides training examples Learning algorithms + different features + training examples = Automatic Visual Detectors

Definition Hierarchy User Input:
Object Level 1: Object Level 2: Object-part Object-part 1 Object-part 2 Object-part n Level 3: Perceptual Area Perceptual-area 1 Perceptual-area n Level 4: Region Region 1 Region 2 User Input: Define hierarchy : decide nodes and containment relationships. For each node: label examples in images/videos. Labeling by clicking on regions or outlining areas.

Definition Hierarchy Batter Regions Batting Ground Pitcher Grass Sand
Level 4: Region Level 2: Object-part Level 1: Object Level 3: Perceptual Area

Definition Hierarchy Example
Batting Ground Pitcher Batter Grass Sand Regions Regions Regions Regions

Learning Detectors from User Input
Training Data For each node of the hierarchy, a set of examples. Superset of features (incl. MPEG-7) is extracted from each example: Color (Average LUV, Dominant Color, etc.) Shape & Location (Perimeter, Formfactor, Eccentricity, etc.) Texture (Edge Direction Histogram, Tamura, etc.) Motion (Trajectory, velocity, etc.) Superset of Machine learning algorithms use training data

Learning Classifiers For Nodes
Machine Learning Algorithm Stage 3: classifiers, generate MPEG-7 Descriptions Stage 2: training D1 Definition Hierarchy C1 Visual Detector Stage 1: training data obtained and feature vectors computed (MPEG-7)

An Example Cf Determines Face o-p classifier input Face region
Object Handshake Object-part Face 1 Handshake Face 2 Perceptual Area Region Regions Regions Regions Cf Determines Face o-p classifier input Face region classifier Cr

Learning Classifiers Definition Hierarchy C1
Algorithm 1 C1 Multiple Classifiers for D1. Learning Algorithm 2 C2 D1 Learning Algorithm n Cn Stage 1: Training data obtained and feature vectors computed (MPEG-7 features). Stage 2: Classifiers learned Stage 3: Classifiers/features selected

Training Summary User: Defines Definition Hierarchy
Labels example images/video according to hierarchy Semantic MPEG-7 descriptions generated for training set System: Automatically segments examples Extracts visual features for each node (structure, MPEG-7) Applies set of learning algorithms to each node Selects best features for each algorithm A set of classifiers for each node (best features) Best classifier selected or different classifiers combined Assigns semantic labels (semantics, MPEG-7)

Mapping to MPEG-7 Moving and Still Regions
Visual Descriptors Semantic Entities and Relations Batting Ground Pitcher Batter Grass Sand Regions Regions Regions Regions Visual features

Classification Summary
Automatic segmentation Feature extraction (MPEG-7) Classification and grouping Generation of MPEG-7 descriptors Regions, groups of regions, Object/scene Face 2 Regions Handshake Face 1 Region Object-part Object Perceptual Area

Experiments (I) Set I: Sky images Set II: Handshake images Object
Regions Sky Level 1: Object Level 2: Object-part Level 3: Perceptual Area Level 4: Region Object Handshake Object-part Face 1 Handshake Face 2 Perceptual Area Region Regions Regions Regions

Experiment (II) Set III: Baseball Video Level 1: Object Batting
Level 2: Object-part Ground Pitcher Batter Level 3: Perceptual Area Grass Sand Level 4: Region Regions Regions Regions Regions

Overall Performance Classification Results:
Image set Training Set Size Test set Accu. Precision Recall Baseball 60 Video Shots % 100 % 64 % Handshakes 80 Images % 70 % 74 % Skies 45 Images % 87 % 50%

Structure, Semantics and Summarization Description Tools AMOS (structure) Video Object Segmentation and Retrieval Visual Apprentice (structure + semantics) Learning Object/Scene Detectors from User Input IMKA (structure + semantics) Multimedia Knowledge Framework, Extraction and Application KIA AV Scene Segmentation, Discovery and Summarization Summary

IMKA Explore new frontiers for Multimedia Information Systems (MISs):
Construct intelligent and interactive MISs for analysis, retrieval, navigation, and synthesis of multimedia Paradigm shift towards more semantics-based, knowledge-driven MISs Anticipate impact of MPEG-7 on MISs Improve effectiveness and performance of MISs

The IMKA System MediaNet: Multimedia knowledge representation framework Extends traditional knowledge representations by incorporating perceptual and symbolic information Defines and illustrates concepts and relations using multi-modal content and descriptors Encoded using MPEG-7 description tools Implementation: Semi-automatic construction of MediaNet knowledge base CBIR query expansion and multi-modal query translation Experimentation: MPEG-7 color image test-set (5466 images, 51 queries). Initial results show improved retrieval effectiveness.

AV content, descriptors, descriptor similarity
MediaNet Evolution MediaNet Semantic concepts Word Perceptual concepts Perceptual AV content, descriptors, descriptor similarity WordNet Lexical concepts Semantic Word MMT Semantic concepts Word AV content descriptors Mirror Thesaurus Perceptual concepts Word AV descriptors Semantic

MediaNet (Symbolic + Perceptual)
Novel multimedia representation of world concepts at symbolic and perceptual levels: Illustration of concepts using multimedia content. Perceptual feature-based relations. Weights, probabilities, and conceptual contexts. Relations Specialization of Similar shape Concepts Human Concept Hominid Concept Weight, prob, context Place of Concept Earth “earth” W = 0.5 P = 1.0 Shape descriptor distance < T “homo” “human” “man” “hominid” “a primate of the family Hominidae” Multimedia content Shape descriptor: (0.04 …)

MediaNet Constructs Concepts: Real world entities: rock , game .
Abstract concepts: beauty Unnamed objects: texture pattern Relations: Semantic relations (WordNet): generalization (animal, dog). Perceptual relations (CBR descriptor similarity): to have similar shape ( ). Content: Multimedia data (image, text), feature descriptors (color histogram), descriptor similarity (Euclidean). Some representations not relevant for some concepts (audio for concept Sky).

AV content, descriptors, descriptor similarity
Mapping to MPEG-7 Semantic Entities Labels Segment Segment Descriptors Semantic Relations Semantic concepts Perceptual concepts Word Perceptual Semantic AV content, descriptors, descriptor similarity Word Word

Construction and Retrieval
MediaNet KB Query Processor Query Results Construction Intelligent CBIR Feature Database Extraction Visual Query CB Search Engine CB Results “tapirs” MediaNet construction: Textual annotations, WordNet, image network of examples, automatic extraction tools, and human assistance. Intelligent CBIR using MediaNet: Expand and translate queries across modalities.

MediaNet Construction
WordNet: Hype/hyponymy Mero/Holonymy Antonymy Rock, stone Rock candy, rock Rock music, rock Rock, careen, sway Rock, sway, shake Cradle, rock Sky Flip, toss, sky, pitch Sunset …. WordNet: Senses “stone” “rock” “sky” “sunset” Rock And Sky Sunset … “Rock and sky” “Sunset” + features Feature centroids Automatic feature extraction tools: Color histogram Color coherence Tamura texture Wavelet texture

Multimodal Query Translation and Expansion
+ features Feature centroids “stone” “rock” “sky” “sunset” Semantic Space Hypo/Hypernimy 1 Mero/Holonymy 2 Antonymy MAX Feature Space CB Queries Expanded Query Query Tapir Snake Monkey Tapir Snake “tapir” Weighted minimum distance per image

Evaluation Evaluation criteria:
Retrieval effectiveness (recall, precision) Additional functionality. Ground truth: images from MPEG-7 content set 50 queries by MPEG-7 for color descriptor evaluation. Semantic query “tapirs” by authors with relevance scores: Tapir images: Monkey images: 0.75 Snake images: Butterfly and fish images: 0.25 MediaNet KB construction: 185 images in 50 classes with annotations Concepts: 96. Relations: 50 specialization, 34 composition, 1 opposite. Experiments: Color histogram vs. several color and texture descriptors. Visual query vs. text query.

Experimental Results 50 Color MPEG-7 Queries:
Visual w/o MN Visual w/ MN Text w/ MN 50 Color MPEG-7 Queries: Average precision vs. recall Visual w/o MN Visual w/ MN Text w/ MN Semantic Query “tapirs”: Precision vs. recall

Experiment Conclusions
Summary retrieval effectiveness results: Visual queries, Color histogram Conclusions: Improved retrieval effectiveness: 100% improved performance for semantic query “tapirs”. Similar performance for 50 visual/semantic queries. Color histogram as relevant feature for retrieval. Additional functionality: Multi-modal queries: visual and/or textual. 50 MPEG-7 queries Semantic query “tapirs” W/o MN W/ MN 3-point avg. 0.71 0.66 0.43 0.80 11-point avg. 0.65 0.61 0.35 0.78

Structure, Semantics and Summarization Description Tools AMOS (structure) Video Object Segmentation and Retrieval Visual Apprentice (structure + semantics) Learning Object/Scene Detectors from User Input IMKA (structure + semantics) Multimedia Knowledge Framework, Extraction and Application KIA (summary) AV Scene Segmentation, Discovery and Summarization Summary

Summary Description Tools
Hierarchical Summary DS Highlight Segment DS Highlight Summary DS Segment DS Audio Visual Context

KIA Visual skim generation:
Fully automatic reduction of the duration of the original video, given a target time. Constraints: Preserve the semantics Preserve the frame rate Talk about motivation here; a few sentences about (a) on demand summaries, (b) browsing of digital archives (c) fast-forwarding of streaming video, while maintaining frame rate.

Prior Work Informedia project [CHI 1998].
Microsoft research [ACM MM 1999]. MoCA [J. VCIR 1996]. Issues: Shots are considered to be indivisible. Little analysis on the effect of video syntax on semantics. J. VCIR – journal of visual communication and image recognition; chi – computer and human interface

KIA Approach Automatic determination of computable scenes and structures. Derive relationship between minimum shot comprehension time and its visual complexity. Rules of film syntax are used for shot removal. Finally, the problem is cast as an objective function maximization subject to constraints.

Summary Creation

Mapping to MPEG-7 Highlight Segments

Experiment Results User studies validate our approach
Original : 114 sec. Skim : 33 sec. 70 % data reduction User studies validate our approach All the skims tested are deemed by the users to be coherent Excellent results for compression rates 70 ~ 80 %. sundaram and chang acm mm 2000 sundaram and chang icme 2001

Structure, Semantics and Summarization Description Tools AMOS (structure) Video Object Segmentation and Retrieval Visual Apprentice (structure + semantics) Learning Object/Scene Detectors from User Input IMKA (structure + semantics) Multimedia Knowledge Framework, Extraction and Application KIA (summary) AV Scene Segmentation, Discovery and Summarization Summary

Summary MPEG-7 Standard: Multimedia description
Describes structure, semantics and summaries, among others Segmentation, searching, filtering, understanding and summarization of multimedia are still challenges AMOS: Video object segmentation and retrieval Semi-automatic segmentation based on region tracking Retrieval based on visual features and spatio-temp. relations Visual Apprentice: Learning of Visual Object/Scene Detectors Users define visual classes and provide training examples System combines features/learning algorithms at multiple levels IMKA: Intelligent Multimedia Knowledge Application Multimedia to represent semantic/perceptual world knowledge Extracts multimedia knowledge for image retrieval KIA: High-Level Audio-Visual Summaries Automatic AV scene segmentation and structure discovery Generates video skims preserving semantics

For More Info, Papers, …. (I)
Columbia University: AMOS: Visual Apprentice: IMKA: KIA:

For More Info, Papers, …. (II)
DVMM Group: ADVENT Project: MPEG Committee:

Thanks for your attention!
The End Thanks for your attention!

A. B. Benítez, D. Zhong, A. Jaimes, H. Sundaram and S.-F. Chang

Similar presentations

Presentation on theme: "A. B. Benítez, D. Zhong, A. Jaimes, H. Sundaram and S.-F. Chang"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A. B. Benítez, D. Zhong, A. Jaimes, H. Sundaram and S.-F. Chang

Similar presentations

Presentation on theme: "A. B. Benítez, D. Zhong, A. Jaimes, H. Sundaram and S.-F. Chang"— Presentation transcript:

Similar presentations

About project

Feedback