Presentation is loading. Please wait.

Presentation is loading. Please wait.

A. B. Benítez, D. Zhong, A. Jaimes, H. Sundaram and S.-F. Chang

Similar presentations


Presentation on theme: "A. B. Benítez, D. Zhong, A. Jaimes, H. Sundaram and S.-F. Chang"— Presentation transcript:

1 MPEG-7 Implementation: Extraction and Application of MPEG-7 Description Tools
A. B. Benítez, D. Zhong, A. Jaimes, H. Sundaram and S.-F. Chang {ana, dzhong, ajaimes, sundaram, ee.columbia.edu Digital Video and Multimedia Group (DVMM) Department of Electrical Engineering Columbia University

2 Overview MPEG-7 Standard: Multimedia description
Describes structure, semantics and summaries, among others Segmentation, searching, filtering, understanding and summarization of multimedia are still challenges AMOS: Video object segmentation and retrieval Semi-automatic segmentation based on region tracking Retrieval based on visual features and spatio-temp. relations Visual Apprentice: Learning of Visual Object/Scene Detectors Users define visual classes and provide training examples System combines features/learning algorithms at multiple levels IMKA: Intelligent Multimedia Knowledge Application Multimedia to represent semantic/perceptual information about world Extracts multimedia knowledge for image retrieval KIA: High-Level Audio-Visual Summaries Automatic AV scene segmentation and structure discovery Generates video skims preserving semantics

3 Outline MPEG-7 Standard
Structure, Semantics and Summarization Description Tools AMOS Video Object Segmentation and Retrieval Visual Apprentice Learning Object/Scene Detectors from User Input IMKA Multimedia Knowledge Framework, Extraction and Application KIA AV Scene Segmentation, Discovery and Summarization Summary

4 Outline MPEG-7 Standard
Structure, Semantics and Summarization Description Tools AMOS Video Object Segmentation and Retrieval Visual Apprentice Learning Object/Scene Detectors from User Input IMKA Multimedia Knowledge Framework, Extraction and Application KIA AV Scene Segmentation, Discovery and Summarization Summary

5 Motivation for MPEG-7 Explosive proliferation of multimedia content
Efficient, intelligent, interoperable applications Storage and retrieval Multimedia editing Personalized TV Remote Sensing Applications Universal Multimedia Access Surveillance Applications Etc. Soap operas Sports News Consumer content Scientific content

6 International Standard
MPEG-7 Standard Flexible, extensible, multi-level, and standard framework for describing multimedia Systems, DDL, Video, Audio, MDS, Software Scope Schedule Flexible, extensible, multi-level, and standard framework for describing multimedia Feature Extraction MPEG-7 Description Search/Filtering Application Call For Proposals Working Draft Committee Draft International Standard 10/98 12/99 10/00 9/01

7 MPEG-7 Framework Description Definition Language (DDL)
Language to create new Ds/DSs or extend existing ones (XML Schema) Description Schemes (DSs) Structure and semantics of relations among Ds/DSs Descriptors (Ds) Representation of a feature of AV data AV Content Item Data Feature User or System to signifies 1..* Description Definition Language Scheme Descriptor 1..* 0..* defines describes

8 Multimedia Description Schemes
Models Collections Content organization User interaction Preferences History Content management Content description Creation & Production Media Usage Semantics Structure Navigation & Access Summaries Variations Views Schema Tools Links & Media Localization Basic Basic elements Datatypes

9 Multimedia Description Schemes
Models Collections Content organization User interaction Preferences History Navigation & Access Summaries Variations Views Creation & Production Media Usage Content management Content description Structure Semantics Schema Tools Links & Media Localization Basic Basic elements Datatypes

10 Structure Description Tools
TextAnnotation SpatialMask . . . Segment Relation CS Decomposition DS VideoSegment DS MovingRegion DS . . . StillRegion DS Segment DS describes Multimedia Content

11 Segments Still regions

12 Segment Attributes Color Dominant Color Scalable Color Color Layout
Color Structure GoF/GoP Color Texture Homogeneous Text. Texture Browsing Edge Histogram Shape Region Shape Contour Shape 3D Shape Motion Camera Motion Motion Trajectory Parametric Motion Motion Activity Localization Region Locator Spatio-Temporal Locator Media Time Other Face Recognition Text Annotation Creation Info Usage Info Media Info

13 Segment Relations Segment Decompositions
spatial, temporal, spatio-temporal, media source Spatial Relations south, west, northwest, southwest, left, below, under, equal, inside, covers, overlaps, disjoint Temporal Relations before, meets, overlaps, during, contains, starts, finishes, equal sequential, parallel Spatio-Temporal Relations union, intersection Other Relations keyFor, annotates

14 Structure Description I
TextAnnotation ColorStructure Still Region Spatial Decomposition no overlap, gap Still Regions ContourShape TextAnnotation left Spatial Relation

15 Structure Description I - XML
<StillRegion id="SR1"> <TextAnnotation> <FreeTextAnnotation> Alex shakes hands with Ana </FreeTextAnnotation> </TextAnnotation> <SpatialDecomposition overlap="false" gap="true"> <StillRegion id="SR2"> <TextAnnotation> <FreeTextAnnotation> Alex </FreeTextAnnotation> </TextAnnotation> <Relation xsi:type="SpatialSegmentRelationType“ name="left“ target="#SR1"/> <VisualDescriptor xsi:type=“ContourShapeType"> ... </VisualDescriptor> </StillRegion> <StillRegion id="SR3"> <TextAnnotation> <FreeTextAnnotation> Ana </FreeTextAnnotation> </TextAnnotation> </SpatialDecomposition> left

16 Structure Description II
Video Segment keyFor Other Relation Temporal Decomposition Video Segments no overlap, no gap Moving Region Spatial Decomposition no overlap, gap Moving Regions Spatial Decomposition no overlap, no gap Spatial Relation above

17 Structure Description II
Video Segment MediaTime Mosaic GoFGoPColor TextAnnotation Temporal Decomposition no overlap, no gap Video Segments Spatial Decomposition no overlap, gap MediaTime ScalableColor ParametricMotion TextureBrowsing ContourShape TextAnnotation Moving Region no overlap, no gap Spatial Decomposition Moving Regions Spatial Relation above

18 Semantics Description Tools
Object DS Event DS Concept DS SemanticState DS SemanticPlace DS SemanticTime DS AgentObject DS Definition . . . Label AnalyticModel DS Segment DS Semantic Relation DS SemanticBag DS SemanticBase DS captures Content Narrative World describes Semantic DS Multimedia

19 Semantic Description Friendship (Concept)
symbolPerception property Friendship (Concept) mediaPerception location time 9th Sept (Semantic Time) New York (Semantic Place) Shake hands (Event) Alex (Agent Object) Ana (Agent Object) accompanier agent Labels: “man” Definitions: “a primate of the family Hominidae” Properties: “tall”, “slim”

20 Semantic Description - XML
<Semantic id=”SM1”> <Label><Name>Alex shakes hands with Ana </Name></Label> <SemanticBase xsi:type="EventType" id="EV1"> <Label><Name>Shake hands</Name></Label> <Relation type=“urn:mpeg:mpeg7:cs:ObjectEventRelationCS:agent" target="#AO1"/> <Relation type=“urn:mpeg:mpeg7:cs:ObjectEventRelationCS:accompanier“ target="#AO2"/> <Relation type=“urn:mpeg:mpeg7:cs:ConceptSemanticBaseRelationCS:property“ target="#C1"/> <Relation type=“urn:mpeg:mpeg7:cs:SemanticPlaceSemanticBaseRelationCS:location“ target="#SP1"/> <Relation type=“urn:mpeg:mpeg7:cs:SemanticTimeSemanticBaseRelationCS:time“ target="#ST1"/> </SemanticBase> <SemanticBase xsi:type="AgentObjectType" id="AO1"> <Label><Name>Alex</Name></Label> <Agent xsi:type="PersonType"> <Name> <GivenName> Alex </GivenName> </Name> </Agent> <SemanticBase xsi:type="AgentObjectType" id="AO2"> <Label><Name>Ana</Name></Label> <Agent xsi:type="PersonType"> <Name> <GivenName> Ana </GivenName> </Name> </Agent> <SemanticBase xsi:type="ConceptType" id="C1"> <Label><Name>Comradeship</Name></Label> <SemanticBase xsi:type=”SemanticPlaceType” id=”SP1”> <Label><Name>New York</Name></Label> <SemanticBase xsi:type=”SemanticTimeType” id=”ST1”> <Label><Name>September 9</Name></Label> </Semantic>

21 Semantic Relations “The vessel is an example of Maya art created in Guatemala in the 8th Century. The vessel’s height is 14 cm and it has several paintings. The paintings show the realm of the lords of death with a death figure that dances and another figure that holds an axe and a handstone. The paintings represent sacrifice”.

22 Summary Description Tools
Hierarchical Summary DS Highlight Segment DS Highlight Summary DS Segment DS Audio Visual Context

23 Summary Description Summaries International Sports Environmental
Skim all news

24 Outline MPEG-7 Standard
Structure, Semantics and Summarization Description Tools AMOS (structure) Video Object Segmentation and Retrieval Visual Apprentice Learning Object/Scene Detectors from User Input IMKA Multimedia Knowledge Framework, Extraction and Application KIA AV Scene Segmentation, Discovery and Summarization Summary

25 Structure Description Tools
TextAnnotation SpatialMask . . . Segment Relation CS Decomposition DS VideoSegment DS MovingRegion DS . . . StillRegion DS Segment DS describes Multimedia Content

26 AMOS Uniform low-level feature region segmentation (e.g. using color and motion) Semantic video object segmentation Low-level feature region-based similarity search Semantic video object-based similarity search Model semantic video objects Underlying regions Visual features Spatio-temporal relations

27 Video Object Segmentation
1 Object Definition (user input) Region Segmentation starting frame Homogeneous Regions Region Aggregation Video Objects Semantic Video Object Foreground Regions Background Regions Region Tracking Motion Projection succeeding frame 2

28 Object Projection and Tracking
segmented regions at frame n-1 projected regions at frame n BG BG FG FG hole : new region Egomotion Model:

29 Segmentation Results maximum boundary deviation # of frames with user input Frame# Uncovered regions caused major tracking errors, corrected by few user inputs Remaining errors reflect accuracy limitation

30 Features and Relations
Visual features for regions and objects: Representative color Tamura texture Shape descriptors Motion trajectory Spatio-temporal relations: Spatial orientation graph (angle) Spatial topological graph (contains, not contain, contained) Temporal directional graph (start after, same time, before) Foreground Regions Semantic Video Object

31 Spatio-Temporal Decomposition
Mapping to MPEG-7 Visual features for regions and objects: Representative color Tamura texture Shape descriptors Motion trajectory Spatio-temporal relations: Spatial orientation graph (angle) Spatial topological graph (contains, not contain, contained) Temporal directional graph (start after, same time, before) Moving Regions Visual Descriptors Spatio-Temporal Decomposition Moving Region Segment Relations

32 Video Object Searching
Query Object Region Matching : for each query region, find candidate region list based on visual feature distance (i.e. color, texture, etc) Join & Validation : join candidate region lists and compute total object distance (visual features + spatio-temporal relations) Retrieved Objects

33 Search Interface Query Canvas Query Results Feature Weights

34 AMOS Segmentation System AMOS Search System
Demo Time AMOS Segmentation System AMOS Search System

35 Outline MPEG-7 Standard
Structure, Semantics and Summarization Description Tools AMOS (structure) Video Object Segmentation and Retrieval Visual Apprentice (structure + semantics) Learning Object/Scene Detectors from User Input IMKA Multimedia Knowledge Framework, Extraction and Application KIA AV Scene Segmentation, Discovery and Summarization Summary

36 Semantics Description Tools
Object DS Event DS Concept DS SemanticState DS SemanticPlace DS SemanticTime DS AgentObject DS Definition . . . Label AnalyticModel DS Segment DS Semantic Relation DS SemanticBag DS SemanticBase DS captures Content Narrative World describes Semantic DS Multimedia

37 The Visual Apprentice Focus on semantics, object/scene structure, and user preferences Visual Object/Scene Detectors Automatically assign semantic labels to objects (e.g., sky) or scenes (e.g., handshake scene) User input and learning User defines models and provides training examples Learning algorithms + different features + training examples = Automatic Visual Detectors

38 Definition Hierarchy User Input:
Object Level 1: Object Level 2: Object-part Object-part 1 Object-part 2 Object-part n Level 3: Perceptual Area Perceptual-area 1 Perceptual-area n Level 4: Region Region 1 Region 2 User Input: Define hierarchy : decide nodes and containment relationships. For each node: label examples in images/videos. Labeling by clicking on regions or outlining areas.

39 Definition Hierarchy Batter Regions Batting Ground Pitcher Grass Sand
Level 4: Region Level 2: Object-part Level 1: Object Level 3: Perceptual Area

40 Definition Hierarchy Example
Batting Ground Pitcher Batter Grass Sand Regions Regions Regions Regions

41 Learning Detectors from User Input
Training Data For each node of the hierarchy, a set of examples. Superset of features (incl. MPEG-7) is extracted from each example: Color (Average LUV, Dominant Color, etc.) Shape & Location (Perimeter, Formfactor, Eccentricity, etc.) Texture (Edge Direction Histogram, Tamura, etc.) Motion (Trajectory, velocity, etc.) Superset of Machine learning algorithms use training data

42 Learning Classifiers For Nodes
Machine Learning Algorithm Stage 3: classifiers, generate MPEG-7 Descriptions Stage 2: training D1 Definition Hierarchy C1 Visual Detector Stage 1: training data obtained and feature vectors computed (MPEG-7)

43 An Example Cf Determines Face o-p classifier input Face region
Object Handshake Object-part Face 1 Handshake Face 2 Perceptual Area Region Regions Regions Regions Cf Determines Face o-p classifier input Face region classifier Cr

44 Learning Classifiers Definition Hierarchy C1
Algorithm 1 C1 Multiple Classifiers for D1. Learning Algorithm 2 C2 D1 Learning Algorithm n Cn Stage 1: Training data obtained and feature vectors computed (MPEG-7 features). Stage 2: Classifiers learned Stage 3: Classifiers/features selected

45 Training Summary User: Defines Definition Hierarchy
Labels example images/video according to hierarchy Semantic MPEG-7 descriptions generated for training set System: Automatically segments examples Extracts visual features for each node (structure, MPEG-7) Applies set of learning algorithms to each node Selects best features for each algorithm A set of classifiers for each node (best features) Best classifier selected or different classifiers combined Assigns semantic labels (semantics, MPEG-7)

46 Mapping to MPEG-7 Moving and Still Regions
Visual Descriptors Semantic Entities and Relations Batting Ground Pitcher Batter Grass Sand Regions Regions Regions Regions Visual features

47 Classification Summary
Automatic segmentation Feature extraction (MPEG-7) Classification and grouping Generation of MPEG-7 descriptors Regions, groups of regions, Object/scene Face 2 Regions Handshake Face 1 Region Object-part Object Perceptual Area

48 Experiments (I) Set I: Sky images Set II: Handshake images Object
Regions Sky Level 1: Object Level 2: Object-part Level 3: Perceptual Area Level 4: Region Object Handshake Object-part Face 1 Handshake Face 2 Perceptual Area Region Regions Regions Regions

49 Experiment (II) Set III: Baseball Video Level 1: Object Batting
Level 2: Object-part Ground Pitcher Batter Level 3: Perceptual Area Grass Sand Level 4: Region Regions Regions Regions Regions

50 Overall Performance Classification Results:
Image set Training Set Size Test set Accu. Precision Recall Baseball 60 Video Shots % 100 % 64 % Handshakes 80 Images % 70 % 74 % Skies 45 Images % 87 % 50%

51 Outline MPEG-7 Standard
Structure, Semantics and Summarization Description Tools AMOS (structure) Video Object Segmentation and Retrieval Visual Apprentice (structure + semantics) Learning Object/Scene Detectors from User Input IMKA (structure + semantics) Multimedia Knowledge Framework, Extraction and Application KIA AV Scene Segmentation, Discovery and Summarization Summary

52 IMKA Explore new frontiers for Multimedia Information Systems (MISs):
Construct intelligent and interactive MISs for analysis, retrieval, navigation, and synthesis of multimedia Paradigm shift towards more semantics-based, knowledge-driven MISs Anticipate impact of MPEG-7 on MISs Improve effectiveness and performance of MISs

53 The IMKA System MediaNet: Multimedia knowledge representation framework Extends traditional knowledge representations by incorporating perceptual and symbolic information Defines and illustrates concepts and relations using multi-modal content and descriptors Encoded using MPEG-7 description tools Implementation: Semi-automatic construction of MediaNet knowledge base CBIR query expansion and multi-modal query translation Experimentation: MPEG-7 color image test-set (5466 images, 51 queries). Initial results show improved retrieval effectiveness.

54 AV content, descriptors, descriptor similarity
MediaNet Evolution MediaNet Semantic concepts Word Perceptual concepts Perceptual AV content, descriptors, descriptor similarity WordNet Lexical concepts Semantic Word MMT Semantic concepts Word AV content descriptors Mirror Thesaurus Perceptual concepts Word AV descriptors Semantic

55 MediaNet (Symbolic + Perceptual)
Novel multimedia representation of world concepts at symbolic and perceptual levels: Illustration of concepts using multimedia content. Perceptual feature-based relations. Weights, probabilities, and conceptual contexts. Relations Specialization of Similar shape Concepts Human Concept Hominid Concept Weight, prob, context Place of Concept Earth “earth” W = 0.5 P = 1.0 Shape descriptor distance < T “homo” “human” “man” “hominid” “a primate of the family Hominidae” Multimedia content Shape descriptor: (0.04 …)

56 MediaNet Constructs Concepts: Real world entities: rock , game .
Abstract concepts: beauty Unnamed objects: texture pattern Relations: Semantic relations (WordNet): generalization (animal, dog). Perceptual relations (CBR descriptor similarity): to have similar shape ( ). Content: Multimedia data (image, text), feature descriptors (color histogram), descriptor similarity (Euclidean). Some representations not relevant for some concepts (audio for concept Sky).

57 AV content, descriptors, descriptor similarity
Mapping to MPEG-7 Semantic Entities Labels Segment Segment Descriptors Semantic Relations Semantic concepts Perceptual concepts Word Perceptual Semantic AV content, descriptors, descriptor similarity Word Word

58 Construction and Retrieval
MediaNet KB Query Processor Query Results Construction Intelligent CBIR Feature Database Extraction Visual Query CB Search Engine CB Results “tapirs” MediaNet construction: Textual annotations, WordNet, image network of examples, automatic extraction tools, and human assistance. Intelligent CBIR using MediaNet: Expand and translate queries across modalities.

59 MediaNet Construction
WordNet: Hype/hyponymy Mero/Holonymy Antonymy Rock, stone Rock candy, rock Rock music, rock Rock, careen, sway Rock, sway, shake Cradle, rock Sky Flip, toss, sky, pitch Sunset …. WordNet: Senses “stone” “rock” “sky” “sunset” Rock And Sky Sunset “Rock and sky” “Sunset” + features Feature centroids Automatic feature extraction tools: Color histogram Color coherence Tamura texture Wavelet texture

60 Multimodal Query Translation and Expansion
+ features Feature centroids “stone” “rock” “sky” “sunset” Semantic Space Hypo/Hypernimy 1 Mero/Holonymy 2 Antonymy MAX Feature Space CB Queries Expanded Query Query Tapir Snake Monkey Tapir Snake “tapir” Weighted minimum distance per image

61 Evaluation Evaluation criteria:
Retrieval effectiveness (recall, precision) Additional functionality. Ground truth: images from MPEG-7 content set 50 queries by MPEG-7 for color descriptor evaluation. Semantic query “tapirs” by authors with relevance scores: Tapir images: Monkey images: 0.75 Snake images: Butterfly and fish images: 0.25 MediaNet KB construction: 185 images in 50 classes with annotations Concepts: 96. Relations: 50 specialization, 34 composition, 1 opposite. Experiments: Color histogram vs. several color and texture descriptors. Visual query vs. text query.

62 Experimental Results 50 Color MPEG-7 Queries:
Visual w/o MN Visual w/ MN Text w/ MN 50 Color MPEG-7 Queries: Average precision vs. recall Visual w/o MN Visual w/ MN Text w/ MN Semantic Query “tapirs”: Precision vs. recall

63 Experiment Conclusions
Summary retrieval effectiveness results: Visual queries, Color histogram Conclusions: Improved retrieval effectiveness: 100% improved performance for semantic query “tapirs”. Similar performance for 50 visual/semantic queries. Color histogram as relevant feature for retrieval. Additional functionality: Multi-modal queries: visual and/or textual. 50 MPEG-7 queries Semantic query “tapirs” W/o MN W/ MN 3-point avg. 0.71 0.66 0.43 0.80 11-point avg. 0.65 0.61 0.35 0.78

64 Outline MPEG-7 Standard
Structure, Semantics and Summarization Description Tools AMOS (structure) Video Object Segmentation and Retrieval Visual Apprentice (structure + semantics) Learning Object/Scene Detectors from User Input IMKA (structure + semantics) Multimedia Knowledge Framework, Extraction and Application KIA (summary) AV Scene Segmentation, Discovery and Summarization Summary

65 Summary Description Tools
Hierarchical Summary DS Highlight Segment DS Highlight Summary DS Segment DS Audio Visual Context

66 KIA Visual skim generation:
Fully automatic reduction of the duration of the original video, given a target time. Constraints: Preserve the semantics Preserve the frame rate Talk about motivation here; a few sentences about (a) on demand summaries, (b) browsing of digital archives (c) fast-forwarding of streaming video, while maintaining frame rate.

67 Prior Work Informedia project [CHI 1998].
Microsoft research [ACM MM 1999]. MoCA [J. VCIR 1996]. Issues: Shots are considered to be indivisible. Little analysis on the effect of video syntax on semantics. J. VCIR – journal of visual communication and image recognition; chi – computer and human interface

68 KIA Approach Automatic determination of computable scenes and structures. Derive relationship between minimum shot comprehension time and its visual complexity. Rules of film syntax are used for shot removal. Finally, the problem is cast as an objective function maximization subject to constraints.

69 Summary Creation

70 Mapping to MPEG-7 Highlight Segments

71 Experiment Results User studies validate our approach
Original : 114 sec. Skim : 33 sec. 70 % data reduction User studies validate our approach All the skims tested are deemed by the users to be coherent Excellent results for compression rates 70 ~ 80 %. sundaram and chang acm mm 2000 sundaram and chang icme 2001

72 Outline MPEG-7 Standard
Structure, Semantics and Summarization Description Tools AMOS (structure) Video Object Segmentation and Retrieval Visual Apprentice (structure + semantics) Learning Object/Scene Detectors from User Input IMKA (structure + semantics) Multimedia Knowledge Framework, Extraction and Application KIA (summary) AV Scene Segmentation, Discovery and Summarization Summary

73 Summary MPEG-7 Standard: Multimedia description
Describes structure, semantics and summaries, among others Segmentation, searching, filtering, understanding and summarization of multimedia are still challenges AMOS: Video object segmentation and retrieval Semi-automatic segmentation based on region tracking Retrieval based on visual features and spatio-temp. relations Visual Apprentice: Learning of Visual Object/Scene Detectors Users define visual classes and provide training examples System combines features/learning algorithms at multiple levels IMKA: Intelligent Multimedia Knowledge Application Multimedia to represent semantic/perceptual world knowledge Extracts multimedia knowledge for image retrieval KIA: High-Level Audio-Visual Summaries Automatic AV scene segmentation and structure discovery Generates video skims preserving semantics

74 For More Info, Papers, …. (I)
Columbia University: AMOS: Visual Apprentice: IMKA: KIA:

75 For More Info, Papers, …. (II)
DVMM Group: ADVENT Project: MPEG Committee:

76 Thanks for your attention!
The End Thanks for your attention!


Download ppt "A. B. Benítez, D. Zhong, A. Jaimes, H. Sundaram and S.-F. Chang"

Similar presentations


Ads by Google