Presentation is loading. Please wait.

Presentation is loading. Please wait.

MPEG-7 Implementation: Extraction and Application of MPEG-7 Description Tools A. B. Benítez, D. Zhong, A. Jaimes, H. Sundaram and S.-F. Chang {ana, dzhong,

Similar presentations


Presentation on theme: "MPEG-7 Implementation: Extraction and Application of MPEG-7 Description Tools A. B. Benítez, D. Zhong, A. Jaimes, H. Sundaram and S.-F. Chang {ana, dzhong,"— Presentation transcript:

1 MPEG-7 Implementation: Extraction and Application of MPEG-7 Description Tools A. B. Benítez, D. Zhong, A. Jaimes, H. Sundaram and S.-F. Chang {ana, dzhong, ajaimes, sundaram, ee.columbia.edu Digital Video and Multimedia Group (DVMM) Department of Electrical Engineering Columbia University

2 Overview MPEG-7 Standard: Multimedia description –Describes structure, semantics and summaries, among others –Segmentation, searching, filtering, understanding and summarization of multimedia are still challenges AMOS: Video object segmentation and retrieval –Semi-automatic segmentation based on region tracking –Retrieval based on visual features and spatio-temp. relations Visual Apprentice: Learning of Visual Object/Scene Detectors –Users define visual classes and provide training examples –System combines features/learning algorithms at multiple levels IMKA: Intelligent Multimedia Knowledge Application –Multimedia to represent semantic/perceptual information about world –Extracts multimedia knowledge for image retrieval KIA: High-Level Audio-Visual Summaries –Automatic AV scene segmentation and structure discovery –Generates video skims preserving semantics

3 Outline MPEG-7 Standard –Structure, Semantics and Summarization Description Tools AMOS –Video Object Segmentation and Retrieval Visual Apprentice –Learning Object/Scene Detectors from User Input IMKA –Multimedia Knowledge Framework, Extraction and Application KIA –AV Scene Segmentation, Discovery and Summarization Summary

4 Outline MPEG-7 Standard –Structure, Semantics and Summarization Description Tools AMOS –Video Object Segmentation and Retrieval Visual Apprentice –Learning Object/Scene Detectors from User Input IMKA –Multimedia Knowledge Framework, Extraction and Application KIA –AV Scene Segmentation, Discovery and Summarization Summary

5 Motivation for MPEG-7 Explosive proliferation of multimedia content Efficient, intelligent, interoperable applications –Storage and retrieval –Multimedia editing –Personalized TV –Remote Sensing Applications –Universal Multimedia Access –Surveillance Applications –Etc. Soap operasSports Consumer content News Scientific content

6 Flexible, extensible, multi-level, and standard framework for describing multimedia Systems, DDL, Video, Audio, MDS, Software Scope Schedule MPEG-7 Standard Flexible, extensible, multi-level, and standard framework for describing multimedia 9/0110/0012/9910/98 International Standard Committee Draft Working Draft Call For Proposals Feature Extraction MPEG-7 Description Search/Filtering Application

7 MPEG-7 Framework Description Definition Language (DDL) –Language to create new Ds/DSs or extend existing ones (XML Schema) Description Schemes (DSs) –Structure and semantics of relations among Ds/DSs Descriptors (Ds) –Representation of a feature of AV data Description Definition Language Description Scheme Descriptor 1..* 0..* defines describes 1..* AV Content Item Data Feature User or System to signifies 1..*

8 Multimedia Description Schemes Content management Content description Creation & Production MediaUsage Semantics Structure Models Collections Content organization Schema Tools Links & Media Localization Basic Tools Basic elements Basic Datatypes Navigation & Access Summaries Variations Views User interaction User Preferences User History

9 Multimedia Description Schemes Content management Content description Creation & Production MediaUsage Semantics Structure Models Collections Content organization Schema Tools Links & Media Localization Basic Tools Basic elements Basic Datatypes Navigation & Access Summaries Variations Views User interaction User Preferences User History

10 Structure Description Tools Segment DS describes Multimedia Content Segment Relation CS Segment Decomposition DS VideoSegment DS MovingRegion DS... StillRegion DS TextAnnotation SpatialMask...

11 Still regions Segments

12 Segment Attributes Color Dominant Color Scalable Color Color Layout Color Structure GoF/GoP Color Texture Homogeneous Text. Texture Browsing Edge Histogram Shape Region Shape Contour Shape 3D Shape Motion Camera Motion Motion Trajectory Parametric Motion Motion Activity Localization Region Locator Spatio-Temporal Locator Media Time Other Face Recognition Text Annotation Creation Info Usage Info Media Info

13 Segment Relations Segment Decompositions spatial, temporal, spatio-temporal, media source Spatial Relations south, west, northwest, southwest, left, below, under, equal, inside, covers, overlaps, disjoint Temporal Relations before, meets, overlaps, during, contains, starts, finishes, equal sequential, parallel Spatio-Temporal Relations union, intersection Other Relations keyFor, annotates

14 Structure Description I Still Region TextAnnotation ColorStructure Spatial Relation Still Regions left Spatial Decomposition no overlap, gap ContourShape TextAnnotation

15 Structure Description I - XML Alex shakes hands with Ana Alex... Ana... left

16 Moving Region Spatial Decomposition no overlap, gap Structure Description II Video Segment Temporal Decomposition Video Segments no overlap, no gap Moving Regions Spatial Decomposition no overlap, no gap Spatial Relation above keyFor Other Relation

17 Structure Description II Video Segment Temporal Decomposition Moving Region Spatial Decomposition Moving Regions Spatial Decomposition Spatial Relation Video Segments above no overlap, no gap no overlap, gap no overlap, no gap MediaTime Mosaic GoFGoPColor TextAnnotation MediaTime ScalableColor ParametricMotion TextureBrowsing ContourShape TextAnnotation

18 Semantics Description Tools captures Content Narrative World describes Semantic DS Multimedia Definition... Label SemanticBag DS SemanticBase DS Object DS Event DS Concept DS SemanticState DS SemanticPlace DS SemanticTime DS AgentObject DS AnalyticModel DS Segment DS Semantic Relation DS

19 Semantic Description location time 9 th Sept (Semantic Time) New York (Semantic Place) property Friendship (Concept) Alex (Agent Object) Ana (Agent Object) accompanieragent symbolPerception Shake hands (Event) mediaPerception Labels: man Definitions: a primate of the family Hominidae Properties: tall, slim

20 Semantic Description - XML Alex shakes hands with Ana Shake hands Alex Ana Comradeship New York September 9

21 Semantic Relations The vessel is an example of Maya art created in Guatemala in the 8th Century. The vessels height is 14 cm and it has several paintings. The paintings show the realm of the lords of death with a death figure that dances and another figure that holds an axe and a handstone. The paintings represent sacrifice.

22 Summary Description Tools Audio Visual Context Hierarchical Summary DS Highlight Summary DS Highlight Segment DS Highlight Segment DS Highlight Summary DS Highlight Segment DS Highlight Segment DS

23 Summary Description Summaries International SportsEnvironmentalSkim all news

24 Outline MPEG-7 Standard –Structure, Semantics and Summarization Description Tools AMOS (structure) –Video Object Segmentation and Retrieval Visual Apprentice –Learning Object/Scene Detectors from User Input IMKA –Multimedia Knowledge Framework, Extraction and Application KIA –AV Scene Segmentation, Discovery and Summarization Summary

25 Structure Description Tools Segment DS describes Multimedia Content Segment Relation CS Segment Decomposition DS VideoSegment DS MovingRegion DS... StillRegion DS TextAnnotation SpatialMask...

26 AMOS Uniform low-level feature region segmentation (e.g. using color and motion) Semantic video object segmentation Low-level feature region-based similarity search Semantic video object-based similarity search Model semantic video objects –Underlying regions –Visual features –Spatio-temporal relations

27 Video Object Segmentation 1 Object Definition (user input) Region Segmentation starting frame Region Tracking Motion Projection succeeding frame 2 Homogeneous Regions Region Aggregation Video Objects Semantic Video Object Foreground Regions Background Regions

28 Object Projection and Tracking FG BG segmented regions at frame n-1 projected regions at frame n hole : new region Egomotion Model:

29 Segmentation Results Frame# # of frames with user input maximum boundary deviation Uncovered regions caused major tracking errors, corrected by few user inputs Remaining errors reflect accuracy limitation

30 Features and Relations Visual features for regions and objects: –Representative color –Tamura texture –Shape descriptors –Motion trajectory Spatio-temporal relations: –Spatial orientation graph (angle) –Spatial topological graph (contains, not contain, contained) –Temporal directional graph (start after, same time, before) Semantic Video Object Foreground Regions

31 Mapping to MPEG-7 Visual features for regions and objects: –Representative color –Tamura texture –Shape descriptors –Motion trajectory Spatio-temporal relations: –Spatial orientation graph (angle) –Spatial topological graph (contains, not contain, contained) –Temporal directional graph (start after, same time, before) Moving Region Moving Regions Visual Descriptors Segment Relations Spatio- Temporal Decomposition

32 Video Object Searching Region Matching : for each query region, find candidate region list based on visual feature distance (i.e. color, texture, etc) Join & Validation : join candidate region lists and compute total object distance (visual features + spatio- temporal relations) Query Object Retrieved Objects

33 Search Interface Query Results Query Canvas Feature Weights

34 Demo Time AMOS Segmentation System AMOS Search System

35 Outline MPEG-7 Standard –Structure, Semantics and Summarization Description Tools AMOS (structure) –Video Object Segmentation and Retrieval Visual Apprentice (structure + semantics) –Learning Object/Scene Detectors from User Input IMKA –Multimedia Knowledge Framework, Extraction and Application KIA –AV Scene Segmentation, Discovery and Summarization Summary

36 Semantics Description Tools captures Content Narrative World describes Semantic DS Multimedia Definition... Label SemanticBag DS SemanticBase DS Object DS Event DS Concept DS SemanticState DS SemanticPlace DS SemanticTime DS AgentObject DS AnalyticModel DS Segment DS Semantic Relation DS

37 The Visual Apprentice Focus on semantics, object/scene structure, and user preferences Visual Object/Scene Detectors –Automatically assign semantic labels to objects (e.g., sky) or scenes (e.g., handshake scene) User input and learning –User defines models and provides training examples –Learning algorithms + different features + training examples = Automatic Visual Detectors

38 Definition Hierarchy User Input: –Define hierarchy : decide nodes and containment relationships. –For each node: label examples in images/videos. –Labeling by clicking on regions or outlining areas. Object Object-part 1Object-part 2 Perceptual-area 1Perceptual-area n Region 1 Region 2 Level 4: Region Level 2: Object-part Level 1: Object Level 3: Perceptual Area Object-part n

39 Definition Hierarchy Batter Regions Batting GroundPitcher GrassSand Regions Level 4: Region Level 2: Object-part Level 1: Object Level 3: Perceptual Area

40 Definition Hierarchy Example Batter Batting GroundPitcher GrassSand Regions

41 Learning Detectors from User Input Training Data –For each node of the hierarchy, a set of examples. –Superset of features (incl. MPEG-7) is extracted from each example: Color (Average LUV, Dominant Color, etc.) Shape & Location (Perimeter, Formfactor, Eccentricity, etc.) Texture (Edge Direction Histogram, Tamura, etc.) Motion (Trajectory, velocity, etc.) Superset of Machine learning algorithms use training data

42 Learning Classifiers For Nodes Stage 1: training data obtained and feature vectors computed (MPEG-7) Machine Learning Algorithm Stage 3: classifiers, generate MPEG-7 Descriptions Stage 2: training D1D1 Definition Hierarchy C1C1 Visual Detector

43 An Example Face 2 Regions Handshake Face 1 Handshake Regions Region Object-part Object Perceptual Area CrCr CfCf Face region classifier Determines Face o-p classifier input

44 Learning Classifiers Learning Algorithm 1 Learning Algorithm 2 Learning Algorithm n Stage 1: Training data obtained and feature vectors computed (MPEG-7 features). Stage 3: Classifiers/features selected Stage 2: Classifiers learned D1 C1C1 C2C2 CnCn Multiple Classifiers for D1. Definition Hierarchy

45 Training Summary User: –Defines Definition Hierarchy –Labels example images/video according to hierarchy Semantic MPEG-7 descriptions generated for training set System: –Automatically segments examples –Extracts visual features for each node (structure, MPEG-7) –Applies set of learning algorithms to each node –Selects best features for each algorithm –A set of classifiers for each node (best features) –Best classifier selected or different classifiers combined –Assigns semantic labels (semantics, MPEG-7)

46 Mapping to MPEG-7 Batter Batting GroundPitcher GrassSand Regions Visual features Moving and Still Regions Visual Descriptors Semantic Entities and Relations

47 Classification Summary Face 2 Regions Handshake Face 1 Handshake Regions Region Object-part Object Perceptual Area Automatic segmentation Feature extraction (MPEG-7) Classification and grouping Generation of MPEG-7 descriptors –Regions, groups of regions, –Object/scene

48 Experiments (I) Set I: Sky images Set II: Handshake images Level 4: Region Level 2: Object-part Level 1: Object Level 3: Perceptual Area Regions Sky Face 2 Regions Handshake Face 1 Handshake Regions Region Object-part Object Perceptual Area

49 Experiment (II) Set III: Baseball Video Batter Regions Batting GroundPitcher GrassSand Regions Level 4: Region Level 2: Object-part Level 1: Object Level 3: Perceptual Area

50 Overall Performance Classification Results: Image setTraining Set SizeTest setAccu. PrecisionRecall Baseball60 Video Shots31692%100 %64 % Handshakes80 Images 73394%70 %74 % Skies45 Images130094%87 %50%

51 Outline MPEG-7 Standard –Structure, Semantics and Summarization Description Tools AMOS (structure) –Video Object Segmentation and Retrieval Visual Apprentice (structure + semantics) –Learning Object/Scene Detectors from User Input IMKA (structure + semantics) –Multimedia Knowledge Framework, Extraction and Application KIA –AV Scene Segmentation, Discovery and Summarization Summary

52 IMKA Explore new frontiers for Multimedia Information Systems (MISs): –Construct intelligent and interactive MISs for analysis, retrieval, navigation, and synthesis of multimedia –Paradigm shift towards more semantics-based, knowledge-driven MISs –Anticipate impact of MPEG-7 on MISs –Improve effectiveness and performance of MISs

53 The IMKA System MediaNet: Multimedia knowledge representation framework –Extends traditional knowledge representations by incorporating perceptual and symbolic information –Defines and illustrates concepts and relations using multi- modal content and descriptors –Encoded using MPEG-7 description tools Implementation: –Semi-automatic construction of MediaNet knowledge base –CBIR query expansion and multi-modal query translation Experimentation: –MPEG-7 color image test-set (5466 images, 51 queries). –Initial results show improved retrieval effectiveness.

54 MediaNet Evolution WordNet Lexical concepts Semantic Word MediaNet Semantic concepts Semantic Word Perceptual concepts Perceptual Word AV content, descriptors, descriptor similarity MMT Semantic concepts Semantic Word AV content descriptors Mirror Thesaurus Perceptual concepts Word AV descriptors Semantic concepts Perceptual

55 MediaNet (Symbolic + Perceptual) Novel multimedia representation of world concepts at symbolic and perceptual levels: –Illustration of concepts using multimedia content. –Perceptual feature-based relations. –Weights, probabilities, and conceptual contexts. Concepts Human ConceptHominid Concept Relations Specialization of Similar shape Shape descriptor distance < T homohuman man hominid a primate of the family Hominidae Multimedia content Shape descriptor: (0.04 …) Weight, prob, context Place of Concept Earth earth W = 0.5 P = 1.0

56 MediaNet Constructs Concepts: –Real world entities: rock, game. –Abstract concepts: beauty. –Unnamed objects: texture pattern. Relations: –Semantic relations (WordNet): generalization (animal, dog). –Perceptual relations (CBR descriptor similarity): to have similar shape ( ). Content: –Multimedia data (image, text), feature descriptors (color histogram), descriptor similarity (Euclidean). –Some representations not relevant for some concepts (audio for concept Sky).

57 Mapping to MPEG-7 Semantic concepts Semantic Word Perceptual concepts Perceptual Word AV content, descriptors, descriptor similarity Semantic Entities Labels Segment Segment Descriptors Semantic Relations

58 Feature Database Feature Extraction Visual Query CB Search Engine CB Results Construction and Retrieval MediaNet construction: –Textual annotations, WordNet, image network of examples, automatic extraction tools, and human assistance. Intelligent CBIR using MediaNet: –Expand and translate queries across modalities. tapirs MediaNet KB Query Processor Query Results MediaNet Construction Intelligent CBIR

59 MediaNet Construction Rock And Sky Sunset … Rock, stone Rock candy, rock Rock music, rock Rock, careen, sway Rock, sway, shake Cradle, rock Sky Flip, toss, sky, pitch Sunset …. WordNet: Senses WordNet: Hype/hyponymy Mero/Holonymy Antonymy + features Feature centroids Automatic feature extraction tools: Color histogram Color coherence Tamura texture Wavelet texture stonerock sky sunset Rock and sky Sunset

60 Multimodal Query Translation and Expansion tapir Weighted minimum distance per image Feature Space Tapir Snake Tapir Snake Monkey Semantic Space Hypo/Hypernimy 1 Mero/Holonymy 2 Antonymy MAX Query CB Queries Expanded Query + features Feature centroids stonerock sky sunset

61 Evaluation Evaluation criteria: –Retrieval effectiveness (recall, precision) –Additional functionality. Ground truth: 5281 images from MPEG-7 content set –50 queries by MPEG-7 for color descriptor evaluation. –Semantic query tapirs by authors with relevance scores: Tapir images: 1.00Monkey images: 0.75 Snake images: 0.50Butterfly and fish images: 0.25 MediaNet KB construction: 185 images in 50 classes with annotations –Concepts: 96. –Relations: 50 specialization, 34 composition, 1 opposite. Experiments: –Color histogram vs. several color and texture descriptors. –Visual query vs. text query.

62 Experimental Results 50 Color MPEG-7 Queries: Average precision vs. recall Semantic Query tapirs: Precision vs. recall Visual w/o MN Visual w/ MN Text w/ MN Visual w/o MN Visual w/ MN Text w/ MN

63 Experiment Conclusions Summary retrieval effectiveness results: –Visual queries, Color histogram Conclusions: –Improved retrieval effectiveness: 100% improved performance for semantic query tapirs. Similar performance for 50 visual/semantic queries. Color histogram as relevant feature for retrieval. –Additional functionality: Multi-modal queries: visual and/or textual. 50 MPEG-7 queriesSemantic query tapirs W/o MNW/ MNW/o MNW/ MN 3-point avg point avg

64 Outline MPEG-7 Standard –Structure, Semantics and Summarization Description Tools AMOS (structure) –Video Object Segmentation and Retrieval Visual Apprentice (structure + semantics) –Learning Object/Scene Detectors from User Input IMKA (structure + semantics) –Multimedia Knowledge Framework, Extraction and Application KIA (summary) –AV Scene Segmentation, Discovery and Summarization Summary

65 Summary Description Tools Audio Visual Context Hierarchical Summary DS Highlight Summary DS Highlight Segment DS Highlight Segment DS Highlight Summary DS Highlight Segment DS Highlight Segment DS

66 KIA Visual skim generation: –Fully automatic reduction of the duration of the original video, given a target time. Constraints: –Preserve the semantics –Preserve the frame rate

67 Prior Work Informedia project [CHI 1998]. Microsoft research [ACM MM 1999]. MoCA [J. VCIR 1996]. Issues: –Shots are considered to be indivisible. –Little analysis on the effect of video syntax on semantics.

68 KIA Approach 1. Automatic determination of computable scenes and structures. 2. Derive relationship between minimum shot comprehension time and its visual complexity. 3. Rules of film syntax are used for shot removal. 4. Finally, the problem is cast as an objective function maximization subject to constraints.

69 Summary Creation

70 Mapping to MPEG-7 Highlight Segments

71 Experiment Results User studies validate our approach All the skims tested are deemed by the users to be coherent Excellent results for compression rates 70 ~ 80 %. Original : 114 sec. Skim : 33 sec. 70 % data reduction sundaram and chang acm mm 2000 sundaram and chang icme 2001

72 Outline MPEG-7 Standard –Structure, Semantics and Summarization Description Tools AMOS (structure) –Video Object Segmentation and Retrieval Visual Apprentice (structure + semantics) –Learning Object/Scene Detectors from User Input IMKA (structure + semantics) –Multimedia Knowledge Framework, Extraction and Application KIA (summary) –AV Scene Segmentation, Discovery and Summarization Summary

73 MPEG-7 Standard: Multimedia description –Describes structure, semantics and summaries, among others –Segmentation, searching, filtering, understanding and summarization of multimedia are still challenges AMOS: Video object segmentation and retrieval –Semi-automatic segmentation based on region tracking –Retrieval based on visual features and spatio-temp. relations Visual Apprentice: Learning of Visual Object/Scene Detectors –Users define visual classes and provide training examples –System combines features/learning algorithms at multiple levels IMKA: Intelligent Multimedia Knowledge Application –Multimedia to represent semantic/perceptual world knowledge –Extracts multimedia knowledge for image retrieval KIA: High-Level Audio-Visual Summaries –Automatic AV scene segmentation and structure discovery –Generates video skims preserving semantics

74 For More Info, Papers, …. (I) Columbia University: AMOS: Visual Apprentice: IMKA: KIA:

75 For More Info, Papers, …. (II) DVMM Group: ADVENT Project: MPEG Committee:

76 The End Thanks for your attention!


Download ppt "MPEG-7 Implementation: Extraction and Application of MPEG-7 Description Tools A. B. Benítez, D. Zhong, A. Jaimes, H. Sundaram and S.-F. Chang {ana, dzhong,"

Similar presentations


Ads by Google