Download presentation
Presentation is loading. Please wait.
1
Video segmentation Fengting Yang
2
Video Task Categories Video object tracking
Video semantic segmentation Video object segmentation Video instance segmentation
3
Video Object Tracking Input: video + bbox of target(s) in first frame
Output: bbox of the target(s) in the rest frames Example from MOT dataset: Most popular way nowadays: tracking-by-detection IoUTracker[1] and IoUTracker+[2] provide very intuitive implementation in this track [1] Bochinski, Erik, Volker Eiselein, and Thomas Sikora. "High-speed tracking-by-detection without using image information." (AVSS)., 2017. [2] Bochinski, Erik, Tobias Senst, and Thomas Sikora. "Extending IOU based multi-object tracking by visual information." (AVSS)., 2018.
4
Video Semantic Segmentation
Input: video Output: semantic segmentation Example from ICNet (ECCV’18, real-time, high-res, single-view based): * Optical flow is often adopted in video segmentation task [3] [3] Zhu, Xizhou, et al. "Deep feature flow for video recognition." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
5
Video Object Segmentation
Generic Object Semi-supervised: Input: video + segment(s) in first frame Output: corresponding segment(s) in rest frames Unsupervised: Input: video Output: segments of the objects “consistently appear throughout the entire video and have predominant motion” [4] Example from DAVIS dataset (few instances per frame, no id consistency): [4] Caelles, Sergi, et al. "The 2019 davis challenge on vos: Unsupervised multi-object segmentation." arXiv preprint arXiv: (2019).
6
Video Instance Segmentation
Input: video + object categories in the video Output: label + id + segments of all targeted objects Example from MOTS dataset:
7
Papers to be discussed MaskTracker (single object segmentation, semi-supervised) FEELVOS (multi-object segmentation, semi-supervised) Mask Track RCNN & Track RCNN (instance segmentation )
8
MaskTrack Ref. : Perazzi, Federico, et al. "Learning video object segmentation from static images." CVPR Setting: Single object segmentation, semi-supervised Main properties: offline learning on single image dataset online fine-tuning on video sequence good inclusivity
9
Offline Learning cast object segmentation mask refinement Assume:
Small movement in adjacent frames Deform GT label by: Affine transformation (scaling + translation) Non-rigid deformation Dilation operation Advantage: overcome the data limitation cast object segmentation mask refinement DAVIS only has 3440 annotated images
10
Online fine-tuning Method: Pro.: Cons:
Affine, non-rigid deformation, flipping and rotating the given masked Gain ~1000 training sample Pro.: Fill the domain gap Cons: Slow inference (12s/frm)
11
Good inclusivity Box annotation Optical flow: CRF:
using bbox instead of segmentation in the first frame add a bbox -> segmentation convnet Optical flow: using object flow magnitude as additional input go through the same network average the two outputs ( from RGB & flow input) CRF: As post-processing to sharpen the edge
12
Ablation Study
13
Ablation Study The best under all the challenges except camera shaking
The more segmentation frame, the better performance
14
Summary Insights: Limitation:
Enlarge the training set with offline learning Fill the domain gap with online learning Optical flow and CRF helps Limitation: Single object, cannot handle occlusion, association Semi-supervised & mask propagation based, cannot handle instance appear in the middle frame out-of-view heavily reply on the last frame prediction online fine-tuning slow processing Fig 3. The changes of J mean values over the length of video sequences (from Youtube-VOS[4]) [4] Xu, Ning, et al. "Youtube-vos: Sequence-to-sequence video object segmentation." Proceedings of the European Conference on Computer Vision (ECCV)
15
FEELVOS Ref. :Voigtlaender, Paul, et al. "Feelvos: Fast end-to-end embedding learning for video object segmentation." CVPR Setting: Multi-object segmentation, semi-supervised using only one ConvNet, no additional cues, no first frame fine-tuning Main properties: Feature embedding Global matching + local matching Dynamic segmentation head
16
Feature embedding Intuition: Feature distance
same object similar feature different object different feature tracking pixel belonging matching feature Feature distance
17
System overview Note: They claims using feature matching as feature is better than treat it as hard determinate evidence. In training, randomly choose 1 frame and 2 adjacent frames, apply loss only on the last frame
18
Matching visualization
Feature matching Global matching: current frame p to the first frame q Local matching current frame p to the last frame q restrict in a neighborhood P for computation efficiency Matching visualization Distance function
19
Dynamic Segmentation Head
Handle the variation of the object number
20
Experiment J: measure the IOU, F measure the contour alignment,
J&F: mean of J and F, t: s/img
21
Experiment FF: first frame, PF: previous frame
GM: global matching, LM: local matching PFP: previous frame prediction
22
Summary Insight: Limitation:
association can be done by matching feature embedding feature Limitation: reply on first frame detection quality instance appear in the middle frame
23
MOTS Ref. : Voigtlaender, Paul et al. MOTS: Multi-Object Tracking and Segmentation, CVPR’19 Main Contribution: A large video instance segmentation dataset Evaluation metrics Baseline network – Track RCNN
24
MOTS Dataset Method: manual annotation sample CNN human correction
25
Mask R-CNN
26
Evaluation Metrics Intuition: Refer to the paper for details
encourage good IoU punish missing instance and ID switch Refer to the paper for details
27
Mask RCNN [5] He, Kaiming, et al. "Mask r-cnn." ICCV. 2017.
img from
28
Track RCNN Intuition: Refer paper for the details encourage good IoU
punish missing instance and ID switch Refer paper for the details Add tracking head on Mask RCNN and incorporate multi-frame feature in feature extraction In training, they use 8 adjacent frames. Did not mention the inference implementation
29
Tracking head 2 fully connected layer 128d feat. vector associated with each instance feature embedding loss ID association: assign the current instance to the similar one in previous β frames (Hungarian algorithm (w.r.t. L1 distance), and L1 distance < δ) unsigned high confidence instances are set as new instances
30
Experiment maskprop: link mask RCNN result with optical flow
box orig + MG: tracking with bbox first and segment in bbox ours + MG: train as ours but replace mask with mask RCNN’s head maskprop mechanism Even with gt bbox, segmentation is not easy task
31
Experiment convLSTM seems do not help much
32
Video Instance Segmentation
Ref. Linjie Yang, Yuchen Fan and Ning Xu. “Video Instance Segmentation”, ICCV’ 19 Main Contribution: A large video instance segmentation dataset Evaluation metrics Baseline network – Mask Track RCNN
33
Dataset – YouTube-VIS manually label
Compared to MOTS: more data (25 vs 2,883), more category (2 vs 40), more instance (977 vs 4,883)
34
Mask Track R-CNN Track RCNN Difference of Tracking head :
Actually I think 1 is not good idea: if mis-grouping happens, one of feature will be contaminated (fig.4 last row) --- trade-off of speed/performance Difference of Tracking head : Segmentation and detection feature are from single view using memory queue and update feature every frame (if the same instance appears) using embedding feature similarity, semantic consistency, spatial correlation and detection confidence as association cues in test time
35
Tracking head Define the embedding feature based association probability: Cross entropy based tracking loss: ID association: v: score that assigns instance i to stored ID n (n==0 means new id), s: classification score, b: bbox, c: classification category Associated according to the highest score, no association in the same frame
36
One problem of this mechanism
not robust to the intermediate mistake A error (mis-grouping) happens, one of feature will be contaminated, and hard to recover trade-off between speed and performance
37
Experiments AP : averaged over multiple intersection-over-union (IoU) thresholds. AR: defined as the maximum recall given some fixed number of segmented instances per video
38
Experiments Image oracle: Given gt bbox, segmentation and categorty
helps a lot (main future direction), but association still not easy ID Oracle: given gt association helps a little (no big improvement space by only modifying current tracking mechanism) Bbox and category consistency plays important role
39
Summary Insight: Limitation:
using Mask R-CNN as backbone for instance segmentation using embedding feature (and other cues) for association Limitation: temporal information seems not be well used in CNN, e.g. deep feature flow geometric cue e.g. relative detph motion parallax
40
Appendix: J&F in DAVIS J & f metric J metric f metric
41
Appendix: Hungarian Algorithm
Good example in Appendix: CRF Intuitive explanation in Zhihu:
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.