Video segmentation Fengting Yang 2019-08-30.

Video segmentation Fengting Yang

Video Task Categories Video object tracking
Video semantic segmentation Video object segmentation Video instance segmentation

Video Object Tracking Input: video + bbox of target(s) in first frame
Output: bbox of the target(s) in the rest frames Example from MOT dataset: Most popular way nowadays: tracking-by-detection IoUTracker[1] and IoUTracker+[2] provide very intuitive implementation in this track [1] Bochinski, Erik, Volker Eiselein, and Thomas Sikora. "High-speed tracking-by-detection without using image information." (AVSS)., 2017. [2] Bochinski, Erik, Tobias Senst, and Thomas Sikora. "Extending IOU based multi-object tracking by visual information." (AVSS)., 2018.

Video Semantic Segmentation
Input: video Output: semantic segmentation Example from ICNet (ECCV’18, real-time, high-res, single-view based): * Optical flow is often adopted in video segmentation task [3] [3] Zhu, Xizhou, et al. "Deep feature flow for video recognition." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Video Object Segmentation
Generic Object Semi-supervised: Input: video + segment(s) in first frame Output: corresponding segment(s) in rest frames Unsupervised: Input: video Output: segments of the objects “consistently appear throughout the entire video and have predominant motion” [4] Example from DAVIS dataset (few instances per frame, no id consistency): [4] Caelles, Sergi, et al. "The 2019 davis challenge on vos: Unsupervised multi-object segmentation." arXiv preprint arXiv: (2019).

Video Instance Segmentation
Input: video + object categories in the video Output: label + id + segments of all targeted objects Example from MOTS dataset:

Papers to be discussed MaskTracker (single object segmentation, semi-supervised) FEELVOS (multi-object segmentation, semi-supervised) Mask Track RCNN & Track RCNN (instance segmentation )

MaskTrack Ref. : Perazzi, Federico, et al. "Learning video object segmentation from static images." CVPR Setting: Single object segmentation, semi-supervised Main properties: offline learning on single image dataset online fine-tuning on video sequence good inclusivity

Offline Learning cast object segmentation  mask refinement Assume:
Small movement in adjacent frames Deform GT label by: Affine transformation (scaling + translation) Non-rigid deformation Dilation operation Advantage: overcome the data limitation cast object segmentation  mask refinement DAVIS only has 3440 annotated images

Online fine-tuning Method: Pro.: Cons:
Affine, non-rigid deformation, flipping and rotating the given masked Gain ~1000 training sample Pro.: Fill the domain gap Cons: Slow inference (12s/frm)

Good inclusivity Box annotation Optical flow: CRF:
using bbox instead of segmentation in the first frame add a bbox -> segmentation convnet Optical flow: using object flow magnitude as additional input go through the same network average the two outputs ( from RGB & flow input) CRF: As post-processing to sharpen the edge

Ablation Study

Ablation Study The best under all the challenges except camera shaking
The more segmentation frame, the better performance

Summary Insights: Limitation:
Enlarge the training set with offline learning Fill the domain gap with online learning Optical flow and CRF helps Limitation: Single object, cannot handle occlusion, association Semi-supervised & mask propagation based, cannot handle instance appear in the middle frame out-of-view heavily reply on the last frame prediction online fine-tuning  slow processing Fig 3. The changes of J mean values over the length of video sequences (from Youtube-VOS[4]) [4] Xu, Ning, et al. "Youtube-vos: Sequence-to-sequence video object segmentation." Proceedings of the European Conference on Computer Vision (ECCV)

FEELVOS Ref. :Voigtlaender, Paul, et al. "Feelvos: Fast end-to-end embedding learning for video object segmentation." CVPR Setting: Multi-object segmentation, semi-supervised using only one ConvNet, no additional cues, no first frame fine-tuning Main properties: Feature embedding Global matching + local matching Dynamic segmentation head

Feature embedding Intuition: Feature distance
same object  similar feature different object  different feature tracking pixel belonging  matching feature Feature distance

System overview Note: They claims using feature matching as feature is better than treat it as hard determinate evidence. In training, randomly choose 1 frame and 2 adjacent frames, apply loss only on the last frame

Matching visualization
Feature matching Global matching: current frame p to the first frame q Local matching current frame p to the last frame q restrict in a neighborhood P for computation efficiency Matching visualization Distance function

Dynamic Segmentation Head
Handle the variation of the object number

Experiment J: measure the IOU, F measure the contour alignment,
J&F: mean of J and F, t: s/img

Experiment FF: first frame, PF: previous frame
GM: global matching, LM: local matching PFP: previous frame prediction

Summary Insight: Limitation:
association can be done by matching feature embedding feature Limitation: reply on first frame detection quality instance appear in the middle frame

MOTS Ref. : Voigtlaender, Paul et al. MOTS: Multi-Object Tracking and Segmentation, CVPR’19 Main Contribution: A large video instance segmentation dataset Evaluation metrics Baseline network – Track RCNN

MOTS Dataset Method: manual annotation sample  CNN  human correction

Mask R-CNN

Evaluation Metrics Intuition: Refer to the paper for details
encourage good IoU punish missing instance and ID switch Refer to the paper for details

Mask RCNN [5] He, Kaiming, et al. "Mask r-cnn." ICCV. 2017.
img from

Track RCNN Intuition: Refer paper for the details encourage good IoU
punish missing instance and ID switch Refer paper for the details Add tracking head on Mask RCNN and incorporate multi-frame feature in feature extraction In training, they use 8 adjacent frames. Did not mention the inference implementation

Tracking head 2 fully connected layer  128d feat. vector associated with each instance feature embedding loss ID association: assign the current instance to the similar one in previous β frames (Hungarian algorithm (w.r.t. L1 distance), and L1 distance < δ) unsigned high confidence instances are set as new instances

Experiment maskprop: link mask RCNN result with optical flow
box orig + MG: tracking with bbox first and segment in bbox ours + MG: train as ours but replace mask with mask RCNN’s head maskprop mechanism Even with gt bbox, segmentation is not easy task

Experiment convLSTM seems do not help much

Video Instance Segmentation
Ref. Linjie Yang, Yuchen Fan and Ning Xu. “Video Instance Segmentation”, ICCV’ 19 Main Contribution: A large video instance segmentation dataset Evaluation metrics Baseline network – Mask Track RCNN

Dataset – YouTube-VIS manually label
Compared to MOTS: more data (25 vs 2,883), more category (2 vs 40), more instance (977 vs 4,883)

Mask Track R-CNN Track RCNN Difference of Tracking head :
Actually I think 1 is not good idea: if mis-grouping happens, one of feature will be contaminated (fig.4 last row) --- trade-off of speed/performance Difference of Tracking head : Segmentation and detection feature are from single view using memory queue and update feature every frame (if the same instance appears) using embedding feature similarity, semantic consistency, spatial correlation and detection confidence as association cues in test time

Tracking head Define the embedding feature based association probability: Cross entropy based tracking loss: ID association: v: score that assigns instance i to stored ID n (n==0 means new id), s: classification score, b: bbox, c: classification category Associated according to the highest score, no association in the same frame

One problem of this mechanism
not robust to the intermediate mistake A error (mis-grouping) happens, one of feature will be contaminated, and hard to recover trade-off between speed and performance

Experiments AP : averaged over multiple intersection-over-union (IoU) thresholds. AR: defined as the maximum recall given some fixed number of segmented instances per video

Experiments Image oracle: Given gt bbox, segmentation and categorty
helps a lot (main future direction), but association still not easy ID Oracle: given gt association helps a little (no big improvement space by only modifying current tracking mechanism) Bbox and category consistency plays important role

Summary Insight: Limitation:
using Mask R-CNN as backbone for instance segmentation using embedding feature (and other cues) for association Limitation: temporal information seems not be well used in CNN, e.g. deep feature flow geometric cue e.g. relative detph motion parallax

Appendix: J&F in DAVIS J & f metric J metric f metric

Appendix: Hungarian Algorithm
Good example in Appendix: CRF Intuitive explanation in Zhihu:

Video segmentation Fengting Yang 2019-08-30.

Similar presentations

Presentation on theme: "Video segmentation Fengting Yang 2019-08-30."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Video segmentation Fengting Yang 2019-08-30.

Similar presentations

Presentation on theme: "Video segmentation Fengting Yang 2019-08-30."— Presentation transcript:

Similar presentations

About project

Feedback