Self-Supervised Video Object Segmentation by Motion Grouping

rasz · on April 17, 2021

Reminds me of when people wanted to use rasPi for ZoneMinder but it turned out it was too slow for traditional Motion application. They turned to Hardware h264 encoder producing nice motion vectors and we got this Motion detection for free by leveraging hardware encoder https://billw2.github.io/pikrellcam/pikrellcam.html

ulnarkressty · on April 17, 2021

This is also one of the tricks used to get real-time 3D reconstruction on the Oculus Quest.

https://research.fb.com/wp-content/uploads/2020/07/Passthrou...

rijoja · on April 17, 2021

jonplackett · on April 17, 2021

Amazing! I always thought it must be a huge disadvantage that computer vision only has one frame to work with (and only one camera vs 2 eyes). Often as a human you stare at for a while when trying to spot something tricky, like a bird in the trees and then find it when it moves. Or you move you head side to side to create movement. It would be interesting to see a comparison with what the best single frame segmentation could detect in one of these scenes.

weidi_xie · on April 17, 2021

If you check the Table 3, you can see the comparison with some of the top unsupervised video segmentation model, trained with supervised learning, eg. COSNet, MATNet. They perform reasonably well on MoCA, but they were all trained with massive manual segmentation annotations, which is not typically not scalable.

The proposed self-supervised approach is comparable to those top methods, even without using RGB, and any manual annotations.

strogonoff · on April 17, 2021

Related concept: event camera[0]. Instead of capturing light intensities at every pixel of the sensor every frame, every pixel captures changes in intensity as they occur.

[0] https://en.wikipedia.org/wiki/Event_camera

seesawtron · on April 17, 2021

Does anyone in this field of "flow" based segmentation know how would the performance be if there were multiple moving objects in the same scene? The examples in this paper seem limted to one moving object (cluster of moving pixels) apposed to a relatively static background?

g_airborne · on April 17, 2021

In terms of accuracy, the authors mention it as a limitation, so probably it could be a problem.

In terms of runtime, it should not matter. Generally speaking though, the overhead of optical flow is often overlooked. For video DL applications, optical flow calculation often takes more time than inference itself. For academic purposes, datasets are often already preprocessed and the optical flow runtime is not mentioned. Doing real time video analysis with optical flow is quite impractical though.

weidi_xie · on April 17, 2021

The time for computing optical flow is indeed a bottleneck, and some researchers in the community is working on that: - https://arxiv.org/abs/2103.04524 - https://arxiv.org/pdf/2103.17271.pdf

amcoastal · on April 17, 2021

Its an active field of research. Optical flow algorithms are typically used to create vector fields and have shown promise in fluid flows (https://www.mdpi.com/2072-4292/13/4/690/htm). However a deep learning version of this on real life data has yet to be accomplished. If anyone reading this knows otherwise plz link :D

CyberDildonics · on April 17, 2021

I'm not sure what using the vectors for fluids has to do with this or what the parent asked. Fluids can use vectors, optical flow can produce vectors, but that is about it.

This person asked about clustering and segmenting into more than two separate groups.

amcoastal · on April 22, 2021

Yeah, and this paper describes both phenomena combined.

binaryzeitgeist · on April 17, 2021

This is very similar to work by Curious AI done a couple years ago, although it didn't work on high res videos.

Tagger: Deep Unsupervised Perceptual Grouping

https://arxiv.org/abs/1606.06724

weidi_xie · on April 17, 2021

Indeed, the only difference is they work on RGB space, and the dataset is a bit toy-ish (no offence), as the networks simply need to separate the objects either by color, or a regular texture pattern.

What proposed in this motion grouping paper, is more like on the idea level, which gives an observation that, although objects in natural videos or images are of very complicated texture, and there is no reason a network can group these pixels together if no supervision is provided.

However, in motion space, pixels moving together form an homogeneous field, and luckily, from psychology, we know that any parts of the objects tend to move together.

jiofih · on April 17, 2021

The difference is that the new work is based completely on optical flow (i.e. movement) input, while this one is based on.. something

forgingahead · on April 17, 2021

Fantastic. The ability to train without labourious labelling/annotations can really help produce effective models appropriate for real-world use cases.

No code as yet, but looking forward to having it released and playing with it: https://github.com/charigyang/motiongrouping

*Edit: It will be interesting to see if this works Video > Image as well. Much of the current video AI work stems from the image foundation, ie split a video into constituent frames and run detection models on those images. But the image detection/segmentation models assume each image is different, and so the process when parsing video this way is unnecessarily complex - sequential video frames in the same scene are more alike than different.

If good segmentation models for video can be more easily trained using this method, then it would be interesting if they can also be applied accurately to still images, since a snapshot of a video is a single image anyway.

watersb · on April 17, 2021

I thought that MP4 did something like this, but then I suppose (after working through that great image seam-carving article) that MP4 is doing inter-frame diff chunking. Hmm. So it "knows" that the pixels are grouped into regions that move or not -- but doesn't care. You can't ask an MP4 video codec to recognize objects.

rembicilious · on April 17, 2021

Wow! The title is perfectly succinct. I have never heard of this concept before and upon reading the title realized that many powerful techniques will be based upon it. Perhaps several already are. It seems such an “obvious” solution, but I can’t say that it would have ever occurred to me. I don’t work in ML, but fascinating little gems like this are what keeps me coming back to HN

jeeeb · on April 17, 2021

I’ve been looking (preliminarily) recently at using movement data to improve object detection performance on video frames.

The primary challenge I can think of is that the different network structure required makes transfer learning a feature extraction backbone trained from imagenet etc difficult

This looks interesting. Is there any other good works in this area?

boromi · on April 17, 2021

The official code is a README.md file ?? https://github.com/charigyang/motiongrouping

weidi_xie · on April 17, 2021

will be online soon.

eschneider · on April 17, 2021

Huh. We shipped commercial products based on this sort of thing years ago...

krisoft · on April 17, 2021

Did you publish the method?

weidi_xie · on April 17, 2021

micelwell123 · on April 20, 2021

https://www.centerffs.org/sites/default/files/webform/intern...

ttty · on April 17, 2021

[flagged]

mkl · on April 17, 2021

That's just the abstract of the linked article.