Reminds me of when people wanted to use rasPi for ZoneMinder but it turned out it was too slow for traditional Motion application. They turned to Hardware h264 encoder producing nice motion vectors and we got this Motion detection for free by leveraging hardware encoder https://billw2.github.io/pikrellcam/pikrellcam.html
Amazing! I always thought it must be a huge disadvantage that computer vision only has one frame to work with (and only one camera vs 2 eyes). Often as a human you stare at for a while when trying to spot something tricky, like a bird in the trees and then find it when it moves. Or you move you head side to side to create movement. It would be interesting to see a comparison with what the best single frame segmentation could detect in one of these scenes.
If you check the Table 3, you can see the comparison with some of the top unsupervised video segmentation model, trained with supervised learning, eg. COSNet, MATNet. They perform reasonably well on MoCA, but they were all trained with massive manual segmentation annotations, which is not typically not scalable.
The proposed self-supervised approach is comparable to those top methods, even without using RGB, and any manual annotations.
Related concept: event camera[0]. Instead of capturing light intensities at every pixel of the sensor every frame, every pixel captures changes in intensity as they occur.
Does anyone in this field of "flow" based segmentation know how would the performance be if there were multiple moving objects in the same scene? The examples in this paper seem limted to one moving object (cluster of moving pixels) apposed to a relatively static background?
In terms of accuracy, the authors mention it as a limitation, so probably it could be a problem.
In terms of runtime, it should not matter. Generally speaking though, the overhead of optical flow is often overlooked. For video DL applications, optical flow calculation often takes more time than inference itself. For academic purposes, datasets are often already preprocessed and the optical flow runtime is not mentioned. Doing real time video analysis with optical flow is quite impractical though.
Its an active field of research. Optical flow algorithms are typically used to create vector fields and have shown promise in fluid flows (https://www.mdpi.com/2072-4292/13/4/690/htm). However a deep learning version of this on real life data has yet to be accomplished. If anyone reading this knows otherwise plz link :D
I'm not sure what using the vectors for fluids has to do with this or what the parent asked. Fluids can use vectors, optical flow can produce vectors, but that is about it.
This person asked about clustering and segmenting into more than two separate groups.
Indeed, the only difference is they work on RGB space, and the dataset is a bit toy-ish (no offence), as the networks simply need to separate the objects either by color, or a regular texture pattern.
What proposed in this motion grouping paper, is more like on the idea level, which gives an observation that, although objects in natural videos or images are of very complicated texture, and there is no reason a network can group these pixels together if no supervision is provided.
However, in motion space, pixels moving together form an homogeneous field, and luckily, from psychology, we know that any parts of the objects tend to move together.
Fantastic. The ability to train without labourious labelling/annotations can really help produce effective models appropriate for real-world use cases.
*Edit: It will be interesting to see if this works Video > Image as well. Much of the current video AI work stems from the image foundation, ie split a video into constituent frames and run detection models on those images. But the image detection/segmentation models assume each image is different, and so the process when parsing video this way is unnecessarily complex - sequential video frames in the same scene are more alike than different.
If good segmentation models for video can be more easily trained using this method, then it would be interesting if they can also be applied accurately to still images, since a snapshot of a video is a single image anyway.
I thought that MP4 did something like this, but then I suppose (after working through that great image seam-carving article) that MP4 is doing inter-frame diff chunking. Hmm. So it "knows" that the pixels are grouped into regions that move or not -- but doesn't care. You can't ask an MP4 video codec to recognize objects.
Wow! The title is perfectly succinct. I have never heard of this concept before and upon reading the title realized that many powerful techniques will be based upon it. Perhaps several already are.
It seems such an “obvious” solution, but I can’t say that it would have ever occurred to me.
I don’t work in ML, but fascinating little gems like this are what keeps me coming back to HN
I’ve been looking (preliminarily) recently at using movement data to improve object detection performance on video frames.
The primary challenge I can think of is that the different network structure required makes transfer learning a feature extraction backbone trained from imagenet etc difficult
This looks interesting. Is there any other good works in this area?