How can iterative exchange between deep learning and spatiotemporal spectral clustering be used for video unsupervised segmentation?

Text|Qianbufan

Editor|Qianbufan

introduce

Spotting objects without human supervision as they move and change appearance in space and time is one of the most challenging and unsolved problems in computer vision; How can we best use the correlation between object motion and appearance to mathematically model the object discovery process without human supervision?

Learning more efficiently from the vast amounts of data available in space-time, with minimal human intervention, visually grouped tasks are natural for humans but demanding for machines; In the context of video unsupervised segmentation, the deep learning field and the iterative graph algorithm field with strong supervised learning ability have the proven advantages of unsupervised clustering.

We introduce a method that can be used to automatically segment the main objects of a video sequence in an unsupervised setting, and while general 3D convolution-based approaches treat the temporal dimension as equivalent to the spatial dimension, we propose a different way to couple motion and appearance.

Objects in the real world form clusters in their space-time neighborhoods, and points belonging to the same object remain connected in space and time, with similar appearance and movement patterns, also different from the rest of the scene.

Scientific background

Video object segmentation is rapidly evolving in the field of computer vision, and most solutions are essentially supervised because they rely on a large number of pre-trained models with human-labeled annotations; While manual labeling is extremely expensive, there are few truly unsupervised methods.

Utilize different heuristics and intrinsic properties of multi-scale video object segmentation; Unlike the above work, embeddings that are pre-trained for significance prediction, tracking, estimating geometric transformations, and video summaries, bridge the gap between classical iterative graph algorithms and deep learning, leveraging the strengths of both to achieve self-supervision.

Figure 1 The architecture diagram module (left) and network module (right) of our iterative knowledge exchange (IKE) system exchange information on multiple loops until convergence.

Figure 2

A visual representation of the space-time graph structure, illustrating the process of creating remote edges that define the graph; The colored curve represents the kinematic chain, formed forward and backward in time from one frame to another by following the optical flow vector, and the black dashed curve corresponds to the graph edges, defined between nodes connected by at least one kinematic chain.

Figure 3

Collect node features along the chain of motion: For a node j, the feature Fj that makes up the feature vector is collected along two output kinematic chains (one forward and one backward) from different features of the pixels associated with the nodes that meet along the chain.

Two key elements set our approach apart:

(1) We propose a compact mathematical model that couples motion and appearance, defining the main objects in the video as the main natural spectral clusters in our characteristic motion matrix.

Figure 4

(2) Our space-time clusters are dense at the pixel level, so we are able to use all the information in the video without losing detail by making hard grouping decisions early on (e.g., calculating superpixels).

Figure 5

method

A double-iterative knowledge exchange model that combines spatiotemporal spectral clustering with deep object segmentation capable of learning without any human annotation, the graphics module takes advantage of the spatiotemporal consistency inherent in the video sequence, but does not have access to deep features.

The Network Module complements the Graphics module by adding deep features to the clustering algorithm with powerful representation capabilities and attempts to predict the output of a spatiotemporal clustering process with only a single frame input.

Figure 6

Graphics module

Given a sequence M video frame, the graphics module finds the main object as the strongest natural cluster in the space-time map, and extracts a set of meter-soft segmentation masks, one for each frame, corresponding to the main object.

Space-time diagram

Define the space-time graph G=(V,E), with a node a∈V associated with each pixel of the video| V|=n, where n = m h w, M- number of frames and ( h , w ) - frame size); G is an undirected graph with an edge set defined by a chain of motion (Figure 2).

In the space-time diagram, each node A has an associated node-level function FA∈R1, starting with the pixels associated with the node, collecting the eigenvector A along the outgoing motion chain and passing through the motion chain through all pixels A connected to.

Formula for spectral clustering problems

We define matrix P as a projection matrix F (P = F (FtF)− 1Ft that projects any vector into the column space of the feature matrix. The constraint X on the vector, which indicates that it should be a linear combination F of the column, can satisfy S=xTMx by the requirement.

The optimal solution x∗ maximizes xTMx, with constraints x = P x and ∥ x∥2 = 1, also maximizes xTPMPx constrained ∥x∥2.

Prove that the sketch as x∗maximizes x = P x under constraints x = P x and ∥x∥2= 1, it also maximizes (Px)TMPx. As P=PT as P=PT, it follows x∗maximizes XTP M P x under the constraints under consideration.

The optimization problem can be defined as follows:

Graph optimization algorithm

The principal eigenvector A of the Feature-Motion matrix best solves the problem defined in the equation, where we convert segmentation into classical spectral clustering, also related to the spectral method of graph matching.

Based on the property A, with non-negative elements, we can infer that the optimal solution x∗ with positive values using the Perron-Frobenius theorem, and our algorithm is an efficient implementation of the power iterative method, which will converge to the optimal solution x∗.

The main algorithmic steps of the graph module during iteration t

Propagation steps

The propagation step is equivalent to having each node A update its label x(t)a=∑bMa,bx(t−1)b

The steps are also equivalent to each node A having its own label propagated to all nodes to which it is connected.

When passing through a node b, we update its label to xb←xb+Ma, bxa but also update the label A xa←xa+Ma, bxb, we jointly propagate information from all nodes in one frame to all adjacent frames in the forward and backward directions.

In each iteration, we estimate the best set of weights w* the best approximation of the current node label F for a given node-level feature. The weights are calculated as follows:

w∗=(FTF)−1FTx(t)

x(t)←Fw∗=Px(t)

Theoretical analysis

Attempting to formulate the steps of the algorithm as a single update yields a recursive relation describing power iteration:

x(t)=(PMx(t−1))/(∥PMx(t−1)∥2

This means that the proposed algorithm guarantees convergence to the PM matrix, which follows x∗ maximizes the Swiss merchant R(PM,x)=(xTPMx)/(xTx).

The L2-norm of the optimal solution is ∥x∗∥2=1 and x∗ living in the column space F, meaning x∗= Px∗. It immediately gives the optimal solution x∗ which also maximizes our target xTPMPx.

Network module

The Network Module (Figure 4) is a deep segmentation model that complements the space-time graph and, in each cycle, trains the network from scratch using only the output of the graph module as a supervised signal and passes them to the graph in subsequent clustering iterations.

The network module is trained on a sample pair (Ii, xi), Ii∈Rh×w×3 is an image of the ITH video sequence, and xi∈[0,1]h×w is the watchdog, and for frame i, provided by the graphics module.

This configuration ensures an increase in penalties in areas of high confidence, while we ensure more permissive behavior in areas of uncertainty, in practice, we consider λ1=λ2=0.5, and the network module solves the following optimization tasks:

Convergence of optimization algorithms in practice

The segmentation process should converge to the same solution x∗ regardless of its initialization X(0), even if the initial solution is completely random, the algorithm will converge to the main object in the video, verifying that the Feature-Motion matrix has a major strong cluster based on the manually labeled ground truth, which indeed corresponds to the main object in the sequence.

To verify the convergence of the unique solution, we carefully examine the effect of the starting point in practice, we verify what happens given the same characteristic motion matrix (this depends only on the optical flow module used, not on the initial solution X(0)), we change the initial starting point.

Table 1 Performance of Unsupervised Graph Modules (First Cycle)

Unsupervised situation: the effect of optical flow

The two nodes (pixels) connected in the kinematic chain are also connected in the graph, while the nodes not connected by the kinematic chain are not connected in the graph, and the connectivity is encoded in a matrix meter and immediately transferred to the characteristic motion matrix A, which is the adjacency matrix of the space-time graph.

In Table 1, a different experiment is proposed, wherein for a given optical flow (M) used to construct the kinematic structure of the graph, we connect node-level eigenvectors computed with two optical flow methods (RAFT and FlowNet 2.0) to construct F.

Figure 7

Spectral analysis of characteristic motion matrices

Characteristic motion matrix A is the key element of the proposed graphics module, and our formula treats segmentation as a spectral clustering problem, provided that the predominant object pixels in the video sequence, in which such objects exist, form a strong natural clustering in space and time.

Figure 8 In Figure 8, we provide the first six eigenvalues A for each configuration considered, in descending order

Improved several graph network loops

The effectiveness of iterative knowledge exchange systems, where the graph acts as the teacher of the network module, and then the network provides more powerful capabilities for the next clustering and learning cycle, in Table 3 and Figure 9, we detail the performance evolution of multiple datasets, taking into account both unsupervised and supervised situations.

Table 3 Relative percentage change between cycles

In Figure 9, the performance evolution of the system is shown in an unsupervised situation, where nodes use only flow characteristics and network modules are always initialized randomly.

The unsupervised formula of our system is the most valuable because the system benefits from the clustering power of the space-time map and the learning ability of the network, making learning possible without the need for manual annotation at any step in the process.

Figure 9

Comparison with baseline and state-of-the-art

In Figure 10, we show the qualitative results of an iterative knowledge exchange system, where we emphasize the protocol between the two components, the graph module, and the network module.

The qualitative results of our unsupervised system, including network and graph modules for all 4 datasets, ground truth is sometimes coarse for YouTube-Objects and DAVSOD, and in these cases our results tend to be finer than annotations, which underscores the difficulty of obtaining highly accurate human annotations.

In Figure 11, we show the final performance of the graph and network modules in an unsupervised setup (without the use of human annotations at any level of training or pretraining), and we observe that while the graph shows superior performance, the single image network module is also competitive and overcomes most top-level methods at the same level of supervision.

Table 4 Quantitative comparison of DAVSOD datasets for video salient object detection tasks

Table 6 Quantitative comparison of YouTube object datasets for zero-sample video object segmentation tasks

Calculate complexity

Each loop of the IKE system needs to pass through the graph module and the network module, given the formula of the space-time graph, there is a one-to-one correspondence between the video pixels and the graph nodes, and the spectral clustering problem may seem tricky.

Considering that the complexity of the entire system is linearly related to the number of frames, reporting the computational cost per frame, for the first cycle of the graphics module, the implementation takes 0.8 seconds/frame: 0.04 seconds for optical flow + 0.18 seconds for graphics data initialization + 0.58 seconds for 20 space-time graphics iterations.

Figure 12

Only the first cycle needs to be initialized, and the reported numbers are the maximum number of features considered (26) and FlowNet2.0 optical flow (0.33 sec/frame for the RAFT solution), 1.64 sec/frame for the network module: 1.63 sec for 5 training epochs + 0.01 for inference.

The total time required for IKE is 5.24 seconds/frame, 224 × 416. The graphics module can also be parallelized, but it is not in our current implementation, and in Figure 13 we examine the evolution of the computational cost of the first cycle of the graphics module, involving the number of features and the number of frames.

Figure 13

Discussion and conclusions

In a double-iterative knowledge exchange system, the unsupervised spatiotemporal clustering module provides a supervisory signal to the deep network module, which in turn transmits its newly learned deep features back to the graph, and the two complementary modules operate as a single self-monitoring entity and exchange information over several cycles until consensus is reached.

IKE fits well with the needs of current video object segmentation because unsupervised cases are mandatory for developing robust and robust methods for unknown data, and by combining more classical graph clustering with the complementary power of modern deep learning, we strike a balance between optimized and data-driven models that can provide new ideas for unsupervised video segmentation research.

Bibliography:

Principles of Gestalt Psychology, K. Koffka, 2013.

Quo vadis Action Recognition? A New Model and Dynamics Dataset", J. Carreira and A. Zisserman, 2017.

Mask Selection and Propagation for Unsupervised Video Object Segmentation, S. Garg and V. Goel, 2021.

MATNeT: A Motion Attention Conversion Network for Zero-Shot Video Object Segmentation, T. Zhou, J. Li, S. Wang, R. Tao, and J. Shen, 2020.

If you also like my article, you may wish to click "follow"! Thank you here!

END

How can iterative exchange between deep learning and spatiotemporal spectral clustering be used for video unsupervised segmentation?