laitimes

A review of multimodal deep learning!

author:Chinese Society of Artificial Intelligence

Redirected from Data Analytics & Applications

1 Introduction

Our experience of the world is multimodal – we see objects, we hear sounds, we feel textures, we smell smells, we taste tastes. A modality refers to the way something happens or is experienced, and when a research question contains multiple modalities, it has the characteristics of multimodality. In order for AI to make progress in understanding the world around us, it needs to be able to interpret these multimodal signals simultaneously.

For example, images are often associated with labels and text explanations, and text contains images to more clearly convey the central idea of the article. Different modalities have very different statistical properties. These data are called multimodal big data, which contain rich multimodal and cross-modal information, which poses a huge challenge to traditional data fusion methods.

In this review, we will introduce some groundbreaking deep learning models to fuse these multimodal big data. As more and more multimodal big data is explored, there are still some challenges that need to be solved. Therefore, this paper reviews the deep learning of multimodal data fusion, aiming to provide readers (regardless of their original community) with the basic principles of multimodal deep learning fusion methods and stimulate new multimodal data fusion techniques for deep learning.

A review of multimodal deep learning!

Combining different modalities or information types to improve the effect with multimodal deep learning is intuitively an attractive task, but in practice, how to combine different noise levels and conflicts between modalities is a challenge. In addition, the model has different quantitative effects on the prediction results. In practice, the most common approach is to connect high-level embeddings of different inputs and then apply softmax.

A review of multimodal deep learning!

The problem with this approach is that it will give equal importance to all subnets/patterns, which is highly unlikely in the real world. A weighted combination of subnetworks is needed here so that each input modality can have a learning contribution (Theta) to the output prediction.

A review of multimodal deep learning!

2 Typical deep learning architecture

In this section, we will introduce a representative deep learning architecture for multimodal data fusion deep learning models. Specifically, the definition of deep architecture, feedforward computation and backpropagation computation, as well as typical variants, are given. Table 1 summarizes the representative models.

Table 1: Summary of representative deep learning models.

A review of multimodal deep learning!

2.1 Deep Belief Network (DBN)

The constrained Boltzmann machine (RBM) is the fundamental block of the deep belief network (Zhang, Ding, Zhang, & Xue, 2018; Bengio, 2009). RBM is a special variant of the Boltzmann machine (see Figure 1). It consists of a visible layer and a hidden layer; There is a fully connected connection between the elements of the visible layer and the elements of the hidden layer, but there is no connection between the elements in the same layer. RBM is also a generative graph model that uses an energy function to capture the probability distribution between visible and hidden elements.

Recently, some advanced RBMs have been proposed to improve performance. For example, in order to avoid network overfitting, Chen, Zhang, Yeung, and Chen (2017) designed a sparse Boltzmann machine that learns the network structure based on a hierarchical latent tree. Ning, Pittman, and Shen (2018) introduced a fast contrastive divergence algorithm to RBM, where boundary-based filtering and delta product are used to reduce redundant dot product calculations in calculations. In order to protect the internal structure of multidimensional data, Ju et al. (2019) proposed tensor RBM, learning high-level distributions hidden in multidimensional data, where tensor decomposition is used to avoid dimensional disasters.

DBM is a typical deep architecture that consists of multiple RBMs stacked on top of each other (Hinton & Salakhutdinov, 2006). It is a generative model based on pre-training and fine-tuning training strategies that can harness energy to capture the distribution of joints between visible objects and corresponding labels. In pre-training, each hidden layer is greedily modeled as an RBM trained in an unsupervised policy. After that, each hidden layer is further trained by the discriminant information of the training label in the supervision strategy. DBNs have been used to solve problems in many areas, such as data dimensionality reduction, representation learning, and semantic hashing. A representative DBM is shown in Figure 1.

Figure 1:

A review of multimodal deep learning!

2.2 Stacked Autoencoder (SAE)

Stacked autoencoders (SAEs) are typical deep learning models for encoder-decoder architectures (Michael, Olivier, and Mario, 2018; Weng, Lu, Tan, and Zhou, 2016). It can capture the concise features of the input by converting the original input into an intermediate representation in an unsupervised-supervised manner. SAE has been widely used in many fields, including dimensionality reduction (Wang, Yao, &&&&Zhao, 2016), image recognition (Jia, Shao, Li, Zhao&Fu, 2018), and text classification (Chen &&&017). Figure 2 illustrates a representative SAE.

Figure 2:

A review of multimodal deep learning!

2.3 卷积神经网络(CNN)

DBN and SAE are fully connected neural networks. In both networks, each neuron in the hidden layer is connected to each neuron in the previous layer, and this topology generates a large number of connections. In order to train the weights of these connections, a fully connected neural network needs a large number of training objects to avoid overfitting and underfitting, which is computationally intensive. In addition, the fully connected topology does not take into account the location information of the features contained between neurons. As a result, fully connected deep neural networks (DBN, SAE, and their variants) cannot process high-dimensional data, especially large image and large audio data.

Convolutional neural networks are a special type of deep network that takes into account the local topology of the data (Li, Xia, Du, Lin, & Samat, 2017; Sze, Chen, Yang, and Emer, 2017). Convolutional neural networks include fully connected networks and constrained networks that include convolutional layers and pooling layers. Constrained networks use convolution and pooling operations to achieve local receptive fields and parameter reduction. Like DBN and SAE, convolutional neural networks are trained by stochastic gradient descent algorithms. It has made great strides in medical image recognition (Maggiori, Tarabalka, Charpiat, and Alliez, 2017) and semantic analysis (Hu, Lu, Li, &Chen, 2014). A representative CNN is shown in Figure 3.

Figure 3:

A review of multimodal deep learning!

2.4 Recurrent Neural Networks (RNNs)

A recurrent neural network is a neural computing architecture that processes serial data (Martens & Sutskever, 2011; Sutskever, Martens and Hinton, 2011). Unlike deep forward architectures (i.e., DBN, SAE, and CNN), it not only maps the input pattern to the output result, but also transmits the hidden state to the output by utilizing connections between hidden cells (Graves & Schmidhuber, 2008). By using these hidden connections, the RNN models the temporal dependence, thus sharing parameters between objects in the temporal dimension. It has been applied in various fields such as speech analysis (Mulder, Bethard, and Moens, 2015), image captioning (Xu et al., 2015), and language translation (Graves & Jaitly, 2014) with excellent performance. Similar to deep forward architectures, the computation also includes forward and backward propagation phases. In forward pass computations, the RNN acquires both the input and the hidden state. In backpropagation calculations, it uses a time backpropagation algorithm to backpropagate the loss of time steps. Figure 4 shows a representative RNN.

Figure 4:

A review of multimodal deep learning!

3 Deep learning for multimodal data fusion

In this section, we review the most representative multimodal data fusion deep learning models from the perspectives of model tasks, model frameworks, and evaluation datasets. Depending on the deep learning architecture used, they fall into four categories. Table 2 summarizes representative multimodal deep learning models.

Table 2:

Abstract of Representative Multimodal Deep Learning Models.

A review of multimodal deep learning!

3.1 Network-based deep belief multimodal data fusion

3.1.1 Example 1

Srivastava and Salakhutdinov (2012) proposed a multimodal generative model based on a deep Boltzmann learning model, which learns multimodal representations by fitting the joint distribution of multimodal data on various modalities such as images, text, and audio.

A review of multimodal deep learning!

Each module of the proposed multimodal DBN is initialized in an unsupervised layer-by-layer mode, and the model is trained by an approximation method based on MCMC.

In order to evaluate the learned multimodal representations, a large number of tasks are performed, such as the task of generating missing modalities, the task of inferring joint representations, and the discriminant task. Experiments verify whether the learned multimodal representation satisfies the required properties.

3.1.2 Example 2

To effectively diagnose Alzheimer's disease at an early stage, Suk, Lee, Shen, and the Alzheimer's Disease Neuroimaging Program (2014) proposed a multimodal Boltzmann model that can incorporate complementary knowledge from multimodal data. Specifically, to address the limitations caused by the shallow feature learning approach, DBN is used to learn deep representations of each modality by transferring domain-specific representations to hierarchical abstract representations. Then, a single-layer RBM is constructed on a concatenation vector that is a linear combination of hierarchical abstract representations from each modality. It is used to learn multimodal representations by constructing joint distributions of different multimodal features. Finally, the proposed model was extensively evaluated on the ADNI dataset based on three typical diagnoses, achieving state-of-the-art diagnostic accuracy.

3.1.3 Example 3

To accurately estimate human pose, Ouyang, Chu, and Wang (2014) designed a multi-source deep learning model that learns multimodal representations from hybrid types, appearance scores, and deformation modes by extracting the joint distribution of body patterns in higher-order spaces. In the human-pose multi-source depth model, three widely used modalities were extracted from the image structure model, which combined various parts of the body based on conditional random field theory. In order to obtain multimodal data, the graph structure model was trained by linear support vector machine. Each of these three features is then fed into a two-layer restricted Boltzmann model to capture an abstract representation of higher-order pose space from feature-specific representations. With unsupervised initialization, the constrained Boltzmann model for each specific modality captures an intrinsic representation of the global space. Then, RBM is used to further learn the human pose representation based on the concatenation vector of advanced blend types, appearance scores, and deformation representations. In order to train the proposed multi-source deep learning model, a task-specific objective function considering both body position and human detection was designed. The proposed model was validated on LSP, PARSE and UIUC and yielded an improvement of up to 8.6%.

Recently, some new DBN-based multimodal feature learning models have been proposed. For example, Amer, Shields, Siddiquie, and Tamrakar (2018) proposed a hybrid approach for sequential event detection, in which conditional RBM is employed to extract modal and cross-modal features with additional discriminant label information. Al-Waisy, Qahwaji, Ipson, and Al-Fahdawi (2018) introduced a multimodal approach to face recognition. In this approach, the multimodal distribution of local manual features captured by the Curvelet transform can be combined using a DBN-based model to model the advantages of local and depth features (Al-Waisy et al., 2018).

3.1.4 Summary

These DBN-based multimodal models use probabilistic graph networks to translate modal-specific representations into semantic features in shared spaces. Then, the joint distribution on the modality is modeled according to the characteristics of the shared space. These DBN-based multimodal models are more flexible and robust in unsupervised, semi-supervised, and supervised learning strategies. They are ideal for capturing the information characteristics of the input data. However, they ignore the spatial and temporal topology of multimodal data.

3.2 Multimodal data fusion based on stacked autoencoder

3.2.1 Example 4

The multimodal deep learning proposed by Ngiam et al. (2011) is the most representative deep learning model based on stacked autoencoder (SAE)-based multimodal data fusion. This deep learning model is designed to solve two data fusion problems: cross-modal and shared modal representation learning. The former aims to capture better unimodal representations by leveraging knowledge from other modalities, while the latter learns complex correlations between modalities at an intermediate level. To achieve these goals, three learning scenarios were designed—multimodal, cross-modal, and shared-modal learning—as shown in Table 3 and Figure 6.

Figure 6:

A review of multimodal deep learning!

Architectures for multimodal, cross-modal, and shared modal learning.

Table 3: Setup for multimodal learning.

A review of multimodal deep learning!

In a multimodal learning scenario, the audio spectrogram and the video frame are connected into vectors in a linear fashion. Tandem vectors are fed into a sparsely confined Boltzmann machine (SRBM) to learn the correlation between audio and video. The model can only learn shadow joint representations of multiple modalities because the correlation is implicit in the high-dimensional representation at the primitive level, and a single-layer SRBM cannot model them. Inspired by this, the concatenation vectors of intermediate representations are fed into the SRBM to simulate the correlation of multiple modalities, showing better performance.

In the cross-modal learning scenario, a deep-stacked multimodal autoencoder is proposed to explicitly learn the correlation between modalities. Specifically, both audio and video are presented as inputs in feature learning, and only one of them is fed into the model in supervised training and testing. The model is initialized in the way of multimodal learning, which can well simulate cross-modal relationships.

In the shared modal representation, modal-specific deep-stacked multimodal autoencoders are introduced under the stimulus of denoising autoencoders to explore joint representations between modalities, especially when one modality is missing. The training dataset, which is expanded by replacing one of the modalities with zero, is fed into the feature learning model.

Finally, detailed experiments were carried out on the CUAVE and AVLetters datasets to evaluate the performance of multimodal deep learning in task-specific feature learning.

3.2.2 Example 5

In order to generate visually and semantically valid human skeletons from a series of images, especially videos, Hong, Yu, Wan, Tao, and Wang (2015) proposed a multimodal depth autoencoder to capture the fusion relationship between image and pose. In particular, the proposed multimodal depth autoencoder is trained by a three-stage strategy to construct a nonlinear mapping between two-dimensional images and three-dimensional poses. In the feature fusion stage, the low-rank representation of multi-view hypergraph is used to construct an internal two-dimensional representation from a series of image features, such as directional gradient histogram and shape context, based on manifold learning. In the second stage, a single-layer autoencoder is trained to learn an abstract representation that is used to recover a three-dimensional pose by reconstructing features between two-dimensional images. At the same time, a single-layer autoencoder is trained in a similar way to learn abstract representations of three-dimensional poses. After obtaining an abstract representation of each single modality, a neural network is used to learn the multimodal correlation between the 2D image and the 3D pose by minimizing the squared Euclidean distance between the two modal mutual representations. The learning of the proposed multimodal depth autoencoder consists of initialization and fine-tuning stages. In initialization, the parameters of each subpart of the multimodal deep autoencoder are copied from the corresponding autoencoder and neural network. Then, the parameters of the whole model are further fine-tuned by the stochastic gradient descent algorithm, and the three-dimensional pose is constructed from the corresponding two-dimensional image.

3.2.3 Summary

The SAE-based multimodal model adopts an encoder-decoder architecture to extract intrinsic modal features and cross-modal features through unsupervised reconstruction methods. Since they are based on SAE, which is a fully connected model, there are many parameters that need to be trained. In addition, they ignore the spatial and temporal topologies in multimodal data.

3.3 Multimodal data fusion based on convolutional neural network

3.3.1 Example 6

To simulate the semantic mapping distribution between images and sentences, Ma, Lu, Shang, and Li (2015) proposed a multimodal convolutional neural network. In order to fully capture semantic relevance, a three-level fusion strategy was designed in the end-to-end architecture—word level, stage level, and sentence level. The architecture consists of image subnets, matching subnets, and multimodal subnets. An image subnet is a representative deep convolutional neural network, such as Alexnet and Inception, that effectively encodes image input into concise representations. Matching subnets model a federated representation that associates image content with a word fragment of a sentence in semantic space.

3.3.2 Example 7

In order to extend the visual recognition system to an infinite number of discrete categories, Frome et al. (2013) proposed a multimodal convolutional neural network by exploiting semantic information in text data. The network consists of a language submodel and a vision submodel. The language submodel is based on a skip-gram model, which can transmit textual information into a dense representation of a semantic space. The vision submodel is a representative convolutional neural network, such as Alexnet, that is pre-trained on a 1000-class ImageNet dataset to capture visual features. To model the semantic relationship between images and text, language and vision submodels are combined through linear projection layers. Each submodel is initialized by the parameters of each modality. Then, in order to train this visual semantic multimodal model, a new loss function is proposed, which can provide a high similarity score for the correct image and label pairs by combining dot product similarity and hinge rank loss. The model can produce state-of-the-art performance on the ImageNet dataset, avoiding semantically unreasonable results.

3.3.3 Summary

The CNN-based multimodal model can learn local multimodal features between modalities through local fields and pooling operations. They explicitly model the spatial topology of multimodal data. And they are not fully connected models with a greatly reduced number of parameters.

3.4 Multimodal data fusion based on recurrent neural networks

3.4.1 Example 8

To generate the caption of the image, Mao et al. (2014) proposed a multimodal recurrent neural architecture. This multimodal recurrent neural network can bridge the probabilistic correlation between images and sentences. It solves the limitation that the previous work could not generate new image captions, as the previous work retrieved the corresponding captions in the sentence database based on the learned image-text mapping. Unlike previous work, the Multimodal Recurrent Neural Model (MRNN) learns joint distributions on the semantic space based on a given word and image. When an image appears, it generates sentences verbatim based on the joint distribution of the capture. Specifically, a multimodal recurrent neural network consists of a language subnet, a vision subnet, and a multimodal subnet, as shown in Figure 7. The language subnet consists of two layers of word embeddings, the former capturing valid task-specific representations, and a single-layer recurrent neural part, which models the temporal dependence of sentences. A visual subnet is essentially a deep convolutional neural network, such as Alexnet, Resnet, or Inception, that encodes high-dimensional images into compact representations. Finally, a multimodal subnet is a hidden network that models the federated semantic distribution of learned language and visual representations.

Figure 7:

A review of multimodal deep learning!

3.4.2 Example 9

In order to solve the limitation that the current visual recognition system cannot generate rich descriptions of images at a glance, a multimodal alignment model is proposed by bridging the modal relationship between visual and text data (Karpathy &Li, 2017). To achieve this, a twofold scheme is proposed. Firstly, a visual semantic embedding model was designed to generate a multimodal training dataset. Then, a multimodal RNN is trained on this dataset to generate rich descriptions of the images.

In the visual semantic embedding model, regional convolutional neural networks are used to obtain rich representations of images that contain enough information about what corresponds to the sentence. Each sentence is then encoded into a dense vector of the same dimension with an image representation using a bidirectional RNN. In addition, a multimodal scoring function is given to measure the semantic similarity between images and sentences. Finally, the Markov random field method is used to generate a multimodal dataset.

In multimodal RNN, a more efficient extension model based on text content and image input is proposed. The multimodal model consists of a convolutional neural network that encodes image inputs and an RNN that encodes image features and sentences. The model is also trained on a stochastic gradient descent algorithm. Both multimodal models were extensively evaluated on the Flickr and Mscoco datasets and achieved state-of-the-art performance.

3.4.3 Summary

The RNN-based multimodal model can analyze the time dependence hidden in multimodal data with the help of explicit state transfer in hidden unit computation. They use a temporal backpropagation algorithm to train the parameters. Since the computation is performed in a hidden-state transmission, parallelization is difficult on high-performance devices.

4 Summary and outlook

We summarized the model as four sets of multimodal data deep learning models based on DBN, SAE, CNN and RNN. Some progress has already been made in these groundbreaking models. However, these models are still in their infancy, so challenges remain.

Firstly, there are a large number of free weights in the multimodal data fusion deep learning model, especially the redundant parameters that have little impact on the target task. In order to train these parameters that capture the feature structure of the data, a large amount of data is fed into a multimodal data fusion deep learning model based on a backpropagation algorithm, which is computationally intensive and time-consuming. Therefore, how to design a new multimodal deep learning compression method based on existing compression strategies is also a potential research direction.

Secondly, multimodal data not only contains cross-modal information, but also contains rich cross-modal information. Therefore, the combination of deep learning and semantic fusion strategies may be one way to solve the challenges posed by exploring multimodal data.

Third, the collection of multimodal data from a dynamic environment shows that the data is uncertain. Therefore, with the explosive growth of dynamic multimodal data, the design of online and incremental multimodal deep learning models for data fusion must be solved.

Read on