Deep Thinking: Is the bigger the visual deep learning model, the better?

Source: 3D Vision Workshop

Add a small assistant: dddvision, note: direction + school/company + nickname, pull you into the group. At the end of the article, industry subdivisions are attached

Scaling up models has been one of the key drivers of progress in various areas of AI in recent years, including language modeling, image and video generation, and more. Similarly, for visual comprehension, larger models consistently show improvement with sufficient pre-trained data. This trend has led to the pursuit of giant models with billions of parameters as the default strategy for achieving more robust visual representations and improving performance on downstream tasks.

In this work, the authors reconsider the question: Does a larger model necessarily be needed to gain a better visual understanding? Rather than scaling up the model, the authors consider scaling in the image-scale dimension, which is called "scale scaling" (S2). With S2, you can run pre-trained and frozen smaller vision models (e.g., ViT-B or ViT-L) on multiple image scales to generate multiscale representations. A pre-trained model is used on a single image scale, the images are interpolated to multiple scales, and the features at each scale are extracted by segmenting the larger images into sub-images of normal size (224^2) and processing them separately at each scale, then summarizing and connecting them with the features in the original representation.

Surprisingly, by evaluating the visual representations of various pre-trained models (e.g., ViT, DINOv2, OpenCLIP, MVP), the authors found that smaller models with S2 scaling consistently outperformed larger models in terms of classification, semantic segmentation, depth estimation, MLLM benchmarking, and bot manipulation, and that the number of parameters was significantly lower (0.28× to 0.07×) and had comparable GFLOPS. Notably, by expanding the image scale to 1008^2, the latest performance of MLLM visual detail understanding has been achieved on the V∗ benchmark, surpassing open-source and even commercial MLLMs such as Gemini Pro and GPT-4V.

The authors further investigate the conditions of the preferred scaling method for scaling S2 relative to the model size. Findings: Although smaller models with S2 achieve better downstream performance than larger models in many cases, larger models can still exhibit better generalization capabilities in difficult cases. This raises an inquiry into whether the smaller model is able to achieve the same generalization capabilities as the larger model. Surprisingly, the authors found that the features of the larger model can be well approximated to the multiscale smaller model by a single linear transformation, implying that the smaller model should have at least similar learning capabilities to its larger counterpart. The authors hypothesize that their weak generalization ability stems from pre-training using only a single image scale. By performing experiments with ImageNet-21k pre-training on ViT, the authors demonstrated that pre-training with S2 scaling can improve the generalization ability of smaller models, enabling them to meet or even exceed the advantages of larger models.

Let's read about this work together~

标题：When Do We Not Need Larger Vision Models?

By Baifeng Shi, Ziyang Wu, Maolin Mao, Xin Wang, Trevor Darrell

Institutions: UC Berkeley, Microsoft

Original link: https://arxiv.org/abs/2403.13043

Code link: https://github.com/bfshi/scaling_on_scales

Scaling up vision models has become the de facto criterion for obtaining more robust visual representations. In this work, we discuss the points at which larger visual models are no longer necessary. First, we demonstrate the power of scaling (S2), in which a smaller pre-trained and frozen vision model (e.g., ViT-B or ViT-L), running on multiple image scales, can outperform larger models (e.g., ViT-H or ViT-G) in terms of classification, segmentation, depth estimation, multimodal LLM (MLLM) datums, and robot manipulation. Notably, S2 achieves state-of-the-art performance in terms of detailed understanding of MLLM for the V* benchmark, outperforming models such as GPT-4V. We investigated the conditions under which S2 was used as the preferred scale method in comparison with the model size. Although larger models have better generalization capabilities on difficult examples, we show that the features of larger visual models can be well approximated by the features of smaller models at multiple scales. This suggests that most, if not all, of what is currently learned by large pretrained models can also be obtained from multiscale, smaller models. Our results show that the multi-scale smaller model has comparable learning ability to the larger model, and that the smaller model can meet or even exceed the advantages of the larger model by pre-training the smaller model with S2. We've released a Python package that can apply S2 to any visual model with a single line of code.

S2-Wrapper is a simple mechanism that scales any pre-trained vision model to multiple image scales in a parameter-free manner. Using ViT-B as an example, S2-Wrapper first interpolates the input images to different scales (e.g., 224^2 and 448^2) and divides each scale into several sub-images (448^2→4×224^2) that are the same size as the default input. For each scale, all sub-images are fed into the same model, and the output (e.g., 4×16^2) is merged into a feature map (32^2) of the entire image. The feature maps of different scales are pooled to the original spatial size (16^2) and then stitched together. The final multiscale feature has the same spatial shape as the single-scale feature, but with a higher channel dimension (e.g., 1536 vs. 768).

Deep Thinking: Is the bigger the visual deep learning model, the better?

S2 scaling and model size scaling were compared for three models (ViT, DINOv2, and OpenCLIP) and three tasks (ImageNet classification, semantic segmentation, and depth estimation).

For each model and each task, test model size scaling (plotted with gray curves) for basic, large, and jumbo/giant models. For S2 scaling (plotted as a green curve), test three sets of scales from single scale (1x) to multiple scales (up to 3x) and adjust each set of scales to match the GFLOPs of the corresponding model size. Note that for specific models and tasks, test S2 scaling (plotted separately with light green and dark green curves) on the base model and the large model, respectively. As you can see, in (a), (d), (e), (f), (g), and (i), the base model with S2 scaling has achieved performance comparable to or better than larger models with similar GFLOPs but smaller model sizes. In (b) and (h), the S2 scaling from the large model is comparable to that of the giant model, likewise with similar GFLOPs and fewer parameters. The only failure case is (c), where S2 scaling on a base model or a large model can't compete with model size scaling.

Comparison of S2 scaling and model size scaling on MLLM. S2 scaling has comparable or better scaling curves than model size scaling on all three types of datums. Using a large image scale will generally provide better performance, while using a larger model will degrade the performance of the model in some cases.

Three types of models were evaluated: (i) ViT pre-trained on ImageNet-21k, (ii) OpenCLIP pre-trained on LAION-2B, and (iii) MAE pre-trained on ImageNet-1k. The reconstruction loss is the average of all output markers and is evaluated on ImageNet-1k. The results are shown in Table 2. It is observed that the multi-scale basic model consistently has lower losses and reconstructs more information represented by the larger model compared to the base model (e.g., 0.521 vs. 0.440 for ViT). More interestingly, the authors found that the amount of reconstruction information from multiscale basic models is often close to that of giant models, although sometimes slightly lower but never significantly exceeded. For example, while OpenCLIP-Base can reconstruct 92.7% of the information, the multiscale base model can reconstruct 99.9%. For other models, the reconstruction ratio of the Base-S2 model is usually close to 100%, but never exceeds 0.5%. This means that (i) the giant/giant model is indeed an effective upper bound for feature reconstruction, and (ii) most of the features of the larger model are also learned by the multiscale smaller model. The only exception is when the OpenCLIP-Huge feature is rebuilt, where the reconstruction rate is 88.9%. Although not close to 100%, it is still significantly better than the basic size model, which means that at least a large part of the giant model features are still multi-scale features. These results imply that smaller models with S2 scaling should have at least a similar level of ability to learn from larger models. On the other hand, it was also noted that there was a gap between the training set and the test set, i.e., the proportion of rebuilds on the test set might be lower than on the training set (e.g., 96.3% vs. 99.9% on OpenCLIP-L). The authors hypothesize that this is because the basic model features that are only pre-trained at the scale of a single image have weak generalization ability when multi-scale is applied only after pre-training.

For memory capacity, given a dataset (e.g., ImageNet), treats each image as a separate category and trains the model to classify individual images, which requires the model to memorize each image. The classification loss reflects the degree to which each instance is remembered, and therefore the capacity of the model. For training loss, the authors report classification loss on the training set of ImageNet-1k, for DINOv2 and OpenCLIP. A lower loss means that the model adapts better to the training data, which implies a larger model capacity. The results are shown in Table 3. For instance memory, it can be seen that ViT-B with S2 scaling (224^2 and 448^2) has a similar loss to ViT-L. For ImageNet classification, the training loss for ViT-B-S2 is similar to that of ViT-L for OpenCLIP, while for DINOv2, the loss is even lower. These results suggest that the multi-scale smaller model has at least comparable model capacity to the larger model. Pre-training with S2 can make smaller models better. The authors evaluated the ImageNet classification of the base model based on S2 scaling during or after pre-training. The model was pre-trained on ImageNet-21k, using ViT image classification or DINOv2 as the pre-training target. The authors compared models with or without S2 during pre-training, and between single-scale base models and large models. The results are shown in Table 4. As you can see, when the base models are trained with a single image scale and scale to multiple image scales only after pre-training, their performance is suboptimal compared to large models. However, when S2 scaling is added to the pre-training, the multi-scale base model is able to outperform the large model on ViT. For DINOv2, the performance of the base model trained in advance with S2 is significantly better than that of the base model without S2 and is more comparable to large models. Although still slightly inferior to large models, using S2 pre-trained large models may yield better scaling curves. It is demonstrated that a smaller model pre-trained with S2 can compete with a larger model.

In this work, the authors ask the question: Is there always a need for a larger model for better visual understanding? The authors found that scaling in the dimension of image scale (known as scale scaling, S2) rather than model size generally results in better performance for a wide range of downstream tasks. The authors further show that smaller models with S2 can learn most of what larger models learn, and that S2 pre-training for smaller models can match the strengths of larger models, or even perform better.

S2 has some implications for future work, including (i) scale-selective processing, i.e., not every scale at every location in an image contains equally useful features, and depending on the image content and advanced tasks, it would be more efficient to select and process certain scales for each region, similar to the bottom-up and top-down selection mechanisms in human visual attention, and (ii) parallel processing of individual images, i.e., each sub-image is processed independently in S2 compared to conventional ViT, which allows different sub-images of a single image to be processed in parallel, which is particularly helpful for handling scenarios where the latency of a single large image is critical。

Readers who are interested in more experimental results and details of the article can read the original paper~

This article is only for academic sharing, if there is any infringement, please contact to delete the article.

3D Vision Workshop Exchange Group

At present, we have established multiple communities in the direction of 3D vision, including 2D computer vision, large models, industrial 3D vision, SLAM, autonomous driving, 3D reconstruction, drones, etc., and the subdivisions include:

2D Computer Vision: Image Classification/Segmentation, Target/Detection, Medical Imaging, GAN, OCR, 2D Defect Detection, Remote Sensing Mapping, Super-Resolution, Face Detection, Behavior Recognition, Model Quantification Pruning, Transfer Learning, Human Pose Estimation, etc

Large models: NLP, CV, ASR, generative adversarial models, reinforcement learning models, dialogue models, etc

Industrial 3D vision: camera calibration, stereo matching, 3D point cloud, structured light, robotic arm grasping, defect detection, 6D pose estimation, phase deflection, Halcon, photogrammetry, array camera, photometric stereo vision, etc.

SLAM: visual SLAM, laser SLAM, semantic SLAM, filtering algorithm, multi-sensor fusion, multi-sensor calibration, dynamic SLAM, MOT SLAM, NeRF SLAM, robot navigation, etc.

Autonomous driving: depth estimation, Transformer, millimeter-wave, lidar, visual camera sensors, multi-sensor calibration, multi-sensor fusion, autonomous driving integrated group, etc., 3D object detection, path planning, trajectory prediction, 3D point cloud segmentation, model deployment, lane line detection, Occupancy, target tracking, etc.

3D reconstruction: 3DGS, NeRF, multi-view geometry, OpenMVS, MVSNet, colmap, texture mapping, etc

Unmanned aerial vehicles: quadrotor modeling, unmanned aerial vehicle flight control, etc

In addition to these, there are also exchange groups such as job search, hardware selection, visual product landing, the latest papers, the latest 3D vision products, and 3D vision industry news

Add a small assistant: dddvision, note: research direction + school/company + nickname (such as 3D point cloud + Tsinghua + Little Strawberry), pull you into the group.

3D Vision Workshop Knowledge Planet

3DGS, NeRF, Structured Light, Phase Deflection, Robotic Arm Grabbing, Point Cloud Practice, Open3D, Defect Detection, BEV Perception, Occupancy, Transformer, Model Deployment, 3D Object Detection, Depth Estimation, Multi-Sensor Calibration, Planning and Control, UAV Simulation, 3D Vision C++, 3D Vision python, dToF, Camera Calibration, ROS2, Robot Control Planning, LeGo-LAOM, Multimodal fusion SLAM, LOAM-SLAM, indoor and outdoor SLAM, VINS-Fusion, ORB-SLAM3, MVSNet 3D reconstruction, colmap, linear and surface structured light, hardware structured light scanners, drones, etc.

Deep Thinking: Is the bigger the visual deep learning model, the better?

Read on