ComBo

Overview

Many vision foundation models are available, each trained with different objectives and data. These differences result in diverse representations, making certain models and layers more effective than others for specific tasks. We explore the benefits of probing features from multiple frozen foundation models and layers, proposing ComBo, a scalable adapter to Combine backBones.

Different foundation models learn different representations throughout their layers

Foundation models are trained using various objectives and data, which leads to diverse feature representations across their layers. The table below summarises some popular models and their training regimes.

Below, we visualise these differences through images optimised to maximally activate neurons in different layers of popular foundation models. These images were generated using the technique from Ghiasi et al. (using the code released here).

The best model and layer depend on the task

These differences in representations affect model performance on downstream tasks. Using linear probing to evaluate the performance of different models and layers on the VTAB-1k benchmark, we observe that the best model and layer vary depending on the task.

We propose ComBo, an architecture to jointly Combine backBones by probing features from their different layers

Probing is a practical and efficient way to leverage features produced across the layers of the diverse foundation models available. It operates on frozen features, avoiding the need for potentially expensive backpropagation through the models, while scaling to many models and layers. To this end, we propose ComBo, a simple and scalable probing-based adapter that can Combine backBones by effectively integrating features from multiple models and layers. ComBo's architecture is straightforward, consisting of two trained modules:

(1) A linear layer that compresses concatenated activations from different layers into compact token-wise representations, while preserving spatial information.
(2) A lightweight transformer that processes these compressed tokens to predict task-specific outputs.

We can also use ComBo to identify and keep only the most task-relevant models

However, not all models are equally relevant for every task. We can employ ComBo to identify and retain only the most task-relevant models. We achieve this by using the norm of the learned linear layer weights associated with each model to measure its importance. The steps are as follows, leading to the displayed importances for the different tasks of the VTAB-1k benchmark:

Train ComBo while minimising the importance of each model.
Inspect the weights to measure task relevance.
Retrain using only the most relevant backbones.

Results

Single Model Performance

When probing a single foundation model, ComBo achieves state-of-the-art probing performance on the VTAB-1k benchmark compared to other methods such as Head2Toe and SMP. Critically, ComBo achieves this without requiring dataset-specific hyperparameter tuning, which is necessary for these prior methods. While it does not perform as well as powerful parameter-efficient fine-tuning methods like Adapter+, it offers important benefits in computational efficiency, enabling it to scale to probing multiple layers of multiple models.

Multi-Model Performance

Leveraging ComBo's ability to probe multiple models and layers jointly, we find that combining multiple models consistently improves upon the best individual model, validating the value of combining representations from diverse models. Performance improves further when probing only the most task-relevant models, as identified by ComBo's model importance mechanism.

BibTeX

@inproceedings{ramtoula2025combo,
  author    = {Ramtoula, Benjamin and Lajoie, Pierre-Yves and Newman, Paul and De Martini, Daniele},
  title     = {Fantastic Features and Where to Find Them: A Probing Method to combine Features from Multiple Foundation Models},
  booktitle = {NeurIPS},
  year      = {2025},
}