Listen to a podcast review of this paper:
Large-scale Vision–Language Models (VLMs) such as CLIP learn powerful semantic representations but operate in Euclidean space, which fails to capture the inherent hierarchical structure of visual and linguistic concepts. Hyperbolic geometry, with its exponential volume growth, offers a principled alternative for embedding such hierarchies with low distortion. However, existing hyperbolic VLMs use entailment losses that are unstable: as parent embeddings contract toward the origin, their entailment cones widen toward a half-space, causing catastrophic cone collapse that destroys the intended hierarchy. Additionally, hierarchical evaluation of these models remains unreliable, being largely retrieval-based and correlation-based metrics and prone to taxonomy dependence and ambiguous negatives.
To address these limitations, we propose an adaptive entailment loss paired with a norm regularizer that prevents cone collapse without heuristic aperture clipping. We further introduce an angle-based Probabilistic Entailment Protocol (PEP) for evaluating hierarchical understanding, scored with AUC-ROC and Average Precision. This paper introduces a stronger hyperbolic VLM baseline ARGENT (Adaptive hieRarchical imaGe-tExt represeNTation). ARGENT improves the SOTA hyperbolic VLM by 0.7, 1.1, and 0.8 absolute points on image classification, text-to-image retrieval, and proposed hierarchical metrics, respectively.
We address two key limitations of existing hyperbolic VLMs:
1. Adaptive Entailment (AdaEnt) Loss. Prior entailment losses suffer from cone collapse—when parent embeddings contract toward the origin, their entailment cones degenerate into half-spaces, destroying the hierarchy. Our AdaEnt loss directly minimizes the exterior angle with an adaptive weight that prevents near-identical pairs from being pushed apart, providing a smoother and more stable training signal without relying on problematic aperture calculations.
2. Probabilistic Entailment Protocol (PEP). We identify critical flaws in existing hierarchical evaluation: ranking metrics are insensitive to error magnitude, and retrieval metrics suffer from false negatives in the candidate pool. Our PEP treats the angle between embeddings as a direct proxy for entailment probability and uses AUC-ROC and Average Precision for robust, fine-grained evaluation.
ARGENT consistently outperforms prior baselines across all model scales (ViT-S, ViT-B, ViT-L) on classification, retrieval, and hierarchical metrics.
| Method | Classification | COCO T2I R@k | Flickr T2I R@k | PEP AUC | PEP AP | |||
|---|---|---|---|---|---|---|---|---|
| INet | Avg. | @5 | @10 | @5 | @10 | |||
| CLIP-L | 39.9 | 40.6 | 57.7 | 69.2 | 70.5 | 80.5 | – | – |
| MERU-L | 39.6 | 40.2 | 57.7 | 68.8 | 70.9 | 81.2 | 60.4 | 23.8 |
| HyCoCLIP-L | 43.9 | 44.4 | 57.5 | 68.6 | 70.7 | 80.2 | 98.0 | 89.5 |
| ARGENT-L (Ours) | 45.6 | 45.1 | 58.6 | 69.7 | 71.7 | 81.7 | 99.5 | 90.3 |
| Δ vs. HyCoCLIP | +1.7 | +0.7 | +1.1 | +1.1 | +1.0 | +1.5 | +1.5 | +0.8 |
Results shown for ViT-L backbone. ARGENT improves across all metrics. See the paper for ViT-S and ViT-B results.
HoroPCA projections reveal that ARGENT learns a more clearly separated hierarchical structure compared to HyCoCLIP.
ARGENT achieves better separation between hierarchical levels (Level 1: most generic to Level 4: most specific), with more general concepts positioned closer to the origin.
ARGENT produces cleaner modality separation with distinct clusters for images, image boxes, texts, and text boxes in the hyperbolic space.
ARGENT exhibits well-separated norm distributions for different modality components, reflecting a more organized hierarchical embedding space.
@article{huynh2026argent,
title = {ARGENT: Adaptive Hierarchical Image-Text Representations},
author = {Huynh, Chuong and Souri, Hossein and Kumar, Abhinav and
Petsiuk, Vitali and Mohan, Deen Dayal and Kumar, Suren},
journal = {arXiv preprint arXiv:2603.23311},
year = {2026}
}