ARGENT: Adaptive Hierarchical
Image-Text Representations

1University of Maryland, College Park    2Samsung Research America, AI Center
Work done while at Samsung Research America

Listen to a podcast review of this paper:

ARGENT teaser: Our Probabilistic Entailment Score offers more discriminative evaluation than hierarchical correlation scores.
ARGENT improves both hierarchical training and evaluation. Our new Probabilistic Entailment Score (pEnt) offers a more discriminative evaluation; while both models may achieve 100% correlation, our metric correctly identifies the superior model.
+2.4%
Classification Accuracy
(vs. HyCoCLIP)
+1.4%
Retrieval Recall@5
(vs. HyCoCLIP)
+2.1%
Hierarchical PEP AUC
(vs. HyCoCLIP)

Abstract

Large-scale Vision–Language Models (VLMs) such as CLIP learn powerful semantic representations but operate in Euclidean space, which fails to capture the inherent hierarchical structure of visual and linguistic concepts. Hyperbolic geometry, with its exponential volume growth, offers a principled alternative for embedding such hierarchies with low distortion. However, existing hyperbolic VLMs use entailment losses that are unstable: as parent embeddings contract toward the origin, their entailment cones widen toward a half-space, causing catastrophic cone collapse that destroys the intended hierarchy. Additionally, hierarchical evaluation of these models remains unreliable, being largely retrieval-based and correlation-based metrics and prone to taxonomy dependence and ambiguous negatives.


To address these limitations, we propose an adaptive entailment loss paired with a norm regularizer that prevents cone collapse without heuristic aperture clipping. We further introduce an angle-based Probabilistic Entailment Protocol (PEP) for evaluating hierarchical understanding, scored with AUC-ROC and Average Precision. This paper introduces a stronger hyperbolic VLM baseline ARGENT (Adaptive hieRarchical imaGe-tExt represeNTation). ARGENT improves the SOTA hyperbolic VLM by 0.7, 1.1, and 0.8 absolute points on image classification, text-to-image retrieval, and proposed hierarchical metrics, respectively.

Method

Adaptive Entailment Loss vs Standard Entailment Loss
Behavior of Adaptive Entailment and Standard Entailment Loss. Left (inside norm boundary): The standard entailment loss collapses to zero for all points in the non-origin half-space, even when the exterior angle is large. Our adaptive loss remains active, preventing vanishing gradients. Right (outside norm boundary): The standard loss penalizes noisy positives and true negatives equally. Our adaptive loss assigns lower loss to likely positives while strongly penalizing negatives.

We address two key limitations of existing hyperbolic VLMs:


1. Adaptive Entailment (AdaEnt) Loss. Prior entailment losses suffer from cone collapse—when parent embeddings contract toward the origin, their entailment cones degenerate into half-spaces, destroying the hierarchy. Our AdaEnt loss directly minimizes the exterior angle with an adaptive weight that prevents near-identical pairs from being pushed apart, providing a smoother and more stable training signal without relying on problematic aperture calculations.


2. Probabilistic Entailment Protocol (PEP). We identify critical flaws in existing hierarchical evaluation: ranking metrics are insensitive to error magnitude, and retrieval metrics suffer from false negatives in the candidate pool. Our PEP treats the angle between embeddings as a direct proxy for entailment probability and uses AUC-ROC and Average Precision for robust, fine-grained evaluation.

Results

ARGENT consistently outperforms prior baselines across all model scales (ViT-S, ViT-B, ViT-L) on classification, retrieval, and hierarchical metrics.

Method Classification COCO T2I R@k Flickr T2I R@k PEP AUC PEP AP
INet Avg. @5 @10 @5 @10
CLIP-L 39.9 40.6 57.7 69.2 70.5 80.5
MERU-L 39.6 40.2 57.7 68.8 70.9 81.2 60.4 23.8
HyCoCLIP-L 43.9 44.4 57.5 68.6 70.7 80.2 98.0 89.5
ARGENT-L (Ours) 45.6 45.1 58.6 69.7 71.7 81.7 99.5 90.3
Δ vs. HyCoCLIP +1.7 +0.7 +1.1 +1.1 +1.0 +1.5 +1.5 +0.8

Results shown for ViT-L backbone. ARGENT improves across all metrics. See the paper for ViT-S and ViT-B results.

Embedding Space Visualization

HoroPCA projections reveal that ARGENT learns a more clearly separated hierarchical structure compared to HyCoCLIP.

HierarCaps Text Embeddings

HyCoCLIP HierarCaps HoroPCA
HyCoCLIP
ARGENT HierarCaps HoroPCA
ARGENT (Ours)

ARGENT achieves better separation between hierarchical levels (Level 1: most generic to Level 4: most specific), with more general concepts positioned closer to the origin.

CC3M Component Embeddings

HyCoCLIP CC3M HoroPCA
HyCoCLIP
ARGENT CC3M HoroPCA
ARGENT (Ours)

ARGENT produces cleaner modality separation with distinct clusters for images, image boxes, texts, and text boxes in the hyperbolic space.

Norm Distributions

HyCoCLIP Norm Distribution
HyCoCLIP
ARGENT Norm Distribution
ARGENT (Ours)

ARGENT exhibits well-separated norm distributions for different modality components, reflecting a more organized hierarchical embedding space.

BibTeX

@article{huynh2026argent,
  title     = {ARGENT: Adaptive Hierarchical Image-Text Representations},
  author    = {Huynh, Chuong and Souri, Hossein and Kumar, Abhinav and
               Petsiuk, Vitali and Mohan, Deen Dayal and Kumar, Suren},
  journal   = {arXiv preprint arXiv:2603.23311},
  year      = {2026}
}