ARGENT: Adaptive Hierarchical
Image-Text Representations

Chuong Huynh^1,† Hossein Souri² Abhinav Kumar²
Vitali Petsiuk² Deen Dayal Mohan² Suren Kumar^2,†

¹University of Maryland, College Park ²Samsung Research America, AI Center

^†Work done while at Samsung Research America

Listen to a podcast review of this paper:

ARGENT teaser: Our Probabilistic Entailment Score offers more discriminative evaluation than hierarchical correlation scores.

ARGENT improves both hierarchical training and evaluation. Our new Probabilistic Entailment Score (p_Ent) offers a more discriminative evaluation; while both models may achieve 100% correlation, our metric correctly identifies the superior model.

+2.4%
Classification Accuracy
(vs. HyCoCLIP)

+1.4%
Retrieval Recall@5
(vs. HyCoCLIP)

+2.1%
Hierarchical PEP AUC
(vs. HyCoCLIP)

Abstract

Large-scale Vision–Language Models (VLMs) such as CLIP learn powerful semantic representations but operate in Euclidean space, which fails to capture the inherent hierarchical structure of visual and linguistic concepts. Hyperbolic geometry, with its exponential volume growth, offers a principled alternative for embedding such hierarchies with low distortion. However, existing hyperbolic VLMs use entailment losses that are unstable: as parent embeddings contract toward the origin, their entailment cones widen toward a half-space, causing catastrophic cone collapse that destroys the intended hierarchy. Additionally, hierarchical evaluation of these models remains unreliable, being largely retrieval-based and correlation-based metrics and prone to taxonomy dependence and ambiguous negatives.

To address these limitations, we propose an adaptive entailment loss paired with a norm regularizer that prevents cone collapse without heuristic aperture clipping. We further introduce an angle-based Probabilistic Entailment Protocol (PEP) for evaluating hierarchical understanding, scored with AUC-ROC and Average Precision. This paper introduces a stronger hyperbolic VLM baseline ARGENT (Adaptive hieRarchical imaGe-tExt represeNTation). ARGENT improves the SOTA hyperbolic VLM by 0.7, 1.1, and 0.8 absolute points on image classification, text-to-image retrieval, and proposed hierarchical metrics, respectively.

Method

Adaptive Entailment Loss vs Standard Entailment Loss

Behavior of Adaptive Entailment and Standard Entailment Loss. Left (inside norm boundary): The standard entailment loss collapses to zero for all points in the non-origin half-space, even when the exterior angle is large. Our adaptive loss remains active, preventing vanishing gradients. Right (outside norm boundary): The standard loss penalizes noisy positives and true negatives equally. Our adaptive loss assigns lower loss to likely positives while strongly penalizing negatives.

We address two key limitations of existing hyperbolic VLMs:

1. Adaptive Entailment (AdaEnt) Loss. Prior entailment losses suffer from cone collapse—when parent embeddings contract toward the origin, their entailment cones degenerate into half-spaces, destroying the hierarchy. Our AdaEnt loss directly minimizes the exterior angle with an adaptive weight that prevents near-identical pairs from being pushed apart, providing a smoother and more stable training signal without relying on problematic aperture calculations.

2. Probabilistic Entailment Protocol (PEP). We identify critical flaws in existing hierarchical evaluation: ranking metrics are insensitive to error magnitude, and retrieval metrics suffer from false negatives in the candidate pool. Our PEP treats the angle between embeddings as a direct proxy for entailment probability and uses AUC-ROC and Average Precision for robust, fine-grained evaluation.

Results

ARGENT consistently outperforms prior baselines across all model scales (ViT-S, ViT-B, ViT-L) on classification, retrieval, and hierarchical metrics.

Method	Classification		COCO T2I R@k		Flickr T2I R@k		PEP AUC	PEP AP
	INet	Avg.	@5	@10	@5	@10
CLIP-L	39.9	40.6	57.7	69.2	70.5	80.5	–	–
MERU-L	39.6	40.2	57.7	68.8	70.9	81.2	60.4	23.8
HyCoCLIP-L	43.9	44.4	57.5	68.6	70.7	80.2	98.0	89.5
ARGENT-L (Ours)	45.6	45.1	58.6	69.7	71.7	81.7	99.5	90.3
Δ vs. HyCoCLIP	+1.7	+0.7	+1.1	+1.1	+1.0	+1.5	+1.5	+0.8

Results shown for ViT-L backbone. ARGENT improves across all metrics. See the paper for ViT-S and ViT-B results.

Embedding Space Visualization

HoroPCA projections reveal that ARGENT learns a more clearly separated hierarchical structure compared to HyCoCLIP.

HierarCaps Text Embeddings

HyCoCLIP

ARGENT (Ours)

ARGENT achieves better separation between hierarchical levels (Level 1: most generic to Level 4: most specific), with more general concepts positioned closer to the origin.

CC3M Component Embeddings

HyCoCLIP

ARGENT (Ours)

ARGENT produces cleaner modality separation with distinct clusters for images, image boxes, texts, and text boxes in the hyperbolic space.

Norm Distributions

HyCoCLIP

ARGENT (Ours)

ARGENT exhibits well-separated norm distributions for different modality components, reflecting a more organized hierarchical embedding space.

BibTeX

@article{huynh2026argent,
  title     = {ARGENT: Adaptive Hierarchical Image-Text Representations},
  author    = {Huynh, Chuong and Souri, Hossein and Kumar, Abhinav and
               Petsiuk, Vitali and Mohan, Deen Dayal and Kumar, Suren},
  journal   = {arXiv preprint arXiv:2603.23311},
  year      = {2026}
}

ARGENT: Adaptive HierarchicalImage-Text Representations

Abstract

Method

Results

Embedding Space Visualization

HierarCaps Text Embeddings

CC3M Component Embeddings

Norm Distributions

BibTeX

ARGENT: Adaptive Hierarchical
Image-Text Representations