ROBIN, A TYPE OF CAT: Investigating Hypernymy in Unimodal and Multimodal Models with Contrastive Learning
Zhirui Chen

Abstract:
Visual information is commonly assumed to complement distributional semantics in achieving human-like concept understanding, motivating development and evaluation of various vision-language models (VLMs). However, there have been mixed findings on when and how VLMs outperform unimodal LMs. One key challenge lies in the representation of abstract words with low perceptability.
In this work, we focus on contrastive VLMs whose text encoders produce visual-semantic representations for text-only input and have been reported to outperform unimodal LMs across several word-, phrase-, or sentence-level understanding tasks.
We propose a novel approach to the lexical relation hypernymy (IS_A) based on synthetic concepts (“q, a type of p”), and conduct intrinsic evaluation of text encoders of contrastive VLMs accordingly against unimodal counterparts with contrastive learning. We find that contrastive VLMs, though generally outperformed by unimodal sentence transformers possibly due to the absence of unimodal language modeling, achieve competitive performance on traditional hypernymy benchmarks. We further argue that contrastive VLMs hold an inherent advantage on distinguishing hypernymy from one particular distractor relation, coordination (co-hyponymy), and suggest that further research is needed to better complement contrastive VLMs with textual distributional information.
Moreover, we examine the impact of word concreteness on model behaviour on a newly constructed dataset, and argue that abstractness does not necessarily pose a more significant challenge to text encoders of contrastive VLMs than to unimodal LMs. We also highlight the importance of exploring more systematic evaluation protocols for abstract concept representation.