Visually Grounded Text Embeddings

Short project on the effects of visual grounding on the semanticity of text embeddings.

TL;DR: This mini-project was done in 4 days as part of the Interpretability and Explainability in AI and earned my group the "best project award" out of ~30 projects (find the poster here). Given the short duration of the project, the results here should be taken as very preliminary (due to the tiny dataset size and lack of tests for statistical significance etc.). However, I still want to include it since the project gave some cool results and inspired a bigger individual follow-up project, where I compared the effects of visual grounding in text-based and speech-based language encoders.

Overview over different types of embeddings. Classic word embeddings are learned through text only. VG-BERT combines a language stream and a visual stream to obtain visually grounded word embeddings. The bilinear relation module further enhances these embeddings, resulting in relationally grounded word embeddings.


Background

This project is based on the paper "Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning" (Zhang et al., 2021). The authors introduce VG-BERT, a vision-language model, which grounds language learning in vision, which is inspired by the fact that humans learn language by grounding concepts in perception and action and follows other research on vision-language models (e.g., CLIP (Radford et al., 2021)). The model consists of a visual stream, based on a VGG model (Simonyan & Zisserman, 2014), and a language stream, based on a BERT-model (Devlin et al., 2019). The model learns to align visual and language representations via cross-modal contrastive learning. After training, the language stream is a stand-alone language model capable of embedding concepts in a visually grounded semantic space. For more details on the architecture of VG-BERT, refer to the original paper (Zhang et al., 2021).


Research Questions

Inspired by the findings that VG-BERT's representational space is semantically more meaningful compared to BERT's representatinal space (Zhang et al., 2021), we pose the following research questions to further examine the impact of visual grounding:

  • Does visual grounding equally impact concrete and abstract concepts?
  • Does visual grounding help with resolving lexical ambiguities?
  • Are the visually grounded embeddings clustered better?
  • How does VG-BERT compare to BERT on probing tasks?

Cosine similarities between BERT and VG-BERT for abstract and concrete words. Higher scores means that the embeddings remain similar.

To answer the first question, we compare the cosine similarity of embeddings of concrete and abstract words between the ungrounded BERT-model and the visually grounded VG-BERT model. In this work, the concreteness of a word is defined as the degree to which its referent is a perceptible entity. Note that while cosine similarity can lead to misleading results when comparing different models (due to the representational spaces not necessarily aligning), it can be used here since VG-BERT uses the pretrained BERT as its main backbone and is simply fine-tuned with additional visual data. As can be seen in the plot above, the visually grounded embeddings of abstract words (freedom, justice, love, etc.) show a higher cosine similarity to their ungrounded counterpart embeddings compared to the embeddings of concrete words (apple, car house, etc.), which means that the visual grounding process affects the embeddings of conctete words more than the embeddings of abstract words. This result should come as little surprise, since while concrete words such as apple or car are very likely to be represented in an image-caption dataset such MS COCO, which was used to visually ground the BERT-embeddings (Zhang et al., 2021), whereas abstract words like freedom or justice are much harder to find unique visualizations for.


The second question asks about the effect of visual grounding on the ability to work with lexical ambiguities and is aimed at the "real-world understanding" of the model. Specifically, we test whether the visually grounded model is better at disambiguating homonyms (words with the same spelling, but different meanings). Take for example the word "trunk", which can refer either to the trunk of an elephant of the trunk of a car, but to which of the two it refers to only arises from its surrounding context (e.g., "The elephant has a long trunk." vs "The luggage is in the trunk."). To evaluate this, we separately pass the two sentences through BERT and VG-BERT and extract the embeddings of the word "trunk". When comparing their cosine similarities, we find that for BERT, the two embeddings remain more similar than for VG-BERT. We repeat this experiment with a total of 20 different homonyms (see full list)

here (click to expand)
| Meaning 1 | Meaning 2 | |----------|-----------| | The luggage is in the trunk | The elephant has a long trunk | | I need to deposit money at the bank | We had a picnic by the bank | | He hit the ball with a bat | A cave is home to a bat | | She wrote a note with a pen | The farmer built a pen | | She gave a friendly wave | He surfed on a wave | | The baseball player is a pitcher | She poured tea from a pitcher | | The musician played the bass | The fisherman caught a big bass | | We watched the bird crane | The construction site used a crane | | She spread the toast with jam | We got stuck in a traffic jam | | I broke my arm | He fired the arm | | They scheduled a date | He ate a sweet date | | The coil has a spring | Flowers bloom in spring | | The athlete broke the record | The musician listened to a record | | This person is of a different race | The runner finished the race | | She did not go to the fair | The grade was not fair | | This paint contains lead | The detective has no lead | | He gave her a diamond ring | I heard the phone ring | | He joined a local club | He hit the ball with a club | | She had to pay a fine | The weather today is fine | | We danced at the ball | The player kicked the ball | | The home team won the match | He lit the match |


Conclusion

Grounding language models in vision enhances their treatment of concrete concepts, disambiguates homonyms, and improves semantic clustering, without disrupting their syntactic representations.

References

2021

  1. NeurIPS
    Explainable semantic space by grounding language to vision with cross-modal contrastive learning
    Yizhen Zhang, Minkyu Choi, Kuan Han, and 1 more author
    Advances in Neural Information Processing Systems, 2021
  2. ICML
    Learning transferable visual models from natural language supervision
    Alec Radford, Jong Wook Kim, Chris Hallacy, and 8 more authors
    In International conference on machine learning, 2021

2019

  1. ACL
    Bert: Pre-training of deep bidirectional transformers for language understanding
    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and 1 more author
    In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019

2014

  1. ICLR
    Very deep convolutional networks for large-scale image recognition
    Karen Simonyan, and Andrew Zisserman
    arXiv preprint arXiv:1409.1556, 2014