> **A vision-language fusion module** maps the encoded image and text into vectors in the same semantic space so that their semantic similarity can be computed using cosine distance of their vectors.
VinVL: Advancing the state of the art for vision-language models