1611698277 - Meta-Guide.com

> **A vision-language fusion module** maps the encoded image and text into vectors in the same semantic space so that their semantic similarity can be computed using cosine distance of their vectors.

VinVL: Advancing the state of the art for vision-language models