Skip to content

Meta-Guide.com

Menu
  • Home
  • About
  • Directory
  • Videography
  • Pages
Menu

1611698277

Posted on 2021/02/09 by mendicott

> **A vision-language fusion module** maps the encoded image and text into vectors in the same semantic space so that their semantic similarity can be computed using cosine distance of their vectors.

VinVL: Advancing the state of the art for vision-language models

 

©2025 Meta-Guide.com | Design: Newspaperly WordPress Theme