If all sentences like “how to describe” are extracted from Internet and their answers are embedded with corresponding video frames/text, isn’t this enough to develop “auto-generating text” from video?

For details, see my Quora answers to:
1. How can one start to develop a system that can learn the subject by itself through large amounts of data from the Internet and create automatic short animated videos to explain things to humans?
2. If an artificial intelligence could take in a well-written screenplay and turn that into a simulated movie, would that be an effective test of human-level AI and natural language processing (NLP)?
3. Is there any website that auto-generates video from a story-based text?

… such a reverse generation system could theoretically be made by crunching all YouTube (say frame by frame) coupled with their audio-to-text transcripts…. But, why would such a system be significant? Conceivably, the human mind converts language to image, and vice versa. This process is likely the origin of metaphor. And, metaphor is likely the basis of dream. So, such a reverse generation system could probably not only be used to generate metaphor, but also to translate or decode not just metaphors but dreams as well.

References:
· SceneMaker | Meta-Guide.com
· Text-to-Image Systems (Draft) | Meta-Guide.com
· TTSCS (Text-to-scene Conversion Systems) | Meta-Guide.com