There are different levels and aspects to this. There is text-to-image conversion, or text-to-image generation. There are also text-to-image systems. There is text-to-scene conversion, and text-to-scene generation, as well as text-to-scene systems. As far as 3D is concerned, there have been attempts at text-to-3D for virtual worlds, such as SecondLife.
For details, see my quick and dirty webpages: