Natural Language Image Recognition


Natural language image recognition is a type of artificial intelligence technology that enables computers to understand and interpret the content of images using natural language processing (NLP) techniques. With natural language image recognition, users can describe the contents of an image using natural language, and the system will attempt to identify and recognize the objects, people, or scenes depicted in the image.

Natural language image recognition can be used in a variety of applications, including image search, image annotation, and image classification. For example, a user might search for images of “dogs playing fetch” and the system would return a collection of images that depict dogs playing fetch. Alternatively, a user might describe an image as “a group of people sitting at a table having a meeting,” and the system would classify the image as depicting a meeting.

Natural language image recognition systems typically use machine learning algorithms and large datasets of labeled images to learn how to recognize and classify the content of images. They may also make use of techniques such as object detection and image segmentation to identify and understand the various components of an image.

  • Image sentence mapping, also known as image-sentence alignment, refers to the process of associating a natural language description of an image with the image itself. In other words, it involves creating a mapping between the words and phrases in a text description of an image and the various objects, people, and scenes depicted in the image. This mapping allows a computer to understand the content of an image and to associate it with a natural language description.
  • Visual-semantic alignments, also known as visual-linguistic alignments, refer to the relationships between visual information and linguistic information in an image or video. These alignments involve matching the visual content of an image or video with the words and phrases used to describe it. For example, a visual-semantic alignment might involve identifying the objects in an image and matching them with the corresponding labels or descriptions used in a text description of the image. Visual-semantic alignments can be used to improve the performance of natural language image recognition systems and to enable computers to better understand the content of images and videos.



See also:

Scene Understanding 2013 | SceneMakerText-to-Image SystemsTTSCS (Text-to-scene Conversion Systems)

Attribute learning in large-scale datasets O Russakovsky, L Fei-Fei – Trends and Topics in Computer Vision, 2012 – Springer … learning. However, the dataset remains highly challenging, with lots of variety within the synsets, as shown in Figure 2. Noun hierarchies such as WordNet have been very successfully used in natural language processing. However … Cited by 62 Related articles All 10 versions

Optimol: automatic online picture collection via incremental model learning LJ Li, L Fei-Fei – International journal of computer vision, 2010 – Springer … “Bag of words” model is frequently used in natural language processing and information retrieval of text documents. … It has been used in natural language process- ing to perform tasks such as parsing strings of words (Mc- Closky et al. 2006). … Cited by 249 Related articles All 26 versions

Hierarchical semantic indexing for large scale image retrieval J Deng, AC Berg, L Fei-Fei – Computer Vision and Pattern …, 2011 – … by work putting existing datasets into hierar- chies [17], and building large new datasets – eg TinyIm- ages [29] and ImageNet [8] – based on the hierarchical se- mantic structure in WordNet [12] a major project of the lin- guistics and natural language processing community. … Cited by 70 Related articles All 12 versions

Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora R Socher, L Fei-Fei – Computer Vision and Pattern Recognition …, 2010 – … textual modalities. The model inputs are (i) a set of discrete visual words with features that describe them contextually and visually and (ii) a set of words from a natural language corpus and their context and adjective features. We … Cited by 43 Related articles All 11 versions

Video event understanding using natural language descriptions V Ramanathan, P Liang… – Computer Vision (ICCV), …, 2013 – Abstract Human action and role recognition play an important part in complex event understanding. State-of-the-art methods learn action and role models from detailed spatio temporal annotations, which requires extensive human effort. In this work, we propose a … Cited by 7 Related articles All 5 versions

Deep fragment embeddings for bidirectional image sentence mapping A Karpathy, A Joulin, L Fei-Fei – arXiv preprint arXiv:1406.5679, 2014 – Page 1. Deep Fragment Embeddings for Bidirectional Image Sentence Mapping Andrej Karpathy Armand Joulin Li Fei-Fei … Abstract We introduce a model for bidirectional retrieval of images and sentences through a multi-modal embedding of visual and natural language data. … Cited by 4

Mining discriminative adjectives and prepositions for natural scene recognition B Yao, JC Niebles, L Fei-Fei – Computer Vision and Pattern …, 2009 – … An interesting direction for future research would ex- plore methods to automatically link the mined relationships to semantic descriptions in natural language. 8. Acknowledgements The authors would like to thank Jia Li and Hao Su for the helpful discussions and comments. … Cited by 3 Related articles All 6 versions

Data acquisition strategy for FTP search engine [J] HU Liang, Y Fang, QI Yun-yun – Computer Engineering and Design, 2009 – … 3, Xie Xin Liu Fei-fei Li Xiao-ming (Dept. … 1, Yu Hongyong, Zhao Tiejun, Zheng Dequan, Wang YueYin MOE-MS Key Laboratory’ of Natural Language Processing and Speech Harbin Institute of Technology, Harbin, 150001;Research on an Effective Retrieval Method Oriented to … Cited by 1 Related articles

Linking people in videos with “their” names using coreference resolution V Ramanathan, A Joulin, P Liang, L Fei-Fei – Computer Vision–ECCV …, 2014 – Springer … Abstract. Natural language descriptions of videos provide a potentially rich and vast source of supervision. … For example, videos of TV episodes have associated screenplay scripts, which contain natural language descriptions of the videos (Fig. …

Design of a Desktop Search Engine LI Xiao-xin – Computer Knowledge and Technology, 2011 – … 2, Piao Xinghai, Zhao Tiejun, Zheng Dequan , Zhang Di MOE-MS Key Laboratory of Natural Language Processing and Speech Harbin Institute of Technology, Harbin, China 150001;Design and Implementation on Blog-Oriented Web … 2, Xie Xin Liu Fei-fei Li Xiao-ming (Dept. … Cited by 2 Related articles

TREETALK: Composition and Compression of Trees for Image Descriptions P Kuznetsova, V Ordonez, TL Berg, Y Choi – Transactions of the …, 2014 – … cap- tions). We tap into the last kind of text, using natu- rally occuring pairs of images with natural language descriptions to compose expressive descriptions for query images via tree composition and compression. Such automatic … Cited by 1

Interactive Visualizations for Deep Learning J Chuang, R Socher – … Research in natural language processing has produced a large body of useful linguistic features that can be used as the initial … ACKNOWLEDGMENTS We thank Quoc Le, Andrej Karpathy, Choon Hui Teo, Jeffrey Pen- nington, Christopher D. Manning, and Jeffrey Heer for their …

Sorting through photos N Savage – Communications of the ACM, 2011 – … Fei-Fei Li, an assistant professor at the Stanford Vision Lab, starting develop- ing such a dataset in ImageNet, along with Kai Li, a … to the design of interactive computer systems by taking a broad view of hCI, considering it in the context of natural language processing, machine … Cited by 1 Related articles

Event detection with spatial latent Dirichlet allocation CC Pan, P Mitra – Proceedings of the 11th annual international ACM/ …, 2011 – … Because of the context-sensitive nature of language, despite ad- vances in natural language processing and information extraction, identifying and disambiguating location with absolute certainty is near impossible. Similarly, extracting temporal information is prone to errors. … Cited by 18 Related articles

Activities report from August 2012 to April 2013 T Amaral – 2013 – … Better than shallow NNs in vision tasks; and natural language processing (NLP) tasks. Better than SVMs in vision tasks. Allow dataset sizes untreatable by SVMs in NLP tasks. … Andrej Karpathy’s matrbm, simpler Matlab code for the same purpose. … Related articles

Probabilistic Latent Semantic Analysis D Oneata – … Their number, K, has to be specified a priori. 1Also the terminology used will be relevant to the natural language processing domain (ie, documents, words, topics). … Mach. Learn., 42:177–196, January 2001. [9] Fei-Fei Li and Pietro Perona. … Related articles

ReferItGame: Referring to Objects in Photographs of Natural Scenes S Kazemzadeh, V Ordonez, M Matten, TL Berg – Page 1. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 787–798, October 25-29, 2014, Doha, Qatar. … Abstract In this paper we introduce a new game to crowd-source natural language referring expressions. …

Polarity Trend Analysis of Public Sentiment on YouTube A Krishna, J Zambreno, S Krishnan – 2013 – … In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2002. [9] Johan Bollen, Huina Mao, and Xiao-Jun Zeng. Twitter mood predicts the stock market. Journal of Computer Science, 2(1), 2011. [10] Andrej Karpathy. … Cited by 2 Related articles All 3 versions

Syntactic topic models JL Boyd-Graber, DM Blei – Advances in neural information …, 2009 – … 7 Page 8. [3] Fei-Fei Li, P. Perona. A Bayesian hierarchical model for learning natural scene categories. … The infinite PCFG using hierarchical Dirichlet processes. In Proceedings of Emperical Methods in Natural Language Processing, pages 688–697. 2007. … Cited by 105 Related articles All 16 versions

Structural image retrieval using automatic image annotation and region based inverted file D Zhang, M Islam, G Lu – Journal of Visual Communication and Image …, 2013 – Elsevier Image retrieval has lagged far behind text retrieval despite more than two decades of intensive research effort. Most of the research on image retrieval in the. Cited by 3 Related articles All 4 versions

Vocabulary Length Experiments For Binary Image Classification Using BOV Approach SP Vimal, E Puri, PK Thiruvikiraman – Signal, 2013 – … The visual words come from the visual vocabulary which is constructed using the key points extracted from the image database. Unlike the natural language, the length of such vocabulary for image classification is task dependent. … 1. IEEE, 2005. [8] Fei-Fei, Li, and Pietro Perona. … Related articles All 3 versions

Heterogeneous transfer learning for image clustering via the social web Q Yang, Y Chen, GR Xue, W Dai, Y Yu – … on Natural Language …, 2009 – Page 1. Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, pages 1–9, Suntec, Singapore, 2-7 August 2009. cO2009 ACL and AFNLP Heterogeneous Transfer Learning for Image Clustering via the Social Web … Cited by 60 Related articles All 10 versions

A brief survey on deep belief networks and introducing a new object oriented MATLAB toolbox (DeeBNet) MA Keyvanrad, MM Homayounpour – arXiv preprint arXiv:1408.3264, 2014 – … natural language processing are some of these applications that use artificial … Palm, 4 Ruslan Salakhutdinov and Geoff Hinton, 5 Andrej Karpathy, https://code …

Latent topic model for image annotation by modeling topic correlation X Xu, A Shimada, R Taniguchi – Multimedia and Expo (ICME), …, 2013 – … Motivated by Correlated Topic Model (CTM) [2] which derives from natural language processing to model topic cor- relation of a document, we extend the popular LDA based models (corrLDA [3], sLDA-bin [4], trmmLDA [5 … [7] C.Wang, DMBlei, and Fei fei Li, “Simultaneous image … Related articles

Geo-informative discriminative image representation by semi-supervised hierarchical topic modeling Z Li, S Tang, J Shao, W Lu… – Multimedia and Expo ( …, 2014 – … [3] Fei-Fei Li, P. Perona, “A Bayesian … [10] Daniel Ramage and David Hall and Ramesh Nallapati and Christopher D. Manning, “Labeled LDA: A super- vised topic model for credit attribution in multi-labeled corpora,” Empirical Methods in Natural Language Pro- cessing, 2009. …

Improving Video Activity Recognition using Object Recognition and Text Mining. TS Motwani, RJ Mooney – ECAI, 2012 – … First, it has introduced a novel method for auto- matically discovering activity classes from natural-language descrip- tions of videos. … [6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li, ‘Imagenet: A large-scale hierarchical image database’, in CVPR, (2009). … Cited by 12 Related articles All 8 versions

Dynamic Programming-Based Optimization For Audio-Visual Skims Y Huang, J Gao, H Yu – … keypoints. The term “BoW” comes from natural language processing initially and it has been used in computer vision recently [4]. The BoW model uses the occurrence of each “word” in the dictionary as the feature of “object”. … Related articles All 5 versions

Plan and Activity Recognition from a Topic Modeling Perspective RG Freedman, HT Jung, S Zilberstein – 2014 – … Abstract We examine new ways to perform plan recognition (PR) us- ing natural language processing (NLP) techniques. PR often focuses on the structural relationships between consecutive observations and ordered activities that comprise plans. … Cited by 1 Related articles All 2 versions

Spherical topic models J Reisinger, A Waters… – Proceedings of the …, 2010 – … Performance is evaluated empirically, both through human rat- ings of topic quality and through diverse classi- fication tasks from natural language processing and computer vision. In these experiments, SAM consistently outperforms existing models. 1. Introduction … Cited by 25 Related articles All 12 versions

Compositional Structure Learning for Action Understanding R Xu, G Chen, C Xiong, W Chen, JJ Corso – arXiv preprint arXiv: …, 2014 – … Perception & Psychophysics, 1973. [15] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. … Gener- ating natural-language video descriptions using text-mined knowledge. In AAAI, 2013. …

TUHOI: Trento Universal Human Object Interaction Dataset DT Le, J Uijlings, R Bernardi – V&L Net 2014, 2014 – … Perhaps prepositions in natural language can be linked to this relative position between the object and human (eg, step out of a car). … Dense trajectories and motion boundary descriptors for action recognition. Bangpeng Yao and Fei-Fei Li. 2010. …

Improving Image Classification by Co-training with Multi-modal Features K Weston – 2011 – Page 1. Improving Image Classi cation by Co-training with Multi-modal Features Kyle Weston Master of Engineering Department of Electrical Computer and Software Engineering McGill University Montreal,Quebec April 2011 … Related articles All 2 versions

Learning Image Embeddings using Convolutional Neural Networks for Improved Multi-Modal Semantics D Kiela, L Bottou – … on Empirical Methods in Natural Language …, 2014 – Page 1. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 36–45, October 25-29, 2014, Doha, Qatar. cO2014 Association for Computational Linguistics … 2009. Natural Language Processing with Python. … Cited by 1

Physical modelling for interactive installations and the performing arts S Fdili Alaoui, C Henry, C Jacquemin – International Journal of …, 2014 – Taylor & Francis … New York: ACM Press/Addison-Wesley. [CrossRef] View all references; Coros et al. 20012. Coros, Stelian, Andrej Karpathy, Ben Jones, Lionel Reveret, and Michiel van de Panne. 2001. ‘Locomotion Skills for Simulated Quadrupeds.’ ACM Transactions on Graphics 30 (4): 1–12. …

A Fast and Accurate Dependency Parser using Neural Networks D Chen, CD Manning – … Methods in Natural Language …, 2014 – Page 1. A Fast and Accurate Dependency Parser using Neural Networks Danqi Chen Computer Science Department Stanford University Christopher D. Manning Computer Science Department Stanford University Abstract … Cited by 1

Latent Facial Topics for affect analysis P Lade, VN Balasubramanian… – Multimedia and Expo …, 2013 – … We de- rive inspiration from the success of topic models in natural language processing to automatically discover Latent Facial Topics from the features extracted from face images. … [11] Chong W, D. Blei, and Fei-Fei Li, “Simultaneous image classification and annotation,” in … Cited by 1 Related articles

How Robots Can Recognize Activities and Plans Using Topic Models RG Freedman, HT Jung, RA Grupen, S Zilberstein – 2014 – … We examine new ways to perform plan recognition (PR) using natural language processing (NLP) techniques. … It has been suggested that plan recog- nition (PR) and natural language processing have much in common and are amenable to similar analyses. … Related articles

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models R Kiros, R Salakhutdinov, RS Zemel – arXiv preprint arXiv:1411.2539, 2014 – … In ACL, 2014. [15] Andrej Karpathy, Armand Joulin, and Li Fei-Fei. … TACL, 2014. [28] Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, and Bernt Schiele. Translating video content to natural language descriptions. In ICCV, 2013. …

Grounded Compositional Semantics for Finding and Describing Images with Sentences R Socher, Q Le, C Manning, A Ng – NIPS Deep Learning …, 2013 – Page 1. Grounded Compositional Semantics for Finding and Describing Images with Sentences Richard Socher, Andrej Karpathy, Quoc V. Le*, Christopher D. Manning, Andrew Y. Ng Stanford University, Computer Science Department, *Google Inc. … Cited by 17 Related articles All 5 versions

Datasets, features, learning, and models in visual recognition G Wang – 2011 – … improving details of the research. I thank the other committee members for their insightful suggestions in my prelim- inary exam. Thanks also go to my first graduate adviser, Professor Fei-Fei Li. She introduced me to computer vision and taught me many basic things. … Related articles All 3 versions

Mammoth Data in the Cloud: Clustering Social Images JQB Zhang – Cloud Computing and Big Data, 2013 – … apply our ideas to “deep learning” which has had substantial popular press [9] and significant results recently in unsupervised feature learning for areas such as computer vision, speech recognition, and natural language processing … [11] Adam Coates, Andrej Karpathy, Andrew Y … Related articles

Successful Conclusion of the 2010 Summer Workshop J Du – … priors on the visual world, providing regularization for the output of recognition, and determining useful output structure for applications like natural language generation and … Fei-Fei Li (Stanford, Computer Vision) will provide remote support as the leader of the ImageNet team. … Related articles All 2 versions

Neural network architectures for Prepositional Phrase attachment disambiguation Y Belinkov – 2014 – … In natural language sentences, a PP may often be attached to several possible candidates. … 13 Page 14. Chapter 2 Word Vector Representations 2.1 Background Word representations have been used in Natural Language Processing at least since the 1980s. …

Video event description in scene context C Liu, C Hu, Q Liu, JK Aggarwal – Neurocomputing, 2013 – Elsevier Video event description is an important research topic in video analysis with a vast amount of applications, such as visual surveillance, video retrieval, video. Cited by 1 Related articles All 3 versions

Advances in Neural Information Processing Systems 22 Y Bengio, D Schuurmans, J Lafferty, C Williams… – 2009 – Page 1. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 22 Proceedings of the 2009 Conference edited by Yoshua Bengio, Dale Schuurmans, John Lafferty, Chris Williams and Aron Culotta … Cited by 5 Related articles All 3 versions

Transfer learning for text mining W Pan, E Zhong, Q Yang – Mining Text Data, 2012 – Springer Page 1. Chapter 7 TRANSFER LEARNING FOR TEXT MINING Weike Pan Hong Kong University of Science and Technology Clearwater Bay, Kowloon, Hong Kong Erheng Zhong Hong Kong University of … Cited by 17 Related articles All 9 versions

Analysis by synthesis: a (re-) emerging program of research for language and vision TG Bever, D Poeppel – Biolinguistics, 2010 – Page 1. Biolinguistics 4.2–3: 174–200, 2010 ISSN 1450–3417 http://www.biolinguistics. eu Analysis by Synthesis: A (Re-)Emerging Program of Research for Language and Vision Thomas G. Bever & David Poeppel This contribution … Cited by 27 Related articles All 7 versions

Large scale visual recognition J Deng – 2012 – DTIC Document … BY THE DEPARTMENT OF COMPUTER SCIENCE ADVISER: FEI-FEI LI JUNE 2012 Page 2. … iv Page 7. Acknowledgements First and foremost, I wish to thank my advisers Professor Fei-Fei Li and Professor Kai Li for their unfailing support and generous mentorship. … Related articles All 3 versions

2012 Index IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 34 T Aach, R Achanta, A Adam, L Agapito… – … on Pattern Analysis …, 2013 – … 2012 1520-1532 Fatourechi, Mehrdad, see Malek Esmaeili, Mani, TPAMI Dec. 2012 2481- 2488 Favaro, Paolo, see Bishop, Tom E., TPAMI May 2012 972-986 Fei-Fei, Li, see Yao, Bangpeng, TPAMI Sept. 2012 1691-1703 Feinberg, Joshua, see Keren, Daniel, TPAMI Oct. … All 4 versions

MedLDA: A General Framework of Maximum Margin Supervised Topic Models J Zhu, A Ahmed, EP Xing – arXiv preprint arXiv:0912.5507, 2009 – Page 1. Journal of Machine Learning Research 1 (2008) 1-48 Submitted 4/00; Published 10/00 MedLDA: A General Framework of Maximum Margin Supervised Topic Models Jun Zhu School of Computer Science Carnegie Mellon University Amr Ahmed … Cited by 1 Related articles All 2 versions

MedLDA: maximum margin supervised topic models J Zhu, A Ahmed, EP Xing – The Journal of Machine Learning Research, 2012 – Page 1. Journal of Machine Learning Research 13 (2012) 2237-2278 Submitted 6/10; Revised 9/11; Published 8/12 MedLDA: Maximum Margin Supervised Topic Models Jun Zhu DCSZJ@MAIL.TSINGHUA.EDU.CN State Key … Cited by 40 Related articles All 24 versions

Multi-camera vision for smart environments C Wu – 2011 – … Fei-Fei Li Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost Graduate Education … Benjamin Van Roy and Prof. Fei-Fei Li, for their helpful comments on this work, and the knowledge and insights I have learned from them. … Cited by 1 Related articles All 4 versions

Heterogeneous Feature Fusion for Visual Recognition L Cao – 2011 – … committee members. During my PhD study, it has been my great honor to collaborate with Prof. Feng Liang, Prof. Indranil Gupta and Prof. Fei-Fei Li at UIUC. I also cherish the summer intern experiences of working with a lot of wonderful people including … Cited by 1 Related articles All 4 versions

Numerical Algorithms for the Analysis of Expert Opinions Elicited in Text Format WP Malcolm, W Buntine – 2013 – DTIC Document … 13 5.4 Specific Natural Language Processing Routines . . . … We also provide some brief details on relevant elements of Natural Language Processing. The main core of technical work in this report begins in §6 detailing the specific theory applied for probabilistic topic modelling. … Related articles

Toward a General Framework for Words and Pictures AC Berg, TL Berg, H Daumé III, J Dodge, A Goyal… – … ImageNet large scale visual recognition project with Fei-Fei Li’s lab at Stanford University – this project was a source of recognition … Berg), and problems related to combining information from words and pictures (Tamara Berg), and a natural language processing researcher (Hal … Related articles All 3 versions

Combining visual recognition and computational linguistics: linguistic knowledge for visual recognition and natural language descriptions of visual content M Rohrbach – 2014 – … and Computational Linguistics Linguistic Knowledge for Visual Recognition and Natural Language Descriptions of Visual Content Thesis for obtaining the title of … Dr. Manfred Pinkal Saarland University, Germany Reviewer Prof. Fei-Fei Li, Ph.D. Stanford University … Related articles

Learning with structured data: applications to computer vision. S Nowozin, DIM Eng – 2009 – … rules. For example, a natural language document has a linear order of sections, paragraphs, and sentences and these parts decompose hierarchically from the entire document down to single words or even characters. Another … Cited by 2 Related articles All 4 versions