SMART IMAGE CAPTION GENERATOR

5 min readJan 7, 2022

Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. Being able to automatically describe the content of an image using properly formed English sentences is a very challenging task, but it could have great impact, for instance by helping visually impaired people better understand the content of images on the web.

The task is significantly harder, for example, the well-studied image classification or object recognition tasks, which have been a main focus in the computer vision community. Indeed, a description must capture not only the objects contained in an image, but it also must express how these objects relate to each other as well as their attributes and the activities they are involved in. Moreover, the above semantic knowledge has to be expressed in a natural language like English, which means that a language model is needed in addition to visual understanding.

Convolutional Neural Networks were designed to map image data to an output variable. They have proven so effective that they are the go-to method for any type of prediction problem involving image data as an input.

Recurrent Neural Networks, or RNNs, were designed to work with sequence prediction problems. Some of these sequence prediction problems include one-to-many, many-to-one, and many-to-many.

LSTM networks are perhaps the most successful RNN’s as they allow us to encapsulate a wider sequence of words or sentences for prediction.

Can we model this as a one-to-many sequence prediction task?

Yes, but how would the LSTM or any other sequence prediction model understand the input image. We cannot directly input the RGB image tensor as they are ill-equipped to work with such inputs. Input with spatial structure, like images, cannot be modeled easily with the standard Vanilla LSTM.

Can we extract some features from the input image?

Yes, this is precisely what we need to do in order to use the LSTM architecture for our purpose. We can use the deep CNN architecture to extract features from the image which are then fed into the LSTM architecture to output the caption.

The Smart AI-infused image caption generator is packed with deep learning neural networks; namely, Convolutional Neural Networks(CNN), Recurrent Neural Networks (RNN), and Long Short Term Memory (LSTM) :

1) CNNs are deployed for extracting spatial information from the images

2) RNNs are harnessed for generating sequential data of words

3) LSTM is good at remembering lengthy sequences of words

A functional CNN-RNN model.
Image source- ResearchGate.

Smart Image caption Generator works in 3 phases:

1) Feature Extraction

The first move is made by CNNs to extract distinct features from an image based on its spatial context. CNNs create dense feature vectors, also called embedding, that is used as an input for the following RNN algorithms.

The CNN is fed with images as inputs in different formats including png, jpg, and others. The neural networks compress large amounts of features extracted from the original image into smaller and RNN-compatible feature vector. It is the reason why CNN is also referred to as ‘Encoder’.

2) Tokenization

The second phase brings RNN into the picture for ‘decoding’ the process vector inputs generated by the CNN module. For initiating the task for captions, the RNN model needs to be trained with a relevant dataset. It is essential to train the RNN model for predicting the next word in the sentence. However, training the model with strings is ineffective without definite numerical alphas values.

For this purpose, it required to convert the image captions into a list of tokenized words as shown below-

3) Text Prediction

Post tokenization, the last phase of the model is triggered using LSTM. This step requires an embedding layer for transforming each word into the desired vector and eventually pushed for decoding. With LSTM, the RNN model must be able to remember spatial information from the input feature vector and predict the next word. Now with LSTM performing its tasks.

Applications of Smart Image Captioning

The AI-powered image captioning model is an automated tool that generates concise and meaningful captions for prodigious volumes of images efficiently. The model employs techniques from computer vision and Natural Language Processing (NLP) to extract comprehensive textual information about the given images.

Recommendations in Editing Applications
Assistance for Visually Impaired
Media and Publishing Houses
Social Media Posts

Model Evaluation — Bilingual Evaluation Understudy Score(BLEU’s):

BLEU is a metric for evaluating a generated sentence to a reference sentence. The score was developed for evaluating the predictions made by automatic machine translation systems. A perfect match results in a score of 1.0, whereas a perfect mismatch results in a score of 0.0.

The actual and predicted descriptions are collected and evaluated using the corpus BLEU score that summarizes how close the generated text is to the expected text.

CONCLUSION

A CNN-LSTM architecture has wide-ranging applications as it stands at the helm of Computer Vision and Natural Language Processing. It allows us to use state of the art neural models for NLP tasks such as the transformer for sequential image and video data. At the same time, extremely powerful CNN networks can be used for sequential data such as the natural language. Hence, it allows us to leverage the useful aspects of powerful models in tasks they have never been used for before.