New Interested in a scientist / researcher / intern position @ Wipro AI! Drop me an email with your CV.

Image2Tweet
Image2Caption          Image2ConceptualCaption          Image2Tweet
NewCodalab page is up now click

ABSTRACT

Image Captioning for the English language is well studied paradigm. During the last decade there are many big datasets released and numerous efforts could be noticed in the literature on Image Captioning for English. However, Image Captioning for Indian languages has not been explored properly. Image Captioning has various applications such as recommendations in editing applications, usage in virtual assistants, for image indexing, for visually impaired persons, for social media, and several other natural language processing applications.
is an endeavour to ignite the Image Captioning research for Indian languages. As this is the first iteration of this task we will be releasing large scale Hindi data - with an expectation that in the coming series we will be able to add more Indian languages.
consissts of 3 sub-tasks based on the increasings level of complexity, from the relatively easier to the toughest respectively. To understand each of the sub-tasks please look at the following descriptions and examples -



  • - The task is to generate a descriptive caption for a given image.
    दो आदमी हाथ मिला रहे हैं।
    Eng. Trans. - Two men are hand shaking.

  • - The task is to generate a vivid and conceptually more cleaer linguistic caption for a given image.
    सूट में दो आदमी प्यारसे से हाथ मिला रहे हैं और मुस्कुरा रहे हैं।
    Eng. Trans. - Two men in suits are hand shaking warmly and smiling.

  • - The task is to generate a tweet for a given image, more like a reporter or a human user.
    मोदी और ट्रंप ने भारत और अमेरिका के बीच दोस्ती के नए युग की शुरुआत की। \#राउडीमोदी
    Eng. trans. - Modi and Trump starts a new era of friendship between India and USA. \#rowdymodi

For each sub-task we be releasing total 3 large datasets - consist image and corresponding captions/labels.

IMAGE CAPTIONING FOR INDIAN LANGUAGES - DEMANDS EXPLORATION

Generating a description of an image is called image captioning. Image captioning requires to recognize the important objects, their attributes and their relationships in an image. It also needs to generate syntactically and semantically correct sentences. This task involves the knowledge of both computer vision and natural language processing.

Image captioning is important for many reasons. For example, they can be used for automatic image indexing. Image indexing is important for Content-Based Image Retrieval (CBIR) and therefore, it can be applied to many areas, including biomedicine, commerce, education, digital libraries, and web searching.

Image Captioning has been a very popular research reaserch area since the last decade. Even before the boom of neural network based techniques people tried various hand crafted features such as s Local Binary Patterns (LBP) (Ojala et al., 2000), Scale- Invariant Feature Transform (SIFT) (Lowe, 2004), the Histogram of Oriented Gradients (HOG) (de Marneffe et al., 2006) along with classical ML methonds like SVM for Image Captioning. On the other hand, in deep machine learning based techniques, features are learned automatically from training data and they can handle a large and diverse set of images. In the last 5 years, a large number of articles have been published on image captioning with deep machine learning being popularly used. Moreover, the availability of large and new datasets has made the learning-based image captioning an interesting research area. The popular datasets for English Image Captioning are - Flickr30K Dataset (Plummer et al., 2015), MS COCO (Lin et al., 2014), and the recently released one from Google called Conceptual Caption dataset (Sharma et al., 2018). Image captioning has been done in three most popular languages in the world that are Chinese, Spanish and English. But there is almost no research of image captioning in Hindi or Indian languages.



WHY ?


Image Captioning has several application, but Image2Tweet task has been formulated keeping some specific needs in mind.

Limitations of COCO Style and Conceptual Caption - although the candidate captions filtered by the above are often good Alt-text image descriptions, most of them use proper nouns (such as characters, places, locations, organizations, etc.). This poses some problems because the image captioning model is difficult to learn such fine-grained proper noun inference from the input image pixels.

Visual Fact Verification - Fake news identificantion and verification is one of the very hot research topics in the recent time. Fake news verification often requires fact verification. Although, textual fact verification has received significant research attentions in the recent times, but visual fact verification - image. and/or video fact verification has not received adequate attentions. On the other hand, fact verification from text demands textual entailment based methods to support or refute any claim. Now, lets say we have the following image - demands fact verifications -


COCO Style: Two men are hand shaking.
Conceptual Caption: Two men in black dress are hand shaking in a meeting.
Expected Image2Tweet: Donald Trump and Osama Bin Laden are shaking hands.

Visual entailment has recently got some research attentions. However, proposed techniques rely on text as a pivot for the entailment process. Let us see the following example from SNLI-VE the dataset (Figure below), where given image is considered as the premise, and the automatically generated image caption is considered as the hypothesis and then the hypothesis is entailed with the gold text. Finally the system has to predict the entailed, neutrl, and contradiction status.



Illustrates a VE example, which is from the SNLI-VE dataset we propose below, that given an image premise, the three different text hypotheses lead to different labels.


If textual captions have to be used further for fact verification, then COCO, and Google Coneptual style captions are too generic to be further used for a visual entailment, whereas in the specified task Image2Tweet the system should be able to identify people in the image and should be able to generate more specific textual description - could be further successfully used for claim verification.

Now, the counter argument could be person (many) recognition from faces is yet another problem at scale. We aregue popular faces, appear in typical news in a country could not be more 1000 (mostly). Therefore, given a huge amount of parallel data of images and tweets a system should be able learn to idenfy person from images. In recent times - DALL-E, by OpenAI has shown significant success on text to image generation. Let us see another example of image fake news. The following image became viral in social media with an attached claim that - the popular cricketer Sachin Tendulkar has decided to join the nationalist party of India, named BJP.



COCO Style: A man in front of a crowd.
Conceptual Caption: A closeup of a mid-aged man, and a parade.
Expected Image2Tweet: Sachin Tendulkar and BJP parade.


This specific tweet generation from the image input requires - 1) person identification, 2) BJP logo identification, and 3) event identification.

Face recognition at scale: Face recognition at scale is a well studied problem. There are quite a few large datasets available. Among them following two - LFW (Labeled Faces in the Wild), and MS-Celeb-1M, are the mostly used. Face verification is a relatively easy task with the help of discriminative features from deep neural networks. However, it is still a challenge to recognize faces on millions of identities while keeping high performance and efficiency. However, the number of identities is too large and it is not that elegant to treat the task as an image classification task. People treat the classification task as similarity search and do experiments on different similarity search strategies. Similarity search strategy accelerates the speed of searching and boosts the accuracy of final results.

Object Detection at scale: Object detection is the task of detecting instances of objects of a certain class within an image. The state-of-the-art methods can be categorized into two main types: one-stage methods and two stage-methods. One-stage methods prioritize inference speed, and example models include YOLO, SSD and RetinaNet. Two-stage methods prioritize detection accuracy, and example models include Faster R-CNN, Mask R-CNN and Cascade R-CNN. The most popular benchmark is the MSCOCO dataset. Typically, COCO dataset is manually segmented and annotated for a given object region in an image.

We have argued previously that given a large amount of data a system should learn face to name, and object mapping automatically. Among many recent papers we would like to mention (Hong et al., 2016). Authors presented their experiments to cluster and annotate a set of input images jointly, where the images are clustered into several discriminative groups and each group is identified with representative labels automatically. The output could be seen in the Figure (Xiong et al., 2015).


Joint clustering results by (Hong et al., 2016) on CIFAR-100 dataset.


Event detection at scale: Quite similar to face recognition, and object detection at scale event detection from images are well studied paradigm. Web Image Dataset for Event Recognition (WIDER) is the popular dataset for the same. An example from the WIDER dataset (Xiong et al., 2015) in the Figure below.




Examples of several categories in the WIDER (Xiong et al., 2015) dataset, which exhibit diverse visual patterns.


Here we extend similar argumentasion - given large amount of data a system should be able to learn visual events automatically.
Therefore, in order to solve , researchers need to solve various problems like face recognition at scale, object detection at scale, and event detection at scale - but in an unsupervised way given plethora of parallel image and textual description data. Social media is the perfect candidate to supply such amount of data with adequate variety.

DATA AND RESOURCES - TO BE RELEASED


A number of datasets available for English Image Captioning. These datasets differ in various perspective such as the number of images, the number of captions per image, format of the captions, and image size. We choose Three datasets: Flickr30k, and MS COCO Dataset, and Google Conceptual Captioning to further translate to Hindi.

  • : Image Captioning demands a large curated dataset either to train a system or to evaluate, since each image can correspond to many possible captions. Here for this shared task we have translated two most popular English datasets to Hindi -

    Flickr30K Dataset - Flickr30K is a dataset for automatic image description and grounded language understanding. It contains 30k images collected from Flickr with 158k captions provided by human annotators. It does not provide any fixed split of images for training, testing, and validation. Researchers can choose their own choice of numbers for training, testing, and validation. For this task we choose a 70-30 asplit for training and testing. Image captions are automatically translated from English to Hindi.

    MS COCO Datset - Microsoft COCO Dataset is a very large dataset, popularly used for Image Captioning, conytains more than 300,000 images, more than 2 million instances, and 5 captions per image. Many image captioning methods use the dataset in their experiments. All the Englush captions are trnslated automatically - and then manually checked.



  • : Image captions are normally added by website authors (using Alt-text HTML). It is one way to make this content more accessible to the end user through search engines. The term Conceptual Captions has been coined by Google in their paper. Flickr and COCO style captions are merely a textual description generated from a given image, whereas conceptual captions represents richer concepts about an image. For example -

    COCO Style: a group of men standing in front of a building
    Conceptual Caption: graduates line up for the commencement ceremony


    Gogle Conceptual Caption Dataset - Conceptual Captions, a new dataset containing of ~3.3 million image and caption pairs that are created by automatically bring out and purifying image caption annotations from billion numbers of web pages. Conceptual Captions introduces the procedure for increasing the described image of the human recycled MS-COCO dataset. The machine-curated Conceptual Captions has an accuracy of ~90\%.because images in Conceptual Captions are pulled from across the web, it represents a larger numbers of variety of image-caption styles ,than previous datasets, allowing for better training of image captioning models.



  • : Since Image2Tweet is a new task even for English, we will be releasing two datasets - for English and Hindi.

    English dataset: - We have identifiied two very popular English news papers and collected data from their Twitter handles.

          
    Hindi Dataset: -
          


    Additional Corpus for Language Modeling It has been seen in many Image Captioning related papers that researchers try this task in two steps. In the first step people use several CNN based architecture to generate textual concepts from the the given image, then in the second step a language model is used to generate a set of candidate captions. This methods works fine, and gives better scores. Therefore - for the Image2Tweet task we have decided to release additional corpus to be used to re-train or fine-tune modern transformer based language models like BERT, or GPT. A sample tweet from the corpus look like the Figure below.


    Examples of a sample tweet from the corpus.


    Hindi Twitter and NEWS Corpus We have extracted all the news text from such anchored news URLs from all the tweets, and extracted the main news text from the HTML pages, discarding all other unnecessary parts. While calculating image smilarity based clustering (please refer to the Section 4), many candidate images are discarded as they dont have much overlap with any other image in the corpus. However, we kept all the text into our corpus. Participants can use this corpus to fine tune their language models.


EVALUATION STRATEGY & RANKING


For Image Captioning most widely used metrics are n-gram based matching metrics such as BLEU, ROUGE, METEOR, and CIDEr.



BLEU - BLEU (Bilingual evaluation understudy) is a metric that is used to measure the quality of machine generated text. Individual text segments are compared with a set of reference texts and scores are computed for each of them.



ROUGE - ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics that are used for measuring the quality of text summary. It compares word sequences, word pairs, and n-grams with a set of reference summaries created by humans. Different types of ROUGE such as ROUGE-1, 2, ROUGE-W, ROUGE-SU4 are used for different tasks.



METEOR - METEOR (Metric for Evaluation of Translation with Explicit ORdering) is another metric used to evaluate the machine translated language. Standard word segments are compared with the reference texts. In addition to this, stems of a sentence and synonyms of words are also considered for matching.

CIDEr - CIDEr (Consensus-based Image Descripton Evaluation) is an automatic consensus metric for evaluating image descriptions. Most existing datasets have only five captions per image. Previous evaluation metrics work with these small number of sentences and are not enough to measure the consensus between generated captions and human judgement. However, CIDEr achieves human consensus using term frequency-inverse document frequency (TF-IDF).



For simplicity of the ranking we will be using BLEU-4, however other evaluation metrics like ROUGE, METEOR, and CIDEr will be provided post-assessment - more for for the report writing.



  • For this task we will be ranking systems based on BLEU-4, as this metric is mostly used in the community to compare.

  • : For this task we will be using CIDEr to rank participants, as the original Google Conceptual captioning did use this metric to report and compare.

  • : This is the toughest task, and being introduced first time here. We use a CIDEr based evaluation setup for the Image2Tweet.



    Challenges of Image2Tweet evaluation: Popular Image Captioning datasets like Flickr, COCO, and Google Conceptual Caption provide multiple captions per image - as the same image can be described in many different ways. Therefore while calculating evaluation scores like BLEU-4 or ROUGHE etc. the system generated caption is compared with all the reference captions in the gold data. For example look at the following Figure :

    An example of Image Captioning evaluation - how BLEU score is being calculated from the system generated caption vs. all the reference gold captions.


    But, having multiple tweets for a given image is impossible to collect, and having only one reference tweet will affect the evaluation score.
    The proposed setup for Image2Tweet evaluation: Since having multiple tweets for an image would rather not be possible, we found out that similar images may end up of having similar tweets. With this philosophy in mind we have applied content based similarity match on the collected data and kept all the similar images in one cluster. The releassed data is pre-processed accordingly, and all the clusters are marked along with image ids. For this task, we will be using CIDEr but the score will be calculated between system generated tweet vs. all the tweets belong to the smilar image cluster provided in the dataset.


    Having multiple tweets for a given image is impossible to collect - on the other hand, similar images may end up of having similar tweets. With this philosophy in mind we have applied content based similarity match on the collected data and kept all the similar images in one cluster.



    Baselines: A baseline system with essential details will be made available through a GitHub repository. The task will be hosted on Codalab.


WINNER SELECTION


Team will be doing best in all the language pairs using only our data (constrained) will be the winner. All the unconstrained submission will used for the academic discussion during the session.


** Note: teams can use others additional resources, but the submission will be considered only on those systems, are trained only on the provided data.


DATA


We will release a hindi and an english dataset consisting of 50k instances each.


NewData Release 

Task Language
English Hindi
Flick 30K - download
MS COCO - download
Flick 30K Hindi - download [*translated]
MS COCO Hindi - download [*translated]
Google Conceptual Caption - download Google Conceptual Caption Hindi - download [*translated]
Coupus - ImagesTweets
Corpus - ImagesTweets
Coupus - ImagesTweets
Corpus - ImagesTweets



IMPORTANT DATES

* (all dates are tentative)

Registration for the task begins: 1 october 2021

Training/Dev data release: 3 october 2021

Test Set release: 15 Nov 2021

Test submission deadline: 18 Nov 2021

Results announced: 21 Nov 2021

Working Notes submission deadline: 28 Nov 2021

Working Notes reviews: 5 Dec 2021

Working Notes final versions due: 10 Dec 2021



CONTACT


Image2Tweet {AT} googlegroups {DOT} com


Organizers

Dr. Amitava Das is a Lead Scientist at Wipro AI Labs, Bangalore, India. Currently, Dr. Das is actively working on Code-Mixing and Social Computing.

Dr. Santanu Pal s a Lead Scientist at Wipro AI Labs, Bangalore, India. Research interests: Machine Translation.

Parth Patwa is a MS studet at UCLA. Research interests: Natural Language Processing, Machine Learning, Social Computing, and Computer Vision.

 

Student Organizers

Rishabh Jha is a 3rd Year Undergraduate student at IIIT Sricity. Research interests: Computer Vision, Natural Language Processing.
Varshith Kaki is a 3rd Year Undergraduate student at IIITS. Research interests: Machine Learning, Computer Vision, Software Engineering and Programming Languages.
Shubham Bhagat is a 4rd Year Undergraduate student at IIITS. He qualified for Google Summer of Code in 2020
Abhinav Talari is a 4rd Year Undergraduate student at IIITS.

 

REFERENCES