New Interested in a scientist / researcher / intern position @ Wipro AI! Drop me an email with your CV.
New Receipient of Outstanding Research paper Award @EMNLP 2023 for Counter Turing Test (CT2): AI-Generated Text Detection is Not as Easy as You May Think - Introducing AI Detectability Index (ADI)
New Our work on Hallucination (The Troubling Emergence of Hallucination in Large Language Models - An Extensive Definition, Quantification, and Prescriptive Remediations published @EMNLP 2023) got covered by The Washington Post -> link

Task 9: Sentiment Analysis for Code-Mixed social media text @ SemEval 2020

CodaLab Page Link New

RATIONALE

Mixing languages, also known as code-mixing, is a norm in multilingual societies. Multilingual people, who are non-native English speakers, tend to code-mix using English-based phonetic typing and the insertion of anglicisms in their main language. In addition to mixing languages at the sentence level, it is fairly common to find the code-mixing behavior at the word level. This linguistic phenomenon poses a great challengeto conventional NLP systems, which currentlyrely on monolingual resources to handle the combination of multiple languages. The objective of this proposal is to bring the attention of the research community towards the task of sentiment analysis in code-mixed social media text. Specifically, we focus on the combination of English with Spanish (Spanglish) and Hindi (Hinglish), which are the 3rd and 4th most spoken languages in the world respectively

 

Hinglish and Spanglish - the Modern Urban Languages 


The evolution of social media texts such as blogs, micro-blogs (e.g., Twitter), and chats (e.g., WhatsApp and Facebook messages) has created many new opportunities for information access and language technology, but it has also posed many new challenges making it one of the current prime research areas. Although current language technologies are primarily built for English, non-native English speakers combine English and other languages when they use social media. In fact, statistics show that half of the messages on Twitter are in a language other than English (Schroeder, 2010). This evidence suggests that other languages, including multilinguiality and code-mixing, need to be considered by the NLP community.


Code-mixing poses several unseen difficulties to NLP tasks such as word-level language identification, part-of-speech tagging, dependency parsing, machine translation and semantic processing. Conventional NLP systems heavily rely on monolingual resources to address code-mixed text, which limit them to properly handle issues like English-based phonetic typing, word-level code-mixing, and others. The next two phrases are examples of code-mixing in Spanglish and Hinglish. For the Spanglish example, in addition to the code-mixing at the sentence level, the word pushes conjugates the English word push according to the grammar rules in Spanish, which shows that code-mixing can also happen at the word level. Better to add more details on the Hinglish example In the Hinglish example only one English word enjoy has been used, but more noticeably for the Hindi words - instead of using Devanagari script, English phonetic typing is a popular practice in India.


No me pushes, please
Eng. Trans.: Do not push me, please

Aye aur enjoy kare
Eng. Trans.: come and enjoy



Additionally, code-mixing frequently occurs in informal settings like social media platforms, which brings more challenges such as flexible grammar, creative spelling, arbitrary punctuation, slang, genre-specific terminology and abbreviations, among others. This whole scenario opens up new research lines where the focus goes beyond simply combining monolingual resources to address the multilingual code-mixing phenomenon in social media environments.


Perhaps we could articulate better this paragraph and merge it with the last paragraph Naturally, code-mixing is more common in geographical regions with a high percentage of bi- or multilingual speakers, such as in Texas and California in the US, Hong Kong and Macao in China, many European and African countries, and the countries in South-East Asia. Multi-linguality and codemixing are also very common in India. Here we propose a sentiment analysis shared task on codemixed social media text on the 3rd and 4th widely spoken languages i.e. Spanish and Hindi mixed with English.


Although code-mixing has received enough attention recently, the availability of properly annotated data is still under scarcity. In this shared task, we will be releasing 20K annotated tweets with word-level language and tweet-level sentiment labels. We believe that the wide interest on the task of sentiment analysis can attract the attention of the NLP community to the code-mixing phenomenon.


The SentiMix task - A summary 

The task is to predict the sentiment of a given code-mixed tweet. The sentiment labels are positive, negative, or neutral , and the code-mixed languages will be English-Hindi and English-Spanish. Besides the sentiment labels, we will also provide the language labels at the word level. The word-level language tags are en (English), spa (Spanish), hi (Hindi), mixed, and univ (e.g., symbols, @ mentions, hashtags). Table 1 shows examples of annotated tweets. If we run out of space, we can uncomment the small command in the table Participants will be provided training, development and test data to report the efficiency of their sentiment analysis systems. Efficiency will be measured in terms of Precision, Recall, and F-measure.

 

 

Data & Resources 

The organizing team has collected and annotated the corpus. The source of the corpus is social media, specifically Twitter. For the purpose of collection of code-mixed data, an extensive list of Twitter handles and pages exhibiting a lot of codemixing was prepared. For both the language pairs Spanglish and Hinglish corpus of 20,000 annotated tweets will be released. The data is annotated with tweet level sentiment and word-level language.

To ensure the quality of the annotation the data is annotated semi-automatically. The baseline word level language identifier and tweet level sentiment analyzer have been used to obtain a basic annotations and then it has crowdsourced to obtain the correct annotations. Finally, a manual quality checking has been done for each crowdsource annotator. Quality threshold under a specified thresold has been discarded. Each tweet has been annotated at least by two crowdsource annotators and further manually checked. Finally, only those tweets are chosen for which inter annotator agreement {kappa score} is above 0.8.

 

 

Pilot Task 
Code-mixing has received significant research attention in the last few years. There has been three successful series of workshops on Computational Approaches to Linguistic Code-Switching (CALCS). In EMNLP 2014, the first CALCS workshop (Solorio et al., 2014) received a total of 18 regular workshop submissions from which 8 were accepted. Additionally, 7 teams participated in the shared task on language identification. In EMNLP 2016, the second CALCS series (Molina et al., 2016) got 19 regular workshop submissions from which 17 were accepted, and 9 teams participated on the shared task. In ACL 2018, the third CALCS workshop (Aguilar et al., 2018) received 19 regular workshop submissions from which 11 were accepted. The shared task on named entity recognition got 9 teams. Thamar, one of the proposers of current task, is the organizer of the CALCS workshop series.

There were 4 (SahaRoy et al., 2013; Choudhury et al., 2014; Sequiera et al., 2015; banerjee et al., 2016) successful series of Mixed Script In- formation Retrieval have been organized with Forum for Information Retrieval Evaluation (FIRE). The tasks addressed rage of issues - focused on word-level language identification, IR for CodeMixing languages, question-answering for CodeMixing languages. Amitava, one of the proposers of the task was one of the organizers. In all the successive years 10+ teams participated in the task series.

Two successful shared tasks on POS tagging for Code-Mixing languages have been organized with the International Conference on Natural Language Processing (ICON) in 2015, and 2016 (Das, 2015- 2016). Amitava, one of the proposers of the task was the organizer. Altogether, 5 and 7 teams participated in 2015, and 2016 respectively.

Despite of these successful workshops and events, we feel that more efforts are needed, and SemEval is the ideal forum to organize the sentiment analysis on code-mixing language task.

 

Expected Impacts 

Although Code-Mixing has received enough attention in recent years, but availability of properly annotated data is still under scarcity. In this shared task, we will be releasing 20K annotated data with word-level language marking and sentiment tagged at the tweet level. Although the task will mainly be focusing on sentiment analysis problem, but the data will be serving the NLP community, whoever are interested in Code-Mixing problem for these particular two languages.

 

Evaluation Ranking and the Baseline 

The metric for evaluating the participating systems will be as follows. We will use F1 averaged across the positives, negatives, and the neutral. The final ranking would be based on the average F1 score.

However, for further theoritical discussion and we will release macro-averaged recall (recall averaged across the three classes), since the latter has better theoretical properties than the former (Esuli and Sebastiani, 2015), and since this provides better consistency.

Each participating team will initially have access to the training data only. Later, the unlabelled test data will be released. After SemEval-2020, the labels for the test data will be released as well. We will ask the participants to submit their predictions in a specified format (within 24 hours), and the organizers will calculate the results for each participant. We will make no distinction between constrained and unconstrained systems, but the participants will be asked to report what additional resources they have used for each submitted run. .



Baseline: : Two baseline systems would be released for Spanglish and Hinglish sentiment analysis. A clear description of the systems - ML model details, features etc will be made available through Github repository. The sysem would be avilable in Python. We will host the task in Codalab.

 

Organizers

Dr. Amitava Das is a Lead Scientist at Wipro AI Labs, Bangalore, India. Currently, Dr. Das is actively working on Code-Mixing and Social Computing.

Thamar Solorio is an Associate Professor at the Computer Science Department of University of Houston. Her research interest spread over several key NLP areas - Code-Mixing, authorship attribution, and plagiarism detection etc.

Gustavo Aguilar is a PhD student in Computer Science at the University of Houston. His research interests lie in the field of Natural Language Processing and Machine Learning and computational analysis of linguistic code-switching.

Prof. Tanmoy Chakraborty is an Assistant Professor and a Ramanujan Fellow at Indraprastha Institute of Information Technology Delhi (IIIT-D). His broad research interests include Data Mining, Complex Networks, Social Computing, and NLP.

Prof. Bjorn Gamback os a Professor at NTNU Norway also Senior Advisor at RISE SICS AB in Stockholm, Sweden. He has been actively working on Machine Translation, Code-Mixing and Sentiment Analysis.

Dr. Dan Garrette is a Research Scientist at Google Research in New York. His research broadly focuses on Natural Language Processing and Machine Learning. He is currently actively working on code-mixing.

Srinivas PYKI is a PhD student in Computer Science at the IIIT Sri City. His research interests lie in the field of Natural Language Processing and Social Computing.

 

Student Organizers

Parth Patwa is an undergraduate student at IIIT Sri City.
Suraj Pandey is a Master student at IIIT Delhi.

 

References

Gustavo Aguilar, Fahad AlGhamdi, Victor Soto, Mona Diab, Julia Hirschberg, and Thamar Solorio. 2018. Overview of the CALCS 2018 Shared Task: Named Entity Recognition on Code-switched Data. In Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, Melbourne, Australia. Association for Computational Linguistics.

 

Somnath banerjee, Kunal Chakma, Sudip Kumar Naskar, Amitava Das, Paolo Rosso, Sivaji Bandyopadhyay, and Monojit Choudhury. 2016. Overview of the mixed script information retrieval (msir). In Proceedings of FIRE 2016. FIRE.

 

Monojit Choudhury, Gokul Chittaranjan, Parth Gupta, and Amitava Das. 2014. Overview of fire 2014 track on transliterated search. In Proceedings of FIRE 2014. FIRE.

 

Amitava Das. 2015-2016. Tool contest on pos tagging for code-mixed indian social media (facebook, twitter, and whatsapp).

 

Andrea Esuli and Fabrizio Sebastiani. 2015. Optimizing text quantifiers for multivariate loss functions. ACM Trans. Knowl. Discov. Data, 9(4):27:1–27:27.

 

Giovanni Molina, Fahad AlGhamdi, Mahmoud Ghoneim, Abdelati Hawwari, Nicolas Rey-Villamizar, Mona Diab, and Thamar Solorio. 2016. Overview for the second shared task on language identification in code-switched data. In Proceedings of the Second Workshop on Computational Approaches to Code Switching, pages 40–49. ACL.

 

Rishiraj SahaRoy, Monojit Choudhury, Prasenjit Majumder, and Komal Agarwal. 2013. Overview and datasets offire 2013 track on transliterated search. In Proceedings of FIRE 2013. FIRE.

 

Stan Schroeder. 2010. Half of messages on Twitter aren’t in English [STATS]. http://mashable.com/2010/02/24/half-messages-twitter-english/.

 

Royal Sequiera, Monojit Choudhury, Parth Gupta, Paolo Rosso, Shubham Kumar, Somnath Banerjee, Sudip Kumar Naskar, Sivaji Bandyopadhyay, Gokul Chittaranjan, Amitava Das, and Kunal Chakma. 2015. Overview of fire-2015 shared task on mixed script information retrieval. In Proceedings of FIRE 2015. FIRE.

 

Thamar Solorio, Elizabeth Blair, Suraj Maharjan, Steven Bethard, Mona Diab, Mahmoud Ghoneim, Abdelati Hawwari, Fahad AlGhamdi, Julia Hirschberg, Alison Chang, and Pascale Fung. 2014. Overview for the first shared task on language identification in code-switched data. In Proceedings of the First Workshop on Computational Approaches to Code Switching, pages 62–72. ACL.

 

IMPORTANT DATES

Trial data ready July 31, 2019

Training data ready September 4, 2019

Test data ready December 3, 2019

Evaluation start January 10, 2020

Evaluation end January 31, 2020

Paper submission due February 23, 2020

Notification to authors March 29, 2020

Camera ready due April 5, 2020

SemEval workshop Summer 2020