Corpus | DETESTS IberLEF

Corpus

Training data (Only gold standard version)

Our DETESTS dataset is made up of two parts – one from the NewsCom-TOX corpus (Taulé et al., 2021) and the other from the StereoCom corpus. Both corpora consist of comments published in response to different articles extracted from Spanish online newspapers (ABC, elDiario.es, El Mundo, NIUS, etc.) and discussion forums (such as Menéame); in the case of NewsCom-TOX the dates of the articles range from August 2017 to August 2020 and in StereoCom they range from June 2020 to November 2021.

NewsCom-TOX articles were manually selected based on their controversial subject matter, potential toxicity and the number of published comments (minimum 50 comments). A keyword-based approach was used to search for articles mainly related to racism and xenophobia. Since the NewsCom-TOX corpus was designed primarily to study toxicity and not stereotypes, we used only the part of the corpus with the highest percentage of stereotypes. In order to obtain a sufficient and balanced data volume in terms of the presence or absence of stereotypes, the StereoCom corpus was also collected, with the same content (i.e., comments in response to immigration-related news items in Spanish digital media), selected by subject matter on the basis of a keyword search.

The comments were selected in the same order in which they appear in the temporal web thread, but also with reference to the conversational thread. Each comment was segmented into sentences, and the comment to which every sentence belongs and its position within the comment are indicated.

The DETESTS dataset consists of 5,629 sentences, from which 3,306 sentences correspond to NewsCom-TOX and 2,323 sentences to StereoCom. On average, 24% of the sentences contain a stereotype.

For each sentence, various features have been annotated: 1) Target, distinguishing between "Racial_target" (understanding racial as a group defined by origin, race, ethnicity or religion) and "Other_target" (any other minority or oppressed collective); 2) Presence or non-presence of stereotype; 3) Implicitness of the message, i.e. whether the stereotype is expressed explicitly or implicitly; and 4) Classification: assignment of the sentences previously annotated with the presence of a stereotype to a defined subtype. The “Classification” feature consists of ten predefined values, corresponding to the established subtypes. All the features are annotated with binary values (0=absence of the feature and 1=presence of the feature).

Each sentence is annotated in parallel by three annotators and an inter-annotator agreement test is carried out once all the sentences on each article have been annotated. Then, disagreements are discussed by the annotators and a senior annotator until an agreement is reached. The team of annotators involved in the task consists of two expert linguists and two trained annotators, who are students of linguistics.

We will provide participants with 70% of the dataset to train their models, while the remaining 30% will be used to test their models.

To avoid any conflict with the sources of the comments regarding their intellectual property rights (IPR), a password to access the data will be sent privately to each participant who is interested in the task after filling in the registration form (check our Google Groups). The corpus will only be made available for research purposes. The default dataset includes the gold standard annotation. In case you wish to apply methods of learning with disagreements, we will provide, under request, the pre-aggregated annotation, that is, the annotation of each annotator.

References

Taulé, Mariona, Alejandro Ariza, Montserrat Nofre, Enrique Amigó, Paolo Rosso (2021). ‘Overview of the DETOXIS Task at IberLEF-2021: DEtection of TOXicity in comments In Spanish’, Procesamiento del Lenguaje Natural, 67: 209-221.