top of page

Evaluation 

​

Subtask 1 will be evaluated in two ways as in the SemEval 2021 Task 12 proposal about learning with disagreements (Uma et al. 2021). First, by using the standard classification metric F1 after solving the disagreement cases in the gold standard. The second evaluation metric will be the cross-entropy between the system output values and the soft labels generated by a probabilistic normalization procedure.

​

Subtask 2 is a multi-label hierarchical classification problem. Since the appropriateness of evaluation metrics for such a task is still an open issue, we will consider three metrics:

​

  • Hierarchical F measure (Costa et al. 2007), which combines for each sample (sentence) the precision and recall between label sets and their ancestors.

​

  • Propensity F-measure (Jain et al. 2016), which combines for each sample (sentence) the precision and recall of label sets (without considering ancestors) with an additional factor to consider the specificity of classes (frequency).

​

  • The ICM metric (submitted paper), which is an information theoretic based metric which considers both the hierarchical structure and the class specificity.

​

The global results considering all submitted runs are included in the following excel files:

Results - Task 1

Results - Task 2

​

Next, a ranking per team (selecting their best scoring output) is shown for both subtasks:

​

Table A. Ranking of results for subtask 1

​

​

 

 

 

 

 

​


 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

​

​

​

​

​

​

​

​

​

​

​

​

 

 

Table B. Ranking of results for subtask 2

​

​

​

​

​

​

​

​

​

​

​

​

References

 

​Costa, E. P., Lorena, A. C., Carvalho A. C. P. L. F. & Freitas A. (2007).  A review of performance evaluation measures for hierarchical classifiers. AAAI Workshop - Technical Report

​

Jain, H., Prabhu, Y. & Varma, M. (2016). ‘Extreme multi-label loss functions for recommendation, tagging, ranking & other missing label applications. In Proceedings of the 22ndACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pp. 935–944, New York, NY, USA. Association for Computing Machinery.

​

Uma, A., Fornaciari, T., Dumitrache, A., Miller, T., Chamberlain, J., Plank, B., Simpson, E. & Poesio, M. (2021). ‘SemEval-2021 Task 12: Learning with Disagreements’. In  Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021) (pp. 338-347). Association for Computational Linguistics. DOI: 10.18653/v1/2021.semeval-1.41

bottom of page