Every year the Metrics shared task has been pushing for better automatic MT metrics and in the last few years we have seen great progress with metrics achieving much higher correlations with human judgements (Mathur et al., 2020; Freitag et al., 2021; Freitag et al., 2022). Yet, while the limitations of metrics such as BLEU are well known in the MT comunity, we still do not know the limitations that new metrics (specially neural ones) might have. For this reason we created the metric challenge sets subtask.
Inspired by the Build it, Break it: The Language Edition, participants in the challenge sets subtask (breakers) are asked to build challenging examples that target specific phenomena currently not addressed by MT reference-based or reference-less evaluation metrics. On top of that we also encourage paper submissions on metrics analysis.
Examples of challenge sets developed to test metrics in 2022:
Please check the Proceedings of WMT 2022 to find all the submissions to the challenge sets subtask!
This shared task will have the two following rounds:
1) Breaking Round: Challenge set participants (Breakers) create challenging examples for metrics. They must send the resulting “challenge sets” to the organizers.
2) Scoring Round: The challenge sets created by Breakers will be randomized and sent to all Metrics participants (Builders) to score. Also, the organizers will score all the data with baseline metrics such as BLEU, chrF, BERTScore, COMET, BLEURT, Prism and YiSi-1.
3) Analysis Round: Breakers will receive their data with all the metrics scores for analysis.
Please register your submission with contacts and a short description here.
We expect submissions in a tab-separated values (TSV) with the following format:
translation-direction | source | good-translation | incorrect-translation | reference | phenomena |
---|---|---|---|---|---|
de-en | Das Shampoo hilft gegen Schuppen. | The shampoo helps against dandruff. | The shampoo helps against flakes. | The shampoo helps fight dandruff. | lexical-ambiguity |
where the translation-direction
column contains the translation direction of the entry, the source
column contains the original segment, good-translation
column contains a correct translation for the phenomena (in the above case is a lexical ambiguity), incorrect-translation
an example of a translation that is not correct according to a specific phenomena and a reference
column with (ideally) human translations, phenomena
is an identifier that identifies the phenomena being tested.
Further details about formating the segments will be updated later.
Please submit your challenge set(s) here, after having completed the registration above.
For every challenge set file, please use the challenge set name that you mentioned at the registration form, and name the file as “challenge_set_name.tsv”. It is also possible to submit a compressed file/archive.
Date | |
---|---|
Breaking Round registration & submission deadline | |
Scoring Round begins | 10th August, 2023 |
Scoring Round submission deadline | 17th August, 2023 |
Analysis Round begins | 24th August, 2023 |
Paper submission deadline to WMT | 5th September, 2023 |
WMT Notification of acceptance | 6th October, 2023 |
WMT Camera-ready deadline | 18th October, 2023 |
Conference | 6th - 7th December, 2023 |