Challenge Sets Subtask

Every year the Metrics shared task has been pushing for better automatic MT metrics and in the last few years we have seen great progress with metrics achieving much higher correlations with human judgements (Mathur et al., 2020; Freitag et al., 2021; Freitag et al., 2022). Yet, while the limitations of metrics such as BLEU are well known in the MT comunity, we still do not know the limitations that new metrics (specially neural ones) might have. For this reason we created the metric challenge sets subtask.

Inspired by the Build it, Break it: The Language Edition, participants in the challenge sets subtask (breakers) are asked to build challenging examples that target specific phenomena currently not addressed by MT reference-based or reference-less evaluation metrics. On top of that we also encourage paper submissions on metrics analysis.

Examples of challenge sets developed to test metrics in 2022:

Please check the Proceedings of WMT 2022 to find all the submissions to the challenge sets subtask!

This shared task will have the two following rounds:

1) Breaking Round: Challenge set participants (Breakers) create challenging examples for metrics. They must send the resulting “challenge sets” to the organizers.

2) Scoring Round: The challenge sets created by Breakers will be randomized and sent to all Metrics participants (Builders) to score. Also, the organizers will score all the data with baseline metrics such as BLEU, chrF, BERTScore, COMET, BLEURT, Prism and YiSi-1.

3) Analysis Round: Breakers will receive their data with all the metrics scores for analysis.

Registration

Please register your submission with contacts and a short description here.

Submission format

We expect submissions in a tab-separated values (TSV) with the following format:

translation-direction source good-translation incorrect-translation reference phenomena
de-en Das Shampoo hilft gegen Schuppen. The shampoo helps against dandruff. The shampoo helps against flakes. The shampoo helps fight dandruff. lexical-ambiguity

where the translation-direction column contains the translation direction of the entry, the source column contains the original segment, good-translation column contains a correct translation for the phenomena (in the above case is a lexical ambiguity), incorrect-translation an example of a translation that is not correct according to a specific phenomena and a reference column with (ideally) human translations, phenomena is an identifier that identifies the phenomena being tested.

Further details about formating the segments will be updated later.

Please submit your challenge set(s) here, after having completed the registration above.

For every challenge set file, please use the challenge set name that you mentioned at the registration form, and name the file as “challenge_set_name.tsv”. It is also possible to submit a compressed file/archive.

Important Dates:

  Date
Breaking Round registration & submission deadline 20th 25th July, 2023 ❗
Scoring Round begins 10th August, 2023
Scoring Round submission deadline 17th August, 2023
Analysis Round begins 24th August, 2023
Paper submission deadline to WMT 5th September, 2023
WMT Notification of acceptance 6th October, 2023
WMT Camera-ready deadline 18th October, 2023
Conference 6th - 7th December, 2023