Welcome to the WMT 2022 Metrics Shared Task!

This shared task will examine automatic evaluation metrics for machine translation. We will provide you with MT system outputs along with source text and the human reference translations. We are looking for automatic metric scores for translations at the system-level, and segment-level. We will calculate the system-level, and segment-level correlations of your scores with human judgements.

We invite submissions of reference-free metrics in addition to reference-based metrics.

Have questions or suggestions? Feel free to Contact Us!

❗ System outputs are already available to score! Please download them from here. We included a README and scripts to help you score the data in the correct format. Please adapt these scripts to your metric and send us an email if you have questions.

Metrics Task Important Dates

Β  Date
System outputs ready to download 16th August, 2022
Submission deadline for metrics task 23th August, 2022
Paper submission deadline to WMT 7th September, 2022
WMT Notification of acceptance 9th October, 2022
WMT Camera-ready deadline 16th October, 2022
Conference 7th - 8th December, 2022

Goals

The goals of the shared metrics task are:

Task Description

We will provide you with the source sentences, output of machine translation systems and reference translations.

  1. Official results: Correlation with MQM scores at the sentence and system level for the following language pairs:
    • Chinese-English
    • English-Russian
    • English-German
  2. Secondary Evaluation: Correlation with official WMT Direct Assessment (DA) scores at the sentence and system level.

Subtasks:

  1. QE as a Metric: In this subtask participants have to score machine translation systems without access to reference translations
  2. Challenge Sets: While other participants are worried with building stronger and better metrics, participants of this subtask have to build challengesets that identify where metrics fail!

Paper Describing Your Metric

You are invited to submit a short paper (4 to 6 pages) to WMT describing your automatic evaluation metric. Shared task submission description papers are non-archival, and you are not required to submit a paper if you do not want to. If you don’t, we ask that you give an appropriate reference describing your metric that we can cite in the overview paper.

Training Data

❗ Since data from previous WMT editions might be difficult to navigate we are adding a table with links to download data from previous years. You have new links in the New: Download links section

The WMT Metrics shared task takes place yearly since 2008. You may want to use data from previous editions to tune/train your metric. The following table provides links to the descriptions, the raw data and the findings papers of the previous editions:

year MQM DA system level DA segment level relative ranking paper .bib
2021 πŸ”— πŸ”— πŸ”— Β  πŸ”— πŸ”—
2020 πŸ”— πŸ”— πŸ”— Β  πŸ”— πŸ”—
2019 Β  πŸ”— πŸ”— Β  πŸ”— πŸ”—
2018 Β  πŸ”— πŸ”— Β  πŸ”— πŸ”—
2017 Β  πŸ”— πŸ”— Β  πŸ”— πŸ”—
2016 Β  πŸ”— πŸ”— Β  πŸ”— πŸ”—
2015 Β  Β  Β  πŸ”— πŸ”— πŸ”—
2014 Β  Β  Β  πŸ”— πŸ”— πŸ”—
2013 Β  Β  Β  πŸ”— πŸ”— πŸ”—
2012 Β  Β  Β  πŸ”— πŸ”— πŸ”—
2011 Β  Β  Β  πŸ”— πŸ”— πŸ”—
2010 Β  Β  Β  πŸ”— πŸ”— πŸ”—
2009 Β  Β  Β  πŸ”— πŸ”— πŸ”—
2008 Β  Β  Β  πŸ”— πŸ”— πŸ”—

You can use any past year’s data to tune your metric’s free parameters if it has any for this year’s submission. Additionally, you can use any past data as a test set to compare the performance of your metric against published results from past years metric participants.

Also, for running the mearure metrics quality, specially new ones, we encourage you to use mt-metrics-eval repo developed by George Foster.

DA data:
year DA relative ranks paper
2017 πŸ”— πŸ”— Results of the WMT17 Metrics Shared Task
2018 πŸ”— πŸ”— Results of the WMT18 Metrics Shared Task
2019 πŸ”— πŸ”— Results of the WMT19 Metrics Shared Task
2020 πŸ”— πŸ”— Results of the WMT20 Metrics Shared Task

❗: We are not providing links to the Direct Assessments from 2021 because we found bugs in the scores. We advise participants to avoid using that data. For 2021 you can rely on the MQM annotations below πŸ‘‡.

MQM data:
year LP testset paper
2020 en-de πŸ”— Newstest2020 A Large-Scale Study of Human Evaluation for Machine Translation
2020 zh-en πŸ”— Newstest2020 A Large-Scale Study of Human Evaluation for Machine Translation
2021 en-ru πŸ”— Newstest2021 Results of the WMT21 Metrics Shared Task
2021 en-de πŸ”— Newstest2021 Results of the WMT21 Metrics Shared Task
2021 zh-en πŸ”— Newstest2021 Results of the WMT21 Metrics Shared Task
2021 en-ru πŸ”— Ted Talks Results of the WMT21 Metrics Shared Task
2021 en-de πŸ”— Ted Talks Results of the WMT21 Metrics Shared Task
2021 zh-en πŸ”— Ted Talks Results of the WMT21 Metrics Shared Task

❗: MQM data for en-de and zh-en was mostly annotated by Google and it ranges -25 to 0 where 0 is a perfect translation and -25 is the worse possible score. On the other hand, en-ru data was annotated by Unbabel and ranges -inf to 100 where 100 is a perfect translation and something below 0 is a bad translation. You can find the original data here with more information about raters, etc…

Test Sets (Evaluation Data)

You can download the System outputs from here

Submission Format

The output of your software should produce scores for the translations either at the system-level or the segment-level (or preferably both).

We release along with the data two python scripts to help you score the data. The scripts should be easy to modify in order to run your metrics. We advise you to use them.

We also provide 4 examples of scored data using BLEU, chrF, BLEURT, and COMET-QE (for QE-as-a-metric) available here

Output file format for system-level rankings

The output files for system-level rankings should be called YOURMETRIC.sys.score.gz and formatted as a tab-separated values (TSV) in the following way:

METRIC-NAME\tLANG-PAIR\tTESTSET\tDOMAIN\tREFERENCE\tSYSTEM-ID\tSYSTEM-SCORE

The output files for segment-level scores should be called YOURMETRIC.seg.score.gz and formatted as a tab-separated values (TSV) in the following way:

METRIC-NAME\tLANG-PAIR\tTESTSET\tDOMAIN\tDOCUMENT\tREFERENCE\tSYSTEM-ID\tSEGMENT-NUMBER\tSEGMENT-SCORE

Each field should be delimited by a single tab character.

Where:

How to submit:

Before you submit, please run your scores files through a validation script, which is now available here. You can use it along with either BLEU or COMET-QE sys and seg scores files in the baselines folder

Please enter yourself to this shared spreadsheet so we can keep track of your submissions.

Submissions should be sent to wmt22-metric@googlegroups.com with the subject β€œWMT Metrics submission”.

You are allowed to submit multiple metrics, but we need you to indicate the primary metric in the email. If submitting more than one metric, please share a folder with all your metrics, for example on Google Drive or Dropbox.

Before August 30th (AOE), please send us an email with:

Organization:

Sponsors