Welcome to the WMT 2023 Metrics Shared Task!

This shared task will examine automatic evaluation metrics for machine translation. We will provide you with MT system outputs along with source text and the human reference translations. We are looking for automatic metric scores for translations at the system-level, and segment-level. We will calculate the system-level, and segment-level correlations of your scores with human judgements.

We invite submissions of reference-free metrics in addition to reference-based metrics.

Have questions or suggestions? Feel free to Contact Us!

NEW: Metric inputs and Codalab release:

  1. Register your metric here, if you haven’t already
  2. Create an account on Codalab.
    • You’re allowed one primary submission for a reference-based metric, and one primary submission for a reference-free metric. If you are submitting two metrics that have widely different approaches, for example, one LLM-based metric and one lexical metric, then create 2 accounts on Codalab.
  3. Download the data (link; link also available on Codalab)
  4. Prepare your scores:
    • Please follow the guidelines on submission format as described bellow. The metric inputs download includes sample metrics as well as helper scripts to prepare your scores.
  5. Submit your scores via Codalab:
    • When you submit your metric, Codalab might require some time to process your submission. We’ve noticed processing times between a few minutes and two hours when testing. Codalab does keep track of the submission time, so don’t panic if your last minute submission wasn’t processed before the deadline! Please contact us if it has been longer than 3 hours.
    • After uploading your submission, check its status (under Submit / View Results). It will return an error if there’s an issue with submission, such as formatting issues

Deadline for submissions is 17th August, 2023 ❗. Please check the dates bellow.

mt-metrics-eval: the tool for calculating correlation numbers aka the sacreBLEU for metric developers. You can also dump the most recent test sets.

NEW: Codalab submission platform

Important Dates

Breaking round for challenge sets 25th July, 2023
System outputs ready to download 10th August, 2023
Submission deadline for metrics task 17th August, 2023 ❗
Paper submission deadline to WMT 5th September, 2023
WMT Notification of acceptance 6th October, 2023
WMT Camera-ready deadline 18th October, 2023
Conference 6th - 7th December, 2023


The goals of the shared metrics task are:

Task Description

We will provide you with the source sentences, output of machine translation systems and reference translations.

  1. Official results: Correlation with MQM scores at the sentence and system level for the following language pairs:
    • Hebrew-English (NEW!)
    • Chinese-English
    • English-German This will be a paragraph-level task!
  2. Secondary Evaluation: Correlation with official WMT Human Evaluation at the sentence and system level.


  1. QE as a Metric: In this subtask participants have to score machine translation systems without access to reference translations
  2. Challenge Sets: While other participants are worried with building stronger and better metrics, participants of this subtask have to build challengesets that identify where metrics fail!

How to participate?

Please fill the following registration form so we can keep track of participants. Since we will be using Codalab to handle all submissions you will also have to create an account on codalab and enroll in the competition.

Paper Describing Your Metric

You are invited to submit a short paper (4 to 6 pages) to WMT describing your automatic evaluation metric. Shared task submission description papers are non-archival, and you are not required to submit a paper if you do not want to. If you don’t, we ask that you give an appropriate reference describing your metric that we can cite in the overview paper.

Submission Format

The output of your software should produce scores for the translations either at the system-level or the segment-level (or preferably both).

Output file format for system-level rankings

The output files for system-level scores should be called YOURMETRIC.sys.score and formatted as tab-separated values (TSV) in the following way:


Output file format for segment-level scores

The output files for segment-level scores should be called YOURMETRIC.seg.score and formatted as tab-separated values (TSV) in the following way:


Each field should be delimited by a single tab character.



This year we will be using Codalab to handle all submissions.

Create a zip archive containing YOURMETRIC.sys.score and/or YOURMETRIC.seg.score, and upload it using the Submit / View Results tab under Participate. Fill in the meta-information for your team and metric, as prompted.

Each submission should contain scores for only one metric. The official metric name is the one that appears in the METRIC-NAME field in the score files.

Reference-free (aka QE) metrics must have src in the REFERENCE field; if a metric is reference free, it must be reference-free for both system- and segment-level scores.

If your submission contains only segment-level scores, we will fill in system-level scores by averaging.

You can make multiple submissions. Each new submission must have a different metric name, and one submission must be designated as primary. Only this submission will participate in the official evaluation (final metric ranking). To designate a submission as primary, include PRIMARY in the Description field. You can update this field to change your primary submission at any time before the evaluation ends.

Primary submissions must include, at minimum, segment-level scores for all official language pairs (Hebrew-English, Chinese-English, English-German).

Verify submission:

After uploading your submission, check that its status (under Submit / View Results) shows Finished and a numerical score appears in the Score column. This can take some time. Use the Results tab to check your correlation scores on the leaderboard. These are correlations with an automatic metric, and will not reflect your final correlations to human MQM scores, nor your true ranking compared to other submissions.However, if they are very low or negative, it could indicate a problem with your scores.

If there is a problem uploading your submission, its status will be Failed, and an Error display will show the reasons for the failure. You can get other information from the links under the Error panel. You can also test your submission offline by running the scoring script yourself.

There is currently no way to remove failed submissions or replace existing valid submissions from codalab. To indicate that you do not wish us to use a submission, include DISCARD in its Description field.

Training Data

Since data from previous WMT editions might be difficult to navigate we uploaded previous years data to Hugging Face Datasets. You can find DA, MQM and SQM annotations from previous years in the following links: wmt-da-human-evaluation, wmt-mqm-human-evaluation, wmt-sqm-human-evaluation.

If you wish to find the original data please check the previous editions tab and in the results section you can find the original DAs. For MQM you can find the data here