Deadline for submissions is 17th August, 2023 ❗. Please check the dates bellow.
mt-metrics-eval: the tool for calculating correlation numbers aka the sacreBLEU for metric developers. You can also dump the most recent test sets.
NEW: Codalab submission platform
Date | |
---|---|
Breaking round for challenge sets | 25th July, 2023 |
System outputs ready to download | 10th August, 2023 |
Submission deadline for metrics task | 17th August, 2023 ❗ |
Paper submission deadline to WMT | 5th September, 2023 |
WMT Notification of acceptance | 6th October, 2023 |
WMT Camera-ready deadline | 18th October, 2023 |
Conference | 6th - 7th December, 2023 |
The goals of the shared metrics task are:
We will provide you with the source sentences, output of machine translation systems and reference translations.
Please fill the following registration form so we can keep track of participants. Since we will be using Codalab to handle all submissions you will also have to create an account on codalab and enroll in the competition.
You are invited to submit a short paper (4 to 6 pages) to WMT describing your automatic evaluation metric. Shared task submission description papers are non-archival, and you are not required to submit a paper if you do not want to. If you don’t, we ask that you give an appropriate reference describing your metric that we can cite in the overview paper.
The output of your software should produce scores for the translations either at the system-level or the segment-level (or preferably both).
Output file format for system-level rankings
The output files for system-level scores should be called YOURMETRIC.sys.score and formatted as tab-separated values (TSV) in the following way:
METRIC-NAME\tLANG-PAIR\tTESTSET\tDOMAIN\tREFERENCE\tSYSTEM-ID\tSYSTEM-SCORE
Output file format for segment-level scores
The output files for segment-level scores should be called YOURMETRIC.seg.score and formatted as tab-separated values (TSV) in the following way:
METRIC-NAME\tLANG-PAIR\tTESTSET\tDOMAIN\tDOCUMENT\tREFERENCE\tSYSTEM-ID\tSEGMENT-NUMBER\tSEGMENT-SCORE
Each field should be delimited by a single tab character.
Where:
This year we will be using Codalab to handle all submissions.
Create a zip archive containing YOURMETRIC.sys.score and/or YOURMETRIC.seg.score, and upload it using the Submit / View Results tab under Participate. Fill in the meta-information for your team and metric, as prompted.
Each submission should contain scores for only one metric. The official metric name is the one that appears in the METRIC-NAME field in the score files.
Reference-free (aka QE) metrics must have src
in the REFERENCE field; if a metric is reference free, it must be reference-free for both system- and segment-level scores.
If your submission contains only segment-level scores, we will fill in system-level scores by averaging.
You can make multiple submissions. Each new submission must have a different metric name, and one submission must be designated as primary. Only this submission will participate in the official evaluation (final metric ranking). To designate a submission as primary, include PRIMARY in the Description field. You can update this field to change your primary submission at any time before the evaluation ends.
Primary submissions must include, at minimum, segment-level scores for all official language pairs (Hebrew-English, Chinese-English, English-German).
Verify submission:
After uploading your submission, check that its status (under Submit / View Results) shows Finished and a numerical score appears in the Score column. This can take some time. Use the Results tab to check your correlation scores on the leaderboard. These are correlations with an automatic metric, and will not reflect your final correlations to human MQM scores, nor your true ranking compared to other submissions.However, if they are very low or negative, it could indicate a problem with your scores.
If there is a problem uploading your submission, its status will be Failed, and an Error display will show the reasons for the failure. You can get other information from the links under the Error panel. You can also test your submission offline by running the scoring script yourself.
There is currently no way to remove failed submissions or replace existing valid submissions from codalab. To indicate that you do not wish us to use a submission, include DISCARD in its Description field.
Since data from previous WMT editions might be difficult to navigate we uploaded previous years data to Hugging Face Datasets. You can find DA, MQM and SQM annotations from previous years in the following links: wmt-da-human-evaluation, wmt-mqm-human-evaluation, wmt-sqm-human-evaluation.
If you wish to find the original data please check the previous editions tab and in the results section you can find the original DAs. For MQM you can find the data here