This is the Trace Id: e66fb52c28f663abbaa2b7c46d8b9962
Skip to main content Microsoft 365 Office Azure Copilot Windows Support Windows Apps OneDrive Outlook Moving from Skype to Teams OneNote Microsoft Teams Accessories Xbox games Microsoft AI Microsoft Security Azure Dynamics 365 Microsoft 365 for business Microsoft Power Platform Windows 365 Digital Sovereignty Microsoft Developer Microsoft Learn Support for AI marketplace apps Microsoft Tech Community Microsoft Marketplace Visual Studio Marketplace Rewards Free downloads & security Education Gift cards View Sitemap

Microsoft Speech Language Translation (MSLT) Corpus

The Microsoft Speech Language Translation Corpus release contains conversational, bilingual speech test and tuning data for English, French, and German collected by Microsoft Research. The package includes audio data, transcripts, and translations and allows end-to-end testing of spoken language translation systems on real-world data.

Important! Selecting a language below will dynamically change the complete page content to that language.

Download
  • Version:

    1.0

    Date Published:

    15/07/2024

    File Name:

    MSLT_Corpus.zip

    File Size:

    2.0 GB

    The Microsoft Speech Language Translation Corpus release contains conversational, bilingual speech test and tuning data for English, French, and German collected by Microsoft Research. The package includes audio data, transcripts, and translations and allows end-to-end testing of spoken language translation systems on real-world data. All data contained in this release has been created using a non-public version of Skype Translator. NO PRIVATE USER DATA HAS BEEN COLLECTED OR RELEASED. Instead we hired consultants to have loosely constrained conversations, giving them a list of predefined topics to talk about and a few related questions to start the conversations. Topical constraints were loosely enforced so as to ensure free-form conversations. See the IWSLT paper for more details. We release two sets, one containing Test data, the second containing Dev data. Each set contains data for three languages: English, French, and German. For every utterance, we include the audio file in WAVE format, the disfluent transcript, a cleaned up, segmented and fluent version of the transcript, and the translation from English into French or German or vice versa.
  • Supported Operating Systems

    Windows 10, Windows 7, Windows 8

    • Windows 8, Windows 10, Android, Apple Mac OS X
    • Click Download and follow the instructions.