NVIDIA releases 1M-hour speech dataset for 25 European languages

Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage

Join Now

NVIDIA has released Granary, an open-source multilingual speech dataset containing approximately one million hours of audio, alongside two new AI models designed for transcription and translation across 25 European languages. The release addresses a critical gap in speech AI development, as only a tiny fraction of the world’s 7,000 languages are currently supported by AI language models, with particular focus on underrepresented European languages like Croatian, Estonian, and Maltese.

What you should know: The Granary dataset represents a massive leap forward in multilingual speech AI training data, providing developers with ready-to-use resources for production-scale applications.

The dataset includes nearly 650,000 hours for speech recognition and over 350,000 hours for speech translation.
It covers nearly all of the European Union’s 24 official languages, plus Russian and Ukrainian.
The dataset was developed through collaboration between NVIDIA’s speech AI team, Carnegie Mellon University researchers, and Fondazione Bruno Kessler, an Italian research institute.

The big picture: NVIDIA’s approach tackles data scarcity through an innovative processing pipeline that transforms unlabeled audio into structured, high-quality training data without requiring resource-intensive human annotation.

The team demonstrated that it takes around half as much Granary training data to achieve target accuracy levels compared to other popular datasets.
The processing pipeline is available open source on GitHub through the NVIDIA NeMo Speech Data Processor toolkit.
This methodology can be adapted by developers for other automatic speech recognition (ASR) or automatic speech translation (AST) models.

In plain English: Traditional speech AI development requires humans to manually label hours of audio recordings—a time-consuming and expensive process. NVIDIA’s new approach uses AI to automatically process raw audio files and turn them into clean, structured training data that other AI models can learn from, much like having a smart assistant organize messy files into neat folders.

Key models released: NVIDIA introduced two complementary AI models trained on the Granary dataset, each optimized for different use cases.

NVIDIA Canary-1b-v2: A billion-parameter model optimized for high-quality transcription and translation between English and two dozen supported languages, offering comparable quality to models 3x larger while running inference up to 10x faster.
NVIDIA Parakeet-tdt-0.6b-v3: A streamlined 600-million-parameter model designed for real-time transcription, capable of processing 24-minute audio segments in a single inference pass with automatic language detection.

How it works: Both models leverage NVIDIA’s NeMo software suite to deliver enhanced functionality for enterprise applications.

NeMo Curator filtered out synthetic examples from source data to ensure only high-quality samples were used for training.
Both models provide accurate punctuation, capitalization, and word-level timestamps in their outputs.
The models support multilingual chatbots, customer service voice agents, and near-real-time translation services.

Why this matters: The release democratizes access to high-quality multilingual speech AI technology, particularly benefiting languages with limited available training data.

European languages underrepresented in human-annotated datasets now have access to critical resources for developing more inclusive speech technologies.
The permissive licensing of Canary-1b-v2 enables widespread adoption and customization by developers.
The methodology can accelerate speech AI innovation by providing a replicable framework for other languages and applications.

What’s next: The research behind Granary will be presented at Interspeech, a language processing conference taking place in the Netherlands from August 17-21, with all resources now available on Hugging Face for immediate developer access.

Now We’re Talking: NVIDIA Releases Open Dataset, Models for Multilingual Speech AI

NVIDIA Blog

Menu

NVIDIA releases 1M-hour speech dataset for 25 European languages

Recent News

ByteDance releases Seed-OSS-36B with 512K token context window

Intel’s new feature boosts AI performance by allocating more RAM to integrated graphics

Insta360’s $150 AI webcam uses gimbal tech to fix video calls

Join the revolution

CO/AI

Resources

Join the revolution

Menu

Welcome

NVIDIA releases 1M-hour speech dataset for 25 European languages

Recent News

ByteDance releases Seed-OSS-36B with 512K token context window

Intel’s new feature boosts AI performance by allocating more RAM to integrated graphics

Insta360’s $150 AI webcam uses gimbal tech to fix video calls

Join the revolution

CO/AI

Resources

Join the revolution