Step-by-Step Guide: How the Speech-to-Text-Finetune Blueprint Works

This Blueprint enables you to fine-tune a Speech-to-Text (STT) model, using either your own data or the Common Voice dataset. This Step-by-Step guide walks you through the end-to-end process of finetuning an STT model based on your needs.

Overview

This blueprint consists of three independent, yet complementary, components:

Transcription app 🎙️📝: A simple UI that lets you record your voice, pick any HF STT/ASR model, and get an instant transcription.
Dataset maker app 📂🎤: A UI app that enables you to easily and quickly create your own Speech-to-Text dataset.
Finetuning script 🛠️🤖: A script to finetune your own STT model, either using Common Voice data or your own custom data created by the Dataset maker app.

Prerequisites

Python deps installed: pip install -e . and ffmpeg installed:
[Ubuntu]: sudo apt install ffmpeg
[Mac]: brew install ffmpeg
[Optional] Hugging Face login if you plan to track your models: huggingface-cli login
MDC access for Common Voice via the Python SDK:
Create an account and get an API key from: https://datacollective.mozillafoundation.org/api-reference
Local .env with MDC API key:
cp example_data/.env.example src/speech_to_text_finetune/.env
Edit .env and set MDC_API_KEY=<your_api_key> from https://datacollective.mozillafoundation.org/api-reference

Visit Getting Started for initial project setup.

Step-by-Step Guide

Step 1 - Initial transcription testing

Initially, you can test the quality of the Speech-to-Text models available in HuggingFace by running the Transcription app.

Run:
```
python demo/transcribe_app.py
```
Select or add the HF model id of your choice.
Record a sample and inspect the transcription. You may find that there are sometimes inaccuracies for your voice/accent/chosen language, indicating the model could benefit form finetuning on additional data.

Step 2 - Make your Custom Dataset for STT finetuning

Create your own, custom dataset by running this command and following the instructions:
```
python src/speech_to_text_finetune/make_custom_dataset_app.py
```
Follow the instruction in the app to create at least 10 audio samples, which will be saved locally.

Step 3 - Create a finetuned STT model using your custom data

Configure config.yaml (example):

model_id: openai/whisper-tiny
dataset_id: example_data/custom
language: English  # Set to None for multilingual training or if your language is not supported by Whisper
repo_name: default
download_directory: ""  # Ignored for local datasets
test_size: null  # Ignored here because example_data/custom already provides train/test

training_hp:
  push_to_hub: False
  hub_private_repo: True
  ...

Note that if you select push_to_hub: True you need to have an HF account and log in locally.

Finetune:
```
python src/speech_to_text_finetune/finetune_whisper.py
```
[!TIP] You can gracefully stop finetuning with CTRL+C; evaluation/upload steps will still run.

Step 4 - Create a finetuned STT model using Common Voice

Pick one of the following:

Option A: Mozilla Data Collective Python SDK
Ensure .env contains a valid MDC_API_KEY under the src/speech_to_text_finetune directory.
Find the MDC dataset id for your language (Scripted or Spontaneous).
If you want an interactive notebook walkthrough for an MDC dataset, open demo/mdc_khmer.ipynb and run the cells in order.

Configure config.yaml with the MDC dataset id:

model_id: openai/whisper-tiny
dataset_id: <mdc_dataset_id>
language: English
repo_name: default
download_directory: /path/to/mdc-downloads  # Optional
test_size: null  # Ignored when the MDC dataset already provides train/test

training_hp:
  push_to_hub: False
  hub_private_repo: True
  ...

Finetune:

python src/speech_to_text_finetune/finetune_whisper.py

Option B: Local Common Voice download
Download from https://datacollective.mozillafoundation.org/datasets and extract locally.

Configure config.yaml with the local path:

model_id: openai/whisper-tiny
dataset_id: path/to/common_voice_data/language_id
language: English
repo_name: default
download_directory: ""  # Ignored for local datasets
test_size: null  # Ignored because Common Voice already provides splits

training_hp:
  push_to_hub: False
  hub_private_repo: True
  ...

Finetune:

python src/speech_to_text_finetune/finetune_whisper.py

[!NOTE] The first time a dataset is used, it is processed and cached locally. Next runs reuse the processed version to save time and compute.

Step 5 - Evaluate transcription accuracy with your finetuned STT model

Start the Transcription app:
```
python demo/transcribe_app.py
```
Select your HF model id (if pushed) or provide a local model path.
Record a sample and compare results.

Step 6 - Compare transcription performance between two models

Start the Model Comparison app:
```
python demo/model_comparison_app.py
```
Select a baseline model and a comparison model (e.g., your finetuned model).
Record a sample and review both transcriptions side-by-side.

Step 7 - Evaluate a model on the Fleurs dataset on a specific language

Configure the arguments and run:

python src/speech_to_text_finetune/evaluate_whisper_fleurs.py --model_id openai/whisper-tiny --lang_code sw_ke --language Swahili --eval_batch_size 8 --n_test_samples -1 --fp16 True --update_hf_repo False

🎨 Customizing the Blueprint

To better understand how you can tailor this Blueprint to suit your specific needs, please visit the Customization Guide.

🤝 Contributing to the Blueprint

Want to help improve or extend this Blueprint? Check out the Future Features & Contributions Guide to see how you can contribute your ideas, code, or feedback to make this Blueprint even better!

📖 Resources & References

If you are interested in learning more about this topic, you might find the following resources helpful: - Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers (Blog post by HuggingFace which inspired the implementation of the Blueprint!)

Whisper Training Config Tips
Automatic Speech Recognition Course from HuggingFace (Series of Blog posts)
Fine-Tuning ASR Models: Key Definitions, Mechanics, and Use Cases (Blog post by Gladia)
Active Learning Approach for Fine-Tuning Pre-Trained ASR Model for a low-resourced Language (Paper)
Exploration of Whisper fine-tuning strategies for low-resource ASR (Paper)
Finetuning Pretrained Model with Embedding of Domain and Language Information for ASR of Very Low-Resource Settings (Paper)