Step-by-Step Guide: How the Speech-to-Text-Finetune Blueprint Works
This Blueprint enables you to fine-tune a Speech-to-Text (STT) model, using either your own data or the Common Voice dataset. This Step-by-Step guide walks you through the end-to-end process of finetuning an STT model based on your needs.
Overview
This blueprint consists of three independent, yet complementary, components:
- Transcription app 🎙️📝: A simple UI that lets you record your voice, pick any HF STT/ASR model, and get an instant transcription.
- Dataset maker app 📂🎤: A UI app that enables you to easily and quickly create your own Speech-to-Text dataset.
- Finetuning script 🛠️🤖: A script to finetune your own STT model, either using Common Voice data or your own custom data created by the Dataset maker app.
Prerequisites
- Python deps installed:
pip install -e .andffmpeginstalled: - [Ubuntu]:
sudo apt install ffmpeg - [Mac]:
brew install ffmpeg - [Optional] Hugging Face login if you plan to track your models:
huggingface-cli login - MDC access for Common Voice via the Python SDK:
- Create an account and get an API key from: https://datacollective.mozillafoundation.org/api-reference
- Local
.envwith MDC API key: cp example_data/.env.example src/speech_to_text_finetune/.env- Edit
.envand setMDC_API_KEY=<your_api_key>from https://datacollective.mozillafoundation.org/api-reference
Visit Getting Started for initial project setup.
Step-by-Step Guide
Step 1 - Initial transcription testing
Initially, you can test the quality of the Speech-to-Text models available in HuggingFace by running the Transcription app.
- Run:
- Select or add the HF model id of your choice.
- Record a sample and inspect the transcription. You may find that there are sometimes inaccuracies for your voice/accent/chosen language, indicating the model could benefit form finetuning on additional data.
Step 2 - Make your Custom Dataset for STT finetuning
-
Create your own, custom dataset by running this command and following the instructions:
-
Follow the instruction in the app to create at least 10 audio samples, which will be saved locally.
Step 3 - Create a finetuned STT model using your custom data
- Configure
config.yaml(example):model_id: openai/whisper-tiny dataset_id: example_data/custom language: English # Set to None for multilingual training or if your language is not supported by Whisper repo_name: default download_directory: "" # Ignored for local datasets test_size: null # Ignored here because example_data/custom already provides train/test training_hp: push_to_hub: False hub_private_repo: True ...
Note that if you select push_to_hub: True you need to have an HF account and log in locally.
- Finetune:
[!TIP] You can gracefully stop finetuning with CTRL+C; evaluation/upload steps will still run.
Step 4 - Create a finetuned STT model using Common Voice
Pick one of the following:
- Option A: Mozilla Data Collective Python SDK
- Ensure
.envcontains a validMDC_API_KEYunder thesrc/speech_to_text_finetunedirectory. - Find the MDC dataset id for your language (Scripted or Spontaneous).
- If you want an interactive notebook walkthrough for an MDC dataset, open
demo/mdc_khmer.ipynband run the cells in order. - Configure
config.yamlwith the MDC dataset id: -
Finetune:
-
Option B: Local Common Voice download
- Download from https://datacollective.mozillafoundation.org/datasets and extract locally.
- Configure
config.yamlwith the local path: - Finetune:
[!NOTE] The first time a dataset is used, it is processed and cached locally. Next runs reuse the processed version to save time and compute.
Step 5 - Evaluate transcription accuracy with your finetuned STT model
- Start the Transcription app:
- Select your HF model id (if pushed) or provide a local model path.
- Record a sample and compare results.
Step 6 - Compare transcription performance between two models
- Start the Model Comparison app:
- Select a baseline model and a comparison model (e.g., your finetuned model).
- Record a sample and review both transcriptions side-by-side.
Step 7 - Evaluate a model on the Fleurs dataset on a specific language
- Configure the arguments and run:
🎨 Customizing the Blueprint
To better understand how you can tailor this Blueprint to suit your specific needs, please visit the Customization Guide.
🤝 Contributing to the Blueprint
Want to help improve or extend this Blueprint? Check out the Future Features & Contributions Guide to see how you can contribute your ideas, code, or feedback to make this Blueprint even better!
📖 Resources & References
If you are interested in learning more about this topic, you might find the following resources helpful: - Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers (Blog post by HuggingFace which inspired the implementation of the Blueprint!)
-
Automatic Speech Recognition Course from HuggingFace (Series of Blog posts)
-
Fine-Tuning ASR Models: Key Definitions, Mechanics, and Use Cases (Blog post by Gladia)
-
Active Learning Approach for Fine-Tuning Pre-Trained ASR Model for a low-resourced Language (Paper)
-
Exploration of Whisper fine-tuning strategies for low-resource ASR (Paper)