🎨 Customization Guide
This Blueprint is designed to be flexible and easily adaptable to your specific needs. This guide will walk you through some key areas you can customize to make the Blueprint your own.
BYOD: Bring Your Own Dataset
But I already have my own speech-text dateset! I don't want to create a new one from scratch or use Common Voice! Does this Blueprint have anything to offer me?
But of course!
This guide will walk you through how to use the existing codebase to adapt it to your own unique dataset with minimal effort.
The idea is to load and pre-process your own dataset in the same format as the existing datasets, allowing you to seamlessly integrate with the finetune_whisper.py script.
Step 1: Understand your Dataset
Before creating your custom dataset loading function, it's essential to understand the data format that the finetune_whisper.py script expects. Typically, the dataset should have a structure that looks a bit like this:
{
"train": [
{
"audio": "path/to/audio_file.wav",
"text": "The transcribed text of the audio"
},
# More examples...
],
"test": [
{
"audio": "path/to/audio_file_2.wav",
"text": "Another transcribed text"
},
# More examples...
]
}
Notably, there should be a pair of transcribed text and an audio clip (usually in the form of a path to the audio file, either .mp3 or .wav)
Step 2: Use the built-in tabular ASR dataset loader
If your dataset is a local .csv, .tsv, or .parquet file (or a directory containing one) with the columns audio_path and transcription, you do not need to write a custom loader anymore. The built-in loader in data_process.py will:
- keep only the audio path and transcription columns
- ignore extra metadata columns such as
topic,speaker_id, orparagraph_id - expect
audio_pathto already contain an absolute path to the audio file - preserve a
splitcolumn if it already defines both train and test data - otherwise create a fresh train/test split using
sklearn.model_selection.train_test_split
As an example, lets consider that you have a directory with a csv file and all the audio clips like this:
datasets/
├── my_dataset/
│ ├── dataset.csv
│ └── clips/
│ ├── audio_1.wav
│ ├── audio_2.wav
│ ├── audio_3.wav
│ └── ...
csv my_dataset/dataset.csv
audio_path,transcription,topic,speaker_id
/home/user/datasets/my_dataset/clips/example_1.mp3,"This is an example",culture,speaker_1
/home/user/datasets/my_dataset/clips/example_2.mp3,"This is another example",culture,speaker_2
...
/home/user/datasets/my_dataset/clips/example_n.mp3,"This is yet another example",culture,speaker_n
Optionally, you can also provide a split column with values like train, dev, val, valid, validation, test, eval, or evaluation.
How test_size is applied
- If your dataset already defines both train and test rows, that split is preserved and
test_sizeis ignored. dev,val,valid, andvalidationare treated as training data.test,eval, andevaluationare treated as test data.- If the
splitcolumn is missing, empty, or only defines one side of the split, the loader creates a fresh train/test split over the full dataset. - In that case,
test_sizeis passed directly tosklearn.model_selection.train_test_split. - If
test_sizeisnull, scikit-learn's default split size is used. - For datasets that already come with fixed splits, such as Common Voice or the legacy
train/+test/custom dataset layout, the existing split is preserved andtest_sizeis ignored.
Step 3: Update your config file
Point dataset_id to either the dataset directory or directly to the dataset file. test_size is only used when the loader needs to create a train/test split. If your dataset already defines both train and test rows, that split is preserved and test_size is ignored. If your dataset does not define a complete split, a fresh split is created from all rows using test_size or, when test_size: null, scikit-learn's default behavior. If you are using an MDC dataset id instead of a local path, you can also set download_directory to choose where the raw dataset should be downloaded.
model_id: openai/whisper-tiny
dataset_id: /home/user/datasets/my_dataset
language: English
repo_name: default
download_directory: "" # Only used for MDC dataset ids
n_train_samples: -1
n_test_samples: -1
test_size: 0.2 # Only used when the loader needs to create a test split
training_hp:
push_to_hub: False
hub_private_repo: True
...
Step 4: Fine-Tune the model with your own dataset!
Finally, simply run the finetune_whisper.py script to fine-tune the Whisper model using your custom dataset.
🤝 Contributing to the Blueprint
Want to help improve or extend this Blueprint? Check out the Future Features & Contributions Guide to see how you can contribute your ideas, code, or feedback to make this Blueprint even better!