ASR Loader
Loader for ASR (Automatic Speech Recognition) datasets.
There are two parsing strategies for ASR datasets, controlled by the root_strategy field in the schema.
Strategies
Index-based (default)
A CSV or TSV index file maps audio paths to transcriptions and other metadata columns.
Controlled by:
| Field | Required | Description |
|---|---|---|
index_file |
✓ | Path to the metadata file, relative to the dataset root. |
columns |
✓ | Mapping of logical column names to source columns and dtypes. |
base_audio_path |
✗ | (optional) Directory prefix or list of directories used to resolve file_path dtype columns. |
format |
✗ | Optional file format hint ("csv", "tsv", "pipe"). When omitted, the loader infers it from index_file where possible. |
Multi-split strategy (root_strategy: "multi_split")
Each split (e.g. train, dev, test) is stored in a separate file. The loader locates all matching files, filters by the configured split names, adds a split column to each, applies column mappings, and concatenates all frames into a single DataFrame. The split value is taken from the file stem.
Controlled by:
| Field | Required | Description |
|---|---|---|
splits |
✓ | List of split names to load (e.g. ["train", "dev", "test"]). |
splits_file_pattern |
✗ | (optional) Glob pattern to locate split files (default: "**/*.tsv"). |
columns |
✗ | (optional) Column mappings applied to every split frame. |
base_audio_path |
✗ | (optional) Directory prefix or list of directories used to resolve file_path dtype columns. |
Examples
Index-based schema
dataset_id: "cmj8u48g4005lnxzp98cpr7b2"
task: "ASR"
format: "tsv"
index_file: "ss-corpus-shi.tsv"
base_audio_path: "audios/"
columns:
audio_path:
source_column: "audio_file"
dtype: "file_path"
transcription:
source_column: "transcription"
dtype: "string"
speaker_id:
source_column: "client_id"
dtype: "category"
optional: true
audio_id:
source_column: "audio_id"
dtype: "string"
optional: true
duration_ms:
source_column: "duration_ms"
dtype: "int"
optional: true
prompt_id:
source_column: "prompt_id"
dtype: "string"
optional: true
prompt:
source_column: "prompt"
dtype: "string"
optional: true
votes:
source_column: "votes"
dtype: "int"
optional: true
age:
source_column: "age"
dtype: "category"
optional: true
gender:
source_column: "gender"
dtype: "category"
optional: true
language:
source_column: "language"
dtype: "category"
optional: true
split:
source_column: "split"
dtype: "category"
optional: true
char_per_sec:
source_column: "char_per_sec"
dtype: "float"
optional: true
quality_tags:
source_column: "quality_tags"
dtype: "string"
optional: true
Search-based audio resolution
When the metadata stores an ID or partial filename instead of a directly
joinable relative path, file_path columns can search within one or more
audio roots:
dataset_id: "example-asr"
task: "ASR"
index_file: "data/metadata.csv"
base_audio_path:
- "data/recipes/"
- "data/giving_gift/"
columns:
audio_path:
source_column: "Sentence ID"
dtype: "file_path"
path_match_strategy: "exact" # or "contains"
file_extension: ".wav"
transcription:
source_column: "Sentences"
dtype: "string"
path_match_strategy: "direct" remains the default and preserves the existing
extract_dir / base_audio_path / value behavior. The loader also trims BOMs
and surrounding header whitespace, and can retry common delimiters
automatically when a file initially parses as a single column.
If the true audio filename is composed from multiple metadata columns, use
path_template instead of a fuzzy search:
dataset_id: "khmer-asr-cultural-dataset-4e33cd05"
task: "ASR"
index_file: "data/metadata.csv"
base_audio_path:
- "data/recipes/"
- "data/giving_gift/"
columns:
audio_path:
source_column: "Sentence ID"
dtype: "file_path"
file_extension: ".wav"
path_template: "${Speaker ID}_khm_${Sentence ID}.wav"
transcription:
source_column: "Sentences"
dtype: "string"
Template placeholders reference raw metadata column names exactly, and
${value} refers to the current source_column value. Relative paths are
resolved from the dataset root inferred from the resolved index_file.
If the audio directory itself varies per row, base_audio_path can use the
same placeholder syntax:
dataset_id: "khmer-asr-cultural-dataset-4e33cd05"
task: "ASR"
index_file: "data/metadata.csv"
base_audio_path: "data/${Split}/"
columns:
audio_path:
source_column: "Sentence ID"
dtype: "file_path"
file_extension: ".wav"
path_template: "${Speaker ID}_khm_${value}"
transcription:
source_column: "Sentences"
dtype: "string"
That resolves each row as
dataset_root / data/<Split>/<Speaker ID>_khm_<Sentence ID>.wav.
File-content dtype
When the index file stores paths to transcription files instead of inline text,
use dtype: "file_content" to read the file contents into the DataFrame:
dataset_id: "speech-data-nupe"
task: "ASR"
index_file: "Metadata.csv"
base_audio_path:
- "Speaker_id_1"
- "Speaker_id_2"
columns:
audio_path:
source_column: "Audio_File_Path"
dtype: "file_path"
file_extension: ".wav"
transcription:
source_column: "Transcript_File_Path"
dtype: "file_content"
file_extension: ".txt"
speaker_id:
source_column: "Speaker_ID"
The file_content dtype reuses the same path resolution as file_path
(base_audio_path, file_extension, path_match_strategy, path_template)
but returns the file's text content instead of the resolved path.
Multi-split schema
dataset_id: "cmj8u3okr0001nxxbeshupy5k"
task: "ASR"
root_strategy: "multi_split"
splits:
- dev
- invalidated
- other
- reported
- test
- train
- validated
splits_file_pattern: "**/*.tsv"
base_audio_path: "clips/"
columns:
audio_path:
source_column: "path"
dtype: "file_path"
transcription:
source_column: "sentence"
dtype: "string"
speaker_id:
source_column: "client_id"
dtype: "category"
optional: true
sentence_id:
source_column: "sentence_id"
dtype: "string"
optional: true
sentence_domain:
source_column: "sentence_domain"
dtype: "category"
optional: true
up_votes:
source_column: "up_votes"
dtype: "int"
optional: true
down_votes:
source_column: "down_votes"
dtype: "int"
optional: true
age:
source_column: "age"
dtype: "category"
optional: true
gender:
source_column: "gender"
dtype: "category"
optional: true
accents:
source_column: "accents"
dtype: "category"
optional: true
variant:
source_column: "variant"
dtype: "category"
optional: true
locale:
source_column: "locale"
dtype: "category"
optional: true