Skip to content

ASR Loader

Loader for ASR (Automatic Speech Recognition) datasets.

There are two parsing strategies for ASR datasets, controlled by the root_strategy field in the schema.

Strategies

Index-based (default)

A CSV or TSV index file maps audio paths to transcriptions and other metadata columns.

Controlled by:

Field Required Description
format File format ("csv" or "tsv").
index_file Path to the metadata file, relative to the dataset root.
columns Mapping of logical column names to source columns and dtypes.
base_audio_path (optional) Directory prefix prepended to file_path dtype columns.

Multi-split strategy (root_strategy: "multi_split")

Each split (e.g. train, dev, test) is stored in a separate file. The loader locates all matching files, filters by the configured split names, adds a split column to each, applies column mappings, and concatenates all frames into a single DataFrame. The split value is taken from the file stem.

Controlled by:

Field Required Description
splits List of split names to load (e.g. ["train", "dev", "test"]).
splits_file_pattern (optional) Glob pattern to locate split files (default: "**/*.tsv").
columns (optional) Column mappings applied to every split frame.
base_audio_path (optional) Directory prefix prepended to file_path dtype columns.

Examples

Index-based schema

dataset_id: "cmj8u48g4005lnxzp98cpr7b2"
task: "ASR"
format: "tsv"

index_file: "ss-corpus-shi.tsv"
base_audio_path: "audios/"

columns:
  audio_path:
    source_column: "audio_file"
    dtype: "file_path"
  transcription:
    source_column: "transcription"
    dtype: "string"
  speaker_id:
    source_column: "client_id"
    dtype: "category"
    optional: true
  audio_id:
    source_column: "audio_id"
    dtype: "string"
    optional: true
  duration_ms:
    source_column: "duration_ms"
    dtype: "int"
    optional: true
  prompt_id:
    source_column: "prompt_id"
    dtype: "string"
    optional: true
  prompt:
    source_column: "prompt"
    dtype: "string"
    optional: true
  votes:
    source_column: "votes"
    dtype: "int"
    optional: true
  age:
    source_column: "age"
    dtype: "category"
    optional: true
  gender:
    source_column: "gender"
    dtype: "category"
    optional: true
  language:
    source_column: "language"
    dtype: "category"
    optional: true
  split:
    source_column: "split"
    dtype: "category"
    optional: true
  char_per_sec:
    source_column: "char_per_sec"
    dtype: "float"
    optional: true
  quality_tags:
    source_column: "quality_tags"
    dtype: "string"
    optional: true

Multi-split schema

dataset_id: "cmj8u3okr0001nxxbeshupy5k"
task: "ASR"
root_strategy: "multi_split"

splits:
  - dev
  - invalidated
  - other
  - reported
  - test
  - train
  - validated

splits_file_pattern: "**/*.tsv"
base_audio_path: "clips/"

columns:
  audio_path:
    source_column: "path"
    dtype: "file_path"
  transcription:
    source_column: "sentence"
    dtype: "string"
  speaker_id:
    source_column: "client_id"
    dtype: "category"
    optional: true
  sentence_id:
    source_column: "sentence_id"
    dtype: "string"
    optional: true
  sentence_domain:
    source_column: "sentence_domain"
    dtype: "category"
    optional: true
  up_votes:
    source_column: "up_votes"
    dtype: "int"
    optional: true
  down_votes:
    source_column: "down_votes"
    dtype: "int"
    optional: true
  age:
    source_column: "age"
    dtype: "category"
    optional: true
  gender:
    source_column: "gender"
    dtype: "category"
    optional: true
  accents:
    source_column: "accents"
    dtype: "category"
    optional: true
  variant:
    source_column: "variant"
    dtype: "category"
    optional: true
  locale:
    source_column: "locale"
    dtype: "category"
    optional: true