API Reference
datacollective.datasets
get_dataset_details(dataset_id)
Return dataset details from the MDC API as a dictionary.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_id
|
str
|
The dataset ID (as shown in MDC platform). |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dict with dataset details as returned by the API. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If dataset_id is empty. |
FileNotFoundError
|
If the dataset does not exist (404). |
PermissionError
|
If access is denied (403). |
RuntimeError
|
If rate limit is exceeded (429). |
HTTPError
|
For other non-2xx responses. |
Source code in src/datacollective/datasets.py
load_dataset(dataset_id, download_directory=None, show_progress=True, overwrite_existing=False, overwrite_extracted=False)
Download (if needed), extract (if not already extracted), and load the dataset into a pandas DataFrame.
If the dataset archive already exists in the download directory, it will not be re-downloaded
unless overwrite_existing=True.
If there is a directory with the same name as the archive file without the suffix extension, we assume
it has already been extracted, and it will not be re-extracted unless overwrite_extracted=True.
Uses the dataset schema to determine task-specific loading logic.
Automatically resumes interrupted downloads if a .checksum file exists from a previous attempt.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_id
|
str
|
The dataset ID (as shown in MDC platform). |
required |
download_directory
|
str | None
|
Directory where to save the downloaded archive file. If None or empty, falls back to env MDC_DOWNLOAD_PATH or default. |
None
|
show_progress
|
bool
|
Whether to show a progress bar during download. |
True
|
overwrite_existing
|
bool
|
Whether to overwrite existing archive. |
False
|
overwrite_extracted
|
bool
|
Whether to overwrite existing extracted files by re-extracting the archive file. Only makes sense when overwrite_existing is False. Will check in the download directory for existing extracted files with the default naming of the folder. |
False
|
Returns: A pandas DataFrame with the loaded dataset.
Raises:
| Type | Description |
|---|---|
ValueError
|
If dataset_id is empty or schema is unsupported. |
FileNotFoundError
|
If the dataset does not exist (404). |
PermissionError
|
If access is denied (403) or download directory is not writable. |
RuntimeError
|
If rate limit is exceeded (429) or unexpected response format. |
HTTPError
|
For other non-2xx responses. |
Source code in src/datacollective/datasets.py
save_dataset_to_disk(dataset_id, download_directory=None, show_progress=True, overwrite_existing=False)
Download the dataset archive to a local directory and return the archive path.
Skips download if the target file already exists (unless overwrite_existing=True).
Automatically resumes interrupted downloads if a matching .checksum file exists from a previous attempt.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_id
|
str
|
The dataset ID (as shown in MDC platform). |
required |
download_directory
|
str | None
|
Directory where to save the downloaded archive file. If None or empty, falls back to env MDC_DOWNLOAD_PATH or default. |
None
|
show_progress
|
bool
|
Whether to show a progress bar during download. |
True
|
overwrite_existing
|
bool
|
Whether to overwrite the existing archive file. |
False
|
Returns:
| Type | Description |
|---|---|
Path
|
Path to the downloaded dataset archive. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If dataset_id is empty. |
FileNotFoundError
|
If the dataset does not exist (404). |
PermissionError
|
If access is denied (403) or download directory is not writable. |
RuntimeError
|
If rate limit is exceeded (429) or unexpected response format. |
HTTPError
|
For other non-2xx responses. |
Source code in src/datacollective/datasets.py
datacollective.download
cleanup_partial_download(tmp_filepath, checksum_filepath)
Remove partial download files (.part and .checksum).
Source code in src/datacollective/download.py
determine_resume_state(download_plan)
Determine whether to resume a download based on existing files.
Cases handled
Case 1: .checksum and .part exist, checksum matches -> resume download. Case 2: .checksum and .part exist, checksum does NOT match -> start fresh. Case 3: .part exists but no .checksum -> start fresh (cannot safely resume). Case 4: .checksum exists but no .part -> start fresh (orphaned checksum). Case 5: Neither .checksum nor .part exist -> start fresh.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
download_plan
|
DownloadPlan
|
The DownloadPlan object with download details. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
resume_checksum |
str | None
|
The checksum to use for resumption, or None if starting fresh. |
Source code in src/datacollective/download.py
execute_download_plan(download_plan, resume_download_checksum, show_progress)
Execute the download plan, downloading the dataset to a temporary path.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
download_plan
|
DownloadPlan
|
The DownloadPlan object with download details. |
required |
resume_download_checksum
|
str | None
|
Provide the checksum to resume a previously interrupted download. |
required |
show_progress
|
bool
|
Whether to show a progress bar during download. |
required |
Raises:
| Type | Description |
|---|---|
DownloadError
|
If the download fails or is interrupted. |
Source code in src/datacollective/download.py
get_download_plan(dataset_id, download_directory)
Send a POST request to the API to receive the download session details for a dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_id
|
str
|
The dataset ID (as shown in MDC platform). |
required |
download_directory
|
str | None
|
Directory where to save the downloaded dataset. If None or empty, falls back to env MDC_DOWNLOAD_PATH or default. |
required |
Returns:
| Type | Description |
|---|---|
DownloadPlan
|
a DownloadPlan containing: |
DownloadPlan
|
|
DownloadPlan
|
|
DownloadPlan
|
|
DownloadPlan
|
|
DownloadPlan
|
|
DownloadPlan
|
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If dataset_id is empty. |
FileNotFoundError
|
If the dataset does not exist (404). |
PermissionError
|
If access is denied (403) or download directory is not writable. |
RuntimeError
|
If rate limit is exceeded (429) or unexpected response format. |
HTTPError
|
For other non-2xx responses. |
Source code in src/datacollective/download.py
resolve_download_dir(download_directory)
Resolve and ensure the download directory exists and is writable.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
download_directory
|
str | None
|
User-specified download directory. If None or empty, falls back to env MDC_DOWNLOAD_PATH or default. |
required |
Returns:
| Type | Description |
|---|---|
Path
|
The resolved Path object for the download directory. |
Source code in src/datacollective/download.py
write_checksum_file(checksum_filepath, checksum)
datacollective.api_utils
send_api_request(method, url, stream=False, extra_headers=None, timeout=HTTP_TIMEOUT, include_auth_headers=True)
Send an HTTP request to the MDC API with appropriate headers and error handling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
method
|
str
|
HTTP method (e.g., 'GET', 'POST'). |
required |
url
|
str
|
Full URL for the API endpoint. |
required |
stream
|
bool
|
Whether to stream the response (default: False). |
False
|
extra_headers
|
dict[str, str] | None
|
Additional headers to include in the request (default: None). E.g. for resuming |
None
|
timeout
|
tuple[int, int] | None
|
A tuple specifying (connect timeout, read timeout) in seconds (default: None). |
HTTP_TIMEOUT
|
include_auth_headers
|
bool
|
Whether to include authentication (API KEY) headers (default: True). |
True
|
Returns:
| Type | Description |
|---|---|
Response
|
The HTTP response object. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the resource is not found (404). |
PermissionError
|
If access is denied (403). |
RuntimeError
|
If rate limit is exceeded (429). |
ValueError
|
If API key is missing when authentication is required. |
HTTPError
|
For other non-2xx responses. |
Source code in src/datacollective/api_utils.py
datacollective.schema
ColumnMapping
Bases: BaseModel
A single column mapping entry inside a schema.
Used by index-based tasks to describe how columns in the index file map to logical fields and their data types.
Source code in src/datacollective/schema.py
ContentMapping
Bases: BaseModel
Describes how file contents / metadata map to DataFrame columns.
Used by glob-based tasks (e.g. LM) to specify how to extract text and metadata from files found via glob patterns. For example, the text content might come from the file contents, while metadata (e.g. language code) might come from the file name or parent directory.
Source code in src/datacollective/schema.py
DatasetSchema
Bases: BaseModel
Task-agnostic representation of a dataset schema, as defined by a schema.yaml file.
Every schema must have dataset_id and task. The remaining
fields depend on the task type and the root_strategy
("index" vs "glob").
New task types only need to populate the fields they care about; the loader registered for that task will decide which fields are required at load time.
Source code in src/datacollective/schema.py
to_yaml_dict()
Serialise the schema to a plain dict suitable for YAML output.
Excludes fields that are at their default values so that the
generated schema.yaml stays compact and readable. The
extra dict is merged into the top level.
Source code in src/datacollective/schema.py
get_dataset_schema(dataset_id)
Download and return the schema.yaml content for dataset_id.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_id
|
str
|
The registry dataset ID (the folder name under /registry/). |
required |
Returns:
| Type | Description |
|---|---|
DatasetSchema | None
|
A fully-populated |
DatasetSchema | None
|
the dataset is not found in the registry (HTTP 404). |
Raises: RuntimeError For any other network / HTTP error.
Source code in src/datacollective/schema.py
parse_schema(raw)
Parse a schema from a YAML string, a dict, or a file path.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
raw
|
str | dict[str, Any] | Path
|
YAML string, already-parsed dict, or |
required |
Returns:
| Type | Description |
|---|---|
DatasetSchema
|
A fully-populated |
Raises:
| Type | Description |
|---|---|
ValueError
|
If required fields are missing or the input cannot be parsed. |
Source code in src/datacollective/schema.py
164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 | |
datacollective.schema_loaders.base
BaseSchemaLoader
Bases: ABC
Interface that every task-specific loader must implement.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
schema
|
DatasetSchema
|
The parsed schema for the dataset. |
required |
extract_dir
|
Path
|
The directory where the dataset files have been extracted. |
required |
Source code in src/datacollective/schema_loaders/base.py
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 | |
datacollective.schema_loaders.registry
get_task_loader(task)
Return the loader class for task.
Raises:
| Type | Description |
|---|---|
ValueError
|
If no loader is registered for the given task. |
Source code in src/datacollective/schema_loaders/registry.py
load_dataset_from_schema(schema, extract_dir)
Instantiate the appropriate loader for schema.task and return the
loaded ~pandas.DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
schema
|
DatasetSchema
|
Parsed dataset schema. |
required |
extract_dir
|
Path
|
Root directory where the dataset archive was extracted. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
A pandas DataFrame with the loaded dataset. |
Source code in src/datacollective/schema_loaders/registry.py
datacollective.schema_loaders.cache_schema
datacollective.schema_loaders.tasks.asr
ASRLoader
Bases: BaseSchemaLoader
Load an ASR dataset described by a DatasetSchema.
Source code in src/datacollective/schema_loaders/tasks/asr.py
datacollective.schema_loaders.tasks.tts
TTSLoader
Bases: BaseSchemaLoader
Load a TTS dataset described by a DatasetSchema.
See docs/loaders/tts.md for details on supported loading strategies and schema fields.