Programmatic Uploads
This guide explains how to programmatically upload datasets to the Mozilla Data Collective using the datacollective Python SDK.
Overview
The SDK provides a complete workflow for uploading datasets:
- Create a draft submission - Initialize a new dataset submission
- Upload the dataset file - Upload your archive using resumable multipart uploads
- Update submission metadata - Add required metadata fields to the submission
- Submit for review - Finalize the submission for review
Additionally, the functionality of uploading a dataset file can be used independently to upload a new archive version to an already approved and published dataset submission. Check the Upload a New File Version to an Approved Dataset section for more details.
The SDK also supports resumable uploads, meaning if an upload is interrupted (network error, system shutdown, etc.), you can resume from where it left off.
Prerequisites
Before uploading, ensure you have:
- An API key from the Mozilla Data Collective dashboard
- Your dataset packaged as an archive file (
.tar.gz, uploads useapplication/gzip) - All the required metadata for the dataset submission
- Dataset archives must be 80GB or less
Configuration
Set your API key as an environment variable:
Or create a .env file in your project directory:
Quick Start: One-Step Upload
The simplest way to upload a dataset is using create_submission_with_upload, which handles the entire workflow in a single call:
from datacollective import DatasetSubmission, License, Task, create_submission_with_upload
submission = DatasetSubmission(
name="Dataset Name",
longDescription="A detailed description of the dataset.",
shortDescription="A brief description of the dataset.",
locale="en-US",
task=Task.ASR,
format="TSV",
licenseAbbreviation=License.CC_BY_4_0,
other="This text should provide a detailed description of the dataset, "
"including its contents, structure, and any relevant information "
"that would help users understand what the dataset is about "
"and how it can be used.",
restrictions="Any restrictions you want to impose on the dataset",
forbiddenUsage="Use cases that are not allowed with this dataset",
additionalConditions="Any additional conditions for using the dataset",
pointOfContactFullName="Jane Doe",
pointOfContactEmail="jane@example.com",
fundedByFullName="Funder Name",
fundedByEmail="funder@example.com",
legalContactFullName="Legal Name",
legalContactEmail="legal@example.com",
createdByFullName="Creator Name",
createdByEmail="creator@example.com",
intendedUsage="Describe the intended usage of the dataset, including "
"potential applications and use cases.",
ethicalReviewProcess="Describe the ethical review process that was "
"followed for this dataset, including any approvals "
"or considerations related to data collection and usage.",
exclusivityOptOut=False, # True = This dataset is non-exclusive to Mozilla Data Collective,
# False = Dataset is exclusively hosted in Mozilla Data Collective
agreeToSubmit=True, # True = You confirm that you have the right to submit this dataset and
# that all information provided in the datasheet is accurate.
# Required to be True to complete the submission process
)
response = create_submission_with_upload(
file_path="/path/to/dataset.tar.gz",
submission=submission
)
print(response)
For predefined licenses, pass licenseAbbreviation=License.<VALUE> and leave licenseUrl and license unset. For a custom license, pass a custom string to license and optionally include licenseUrl and licenseAbbreviation.
Upload a New File Version to an Approved Dataset
Use upload_dataset_file when the dataset already exists on the platform and is already in the Published / Approved state.
- Go to Profile → Uploads on the platform.
- Click on the dataset submission you want to upload a new version for (must be in an Approved state).
- Copy the ID from the URL, for example:
https://mozilladatacollective.com/profile/submissions/<ID> - Pass that value to
upload_dataset_fileassubmission_id.
[!IMPORTANT] The value after
/profile/submissions/is the submission ID, not the dataset ID.
from datacollective import upload_dataset_file
approved_submission_id = "XXXXXXXXXXXXXXXXX" # submission ID, not dataset ID
upload_state = upload_dataset_file(
file_path="/path/to/new-dataset-version.tar.gz",
submission_id=approved_submission_id,
)
print(f"Version upload complete! File Upload ID: {upload_state.fileUploadId}")
If the upload is interrupted, rerun the same call and the SDK will resume from the saved state file.
Required Submission Fields
For a detailed explanation of the required fields in the DatasetSubmission model, see the API Reference section.
To complete the submission process, the submission must include at least all of the following fields:
namelongDescriptiontasklocaleformatlicenseAbbreviationorlicenserestrictionsforbiddenUsagepointOfContactFullNamepointOfContactEmailagreeToSubmit=TruefileUploadId
The fileUploadId is only available after a successful file upload and is required to link the uploaded archive to the submission.
Step-by-Step Upload
For more control over the upload process, you can use the individual functions:
Step 1: Create a Draft Submission
from datacollective import DatasetSubmission, create_submission_draft
submission = DatasetSubmission(
name="Dataset Name",
longDescription="A detailed description of the dataset.",
)
draft = create_submission_draft(submission)
submission_id = draft["submission"]["id"]
print(f"Created draft submission: {submission_id}")
Which should output something like:
Step 2: Upload the Dataset File
Then, you can use the submission ID above to upload the dataset file:
from datacollective import upload_dataset_file
upload_state = upload_dataset_file(
file_path="/path/to/your/dataset.tar.gz",
submission_id=submission_id,
)
print(f"Upload complete! File Upload ID: {upload_state.fileUploadId}")
[!TIP] You can also find your submission ID by going to your Uploads in your profile, click on the dataset submission of your choice, and the URL will contain the submission ID (e.g.,
https://mozilladatacollective.com/submissions/cmmjpewijXXXXXXXXX).
Step 3: Update Submission Metadata
For this step, you will need the fileUploadId from the upload response above, which is required to link the uploaded file to your submission. Without this ID, you won't be able to proceed to the submission step. If you no longer have access to it, you will need to re-upload the file to get a new fileUploadId.
At this step, you can also update any other metadata fields.
from datacollective import DatasetSubmission, License, Task, update_submission
update_fields = DatasetSubmission(
task=Task.ASR,
licenseAbbreviation=License.CC_BY_4_0,
locale="en-US",
format="TSV",
restrictions="No restrictions.",
forbiddenUsage="Do not use for unlawful purposes.",
pointOfContactFullName="Jane Doe",
pointOfContactEmail="jane@example.com",
fileUploadId=upload_state.fileUploadId,
# ... other metadata fields ...
)
response = update_submission(
submission_id=submission_id,
submission=update_fields,
)
print(f"Metadata updated: {response}")
Step 4: Submit for Review
from datacollective import DatasetSubmission, submit_submission
response = submit_submission(
submission_id=submission_id,
submission=DatasetSubmission(agreeToSubmit=True),
)
submission = response["submission"]
print(f"Submission status: {submission['status']}")
Resumable Uploads
The SDK automatically handles interrupted uploads using a state file.
How It Works
- When an upload starts, the SDK creates a state file (
.mdc-upload.json) alongside your archive - The state file tracks which parts have been successfully uploaded
- If the upload is interrupted, rerunning the same upload call will resume from where it left off
- Once the upload completes successfully, the state file is removed automatically
Automatic Resume
Simply rerun the same upload call after an interruption.
Using create_submission_with_upload
from datacollective import create_submission_with_upload
# First attempt (interrupted)
response = create_submission_with_upload(
file_path="/path/to/dataset.tar.gz",
submission=submission
)
# Second attempt (resumes automatically)
response = create_submission_with_upload(
file_path="/path/to/dataset.tar.gz",
submission=submission
)
Using upload_dataset_file
from datacollective import upload_dataset_file
# First attempt (interrupted)
upload_state = upload_dataset_file(
file_path="/path/to/your/dataset.tar.gz",
submission_id=submission_id,
)
# Second attempt (resumes automatically)
upload_state = upload_dataset_file(
file_path="/path/to/your/dataset.tar.gz",
submission_id=submission_id,
)
Custom State File Location
You can specify a custom location for the state file:
response = create_submission_with_upload(
file_path="/path/to/dataset.tar.gz",
submission=submission,
state_path="/custom/path/upload-state.json",
)
Disabling Resume
To force a fresh upload (ignoring any existing state), simply delete the state file
(
Error Handling
The SDK raises specific exceptions for common error cases:
| Exception | Cause |
|---|---|
FileNotFoundError |
The specified file path does not exist |
ValidationError |
Invalid DatasetSubmission or required string inputs |
ValueError |
Missing or invalid required parameter |
PermissionError |
API key is invalid or lacks permissions |
RuntimeError |
Rate limit exceeded or upload failed |
Using the DatasetSubmission Model
All submission inputs use the DatasetSubmission Pydantic model, so validation happens
as soon as you construct the model (before any network calls are made).
API Reference
For detailed API documentation, see the API Reference section.
Key Functions
create_submission_with_upload- One-step submission and uploadcreate_submission_draft- Create a draft submissionupdate_submission- Update submission metadataupload_dataset_file- Upload a file to a submissionsubmit_submission- Submit a draft for review