Mozilla Data Collective Python SDK Library
Welcome to the documentation for the datacollective Python client for the
Mozilla Data Collective REST API.
This library helps you:
- Authenticate with the Mozilla Data Collective.
- Download datasets to local storage.
- Load supported datasets into AI-friendly formats, such as pandas DataFrames.
Installation
Install from PyPI:
Getting an API Key
To use the Mozilla Data Collective API, you need an API key:
- Sign up to the Mozilla Data Collective platform.
- Create or retrieve an API key from your Account -> Credentials page.
- Store your key secret in a
.envfile and do not commit it to version control (git).
Configuration
The client reads configuration from environment variables and .env files.
Environment variables
Required:
MDC_API_KEY- Your Mozilla Data Collective API key.
Optional:
MDC_API_URL- API endpoint (defaults to the production URL).MDC_DOWNLOAD_PATH- Local directory where datasets will be downloaded (defaults to~/.mozdata/datasets).
Example using environment variables directly:
export MDC_API_KEY=your-api-key-here
export MDC_API_URL=https://datacollective.mozillafoundation.org/api
export MDC_DOWNLOAD_PATH=~/.mozdata/datasets
.env file
The client will automatically load configuration from a .env file in your
project root or present working directory.
Create a file named .env:
# MDC API Configuration
MDC_API_KEY=your-api-key-here
MDC_API_URL=https://datacollective.mozillafoundation.org/api
MDC_DOWNLOAD_PATH=~/.mozdata/datasets
Security note: do not commit
.envfiles to version control, as they contain secrets.
Basic Usage
IMPORTANT NOTE: Before trying to access any dataset, make sure you have thoroughly read and agreed to the specific dataset's conditions & licensing terms.
[!TIP] You can find the
dataset-idby looking at the URL of the dataset's page on MDC platform. The ID is the unique string of characters located at the very end of the URL, after the/datasets/path. For example, for URLhttps://datacollective.mozillafoundation.org/datasets/cmflnuzw6lrt9e6ui4kwcshvndataset id will becmflnuzw6lrt9e6ui4kwcshvn.
Download a dataset
Use save_dataset_to_disk to download a dataset to the configured download path:
from datacollective import save_dataset_to_disk
dataset = save_dataset_to_disk("your-dataset-id")
# Depending on the implementation, `dataset` may contain metadata
# about the downloaded files or a higher-level dataset object.
The files will be stored under MDC_DOWNLOAD_PATH (default ~/.mozdata/datasets).
Loading and Querying Datasets
Note: in-memory dataset loading is currently supported only for certain datasets.
You can load supported datasets into memory as a pandas DataFrame for
analysis:
from datacollective import load_dataset
df = load_dataset("your-dataset-id")
# Inspect the loaded DataFrame
print(df.head())
Once loaded into a DataFrame, you can use standard pandas operations
to filter, aggregate, and analyze the data.
For details on how
schema.yamlfiles drive the loading process, see Schema-Based Dataset Loading.
Get dataset details
You can retrieve info from the datasheet of a dataset without downloading it:
from datacollective import get_dataset_details
info = get_dataset_details("your-dataset-id")
print(info)
Automatic Download Resume
The SDK automatically handles interrupted downloads. If a download is interrupted
for any reason (network error, user cancellation, system shutdown, etc.), the SDK
will automatically resume from where it left off when you call save_dataset_to_disk
or load_dataset again.
How it works:
- When a download starts, the SDK creates a
.checksumfile alongside the partial download (.partfile) to track the download state. - If the download is interrupted, both files are preserved.
- On the next download attempt, the SDK detects the partial download and resumes from the last byte received.
- Once the download completes successfully, the temporary files are automatically cleaned up.
[!TIP] You don't need to do anything special to enable resume functionality, it works automatically. Just call the same function again after an interruption.
Edge cases handled:
- If the dataset has been updated since the interrupted download, the SDK detects the checksum mismatch and starts a fresh download.
- If only partial files exist without proper tracking data, the SDK safely starts a fresh download.
Automatically check for extracted archives
The load_dataset function avoids redundant extraction by automatically detecting existing files.
It checks if the dataset archive is already downloaded and the folder is extracted under the same name.
This behavior applies when overwrite_existing=False and overwrite_extracted=False.
The SDK identifies the data if the extracted folder name matches the archive name without the extension.
API Reference
For a detailed API reference, see the API Reference section of the documentation.
[!NOTE] This section is intended for maintainers of the
datacollectivelibrary.
Tests
Run the full test suite:
Note that the e2e tests require a valid MDC_TEST_API_KEY and a MDC_TEST_API_URL key set in your environment. Pytest will skip the live e2e tests automatically if either is missing.
Release Workflow
Check out the Release Workflow document for details on how to publish new versions of the library to PyPI using GitHub Actions.