Introducing pydargs

In this blog post I introduce pydargs as a helpful tool in configuration management for python projects.

The Evolution of Configuration in a Machine Learning Project

In my career as a data scientist I have seen many projects start as a simple proof-of-concept scripts or notebooks, be consolidated into minimum-viable-products and eventually grow out to large codebases as they were improved and expanded. If no care is taken, configuration values for the models, data sources etcetera ends up scattered all over the codebase.

Eventually, a part of this configuration is consolidated into a single place, for example in a constants.py file with global variables, or in a dataclass. Often large parts remain behind as they never change anyway, and other parts are added to the argument parser as they change so often.

The end result is configuration that is partly consolidated, partly configurable, and partly hidden away in the code. This makes it difficult to get a good overview of all configuration, and often hinders running quick experiments where a small part of the configuration is changed.

The Dataclass as Configuration

The first step to preventing spaghetti-configuration is to store all configuration in a centralized place from the beginning, preferably in the form of a dataclass or something similar. For a small machine learning project using a gradient boosting classifier, such a configuration object could look similar to this:

from dataclasses import dataclass
from pathlib import Path
from typing import Literal

@dataclass
class Configuration:
    mode: Literal["train", "predict"]

    input_path: Path = Path("input_data/")
    output_path: Path = Path("output_data/")
    model_path: Path = Path("model/model.joblib")

    cv: bool = True
    cv_folds: int = 10

    hp_n_estimators: int = 100
    hp_learning_rate: float = 0.5
    hp_max_depth: int = 8

This contains, to some extent, all of the parameters one might want to change: from mode (whether to train or predict), to paths to data and model objects, and from cross-validation settings to hyperparameters. In a real project there will likely be more parameters, up to hundreds for large projects.

A dataclass provides you with many benefits over using a dictionary as a configuration object, the most important being support for static code analysis tools such as ruff and mypy to help you catch bugs before you introduce them.

Using a Configuration Dataclass

Below is a simple example for usage of such a configuration dataclass. The mode field determines which function is called, and the predict function in turn uses other fields to load the data and the model, and to write the predictions in the desired location.

from sys import argv

def main():
    mode = argv[1]
    config = Configuration(mode=mode)
    run(config)

def run(config: Configuration) -> None:
    if config.mode == "train":
        train(config)
    else:
        predict(config)

def predict(config: Configuration) -> None:
    input_data = load_prediction_data(config.input_path)
    model = load_model(config.model_path)
    predictions = model.predict(input_data)
    store_predictions(predictions, config.output_path)

def train(config: Configuration) -> None:
    ...

The function main can be registered as a script in the package metadata (see my post on setuptools) to allow easy access from the command line. The example above allows (requires, even) providing the mode as a command line argument, but all the other parameters can only be changed by changing the code. Another issue here is that any value is accepted for mode, possibly resulting in unexpected behaviour.

A far better solution is to use ArgumentParser:

from argparse import ArgumentParser

def main():
    parser = ArgumentParser()
    parser.add_argument("mode", type=str, choices=["train", "predict"])
    parser.add_argument("--input-path", type=Path, default=Path("input_data/"))
    parser.add_argument("--output_path", type=Path,default=Path("output_data/"))
    parser.add_argument("--model_path", type=Path, default=Path("model/model.joblib"))
    namespace = parser.parse_args()
    config = Configuration(
        mode=namespace["mode"], 
        input_path=namespace["input_path"], 
        output_path=namespace["output_path"], 
        model_path=namespace["model_path"]
    )
    run(config)

The code in this example verifies that the provided mode is valid, allows changing the input and output paths through command line options, and provides helpful error messages when the input is not correct. Note however that not all parameters are added as arguments, and that the defaults are duplicated here. This means that a new parameter in the configuration would have to be added in three places (in the dataclass, in the call to add_argument and where the dataclass is instantiated) and defaults are stored in the parser as well as the dataclass. All this is more work than necessary and the duplication of defaults is error-prone.

Pydargs

This is where pydargs comes to the rescue. Pydargs configures the ArgumentParser based on the fields in your dataclass and instantiates your dataclass based on the command line arguments. Using pydargs, the above function can be reduced to:

from pydargs import parse

def main():
    config = parse(Configuration)
    run(config)

As a bonus, pydargs supports:

a wide variety of input types, such a literals, lists, dates, and more,
nested dataclasses, allowing you to separate parts of your configuration -for example hyperparameters- into a sub-configuration dataclass, and
pydantic dataclasses for extra validation of the input.

Check it out here and clean up your project’s configuration spaghetti!

Introducing pydargs

The Evolution of Configuration in a Machine Learning Project

The Dataclass as Configuration

Using a Configuration Dataclass

Pydargs

Get in touch with us to learn more about the subject and related solutions

Explore related posts