In this blog post I introduce pydargs as a helpful tool in configuration management for python projects.
The Evolution of Configuration in a Machine Learning Project
In my career as a data scientist I have seen many projects start as a simple proof-of-concept scripts or notebooks, be consolidated into minimum-viable-products and eventually grow out to large codebases as they were improved and expanded. If no care is taken, configuration values for the models, data sources etcetera ends up scattered all over the codebase.
Eventually, a part of this configuration is consolidated into a single place, for example in a constants.py
file with global variables, or in a dataclass. Often large parts remain behind as they never change anyway, and other parts are added to the argument parser as they change so often.
The end result is configuration that is partly consolidated, partly configurable, and partly hidden away in the code. This makes it difficult to get a good overview of all configuration, and often hinders running quick experiments where a small part of the configuration is changed.
The Dataclass as Configuration
The first step to preventing spaghetti-configuration is to store all configuration in a centralized place from the beginning, preferably in the form of a dataclass or something similar. For a small machine learning project using a gradient boosting classifier, such a configuration object could look similar to this:
from dataclasses import dataclass
from pathlib import Path
from typing import Literal
@dataclass
class Configuration:
mode: Literal["train", "predict"]
input_path: Path = Path("input_data/")
output_path: Path = Path("output_data/")
model_path: Path = Path("model/model.joblib")
cv: bool = True
cv_folds: int = 10
hp_n_estimators: int = 100
hp_learning_rate: float = 0.5
hp_max_depth: int = 8
This contains, to some extent, all of the parameters one might want to change: from mode (whether to train or predict), to paths to data and model objects, and from cross-validation settings to hyperparameters. In a real project there will likely be more parameters, up to hundreds for large projects.
A dataclass provides you with many benefits over using a dictionary as a configuration object, the most important being support for static code analysis tools such as ruff and mypy to help you catch bugs before you introduce them.
Using a Configuration Dataclass
Below is a simple example for usage of such a configuration dataclass. The mode
field determines which function is called, and the predict
function in turn uses other fields to load the data and the model, and to write the predictions in the desired location.
from sys import argv
def main():
mode = argv[1]
config = Configuration(mode=mode)
run(config)
def run(config: Configuration) -> None:
if config.mode == "train":
train(config)
else:
predict(config)
def predict(config: Configuration) -> None:
input_data = load_prediction_data(config.input_path)
model = load_model(config.model_path)
predictions = model.predict(input_data)
store_predictions(predictions, config.output_path)
def train(config: Configuration) -> None:
...
The function main
can be registered as a script in the package metadata (see my post on setuptools) to allow easy access from the command line. The example above allows (requires, even) providing the mode as a command line argument, but all the other parameters can only be changed by changing the code. Another issue here is that any value is accepted for mode
, possibly resulting in unexpected behaviour.
A far better solution is to use ArgumentParser
:
from argparse import ArgumentParser
def main():
parser = ArgumentParser()
parser.add_argument("mode", type=str, choices=["train", "predict"])
parser.add_argument("--input-path", type=Path, default=Path("input_data/"))
parser.add_argument("--output_path", type=Path,default=Path("output_data/"))
parser.add_argument("--model_path", type=Path, default=Path("model/model.joblib"))
namespace = parser.parse_args()
config = Configuration(
mode=namespace["mode"],
input_path=namespace["input_path"],
output_path=namespace["output_path"],
model_path=namespace["model_path"]
)
run(config)
The code in this example verifies that the provided mode
is valid, allows changing the input and output paths through command line options, and provides helpful error messages when the input is not correct. Note however that not all parameters are added as arguments, and that the defaults are duplicated here. This means that a new parameter in the configuration would have to be added in three places (in the dataclass, in the call to add_argument
and where the dataclass is instantiated) and defaults are stored in the parser as well as the dataclass. All this is more work than necessary and the duplication of defaults is error-prone.
Pydargs
This is where pydargs comes to the rescue. Pydargs configures the ArgumentParser
based on the fields in your dataclass and instantiates your dataclass based on the command line arguments. Using pydargs, the above function can be reduced to:
from pydargs import parse
def main():
config = parse(Configuration)
run(config)
As a bonus, pydargs supports:
- a wide variety of input types, such a literals, lists, dates, and more,
- nested dataclasses, allowing you to separate parts of your configuration -for example hyperparameters- into a sub-configuration dataclass, and
- pydantic dataclasses for extra validation of the input.
Check it out here and clean up your project’s configuration spaghetti!