Workflows
LipiDetective supports four workflows, all controlled via the workflow
section of the YAML config. Only one workflow runs per invocation. If
multiple flags are set to True, the first match wins in this order:
tunetrain(with or withoutvalidate)testpredict
The validate flag is not a standalone workflow — it controls whether
training includes validation (k-fold or custom split).
Note
When model is set to random_forest, the workflow flags are ignored
entirely. The random forest runs its own self-contained pipeline via
run_random_forest().
workflow:
train: True
validate: True # not a standalone workflow, modifies train
test: False
tune: False
predict: False
Training
Set workflow.train: True to train a model from scratch.
Training reads spectra from the HDF5 file at files.train_input and runs for
training.epochs epochs with the specified learning rate and batch size.
When workflow.validate: True, the validation split strategy depends on
which config fields are set (checked in this order):
Separate validation file — If
files.val_inputis set, it is used as the validation set directly. Bothsplitting_instructionsand k-fold splitting are ignored.Custom split — If
files.splitting_instructionsis set (andval_inputis empty), the referenced YAML file defines which lipid species go into the validation set. Data is read fromtrain_inputonly.K-fold cross-validation (default) — If neither is set,
train_inputis split intotraining.kfolds (default: 6) by lipid species. Each fold trains an independent model and reports per-fold metrics.
After training, set workflow.save_model: True to save model weights.
The save location depends on the training mode:
Single-split training (no validation or custom split):
<files.output>/LipiDetective_Output_<timestamp>/lipidetective_model.pthK-fold cross-validation: one model per fold, e.g.
<files.output>/LipiDetective_Output_<timestamp>/fold_1/lipidetective_model.pth
Validation
Validation is not a standalone workflow — it is enabled by setting
workflow.validate: True together with workflow.train: True.
Setting validate: True without train: True has no effect.
The validation data source depends on the split strategy described above
(val_input → splitting_instructions → k-fold). Metrics include loss,
accuracy, and lipid-class-wise performance.
Testing
Set workflow.test: True to evaluate on the test set.
Testing uses the dataset at files.test_input and produces detailed metrics
including:
Overall accuracy
Per-lipid-class confusion matrices
Confidence scores (when
test.confidence_score: True)
Prediction
Set workflow.predict: True to identify lipids in new mzML spectra.
Prediction reads mzML files from files.predict_input and outputs
identifications to the files.output directory. Key prediction settings:
predict.output—"best_prediction"for top-1 or"top3"for top-3predict.confidence_threshold— Minimum confidence to report (default: 0.98)predict.keep_empty— Whether to include unidentified spectrapredict.keep_wrong_polarity_preds— Whether to keep polarity-mismatched results
A pre-trained model must be loaded (workflow.load_model: True).
Hyperparameter Tuning
Set workflow.tune: True to run automated hyperparameter search using
Ray Tune.
Tuning uses the ASHA scheduler for early stopping and supports WandB logging.
Configure via the tune section:
tune.nr_trials— Number of hyperparameter combinations to trytune.grace_period— Minimum epochs before a trial can be stoppedtune.resources_per_trial— CPU/GPU allocation (nullfor auto-detect)
Trainer API
The Trainer class orchestrates all workflows:
- class Trainer(config: dict[str, Any])[source]
The trainer class creates the lightning module and executes the specified workflow. It handles processing and splitting of the dataset.
- test() None[source]
This loop is for analyzing the models performance on a previously unseen labeled test dataset.
- tune_model(config: dict[str, Any], num_epochs: int, train_loader: DataLoader[Any], val_loader: DataLoader[Any], trainset_lipids: list[str], valset_lipids: list[str]) None[source]
- perform_data_split() DataSplit[source]
This method extracts the names of all datasets in the HDF5 input file and saves them in separate lists for the training and validation sets. These lists can than be used to iterate over the dataset using lazy loading if the whole dataset is too big to be loaded at once.