Workflows

LipiDetective supports four workflows, all controlled via the workflow section of the YAML config. Only one workflow runs per invocation. If multiple flags are set to True, the first match wins in this order:

tune
train (with or without validate)
test
predict

The validate flag is not a standalone workflow — it controls whether training includes validation (k-fold or custom split).

Note

When model is set to random_forest, the workflow flags are ignored entirely. The random forest runs its own self-contained pipeline via run_random_forest().

workflow:
  train: True
  validate: True    # not a standalone workflow, modifies train
  test: False
  tune: False
  predict: False

Training

Set workflow.train: True to train a model from scratch.

Training reads spectra from the HDF5 file at files.train_input and runs for training.epochs epochs with the specified learning rate and batch size.

When workflow.validate: True, the validation split strategy depends on which config fields are set (checked in this order):

Separate validation file — If files.val_input is set, it is used as the validation set directly. Both splitting_instructions and k-fold splitting are ignored.
Custom split — If files.splitting_instructions is set (and val_input is empty), the referenced YAML file defines which lipid species go into the validation set. Data is read from train_input only.
K-fold cross-validation (default) — If neither is set, train_input is split into training.k folds (default: 6) by lipid species. Each fold trains an independent model and reports per-fold metrics.

After training, set workflow.save_model: True to save model weights. The save location depends on the training mode:

Single-split training (no validation or custom split): <files.output>/LipiDetective_Output_<timestamp>/lipidetective_model.pth
K-fold cross-validation: one model per fold, e.g. <files.output>/LipiDetective_Output_<timestamp>/fold_1/lipidetective_model.pth

Validation

Validation is not a standalone workflow — it is enabled by setting workflow.validate: True together with workflow.train: True. Setting validate: True without train: True has no effect.

The validation data source depends on the split strategy described above (val_input → splitting_instructions → k-fold). Metrics include loss, accuracy, and lipid-class-wise performance.

Testing

Set workflow.test: True to evaluate on the test set.

Testing uses the dataset at files.test_input and produces detailed metrics including:

Overall accuracy
Per-lipid-class confusion matrices
Confidence scores (when test.confidence_score: True)

Prediction

Set workflow.predict: True to identify lipids in new mzML spectra.

Prediction reads mzML files from files.predict_input and outputs identifications to the files.output directory. Key prediction settings:

predict.output — "best_prediction" for top-1 or "top3" for top-3
predict.confidence_threshold — Minimum confidence to report (default: 0.98)
predict.keep_empty — Whether to include unidentified spectra
predict.keep_wrong_polarity_preds — Whether to keep polarity-mismatched results

A pre-trained model must be loaded (workflow.load_model: True).

Hyperparameter Tuning

Set workflow.tune: True to run automated hyperparameter search using Ray Tune.

Tuning uses the ASHA scheduler for early stopping and supports WandB logging. Configure via the tune section:

tune.nr_trials — Number of hyperparameter combinations to try
tune.grace_period — Minimum epochs before a trial can be stopped
tune.resources_per_trial — CPU/GPU allocation (null for auto-detect)

Trainer API

The Trainer class orchestrates all workflows:

class Trainer(config: dict[str, Any])[source]

The trainer class creates the lightning module and executes the specified workflow. It handles processing and splitting of the dataset.

train_with_validation() → None[source]

train_without_validation() → None[source]

test() → None[source]: This loop is for analyzing the models performance on a previously unseen labeled test dataset.

predict() → None[source]: Predicts lipid species for unlabeled data. Input is one or more mzML.

get_pred_files() → list[str][source]

schedule_tuning() → None[source]

tune_model(config: dict[str, Any], num_epochs: int, train_loader: DataLoader[Any], val_loader: DataLoader[Any], trainset_lipids: list[str], valset_lipids: list[str]) → None[source]

prepare_tune_config() → None[source]

check_parameter_for_tuning(config_section: str, parameter: str) → None[source]

parse_lipid_dataset_name(lipid_name: str) → str[source]

get_unique_lipids(dataset_list: list[str]) → list[str][source]

perform_data_split() → DataSplit[source]: This method extracts the names of all datasets in the HDF5 input file and saves them in separate lists for the training and validation sets. These lists can than be used to iterate over the dataset using lazy loading if the whole dataset is too big to be loaded at once.

split_data_via_instructions(dataset_names: list[str]) → DataSplit[source]

split_data_by_lipid_species(dataset_names: list[str]) → DataSplit[source]: Sorts the data by lipid type, mode and collision energy so the training and validation sets generated during the splitting can be balanced.

run_random_forest() → None[source]