Configuration

LipiDetective is configured entirely through a YAML file passed via the --config flag. A template is provided at config/config_templates/config_transformer.yaml.

$ uv run lipidetective --config path/to/config.yaml

Model Selection

model: 'transformer'

Choose the model architecture. Options:

transformer — Encoder-decoder transformer for sequence generation (recommended)
convolutional — 3-layer CNN for regression
feedforward — Fully connected network
random_forest — Scikit-learn random forest classifier

CUDA

cuda:
  gpu_nr: [1]

Select which GPUs to use. Pass a list of GPU indices, e.g. [0, 1] for multi-GPU training.

File Paths

files:
  train_input: 'processed/train_dataset.hdf5'
  val_input: 'processed/val_dataset.hdf5'
  test_input: 'processed/test_dataset.hdf5'
  predict_input: 'raw/sample.mzML'
  saved_model: 'lipidetective_model.pth'
  output: 'output'
  splitting_instructions: 'validation_splits/train_val_split.yaml'

All paths can be relative or absolute. Relative paths are resolved from default base directories:

Path type	Base directory	Examples
Data paths (`train_input`, `val_input`, `test_input`, `predict_input`)	`data/`	HDF5 or mzML files
Model paths (`saved_model`)	`models/`	Saved model weights
Output paths (`output`)	`experiments/`	Experiment results
Config paths (`splitting_instructions`)	`config/`	Validation split definitions

Absolute paths are used as-is.

Validation Split Precedence

When training with validation (workflow.validate: True), the data split strategy is determined by which fields are set, checked in this order:

``val_input`` — If set, training and validation use separate HDF5 files. Both splitting_instructions and k-fold splitting are ignored.
``splitting_instructions`` — If set (and val_input is empty), the referenced YAML file defines which lipid species go into the validation set. Data is read from train_input only.
K-fold (default) — If neither is set, train_input is split into training.k folds by lipid species for cross-validation.

Environment Variable Overrides

Override default base directories using environment variables:

Variable	Description
`LIPIDETECTIVE_DATA_DIR`	Base directory for data files
`LIPIDETECTIVE_MODELS_DIR`	Base directory for model files
`LIPIDETECTIVE_OUTPUT_DIR`	Base directory for experiment outputs
`LIPIDETECTIVE_CONFIG_DIR`	Base directory for config files (e.g. splitting instructions)

Example:

export LIPIDETECTIVE_DATA_DIR=/mnt/data/lipidomics
uv run lipidetective --config config.yaml

Programmatic Access

Path resolution functions can be used directly in Python:

from lipidetective import (
    get_project_root,
    resolve_data_path,
    resolve_model_path,
    resolve_output_path,
    resolve_config_path,
)

data_file = resolve_data_path('processed/dataset.hdf5')
model_file = resolve_model_path('lipidetective_model.pth')

Workflow

workflow:
  train: False
  validate: False
  test: False
  tune: False
  predict: True
  save_model: False
  load_model: True
  log_every_n_steps: 10

Key	Default	Description
`train`	`False`	Train the model
`validate`	`False`	Enable validation during training (requires `train: True`)
`test`	`False`	Evaluate on the test set
`tune`	`False`	Run hyperparameter tuning with Ray Tune
`predict`	`True`	Run prediction on mzML input
`save_model`	`False`	Save model weights to the output directory after training
`load_model`	`True`	Load pre-trained model weights from `files.saved_model`
`log_every_n_steps`	`10`	PyTorch Lightning logging frequency

Warning

When loading a pre-trained model (load_model: True), the transformer, input_embedding, and model settings in your config must exactly match the config used to train that model. The saved .pth file contains only weight tensors — no architecture metadata. If any parameter differs (e.g. d_model, num_heads, n_peaks), PyTorch will raise a RuntimeError due to mismatched tensor shapes.

Always keep the config file that was used for training alongside the saved model.

Training

training:
  k: 6
  learning_rate: 0.004
  lr_step: 2
  epochs: 15
  batch: 512
  nr_workers: 0

Key	Default	Description
`k`	`6`	Number of folds for k-fold cross-validation
`learning_rate`	`0.004`	Initial learning rate
`lr_step`	`2`	Step size for learning rate scheduler
`epochs`	`15`	Number of training epochs
`batch`	`512`	Training batch size
`nr_workers`	`0`	DataLoader worker processes (`0` = main process)

Test

test:
  batch: 512
  confidence_score: True

Key	Default	Description
`batch`	`512`	Test batch size
`confidence_score`	`True`	Compute confidence scores for predictions

Prediction

predict:
  output: "best_prediction"
  batch: 512
  save_spectrum: False
  confidence_threshold: 0.98
  keep_empty: False
  keep_wrong_polarity_preds: False

Key	Default	Description
`output`	`"best_prediction"`	`"best_prediction"` for top-1 or `"top3"` for top-3 results
`batch`	`512`	Prediction batch size
`save_spectrum`	`False`	Include raw spectrum data in output
`confidence_threshold`	`0.98`	Minimum confidence to report a prediction
`keep_empty`	`False`	Include spectra with no confident prediction
`keep_wrong_polarity_preds`	`False`	Keep predictions with mismatched polarity

Hyperparameter Tuning

tune:
  nr_trials: 1
  grace_period: 2
  resources_per_trial: null

Key	Default	Description
`nr_trials`	`1`	Number of Ray Tune trials
`grace_period`	`2`	Minimum epochs before early stopping (ASHA scheduler)
`resources_per_trial`	`null`	CPU/GPU allocation per trial (`null` = auto-detect)

Input Embedding

input_embedding:
  n_peaks: 30
  max_mz: 1600
  decimal_accuracy: 1

Key	Default	Description
`n_peaks`	`30`	Number of highest-intensity peaks to retain
`max_mz`	`1600`	Maximum m/z value for spectrum binning
`decimal_accuracy`	`1`	Decimal precision for m/z values

Transformer

transformer:
  d_model: 32
  num_heads: 4
  dropout: 0.1
  ffn_hidden: 256
  num_layers: 2
  output_seq_length: 11

Key	Default	Description
`d_model`	`32`	Embedding dimension (must be divisible by `num_heads`)
`num_heads`	`4`	Number of attention heads
`dropout`	`0.1`	Dropout rate
`ffn_hidden`	`256`	Hidden dimension of the feed-forward layers
`num_layers`	`2`	Number of encoder/decoder layers
`output_seq_length`	`11`	Maximum output token sequence length

WandB Integration

wandb:
  group: 'Debugging'

Uncomment the wandb section in the config to enable Weights & Biases experiment tracking. The group field organizes runs within a WandB project.

Comment

comment: 'Info on the purpose of the current run'

A free-text field to annotate the purpose of the current experiment. Logged with the run metadata.