Configuration

LipiDetective is configured entirely through a YAML file passed via the --config flag. A template is provided at config/config_templates/config_transformer.yaml.

$ uv run lipidetective --config path/to/config.yaml

Model Selection

model: 'transformer'

Choose the model architecture. Options:

  • transformer — Encoder-decoder transformer for sequence generation (recommended)

  • convolutional — 3-layer CNN for regression

  • feedforward — Fully connected network

  • random_forest — Scikit-learn random forest classifier

CUDA

cuda:
  gpu_nr: [1]

Select which GPUs to use. Pass a list of GPU indices, e.g. [0, 1] for multi-GPU training.

File Paths

files:
  train_input: 'processed/train_dataset.hdf5'
  val_input: 'processed/val_dataset.hdf5'
  test_input: 'processed/test_dataset.hdf5'
  predict_input: 'raw/sample.mzML'
  saved_model: 'lipidetective_model.pth'
  output: 'output'
  splitting_instructions: 'validation_splits/train_val_split.yaml'

All paths can be relative or absolute. Relative paths are resolved from default base directories:

Path type

Base directory

Examples

Data paths (train_input, val_input, test_input, predict_input)

data/

HDF5 or mzML files

Model paths (saved_model)

models/

Saved model weights

Output paths (output)

experiments/

Experiment results

Config paths (splitting_instructions)

config/

Validation split definitions

Absolute paths are used as-is.

Validation Split Precedence

When training with validation (workflow.validate: True), the data split strategy is determined by which fields are set, checked in this order:

  1. ``val_input`` — If set, training and validation use separate HDF5 files. Both splitting_instructions and k-fold splitting are ignored.

  2. ``splitting_instructions`` — If set (and val_input is empty), the referenced YAML file defines which lipid species go into the validation set. Data is read from train_input only.

  3. K-fold (default) — If neither is set, train_input is split into training.k folds by lipid species for cross-validation.

Environment Variable Overrides

Override default base directories using environment variables:

Variable

Description

LIPIDETECTIVE_DATA_DIR

Base directory for data files

LIPIDETECTIVE_MODELS_DIR

Base directory for model files

LIPIDETECTIVE_OUTPUT_DIR

Base directory for experiment outputs

LIPIDETECTIVE_CONFIG_DIR

Base directory for config files (e.g. splitting instructions)

Example:

export LIPIDETECTIVE_DATA_DIR=/mnt/data/lipidomics
uv run lipidetective --config config.yaml

Programmatic Access

Path resolution functions can be used directly in Python:

from lipidetective import (
    get_project_root,
    resolve_data_path,
    resolve_model_path,
    resolve_output_path,
    resolve_config_path,
)

data_file = resolve_data_path('processed/dataset.hdf5')
model_file = resolve_model_path('lipidetective_model.pth')

Workflow

workflow:
  train: False
  validate: False
  test: False
  tune: False
  predict: True
  save_model: False
  load_model: True
  log_every_n_steps: 10

Key

Default

Description

train

False

Train the model

validate

False

Enable validation during training (requires train: True)

test

False

Evaluate on the test set

tune

False

Run hyperparameter tuning with Ray Tune

predict

True

Run prediction on mzML input

save_model

False

Save model weights to the output directory after training

load_model

True

Load pre-trained model weights from files.saved_model

log_every_n_steps

10

PyTorch Lightning logging frequency

Warning

When loading a pre-trained model (load_model: True), the transformer, input_embedding, and model settings in your config must exactly match the config used to train that model. The saved .pth file contains only weight tensors — no architecture metadata. If any parameter differs (e.g. d_model, num_heads, n_peaks), PyTorch will raise a RuntimeError due to mismatched tensor shapes.

Always keep the config file that was used for training alongside the saved model.

Training

training:
  k: 6
  learning_rate: 0.004
  lr_step: 2
  epochs: 15
  batch: 512
  nr_workers: 0

Key

Default

Description

k

6

Number of folds for k-fold cross-validation

learning_rate

0.004

Initial learning rate

lr_step

2

Step size for learning rate scheduler

epochs

15

Number of training epochs

batch

512

Training batch size

nr_workers

0

DataLoader worker processes (0 = main process)

Test

test:
  batch: 512
  confidence_score: True

Key

Default

Description

batch

512

Test batch size

confidence_score

True

Compute confidence scores for predictions

Prediction

predict:
  output: "best_prediction"
  batch: 512
  save_spectrum: False
  confidence_threshold: 0.98
  keep_empty: False
  keep_wrong_polarity_preds: False

Key

Default

Description

output

"best_prediction"

"best_prediction" for top-1 or "top3" for top-3 results

batch

512

Prediction batch size

save_spectrum

False

Include raw spectrum data in output

confidence_threshold

0.98

Minimum confidence to report a prediction

keep_empty

False

Include spectra with no confident prediction

keep_wrong_polarity_preds

False

Keep predictions with mismatched polarity

Hyperparameter Tuning

tune:
  nr_trials: 1
  grace_period: 2
  resources_per_trial: null

Key

Default

Description

nr_trials

1

Number of Ray Tune trials

grace_period

2

Minimum epochs before early stopping (ASHA scheduler)

resources_per_trial

null

CPU/GPU allocation per trial (null = auto-detect)

Input Embedding

input_embedding:
  n_peaks: 30
  max_mz: 1600
  decimal_accuracy: 1

Key

Default

Description

n_peaks

30

Number of highest-intensity peaks to retain

max_mz

1600

Maximum m/z value for spectrum binning

decimal_accuracy

1

Decimal precision for m/z values

Transformer

transformer:
  d_model: 32
  num_heads: 4
  dropout: 0.1
  ffn_hidden: 256
  num_layers: 2
  output_seq_length: 11

Key

Default

Description

d_model

32

Embedding dimension (must be divisible by num_heads)

num_heads

4

Number of attention heads

dropout

0.1

Dropout rate

ffn_hidden

256

Hidden dimension of the feed-forward layers

num_layers

2

Number of encoder/decoder layers

output_seq_length

11

Maximum output token sequence length

WandB Integration

wandb:
  group: 'Debugging'

Uncomment the wandb section in the config to enable Weights & Biases experiment tracking. The group field organizes runs within a WandB project.

Comment

comment: 'Info on the purpose of the current run'

A free-text field to annotate the purpose of the current experiment. Logged with the run metadata.