Configuration
LipiDetective is configured entirely through a YAML file passed via the
--config flag. A template is provided at
config/config_templates/config_transformer.yaml.
$ uv run lipidetective --config path/to/config.yaml
Model Selection
model: 'transformer'
Choose the model architecture. Options:
transformer— Encoder-decoder transformer for sequence generation (recommended)convolutional— 3-layer CNN for regressionfeedforward— Fully connected networkrandom_forest— Scikit-learn random forest classifier
CUDA
cuda:
gpu_nr: [1]
Select which GPUs to use. Pass a list of GPU indices, e.g. [0, 1] for
multi-GPU training.
File Paths
files:
train_input: 'processed/train_dataset.hdf5'
val_input: 'processed/val_dataset.hdf5'
test_input: 'processed/test_dataset.hdf5'
predict_input: 'raw/sample.mzML'
saved_model: 'lipidetective_model.pth'
output: 'output'
splitting_instructions: 'validation_splits/train_val_split.yaml'
All paths can be relative or absolute. Relative paths are resolved from default base directories:
Path type |
Base directory |
Examples |
|---|---|---|
Data paths ( |
|
HDF5 or mzML files |
Model paths ( |
|
Saved model weights |
Output paths ( |
|
Experiment results |
Config paths ( |
|
Validation split definitions |
Absolute paths are used as-is.
Validation Split Precedence
When training with validation (workflow.validate: True), the data split
strategy is determined by which fields are set, checked in this order:
``val_input`` — If set, training and validation use separate HDF5 files. Both
splitting_instructionsand k-fold splitting are ignored.``splitting_instructions`` — If set (and
val_inputis empty), the referenced YAML file defines which lipid species go into the validation set. Data is read fromtrain_inputonly.K-fold (default) — If neither is set,
train_inputis split intotraining.kfolds by lipid species for cross-validation.
Environment Variable Overrides
Override default base directories using environment variables:
Variable |
Description |
|---|---|
|
Base directory for data files |
|
Base directory for model files |
|
Base directory for experiment outputs |
|
Base directory for config files (e.g. splitting instructions) |
Example:
export LIPIDETECTIVE_DATA_DIR=/mnt/data/lipidomics
uv run lipidetective --config config.yaml
Programmatic Access
Path resolution functions can be used directly in Python:
from lipidetective import (
get_project_root,
resolve_data_path,
resolve_model_path,
resolve_output_path,
resolve_config_path,
)
data_file = resolve_data_path('processed/dataset.hdf5')
model_file = resolve_model_path('lipidetective_model.pth')
Workflow
workflow:
train: False
validate: False
test: False
tune: False
predict: True
save_model: False
load_model: True
log_every_n_steps: 10
Key |
Default |
Description |
|---|---|---|
|
|
Train the model |
|
|
Enable validation during training (requires |
|
|
Evaluate on the test set |
|
|
Run hyperparameter tuning with Ray Tune |
|
|
Run prediction on mzML input |
|
|
Save model weights to the output directory after training |
|
|
Load pre-trained model weights from |
|
|
PyTorch Lightning logging frequency |
Warning
When loading a pre-trained model (load_model: True), the
transformer, input_embedding, and model settings in your config
must exactly match the config used to train that model. The saved
.pth file contains only weight tensors — no architecture metadata. If
any parameter differs (e.g. d_model, num_heads, n_peaks),
PyTorch will raise a RuntimeError due to mismatched tensor shapes.
Always keep the config file that was used for training alongside the saved model.
Training
training:
k: 6
learning_rate: 0.004
lr_step: 2
epochs: 15
batch: 512
nr_workers: 0
Key |
Default |
Description |
|---|---|---|
|
|
Number of folds for k-fold cross-validation |
|
|
Initial learning rate |
|
|
Step size for learning rate scheduler |
|
|
Number of training epochs |
|
|
Training batch size |
|
|
DataLoader worker processes ( |
Test
test:
batch: 512
confidence_score: True
Key |
Default |
Description |
|---|---|---|
|
|
Test batch size |
|
|
Compute confidence scores for predictions |
Prediction
predict:
output: "best_prediction"
batch: 512
save_spectrum: False
confidence_threshold: 0.98
keep_empty: False
keep_wrong_polarity_preds: False
Key |
Default |
Description |
|---|---|---|
|
|
|
|
|
Prediction batch size |
|
|
Include raw spectrum data in output |
|
|
Minimum confidence to report a prediction |
|
|
Include spectra with no confident prediction |
|
|
Keep predictions with mismatched polarity |
Hyperparameter Tuning
tune:
nr_trials: 1
grace_period: 2
resources_per_trial: null
Key |
Default |
Description |
|---|---|---|
|
|
Number of Ray Tune trials |
|
|
Minimum epochs before early stopping (ASHA scheduler) |
|
|
CPU/GPU allocation per trial ( |
Input Embedding
input_embedding:
n_peaks: 30
max_mz: 1600
decimal_accuracy: 1
Key |
Default |
Description |
|---|---|---|
|
|
Number of highest-intensity peaks to retain |
|
|
Maximum m/z value for spectrum binning |
|
|
Decimal precision for m/z values |
Transformer
transformer:
d_model: 32
num_heads: 4
dropout: 0.1
ffn_hidden: 256
num_layers: 2
output_seq_length: 11
Key |
Default |
Description |
|---|---|---|
|
|
Embedding dimension (must be divisible by |
|
|
Number of attention heads |
|
|
Dropout rate |
|
|
Hidden dimension of the feed-forward layers |
|
|
Number of encoder/decoder layers |
|
|
Maximum output token sequence length |
WandB Integration
wandb:
group: 'Debugging'
Uncomment the wandb section in the config to enable
Weights & Biases experiment tracking. The group
field organizes runs within a WandB project.
Comment
comment: 'Info on the purpose of the current run'
A free-text field to annotate the purpose of the current experiment. Logged with the run metadata.