Configuration ============= LipiDetective is configured entirely through a YAML file passed via the ``--config`` flag. A template is provided at ``config/config_templates/config_transformer.yaml``. .. code-block:: console $ uv run lipidetective --config path/to/config.yaml Model Selection --------------- .. code-block:: yaml model: 'transformer' Choose the model architecture. Options: - ``transformer`` — Encoder-decoder transformer for sequence generation (recommended) - ``convolutional`` — 3-layer CNN for regression - ``feedforward`` — Fully connected network - ``random_forest`` — Scikit-learn random forest classifier CUDA ---- .. code-block:: yaml cuda: gpu_nr: [1] Select which GPUs to use. Pass a list of GPU indices, e.g. ``[0, 1]`` for multi-GPU training. File Paths ---------- .. code-block:: yaml files: train_input: 'processed/train_dataset.hdf5' val_input: 'processed/val_dataset.hdf5' test_input: 'processed/test_dataset.hdf5' predict_input: 'raw/sample.mzML' saved_model: 'lipidetective_model.pth' output: 'output' splitting_instructions: 'validation_splits/train_val_split.yaml' All paths can be **relative** or **absolute**. Relative paths are resolved from default base directories: .. list-table:: :header-rows: 1 :widths: 35 35 30 * - Path type - Base directory - Examples * - Data paths (``train_input``, ``val_input``, ``test_input``, ``predict_input``) - ``data/`` - HDF5 or mzML files * - Model paths (``saved_model``) - ``models/`` - Saved model weights * - Output paths (``output``) - ``experiments/`` - Experiment results * - Config paths (``splitting_instructions``) - ``config/`` - Validation split definitions Absolute paths are used as-is. Validation Split Precedence ^^^^^^^^^^^^^^^^^^^^^^^^^^^ When training with validation (``workflow.validate: True``), the data split strategy is determined by which fields are set, checked in this order: 1. **``val_input``** — If set, training and validation use separate HDF5 files. Both ``splitting_instructions`` and k-fold splitting are ignored. 2. **``splitting_instructions``** — If set (and ``val_input`` is empty), the referenced YAML file defines which lipid species go into the validation set. Data is read from ``train_input`` only. 3. **K-fold** (default) — If neither is set, ``train_input`` is split into ``training.k`` folds by lipid species for cross-validation. Environment Variable Overrides ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Override default base directories using environment variables: .. list-table:: :header-rows: 1 :widths: 40 60 * - Variable - Description * - ``LIPIDETECTIVE_DATA_DIR`` - Base directory for data files * - ``LIPIDETECTIVE_MODELS_DIR`` - Base directory for model files * - ``LIPIDETECTIVE_OUTPUT_DIR`` - Base directory for experiment outputs * - ``LIPIDETECTIVE_CONFIG_DIR`` - Base directory for config files (e.g. splitting instructions) Example: .. code-block:: bash export LIPIDETECTIVE_DATA_DIR=/mnt/data/lipidomics uv run lipidetective --config config.yaml Programmatic Access ^^^^^^^^^^^^^^^^^^^ Path resolution functions can be used directly in Python: .. code-block:: python from lipidetective import ( get_project_root, resolve_data_path, resolve_model_path, resolve_output_path, resolve_config_path, ) data_file = resolve_data_path('processed/dataset.hdf5') model_file = resolve_model_path('lipidetective_model.pth') Workflow -------- .. code-block:: yaml workflow: train: False validate: False test: False tune: False predict: True save_model: False load_model: True log_every_n_steps: 10 .. list-table:: :header-rows: 1 :widths: 30 15 55 * - Key - Default - Description * - ``train`` - ``False`` - Train the model * - ``validate`` - ``False`` - Enable validation during training (requires ``train: True``) * - ``test`` - ``False`` - Evaluate on the test set * - ``tune`` - ``False`` - Run hyperparameter tuning with Ray Tune * - ``predict`` - ``True`` - Run prediction on mzML input * - ``save_model`` - ``False`` - Save model weights to the output directory after training * - ``load_model`` - ``True`` - Load pre-trained model weights from ``files.saved_model`` * - ``log_every_n_steps`` - ``10`` - PyTorch Lightning logging frequency .. warning:: When loading a pre-trained model (``load_model: True``), the ``transformer``, ``input_embedding``, and ``model`` settings in your config **must exactly match** the config used to train that model. The saved ``.pth`` file contains only weight tensors — no architecture metadata. If any parameter differs (e.g. ``d_model``, ``num_heads``, ``n_peaks``), PyTorch will raise a ``RuntimeError`` due to mismatched tensor shapes. Always keep the config file that was used for training alongside the saved model. Training -------- .. code-block:: yaml training: k: 6 learning_rate: 0.004 lr_step: 2 epochs: 15 batch: 512 nr_workers: 0 .. list-table:: :header-rows: 1 :widths: 25 15 60 * - Key - Default - Description * - ``k`` - ``6`` - Number of folds for k-fold cross-validation * - ``learning_rate`` - ``0.004`` - Initial learning rate * - ``lr_step`` - ``2`` - Step size for learning rate scheduler * - ``epochs`` - ``15`` - Number of training epochs * - ``batch`` - ``512`` - Training batch size * - ``nr_workers`` - ``0`` - DataLoader worker processes (``0`` = main process) Test ---- .. code-block:: yaml test: batch: 512 confidence_score: True .. list-table:: :header-rows: 1 :widths: 30 15 55 * - Key - Default - Description * - ``batch`` - ``512`` - Test batch size * - ``confidence_score`` - ``True`` - Compute confidence scores for predictions Prediction ---------- .. code-block:: yaml predict: output: "best_prediction" batch: 512 save_spectrum: False confidence_threshold: 0.98 keep_empty: False keep_wrong_polarity_preds: False .. list-table:: :header-rows: 1 :widths: 35 20 45 * - Key - Default - Description * - ``output`` - ``"best_prediction"`` - ``"best_prediction"`` for top-1 or ``"top3"`` for top-3 results * - ``batch`` - ``512`` - Prediction batch size * - ``save_spectrum`` - ``False`` - Include raw spectrum data in output * - ``confidence_threshold`` - ``0.98`` - Minimum confidence to report a prediction * - ``keep_empty`` - ``False`` - Include spectra with no confident prediction * - ``keep_wrong_polarity_preds`` - ``False`` - Keep predictions with mismatched polarity Hyperparameter Tuning --------------------- .. code-block:: yaml tune: nr_trials: 1 grace_period: 2 resources_per_trial: null .. list-table:: :header-rows: 1 :widths: 30 15 55 * - Key - Default - Description * - ``nr_trials`` - ``1`` - Number of Ray Tune trials * - ``grace_period`` - ``2`` - Minimum epochs before early stopping (ASHA scheduler) * - ``resources_per_trial`` - ``null`` - CPU/GPU allocation per trial (``null`` = auto-detect) Input Embedding --------------- .. code-block:: yaml input_embedding: n_peaks: 30 max_mz: 1600 decimal_accuracy: 1 .. list-table:: :header-rows: 1 :widths: 25 15 60 * - Key - Default - Description * - ``n_peaks`` - ``30`` - Number of highest-intensity peaks to retain * - ``max_mz`` - ``1600`` - Maximum m/z value for spectrum binning * - ``decimal_accuracy`` - ``1`` - Decimal precision for m/z values Transformer ----------- .. code-block:: yaml transformer: d_model: 32 num_heads: 4 dropout: 0.1 ffn_hidden: 256 num_layers: 2 output_seq_length: 11 .. list-table:: :header-rows: 1 :widths: 25 15 60 * - Key - Default - Description * - ``d_model`` - ``32`` - Embedding dimension (must be divisible by ``num_heads``) * - ``num_heads`` - ``4`` - Number of attention heads * - ``dropout`` - ``0.1`` - Dropout rate * - ``ffn_hidden`` - ``256`` - Hidden dimension of the feed-forward layers * - ``num_layers`` - ``2`` - Number of encoder/decoder layers * - ``output_seq_length`` - ``11`` - Maximum output token sequence length WandB Integration ----------------- .. code-block:: yaml wandb: group: 'Debugging' Uncomment the ``wandb`` section in the config to enable `Weights & Biases `_ experiment tracking. The ``group`` field organizes runs within a WandB project. Comment ------- .. code-block:: yaml comment: 'Info on the purpose of the current run' A free-text field to annotate the purpose of the current experiment. Logged with the run metadata.