Models

LipiDetective supports four model architectures. Select the model via the model field in the config:

model: 'transformer'   # or 'convolutional', 'feedforward', 'random_forest'

Transformer (Recommended)

The primary model. Uses an encoder-decoder transformer architecture to generate lipid nomenclature as a token sequence from an input spectrum. The encoder processes the spectrum embedding, and the decoder autoregressively predicts lipid tokens (headgroup, fatty acid chains, etc.).

Configure via the transformer section:

transformer:
  d_model: 32       # Embedding dimension (must be divisible by num_heads)
  num_heads: 4      # Attention heads
  dropout: 0.1
  ffn_hidden: 256   # Feed-forward hidden dimension
  num_layers: 2     # Encoder/decoder layers
  output_seq_length: 11

class TransformerNetwork(config: dict[str, Any], output_attentions: bool = False)[source]

forward(src: Tensor, tgt: Tensor) → Tensor | tuple[Tensor, list[Tensor]][source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

generate_mask(src: Tensor, tgt: Tensor) → tuple[Tensor, Tensor, Tensor][source]

predict(src: Tensor) → Tensor[source]

predict_top_3(src: Tensor) → tuple[Tensor, Tensor][source]

predict_beam_decode(src: Tensor) → Tensor[source]

predict_greedy(src: Tensor) → Tensor[source]

return_encoder_embedding(src: Tensor) → Tensor[source]

greedy_decode(encoder_output: Tensor, tgt: Tensor, memory_padding_mask: Tensor) → None[source]

beam_decode(encoder_output: Tensor, tgt: Tensor, memory_padding_mask: Tensor) → None[source]

get_attention_layers(src: Tensor, src_key_padding_mask: Tensor) → list[Tensor][source]

Convolutional Neural Network

A 3-layer CNN for regression tasks on spectral data. Useful as a baseline or for simpler prediction tasks.

class ConvolutionalNetwork(config: dict[str, Any])[source]

forward(x: Tensor) → Tensor[source]

Forward pass of the convolutional network with three convolutional layers and pooling layers, followed by three fully connected linear layers.

Parameters:: x (torch.Tensor) – input tensor of features with shape (batch_size, 2, n_peaks+1). Dimension 1 is size 2 as the tensor contains the m/z and intensity values of each peak. Dimension 2 is size n_peaks + 1 as the measurement mode (-1 for negative and +1 for positive) and the precursor mass are added.
Returns:: output of the convolutional network with shape (batch_size, 3). Corresponds to the three masses of the lipid components (headgroup and two side chains) that are supposed to be predicted.
Return type:: torch.Tensor

calculate_fc1_size(len_input_spectrum: int) → int[source]

Calculates the size of the output after the convolutional layers so that the size of the first fully connected layer can be set accordingly.

Parameters:: len_input_spectrum (int) – length of the input spectrum
Returns:: size of the in_features of the first fully connected layer
Return type:: int

Feed-Forward Network

A simple fully connected network. Serves as a minimal baseline architecture.

class FeedForwardNetwork(config: dict[str, Any])[source]

forward(x: Tensor) → Tensor[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Random Forest

A scikit-learn RandomForestClassifier wrapper. Operates outside the PyTorch Lightning pipeline and handles its own data loading from HDF5 files. Useful for comparison against deep learning approaches.

class RandomForest(config: dict[str, Any])[source]

run() → None[source]

get_spectrum_data(spectrum: Dataset) → list[Any][source]

use_single_classifier(train_features: list[Any], train_labels: list[Any], test_features: list[Any]) → tuple[Any, RandomForestClassifier][source]

plot_decision_tree(decision_tree: Any, name_file: str) → None[source]

use_triple_classifier(train_features: list[Any], train_labels: list[Any], test_features: list[Any]) → tuple[list[list[Any]], RandomForestClassifier, RandomForestClassifier, RandomForestClassifier][source]

use_triple_regressor(train_features: list[Any], train_labels: list[Any], test_features: list[Any]) → tuple[list[list[Any]], RandomForestRegressor, RandomForestRegressor, RandomForestRegressor][source]

calculate_accuracy(prediction: Any, labels: list[Any], model: str, task: str) → str[source]

check_classification_accuracy(prediction: object, label: object) → bool[source]

check_regression_accuracy(prediction: Any, label: Any) → bool[source]

write_output_to_file(statistics: dict[str, str]) → None[source]

extract_info_dataset(group: Group | Dataset, val_lipids: list[str], train_set: list[Any], test_set: list[Any]) → None[source]

extract_info_dataset_no_split(group: Group | Dataset, dataset: list[Any]) → None[source]

extract_features_and_labels(dataset: list[Any]) → tuple[Any, Any][source]

prepare_data() → tuple[list[Any], list[Any], list[Any], list[Any]][source]