Data Preparation

LipiDetective works with two file formats: HDF5 for training and evaluation, and mzML for prediction on new spectra.

Supported Formats

HDF5 (Training & Evaluation)

HDF5 files store pre-processed tandem mass spectra under a single group /all_datasets/. Each spectrum is stored as a dataset (not a group) with a 2×N array where row 0 is m/z values and row 1 is intensities.

Metadata is stored as HDF5 attributes on each dataset:

Attribute

Type

Description

lipid_species

string

Lipid species name (empty string for unlabeled data)

adduct

string

Adduct type (e.g. [M+H]+)

precursor

float

Precursor m/z value

polarity

string

"pos" or "neg"

scan_index

int

Scan index from the source file

level

int

MS level (2 for MS/MS)

source

string

Source file name

Example HDF5 structure:

/all_datasets/
├── "PC 34:1 | pos | nist | 042"    # dataset: float64 array (2, N)
│   ├── attrs["lipid_species"] = "PC 34:1"
│   ├── attrs["adduct"] = "[M+H]+"
│   ├── attrs["precursor"] = 760.585
│   ├── attrs["polarity"] = "pos"
│   └── ...
├── "PE 36:2 | neg | nist | 117"
│   └── ...
└── ...

Use H5Dataset to load HDF5 files in your training pipeline:

class H5Dataset(config: dict[str, Any], dataset_names: list[str], lipid_librarian: LipidLibrary, file_path: str)[source]

This class implements a custom PyTorch dataset. It handles reading in the training data from HDF5 files. It overrides the __getitem__ method to return the processed spectrum and its metadata for a sample at a given index.

file_path

The path to the HDF5 file

Type:

str

dataset_names

list of dataset names in the HDF5 file group “all_datasets”, used to access samples by index

Type:

list

dataset_len

An integer count of the samples in the HDF5 file

Type:

int

hdf5_file

the opened HDF5 file, set to None during initialization and set once first sample is requested

Type:

h5py.Dataset

config

Dictionary containing the information from the config.yaml file

Type:

dict

network_type

String of the network type specified in the config.yaml file

Type:

str

decimal_accuracy

Integer indicating the decimal accuracy to which the mass spectra should be binned

Type:

int

lipid_librarian

LipidLibrarian instance used for generating the label for a sample

Type:

LipidLibrary

get_n_highest_peaks(spectrum: ndarray, n_peaks: int) ndarray[source]

Processes the spectrum for a sample in the H5Dataset to prepare it as input for the model.

Parameters:
  • spectrum (np.ndarray) – the spectrum extracted from an HDF5 dataset containing m/z and intensity arrays

  • n_peaks (int) – maximum number of peaks to be fed into neural network

Returns:

the spectrum containing the number of peaks specified in n_peaks and sorted by descending intensity

Return type:

np.ndarray

bin_spectrum(spectrum: ndarray) ndarray[source]

Truncates the m/z values at the decimal position defined in the config.yaml and sums up the intensities.

Parameters:

spectrum (np.ndarray) – the m/z and intensity values of a dataset from the HDF5 file

Returns:

the binned spectrum ordered by intensity

Return type:

np.ndarray

mzML (Prediction)

For prediction, LipiDetective reads raw mzML files produced by mass spectrometry instruments. The PredictionDataset handles reading, filtering, and converting mzML spectra:

class PredictionDataset(file_path: str, config: dict[str, Any])[source]
process_input() list[dict[str, Any]][source]
process_mzml() list[dict[str, Any]][source]
process_json() list[dict[str, Any]][source]
get_n_highest_peaks(mz_array: ndarray, intensity_array: ndarray, precursor: float) Tensor[source]

Converting mzML to HDF5

To prepare mzML files for training, use the included conversion script:

$ python -m lipidetective.helpers.mzml_to_hdf5 \
    -i path/to/mzml_folder/ \
    -o path/to/output_folder/

This reads all mzML files in the input directory and produces a single HDF5 file suitable for training.

Spectrum Processing

During loading, spectra are processed as follows:

  1. Peak selection — The n_peaks highest-intensity peaks are retained (default: 30, set via input_embedding.n_peaks)

  2. m/z truncation — Peaks above max_mz are discarded (default: 1600, set via input_embedding.max_mz)

  3. Decimal rounding — m/z values are rounded to decimal_accuracy decimal places (default: 1)

These parameters are configured in the input_embedding section of the config file. See Configuration for details.