Data Preparation

LipiDetective works with two file formats: HDF5 for training and evaluation, and mzML for prediction on new spectra.

Supported Formats

HDF5 (Training & Evaluation)

HDF5 files store pre-processed tandem mass spectra under a single group /all_datasets/. Each spectrum is stored as a dataset (not a group) with a 2×N array where row 0 is m/z values and row 1 is intensities.

Metadata is stored as HDF5 attributes on each dataset:

Attribute	Type	Description
`lipid_species`	string	Lipid species name (empty string for unlabeled data)
`adduct`	string	Adduct type (e.g. `[M+H]+`)
`precursor`	float	Precursor m/z value
`polarity`	string	`"pos"` or `"neg"`
`scan_index`	int	Scan index from the source file
`level`	int	MS level (2 for MS/MS)
`source`	string	Source file name

Example HDF5 structure:

/all_datasets/
├── "PC 34:1 | pos | nist | 042"    # dataset: float64 array (2, N)
│   ├── attrs["lipid_species"] = "PC 34:1"
│   ├── attrs["adduct"] = "[M+H]+"
│   ├── attrs["precursor"] = 760.585
│   ├── attrs["polarity"] = "pos"
│   └── ...
├── "PE 36:2 | neg | nist | 117"
│   └── ...
└── ...

Use H5Dataset to load HDF5 files in your training pipeline:

class H5Dataset(config: dict[str, Any], dataset_names: list[str], lipid_librarian: LipidLibrary, file_path: str)[source]

This class implements a custom PyTorch dataset. It handles reading in the training data from HDF5 files. It overrides the __getitem__ method to return the processed spectrum and its metadata for a sample at a given index.

file_path

The path to the HDF5 file

Type:: str

dataset_names

list of dataset names in the HDF5 file group “all_datasets”, used to access samples by index

Type:: list

dataset_len

An integer count of the samples in the HDF5 file

Type:: int

hdf5_file

the opened HDF5 file, set to None during initialization and set once first sample is requested

Type:: h5py.Dataset

config

Dictionary containing the information from the config.yaml file

Type:: dict

network_type

String of the network type specified in the config.yaml file

Type:: str

decimal_accuracy

Integer indicating the decimal accuracy to which the mass spectra should be binned

Type:: int

lipid_librarian

LipidLibrarian instance used for generating the label for a sample

Type:: LipidLibrary

get_n_highest_peaks(spectrum: ndarray, n_peaks: int) → ndarray[source]

Processes the spectrum for a sample in the H5Dataset to prepare it as input for the model.

Parameters:

spectrum (np.ndarray) – the spectrum extracted from an HDF5 dataset containing m/z and intensity arrays
n_peaks (int) – maximum number of peaks to be fed into neural network

Returns:

the spectrum containing the number of peaks specified in n_peaks and sorted by descending intensity

Return type:

np.ndarray

bin_spectrum(spectrum: ndarray) → ndarray[source]

Truncates the m/z values at the decimal position defined in the config.yaml and sums up the intensities.

Parameters:: spectrum (np.ndarray) – the m/z and intensity values of a dataset from the HDF5 file
Returns:: the binned spectrum ordered by intensity
Return type:: np.ndarray

mzML (Prediction)

For prediction, LipiDetective reads raw mzML files produced by mass spectrometry instruments. The PredictionDataset handles reading, filtering, and converting mzML spectra:

class PredictionDataset(file_path: str, config: dict[str, Any])[source]

process_input() → list[dict[str, Any]][source]

process_mzml() → list[dict[str, Any]][source]

process_json() → list[dict[str, Any]][source]

get_n_highest_peaks(mz_array: ndarray, intensity_array: ndarray, precursor: float) → Tensor[source]

Converting mzML to HDF5

To prepare mzML files for training, use the included conversion script:

$ python -m lipidetective.helpers.mzml_to_hdf5 \
    -i path/to/mzml_folder/ \
    -o path/to/output_folder/

This reads all mzML files in the input directory and produces a single HDF5 file suitable for training.

Spectrum Processing

During loading, spectra are processed as follows:

Peak selection — The n_peaks highest-intensity peaks are retained (default: 30, set via input_embedding.n_peaks)
m/z truncation — Peaks above max_mz are discarded (default: 1600, set via input_embedding.max_mz)
Decimal rounding — m/z values are rounded to decimal_accuracy decimal places (default: 1)

These parameters are configured in the input_embedding section of the config file. See Configuration for details.