Data Preparation
LipiDetective works with two file formats: HDF5 for training and evaluation, and mzML for prediction on new spectra.
Supported Formats
HDF5 (Training & Evaluation)
HDF5 files store pre-processed tandem mass spectra under a single group
/all_datasets/. Each spectrum is stored as a dataset (not a group)
with a 2×N array where row 0 is m/z values and row 1 is intensities.
Metadata is stored as HDF5 attributes on each dataset:
Attribute |
Type |
Description |
|---|---|---|
|
string |
Lipid species name (empty string for unlabeled data) |
|
string |
Adduct type (e.g. |
|
float |
Precursor m/z value |
|
string |
|
|
int |
Scan index from the source file |
|
int |
MS level (2 for MS/MS) |
|
string |
Source file name |
Example HDF5 structure:
/all_datasets/
├── "PC 34:1 | pos | nist | 042" # dataset: float64 array (2, N)
│ ├── attrs["lipid_species"] = "PC 34:1"
│ ├── attrs["adduct"] = "[M+H]+"
│ ├── attrs["precursor"] = 760.585
│ ├── attrs["polarity"] = "pos"
│ └── ...
├── "PE 36:2 | neg | nist | 117"
│ └── ...
└── ...
Use H5Dataset to load HDF5 files in your training pipeline:
- class H5Dataset(config: dict[str, Any], dataset_names: list[str], lipid_librarian: LipidLibrary, file_path: str)[source]
This class implements a custom PyTorch dataset. It handles reading in the training data from HDF5 files. It overrides the __getitem__ method to return the processed spectrum and its metadata for a sample at a given index.
- dataset_names
list of dataset names in the HDF5 file group “all_datasets”, used to access samples by index
- Type:
- hdf5_file
the opened HDF5 file, set to None during initialization and set once first sample is requested
- Type:
h5py.Dataset
- decimal_accuracy
Integer indicating the decimal accuracy to which the mass spectra should be binned
- Type:
- lipid_librarian
LipidLibrarian instance used for generating the label for a sample
- Type:
- get_n_highest_peaks(spectrum: ndarray, n_peaks: int) ndarray[source]
Processes the spectrum for a sample in the H5Dataset to prepare it as input for the model.
- Parameters:
spectrum (np.ndarray) – the spectrum extracted from an HDF5 dataset containing m/z and intensity arrays
n_peaks (int) – maximum number of peaks to be fed into neural network
- Returns:
the spectrum containing the number of peaks specified in n_peaks and sorted by descending intensity
- Return type:
np.ndarray
- bin_spectrum(spectrum: ndarray) ndarray[source]
Truncates the m/z values at the decimal position defined in the config.yaml and sums up the intensities.
- Parameters:
spectrum (np.ndarray) – the m/z and intensity values of a dataset from the HDF5 file
- Returns:
the binned spectrum ordered by intensity
- Return type:
np.ndarray
mzML (Prediction)
For prediction, LipiDetective reads raw mzML files produced by mass
spectrometry instruments. The PredictionDataset handles reading,
filtering, and converting mzML spectra:
Converting mzML to HDF5
To prepare mzML files for training, use the included conversion script:
$ python -m lipidetective.helpers.mzml_to_hdf5 \
-i path/to/mzml_folder/ \
-o path/to/output_folder/
This reads all mzML files in the input directory and produces a single HDF5 file suitable for training.
Spectrum Processing
During loading, spectra are processed as follows:
Peak selection — The
n_peakshighest-intensity peaks are retained (default: 30, set viainput_embedding.n_peaks)m/z truncation — Peaks above
max_mzare discarded (default: 1600, set viainput_embedding.max_mz)Decimal rounding — m/z values are rounded to
decimal_accuracydecimal places (default: 1)
These parameters are configured in the input_embedding section of the
config file. See Configuration for details.