Data Preparation ================ LipiDetective works with two file formats: **HDF5** for training and evaluation, and **mzML** for prediction on new spectra. Supported Formats ----------------- HDF5 (Training & Evaluation) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ HDF5 files store pre-processed tandem mass spectra under a single group ``/all_datasets/``. Each spectrum is stored as a **dataset** (not a group) with a 2×N array where row 0 is m/z values and row 1 is intensities. Metadata is stored as **HDF5 attributes** on each dataset: .. list-table:: :header-rows: 1 :widths: 25 15 60 * - Attribute - Type - Description * - ``lipid_species`` - string - Lipid species name (empty string for unlabeled data) * - ``adduct`` - string - Adduct type (e.g. ``[M+H]+``) * - ``precursor`` - float - Precursor m/z value * - ``polarity`` - string - ``"pos"`` or ``"neg"`` * - ``scan_index`` - int - Scan index from the source file * - ``level`` - int - MS level (2 for MS/MS) * - ``source`` - string - Source file name Example HDF5 structure: .. code-block:: text /all_datasets/ ├── "PC 34:1 | pos | nist | 042" # dataset: float64 array (2, N) │ ├── attrs["lipid_species"] = "PC 34:1" │ ├── attrs["adduct"] = "[M+H]+" │ ├── attrs["precursor"] = 760.585 │ ├── attrs["polarity"] = "pos" │ └── ... ├── "PE 36:2 | neg | nist | 117" │ └── ... └── ... Use ``H5Dataset`` to load HDF5 files in your training pipeline: .. autoclass:: lipidetective.workflow.h5_dataset.H5Dataset :members: :undoc-members: mzML (Prediction) ^^^^^^^^^^^^^^^^^^ For prediction, LipiDetective reads raw mzML files produced by mass spectrometry instruments. The ``PredictionDataset`` handles reading, filtering, and converting mzML spectra: .. autoclass:: lipidetective.workflow.prediction_dataset.PredictionDataset :members: :undoc-members: Converting mzML to HDF5 ------------------------ To prepare mzML files for training, use the included conversion script: .. code-block:: console $ python -m lipidetective.helpers.mzml_to_hdf5 \ -i path/to/mzml_folder/ \ -o path/to/output_folder/ This reads all mzML files in the input directory and produces a single HDF5 file suitable for training. Spectrum Processing ------------------- During loading, spectra are processed as follows: 1. **Peak selection** — The ``n_peaks`` highest-intensity peaks are retained (default: 30, set via ``input_embedding.n_peaks``) 2. **m/z truncation** — Peaks above ``max_mz`` are discarded (default: 1600, set via ``input_embedding.max_mz``) 3. **Decimal rounding** — m/z values are rounded to ``decimal_accuracy`` decimal places (default: 1) These parameters are configured in the ``input_embedding`` section of the config file. See :doc:`configuration` for details.