boltzkit.utils.dataloader
Functions
|
Load or compute derived data for a set of samples with optional caching. |
|
Load data from a file and validate/reshape it according to its type. |
|
|
|
Classes
Configuration options controlling dataset cache loading and automatic generation of sample-related quantities (log-probs/energies, scores/forces). |
|
- boltzkit.utils.dataloader.load_from_file(path: str | ~pathlib.Path, data_type: ~typing.Literal['log_probs', 'samples'], n_samples: int | None = None, dtype: type | ~numpy.dtype = <class 'numpy.float32'>) ndarray[source]
Load data from a file and validate/reshape it according to its type.
This function supports PyTorch (.pt, .pth) and NumPy (.npy, .npz) files. The loaded data is converted to a NumPy array and reshaped based on data_type.
- Parameters:
path (str) – Path to the file to load. Must exist and have a supported extension.
data_type ({"log_probs", "samples"}) –
Specifies the type of data being loaded, which determines shape validation: - “log_probs”: expects data of shape (batch,) or (batch, 1) and flattens to (batch,) - “samples”: expects data of shape (batch,), (batch, dim), or (batch, n_nodes, 3)
3D molecular data is flattened to (batch, n_nodes*3)
dtype (np.dtype, optional) – Desired floating-point type for the loaded data. The data will be converted to this type after loading. If not specified, the library’s default floating-point type (np.float64) is used.
- Returns:
Loaded data as a NumPy array with appropriate shape for the given data_type.
- Return type:
np.ndarray
- Raises:
FileNotFoundError – If path does not exist.
ImportError – If the file format requires PyTorch or NumPy and the library is not installed.
RuntimeError – If the file could not be loaded.
TypeError – If the loaded object is not of the expected type (torch.Tensor for PyTorch, np.ndarray for NumPy).
ValueError – If the file extension is unsupported or if the loaded data has an invalid shape for the specified data_type.
Examples
>>> from pathlib import Path >>> data = _load_from_file(Path("predictions.npy"), data_type="log_probs") >>> print(data.shape) (1000,)
>>> data = _load_from_file(Path("samples.pt"), data_type="samples") >>> print(data.shape) (1000, 198) # if original shape was (1000, 66, 3) for molecular coordinates
- boltzkit.utils.dataloader.cache_load_sample_derived_data(samples: ndarray, data_fpath: Path | None, data_cache_fpath: Path | None = None, data_eval_fn: Callable[[ndarray], ndarray] | None = None, allow_autogen: bool = False, cache_data: bool = False) ndarray[source]
Load or compute derived data for a set of samples with optional caching.
The function attempts, in order, to load data from a primary file, fall back to a cache file, or generate missing data using a provided evaluation function. Generated data can optionally be cached.
Logic priority: 1. Load from primary data_fpath if it exists. 2. Load from data_cache_fpath if it exists (requires cache_data to be True). 3. If allow_autogen is True, compute missing data using data_eval_fn . 4. If cache_data is True, save computed results to data_cache_fpath.
- Parameters:
samples (numpy.ndarray) – Input samples of shape (n_samples, …).
data_fpath (pathlib.Path or None) – Path to the primary data file to load.
data_cache_fpath (pathlib.Path or None) – Path to the cache file for loading/saving data.
data_eval_fn (Callable[[numpy.ndarray], numpy.ndarray] or None) – Function to compute derived data from samples (e.g., log_probs or scores).
allow_autogen (bool) – If True, compute missing data when not available.
cache_data (bool) – If True, enable loading from and saving to cache.
- Returns:
Array of derived data aligned with
samples.- Return type:
numpy.ndarray
- Raises:
ValueError – If autogeneration is enabled but no evaluation function is provided.
RuntimeError – If data cannot be loaded or generated.
- class boltzkit.utils.dataloader.DatasetLoader[source]
Bases:
ABC- load_dataset(type: Literal['train', 'val', 'test'], length: int, *, include_samples: bool = True, include_log_probs: bool = False, include_scores: bool = False, **kwargs) Dataset[source]
Load the dataset of the specified split.
This method retrieves samples and optionally associated log_probs/energies and scores/forces.
- Parameters:
type (Literal["train", "val", "test"]) – Dataset split to load.
length (int, optional) – Maximum number of samples to load. If -1, all available samples are used.
T (float | int | None) – Temperature (in Kelvin) identifying the dataset. Integers are cast to float. If None, the target’s temperature is used.
include_samples (bool, default=True) – Whether to return samples.
include_log_probs (bool, default=False) – Whether to include energy values for each sample. Fails if no energies are available and allow_autogen is False.
include_scores (bool, default=False) – Whether to include force values for each sample. Fails if no forces are available and allow_autogen is False.
- Return type:
- Raises:
ValueError | NotImplementedError | Exception – If dataset configuration is missing or cannot be computed/retrieved
- boltzkit.utils.dataloader._get_cache_path(samples_fpath: Path, cache_data_type: Literal['log_probs', 'scores'] | str) Path[source]
Creates a cache path next to the samples file, e.g., ‘samples.npy’ -> ‘samples.npy_log_probs.npy’
- class boltzkit.utils.dataloader.CacheLoadingArgs[source]
Bases:
objectConfiguration options controlling dataset cache loading and automatic generation of sample-related quantities (log-probs/energies, scores/forces).
- Parameters:
allow_autogen (bool, optional, default=True) – If
True, missing quantities (e.g., log-probs/energies, scores/forces) may be computed automatically online if possible.cache_log_probs (bool, optional, default=True) – Whether log-probs/energies can be cached after online-computation (allow_autogen=True) or loaded from cache files if available.
cache_scores (bool, optional, default=False) – Whether scores/forces can be cached after online-computation (allow_autogen=True) or loaded from cache files if available.
- allow_autogen: bool = True
- cache_log_probs: bool = True
- cache_scores: bool = False
- __init__(allow_autogen: bool = True, cache_log_probs: bool = True, cache_scores: bool = False) None
- class boltzkit.utils.dataloader.CachedRepoDatasetLoader[source]
Bases:
DatasetLoader- __init__(kB_T: float, cached_repo: CachedRepo, T: float, log_prob_fn: Callable[[ndarray], ndarray] | None, score_fn: Callable[[ndarray], ndarray] | None, caching_args: CacheLoadingArgs | dict | None = None)[source]
- class boltzkit.utils.dataloader.DomainScaledDatasetLoader[source]
Bases:
DatasetLoader- __init__(dataset_loader: DatasetLoader, length_scale: float)[source]
- load_dataset(type, length, *, include_samples=True, include_log_probs=False, include_scores=False, **kwargs)[source]
Load the dataset of the specified split.
This method retrieves samples and optionally associated log_probs/energies and scores/forces.
- Parameters:
type (Literal["train", "val", "test"]) – Dataset split to load.
length (int, optional) – Maximum number of samples to load. If -1, all available samples are used.
T (float | int | None) – Temperature (in Kelvin) identifying the dataset. Integers are cast to float. If None, the target’s temperature is used.
include_samples (bool, default=True) – Whether to return samples.
include_log_probs (bool, default=False) – Whether to include energy values for each sample. Fails if no energies are available and allow_autogen is False.
include_scores (bool, default=False) – Whether to include force values for each sample. Fails if no forces are available and allow_autogen is False.
- Return type:
- Raises:
ValueError | NotImplementedError | Exception – If dataset configuration is missing or cannot be computed/retrieved