Molecular Boltzmann Targets

TODO: Update API to create MolecularBoltzmann targets to use classmethods instead TODO: The API on how to create MolecularBoltzmann targets will likely change soon

Overview

The boltzkit.targets.boltzmann.MolecularBoltzmann class defines a family of molecular energy-based targets representing Boltzmann distributions over molecular conformations.

Targets can be initialized from multiple sources:

  • HuggingFace datasets (predefined systems such as alanine dipeptides/peptides)

  • Local directories

  • PDB files

Instantiation

A MolecularBoltzmann target can be created as follows:

from boltzkit.targets.boltzmann import MolecularBoltzmann

target = MolecularBoltzmann("datasets/chrklitz99/alanine_dipeptide")

Supported input sources

HuggingFace datasets

Predefined molecular systems are available via HuggingFace:

  • datasets/chrklitz99/alanine_dipeptide

  • datasets/chrklitz99/alanine_tetrapeptide

  • datasets/chrklitz99/alanine_hexapeptide

These datasets provide both specification of the forcefields and ready-to-use high-quality MD data.

Local directory

A target can also be initialized from a local dataset directory:

target = MolecularBoltzmann("/path/to/local/dataset")

The directory must contain the required files in the expected format similar to the huggingface repositories.

PDB file

For quick setup from a single molecular structure, a PDB file can be used:

target = MolecularBoltzmann.create_from_pdb("custom_target_name", "structure.pdb")

Common operations

Once initialized, the target provides dataset utilities and molecular metadata:

val_dataset = target.load_dataset(T=300.0, type="val") # if available
topology = target.get_mdtraj_topology()
tica_model = target.get_tica_model() # if available

Evaluation functions:

target.get_log_prob(samples)
target.get_score(samples)
target.get_log_prob_and_score(samples)

Length scale configuration

Internally, all MolecularBoltzmann targets operate in nanometers. Coordinate units can be configured at initialization.

target = MolecularBoltzmann(
    "datasets/chrklitz99/alanine_dipeptide",
    length_unit="angstrom"  # default: "nanometer"
)

Unit handling

When a length unit is specified, all inputs and outputs (coordinates, forces, scores) are automatically converted to and from that unit.

This ensures consistent usage across the entire API without manual conversions.

We strongly recommend using a consistent unit system across all evaluations involving MolecularBoltzmann targets.

Alternative specification

A scalar scaling factor can also be provided:

  • "nanometer"1.0

  • "angstrom"0.1

Molecular dynamics trajectories

Note

This section only applies when generating new datasets is necessary. For the systems already included in this repository, the corresponding trajectories have mostly already been generated for 300K. This chapter requires the dev requirements to be installed (see development setup).

Equilibrium-distribution trajectories can be generated using the tools/run_simulation.py script. For each system, we generate two independent trajectories, each containing 10⁷ samples. Alternatively, to increase parallelization, we generate four trajectories of 5 × 10⁶ samples each and concatenate them pairwise, yielding two buffers of 10⁷ samples each like before.

  • Trajectory 1 is used directly as the test dataset and for training the TICA model. The test dataset is a random permutation of this trajectory.

  • Trajectory 2 is subsampled without replacement to construct the training and validation datasets, each containing 10⁶ samples.

Commands

Alanine Dipeptide

The two trajectories each of size 10⁷ for the system Alanine Dipeptide were generated by running the following command twice:

python tools/run_simulation.py --system datasets/chrklitz99/alanine_dipeptide --temps 300.0 --time_step 1.0 --rec_interval 0.5 --pre_eq_time 200.0 --simu_time 5000.0 --integrator LangevinMiddle --write_checkpoint_every_ns 100

Alanine Tetrapeptide

The four trajectories each of size 5×10⁶ for the system Alanine Tetrapeptide were generated by running the following command four times:

python tools/run_simulation.py --system datasets/chrklitz99/alanine_tetrapeptide --temps 300.0:500.0:6 --time_step 1.0 --rec_interval 0.2 --pre_eq_time 200.0 --simu_time 1200.0 --integrator LangevinMiddle --write_checkpoint_every_ns 100 --save_traj_of_replicas 0

Alanine Hexapeptide

The four trajectories each of size 5×10⁶ for the system Alanine Hexapeptide were generated by running the following command four times:

python tools/run_simulation.py --system datasets/chrklitz99/alanine_hexapeptide --temps 300.0:500.0:6 --time_step 1.0 --rec_interval 0.2 --pre_eq_time 200.0 --simu_time 1200.0 --integrator LangevinMiddle --write_checkpoint_every_ns 100 --save_traj_of_replicas 0

Workflow to create TICA model and datasets

  1. Convert to NumPy: Convert the raw .h5 output to NumPy format. Use the --skipN flag for larger systems to discard the initial equilibration phase (e.g., --skipN 1000000 for Alanine Hexapeptide to remove the first 200ns of REMD).

    python tools/extract_trajectory_as_numpy.py path/to/traj.h5 --skipN <N>
    
  2. Create TICA Model: Run the model creation script on the trajectory designated for the test dataset. This identifies slow degrees of freedom and offers options for plotting lag-times or Ramachandran correspondences.

    python tools/create_tica_model.py --traj_path path/to/traj.npy --traj_total_sim_time_ns <sim_time> --system_name <system_path> --lag_time_ps 100
    
  3. Generate Test Dataset: Permute the first trajectory. If multiple parallel trajectories were generated, concatenate them before running this command:

    python tools/permute_trajectory.py path/to/traj_1.npy
    
  4. Generate Train/Val Datasets: Create random subsets without replacement from the second trajectory:

    python tools/split_trajectory.py path/to/traj_2.npy
    
  5. Upload everything to Hugging Face (see alanine dipeptide for reference) and update the info.yaml file.