Molecular Boltzmann Targets
===========================

TODO: Update API to create MolecularBoltzmann targets to use classmethods instead
TODO: The API on how to create MolecularBoltzmann targets will likely change soon

Overview
--------

The :class:`boltzkit.targets.boltzmann.MolecularBoltzmann` class defines a family of molecular energy-based targets representing Boltzmann distributions over molecular conformations.

Targets can be initialized from multiple sources:

- HuggingFace datasets (predefined systems such as alanine dipeptides/peptides)
- Local directories
- PDB files

Instantiation
-------------

A ``MolecularBoltzmann`` target can be created as follows:

.. code-block:: python

   from boltzkit.targets.boltzmann import MolecularBoltzmann

   target = MolecularBoltzmann("datasets/chrklitz99/alanine_dipeptide")


Supported input sources
-----------------------

HuggingFace datasets
~~~~~~~~~~~~~~~~~~~~~

Predefined molecular systems are available via HuggingFace:

- ``datasets/chrklitz99/alanine_dipeptide``
- ``datasets/chrklitz99/alanine_tetrapeptide``
- ``datasets/chrklitz99/alanine_hexapeptide``

These datasets provide both specification of the forcefields and ready-to-use high-quality MD data.

Local directory
~~~~~~~~~~~~~~~~

A target can also be initialized from a local dataset directory:

.. code-block:: python

   target = MolecularBoltzmann("/path/to/local/dataset")

The directory must contain the required files in the expected format similar to the huggingface repositories.

PDB file
~~~~~~~~

For quick setup from a single molecular structure, a PDB file can be used:

.. code-block:: python

   target = MolecularBoltzmann.create_from_pdb("custom_target_name", "structure.pdb")


Common operations
-----------------

Once initialized, the target provides dataset utilities and molecular metadata:

.. code-block:: python

   val_dataset = target.load_dataset(T=300.0, type="val") # if available
   topology = target.get_mdtraj_topology()
   tica_model = target.get_tica_model() # if available


Evaluation functions:

.. code-block:: python

   target.get_log_prob(samples)
   target.get_score(samples)
   target.get_log_prob_and_score(samples)


Length scale configuration
--------------------------

Internally, all MolecularBoltzmann targets operate in nanometers. Coordinate units can be configured at initialization.

.. code-block:: python

   target = MolecularBoltzmann(
       "datasets/chrklitz99/alanine_dipeptide",
       length_unit="angstrom"  # default: "nanometer"
   )


Unit handling
~~~~~~~~~~~~~~

When a length unit is specified, all inputs and outputs (coordinates, forces, scores) are automatically converted to and from that unit.

This ensures consistent usage across the entire API without manual conversions.

We strongly recommend using a consistent unit system across all evaluations involving MolecularBoltzmann targets.


Alternative specification
~~~~~~~~~~~~~~~~~~~~~~~~~

A scalar scaling factor can also be provided:

- ``"nanometer"`` → ``1.0``
- ``"angstrom"`` → ``0.1``


Molecular dynamics trajectories
===============================

.. note::

   This section only applies when generating new datasets is necessary.  
   For the systems already included in this repository, the corresponding trajectories have mostly already been generated for 300K. This chapter requires the ``dev`` requirements to be installed (see :ref:`development setup <development-setup>`).

Equilibrium-distribution trajectories can be generated using the ``tools/run_simulation.py`` script. For each system, we generate **two independent trajectories**, each containing **10⁷ samples**. Alternatively, to increase parallelization, we generate four trajectories of 5 × 10⁶ samples each and concatenate them pairwise, yielding two buffers of **10⁷** samples each like before.

- **Trajectory 1** is used directly as the **test dataset** and for **training the TICA model**. The **test dataset** is a random permutation of this trajectory.
- **Trajectory 2** is subsampled without replacement to construct the **training** and **validation** datasets, each containing **10⁶ samples**.

Commands
--------

Alanine Dipeptide
~~~~~~~~~~~~~~~~~~

The two trajectories each of size 10⁷ for the system **Alanine Dipeptide** were generated by running the following command twice:

.. code-block:: bash

   python tools/run_simulation.py --system datasets/chrklitz99/alanine_dipeptide --temps 300.0 --time_step 1.0 --rec_interval 0.5 --pre_eq_time 200.0 --simu_time 5000.0 --integrator LangevinMiddle --write_checkpoint_every_ns 100

Alanine Tetrapeptide
~~~~~~~~~~~~~~~~~~~~~

The four trajectories each of size 5×10⁶ for the system **Alanine Tetrapeptide** were generated by running the following command four times:

.. code-block:: bash

   python tools/run_simulation.py --system datasets/chrklitz99/alanine_tetrapeptide --temps 300.0:500.0:6 --time_step 1.0 --rec_interval 0.2 --pre_eq_time 200.0 --simu_time 1200.0 --integrator LangevinMiddle --write_checkpoint_every_ns 100 --save_traj_of_replicas 0

Alanine Hexapeptide
~~~~~~~~~~~~~~~~~~~

The four trajectories each of size 5×10⁶ for the system **Alanine Hexapeptide** were generated by running the following command four times:

.. code-block:: bash

   python tools/run_simulation.py --system datasets/chrklitz99/alanine_hexapeptide --temps 300.0:500.0:6 --time_step 1.0 --rec_interval 0.2 --pre_eq_time 200.0 --simu_time 1200.0 --integrator LangevinMiddle --write_checkpoint_every_ns 100 --save_traj_of_replicas 0


Workflow to create TICA model and datasets
------------------------------------------

1. **Convert to NumPy:** Convert the raw ``.h5`` output to NumPy format. Use the ``--skipN`` flag for larger systems to discard the initial equilibration phase (e.g., ``--skipN 1000000`` for Alanine Hexapeptide to remove the first 200ns of REMD).

   .. code-block:: bash

      python tools/extract_trajectory_as_numpy.py path/to/traj.h5 --skipN <N>

2. **Create TICA Model:** Run the model creation script on the trajectory designated for the test dataset. This identifies slow degrees of freedom and offers options for plotting lag-times or Ramachandran correspondences.

   .. code-block:: bash

      python tools/create_tica_model.py --traj_path path/to/traj.npy --traj_total_sim_time_ns <sim_time> --system_name <system_path> --lag_time_ps 100

3. **Generate Test Dataset:** Permute the first trajectory. If multiple parallel trajectories were generated, concatenate them before running this command:

   .. code-block:: bash

      python tools/permute_trajectory.py path/to/traj_1.npy

4. **Generate Train/Val Datasets:** Create random subsets without replacement from the second trajectory:

   .. code-block:: bash

      python tools/split_trajectory.py path/to/traj_2.npy

5. Upload everything to Hugging Face (see alanine dipeptide for reference) and update the ``info.yaml`` file.