Evaluation ========== Overview -------- The evaluation system is modular and operates on an ``EvalData`` object. All fields in ``EvalData`` are optional and only required depending on the selected evaluation modules. Evaluation results are returned as a flat ``metrics`` dictionary. Depending on the selected modules, values may include scalars, histograms, or binary artifacts such as PDF files. The ``boltzkit`` evaluation is optimized for every step of the paper creation process: - **Experiments:** Low quality visualizations can easily be logged to ``wandb`` to save memory - **Review version:** High quality visualizations (PDFs) can be saved separately - **Camera-ready:** Save raw data for later (custom) high-quality visualizations Data conventions ---------------- - Ground-truth samples (``true``) and predicted samples (``pred``) have shape ``(batch, dim)``. - Ground-truth density evaluations (``target_log_prob``) may be unnormalized and have shape ``(batch,)``. - Model density evaluations (``model_log_prob``) must be normalized and have shape ``(batch,)``. Building and running an evaluation pipeline ---------------------------------------------- Evaluation is performed by composing a list of modular evaluators. Example ~~~~~~~ .. code-block:: python from boltzkit.evaluation import run_eval, EvalData, EnergyHistEval from boltzkit.evaluation.molecular_eval import TorsionMarginalEval data = EvalData(...) topology = ... torsion_eval = TorsionMarginalEval(topology) energy_eval = EnergyHistEval() eval_pipeline = [energy_eval, torsion_eval] metrics = run_eval(data, evals=eval_pipeline) Missing data behavior ~~~~~~~~~~~~~~~~~~~~~~~~~ By default, evaluation modules that cannot be computed due to missing fields in ``EvalData`` are skipped with a warning. To enforce strict execution (raise an error instead), disable skipping: .. code-block:: python metrics = run_eval(data, skip_on_missing_data=False) In this mode, a ``ValueError`` is raised if any required inputs for an evaluation module are missing. Output format ~~~~~~~~~~~~~~~~~~~~~~~~~ The returned ``metrics`` object is a flat dictionary: - Keys: metric names (strings) - Values: can be - Scalars (``float``, ``int``) - Structured data (e.g., histograms) - Binary artifacts (e.g., PDF files stored as byte buffers) Utility functions are available to extract or filter subsets of metrics depending on downstream use. Logging and export ------------------ Weights & Biases logging ~~~~~~~~~~~~~~~~~~~~~~~~~ ``wandb`` is optional and must be installed separately. To log evaluation results: .. code-block:: python from boltzkit.evaluation import make_wandb_compatible import wandb # transforms all metrics into wandb-compatible ones # e.g., converts pdfs into low-resolution images. # Drops raw data like histograms to not clutter the wandb server. wandb_metrics = make_wandb_compatible(metrics) wandb.log(wandb_metrics) PDF artifacts (binary buffers) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Some evaluations produce visualizations as PDF files stored in-memory as binary buffers. These can be written directly to disk: .. code-block:: python from boltzkit.evaluation import get_pdfs from boltzkit.utils.pdf import save_pdfs dir_path = "" pdfs = get_pdfs(metrics) save_pdfs(pdfs, dir_path) # uses keys of dict as filenames Histograms ~~~~~~~~~~~~~~~~~~~~~~ The density-counts of histograms can be exported for custom downstream analysis (e.g., visualizations): .. code-block:: python from boltzkit.evaluation import get_histograms from boltzkit.utils.histogram import save_histograms dir_path = "" hists = get_histograms(metrics) save_histograms(hists, dir_path) # uses keys of dict as filenames