mllf.cb.deepset_pretraining_dataset

Dataset generation for DeepSet autoencoder pretraining.

This module handles Step 1 of the 4-step pretraining process: Iterate through substituent PDB files, calculate AEVs, concatenate with partial charges, and generate training tensors.

Functions

detect_core_pdb(prep_dir)

Detect the core PDB file in a prep directory.

detect_protein_pdb(prep_dir)

Detect the protein PDB file in a prep directory.

extract_charges_from_rtf(rtf_path, pdb_name)

Extract partial charges for a substituent from RTF file.

generate_all_bond_pretraining_datasets(...)

Generate bond-topology training datasets for all pretraining systems.

generate_all_pretraining_datasets(...[, ...])

Generate training datasets for all pretraining systems.

generate_bond_training_data_for_system(...)

Generate per-substituent bond-topology training data for AtomBondGNN pretraining.

generate_training_data_for_system(...[, ...])

Generate training data for one pretraining system.

load_system_metadata(system_dir)

Load metadata from a pretraining system directory.