File Handling
Overview
The mllf.file_handling module provides parsers and writers for the various file formats
used in multisite λ-dynamics simulations. These utilities handle:
RTF files: CHARMM topology fragments with atom types and charges
PDB files: Protein Data Bank structure files with 3D coordinates
Bias coefficient files: Variables files containing bias matrices (b, c, x, s)
Simulation output: Parsing population counts, transitions, and rates from MSLD output
The RTF and PDB files are typically generated by the upstream tool msld-py-prep, which handles ligand alignment, charge renormalization, and fragment decomposition. See the File Preparation Workflow section below for details on this upstream process.
File Preparation Workflow
Before using mllf for bias coefficient prediction, ligand structures must be prepared
using msld-py-prep. This tool performs:
Alignment: Aligns ligands to a representative structure
Maximum Common Substructure (MCS) search: Identifies common core atoms
Charge Renormalization: Adjusts partial charges for consistent sampling
Fragment Generation: Creates core and site-specific substituent files
Important Validation:
Check charge deviation: msld-py-prep flags >5% charge changes
Verify total charge: Sum of core + all site charges should equal ligand charge
Visual inspection: Use PyMOL’s
vis_check.pyto verify core/fragment assignment
Common Pitfalls:
Chiral groups: MCS may incorrectly merge chiral atoms. Workaround: temporarily rename chiral atom types before MCS, then restore afterward.
Large charge deviations: If renormalization causes >5% charge change, consider moving problematic core atoms to fragment definitions in MCS_for_MSLD.txt.
Lone pairs: msld-py-prep may include virtual sites (LP atom types). The
mllfRTF parser automatically filters these out.
For More Information:
See the msld-py-prep examples for detailed tutorials. Key reference:
Vilseck et al. “Optimizing Multisite λ-Dynamics Throughput with Charge Renormalization” J. Chem. Inf. Model. 2022, 62(6), 1479-1488. DOI: 10.1021/acs.jcim.2c00047
RTF File Parsing
RTF (Residue Topology File) fragments contain atom type and charge information for
substituents in MSLD simulations. Each substituent has a site#_sub#_pres.rtf file
generated by msld-py-prep’s charge renormalization algorithm.
Charge Renormalization Background:
The charge renormalization process ensures that:
All substituents at a site have identical atom types for corresponding atoms
All substituents at a site have similar partial charges (within tolerance)
The sum of charges equals the original ligand’s total charge
This allows smooth λ-dynamics transitions between chemically diverse substituents.
RTF File Format:
PRES SUB1
ATOM C1 CG2R61 -0.115
ATOM H1 HGR61 0.115
ATOM N1 NG2R60 -0.570
ATOM H2 HGR61 0.115
...
The parser extracts:
Atom name: C1, H1, etc. (column 2)
Atom type: CG2R61, HGR61, etc. (column 3) - used for CGenFF force field
Charge: -0.115, 0.115, etc. (column 4) - partial charges in electron units
PDB File Parsing
PDB files contain 3D atomic coordinates and are used for:
Computing AEVs (Atomic Environment Vectors) for DeepSet embeddings
Spatial filtering in multi-site systems (finding nearby substituents)
Validating element identification against CGenFF atom types
Fragment Structure:
msld-py-prep generates fragment PDB files (site#_sub#_frag.pdb) that include:
Anchor atoms: First atoms connecting to core (inherited from core structure)
Fragment atoms: Unique atoms specific to this substituent
The anchor atoms ensure proper bonding when fragments are combined with the core during MSLD simulations. These atoms appear in both the core and fragment PDB files, so duplicate detection is important when combining structures for AEV computation.
Three-Tier Parsing Strategy:
RDKit parser (primary): Handles standard PDB formats from databases like RCSB
Flexible whitespace parsing (fallback): For CHARMM-style non-standard formats
Fixed-column parsing (last resort): Standard PDB column positions
Spatial Filtering for Multi-Site Systems:
For accurate AEV computation, identify nearby substituents from other sites:
from mllf.file_handling.read_pdb import (
find_reference_subs_from_other_sites,
calculate_min_distance,
find_nearby_pdbs
)
# Find reference substituents (sub1) from other sites within 5.1 Å
nearby_refs = find_reference_subs_from_other_sites(
target_pdb='site1_sub2_frag.pdb',
prep_dir='prep',
cutoff=5.1 # ANI-2x radial cutoff
)
# Returns: ['site2_sub1_frag.pdb', 'site5_sub1_frag.pdb', ...]
# Calculate minimum distance between two structures
coords1, _ = parse_pdb_file('core.pdb')
coords2, _ = parse_pdb_file('site1_sub2_frag.pdb')
min_dist = calculate_min_distance(coords1, coords2)
print(f"Minimum distance: {min_dist:.2f} Å")
Duplicate Detection (for core/substituent overlap):
from mllf.file_handling.read_pdb import (
find_duplicate_atoms,
remove_duplicate_atoms
)
core_coords, core_elements = parse_pdb_file('core.pdb')
sub_coords, sub_elements = parse_pdb_file('site1_sub2_frag.pdb')
# Find atoms in sub that duplicate atoms in core
duplicate_indices = find_duplicate_atoms(core_coords, sub_coords, tolerance=1e-4)
# Remove duplicates
filtered_coords, filtered_elements = remove_duplicate_atoms(
sub_coords, sub_elements, core_coords, tolerance=1e-4
)
Bias Coefficient Files
Bias coefficients are stored in variables.py files containing YAML-formatted
bias matrices (b, c, x, s). These files are read by the MSLD simulator.
Reading Bias Coefficients:
from mllf.file_handling.read_bias_coeff import parse_bias_file
# Parse a variables.py file
bias_data = parse_bias_file('variables.py')
# {
# 'b': [0.1, 0.2, -0.05], # Linear bias vector (length N)
# 'c': [[0.0, 0.3, -0.1], # Quadratic bias matrix (N x N)
# [0.0, 0.0, 0.2],
# [0.0, 0.0, 0.0]],
# 'x': [[...], ...], # Skew bias matrix
# 's': [[...], ...] # End bias matrix
# }
Writing Bias Coefficients:
from mllf.file_handling.write_bias_coeff import write_bias_inp_from_graph
# Write from a Graph object
write_bias_inp_from_graph(
graph=graph,
filename='combo_dir/variables.py',
sub_counts=[3, 2], # 3 subs at site 1, 2 subs at site 2
mapping={'cs': 'quadratic', 'ss': 'end', 'xs': 'skew'}
)
Matrix Conventions:
b (linear): 1D vector, one value per substituent
c (quadratic): Antisymmetric matrix where
c[i][j] = -c[j][i]x (skew): Full matrix, forward/backward directions independent
s (end): Full matrix, forward/backward directions independent
Simulation Output Parsing
Parse MSLD simulation output to extract population counts, transitions, transition rates, and per-pair relative free-energy differences (\(\Delta\Delta G_{i \to j}\)) between substituents.
Basic Usage:
from mllf.file_handling.read_output import (
terminated_normally,
parse_single_population,
parse_transitions_and_rates,
parse_single_ddg,
)
# Read simulation output
with open('output.txt', 'r') as f:
output_text = f.read()
# Check termination status
success = terminated_normally(output_text)
# Parse population counts
populations = parse_single_population(output_text)
# {
# 1: {'counts': {0.95: 1250, 0.99: 1180}, 'site': 1},
# 2: {'counts': {0.95: 1320, 0.99: 1290}, 'site': 1},
# 3: {'counts': {0.95: 950, 0.99: 890}, 'site': 2},
# ...
# }
# Parse transitions and rates
transitions, rates = parse_transitions_and_rates(output_text)
# transitions: {site: {lambda: count}}
# rates: {site: {lambda: rate}}
# Parse per-pair biased ΔΔG (relative free-energy difference between substituents)
ddg = parse_single_ddg(output_text)
# {(2, 3): -1.23, (2, 4): None, (3, 4): None, ...}
# float : ΔΔG_{i→j} at highest λ — round-trip transitions observed
# None : NaN or ±Inf in output — no round-trip crossings, ΔΔG undefined
Output Format Examples:
Population table from MSLD output:
Total Population Count
BLOCK> 0.95 0.99
SINGLE POPULATION> 1 1250 1180
SINGLE POPULATION> 2 1320 1290
SINGLE POPULATION> 3 950 890
Transitions table:
SINGLE TRANSITIONS> 1 245 238
SINGLE TRANSITIONS> 2 198 192
Transition rates:
SINGLE TRANS RATES> 1 0.196 0.190
SINGLE TRANS RATES> 2 0.158 0.154
ΔΔG table (biased relative free-energy difference \(\Delta\Delta G_{i \to j}\) per substituent pair):
BLK(I)..BLK(J).....> 0.950 ....> 0.990 .....> 0.950 ....> 0.990
SINGLE DDG> 2 3 -Infinity -Infinity -Infinity -Infinity
SINGLE DDG> 2 4 NaN NaN NaN NaN
SINGLE DDG> 3 4 Infinity Infinity Infinity Infinity
Columns are: nobias_λ1, nobias_λ2, bias_λ1, bias_λ2.
parse_single_ddg returns the last bias column (highest λ, e.g. 0.990) per pair as
the \(\Delta\Delta G_{i \to j}\) estimate — the biased relative free-energy difference
between substituents i and j at that λ cutoff.
NaN means zero crossings between that pair; ±Infinity means only one crossing
direction occurred (no round-trip estimate). Both are returned as None because no
\(\Delta\Delta G\) can be computed without at least one complete round-trip transition.
A finite value means round-trip transitions were observed and the \(\Delta\Delta G\)
estimate is considered reliable.
Block IDs start at 2 in CHARMM MSLD output (block 1 = reference). For a 0-based node
index i, block_id = i + 2.
Dynamic Lambda Detection:
The parser automatically detects lambda values from header lines:
Searches for
BLOCK>orSITE>headers containing lambda valuesFalls back to default
[0.95, 0.99]if not foundHandles variable numbers of lambda windows
Combination Generation
During contextual bandit training, the policy network learns by running MSLD simulations on different substituent combinations and using the results as rewards. Rather than evaluating all possible combinations upfront, combinations are generated dynamically during training to provide diverse molecular contexts for policy optimization.
Purpose of Combinations:
A combination represents a specific selection of substituents across all λ-sites that will be included in a single MSLD simulation. For example, in a 2-site system with 3 substituents at site 1 and 2 substituents at site 2, a combination might select substituents [1, 2] from site 1 and substituent [1] from site 2, creating a 3-substituent simulation (2 at site 1, 1 at site 2).
Combination Strategy:
The combination generator uses a rotating anchor strategy to ensure comprehensive coverage of the substituent space:
Within-site combinations: Each substituent serves as an “anchor” with additional substituents from the same site forming the tail. For example, with substituents [1,2,3]:
Anchor 1 generates: [1,2], [1,3], [1,2,3]
Anchor 2 generates: [2,1], [2,3], [2,1,3]
Anchor 3 generates: [3,1], [3,2], [3,1,2]
This ensures all pairwise and multi-substituent interactions are explored with each substituent taking a turn as the “reference” state.
Cross-site combinations: For multi-site systems, within-site selections are combined via Cartesian product. Each site contributes at least 2 substituents, ensuring sufficient alchemical transitions for meaningful simulation statistics. For example, if site 1 has 75 valid selections and site 2 has 186 valid selections, this generates 13,950 cross-site combinations.
Size limits: Combinations are limited to a maximum number of substituents per site (default: 10) to prevent computational expense from large simulations. All substituents can still participate across different combinations.
File Renumbering:
Since each combination includes only a subset of substituents, files are renumbered
to ensure consistent indexing within each combination directory. For example, if a
combination selects original substituents [2, 5] from site 1, these are renumbered
as [1, 2] in the combination directory. The mapping.json file records the original
file paths and new names for traceability.
RTF Topology Updates:
RTF files contain PRES tokens that reference site and substituent numbers. When files are copied and renumbered, these tokens are automatically updated to match the new indices, ensuring CHARMM can correctly parse the topology.
Single-Site Handling:
For testing single-site combinations (pairwise interactions within one site), the generator automatically augments the core structure with the first substituent from any excluded sites. This maintains molecular completeness—if you’re only testing site 1 pairs but the system also has site 2, the core.pdb and core.rtf are augmented with site 2’s sub1 atoms so the full molecular structure is preserved.
See Also
Contextual Bandit Setup - CB infrastructure and graph construction
AtomBondGNN Pretraining - AEV computation and atom features
Workflow System - Complete training workflow
mllf API - API reference for file handling modules
External Tools:
msld-py-prep - Upstream file preparation
ParamChem - CGenFF parameterization server
TorchANI - ANI-2x AEV computation