File Handling

Overview

The mllf.file_handling module provides parsers and writers for the various file formats used in multisite λ-dynamics simulations. These utilities handle:

  • RTF files: CHARMM topology fragments with atom types and charges

  • PDB files: Protein Data Bank structure files with 3D coordinates

  • Bias coefficient files: Variables files containing bias matrices (b, c, x, s)

  • Simulation output: Parsing population counts, transitions, and rates from MSLD output

The RTF and PDB files are typically generated by the upstream tool msld-py-prep, which handles ligand alignment, charge renormalization, and fragment decomposition. See the File Preparation Workflow section below for details on this upstream process.

File Preparation Workflow

Before using mllf for bias coefficient prediction, ligand structures must be prepared using msld-py-prep. This tool performs:

  1. Alignment: Aligns ligands to a representative structure

  2. Maximum Common Substructure (MCS) search: Identifies common core atoms

  3. Charge Renormalization: Adjusts partial charges for consistent sampling

  4. Fragment Generation: Creates core and site-specific substituent files

Important Validation:

  • Check charge deviation: msld-py-prep flags >5% charge changes

  • Verify total charge: Sum of core + all site charges should equal ligand charge

  • Visual inspection: Use PyMOL’s vis_check.py to verify core/fragment assignment

Common Pitfalls:

  1. Chiral groups: MCS may incorrectly merge chiral atoms. Workaround: temporarily rename chiral atom types before MCS, then restore afterward.

  2. Large charge deviations: If renormalization causes >5% charge change, consider moving problematic core atoms to fragment definitions in MCS_for_MSLD.txt.

  3. Lone pairs: msld-py-prep may include virtual sites (LP atom types). The mllf RTF parser automatically filters these out.

For More Information:

See the msld-py-prep examples for detailed tutorials. Key reference:

  • Vilseck et al. “Optimizing Multisite λ-Dynamics Throughput with Charge Renormalization” J. Chem. Inf. Model. 2022, 62(6), 1479-1488. DOI: 10.1021/acs.jcim.2c00047

RTF File Parsing

RTF (Residue Topology File) fragments contain atom type and charge information for substituents in MSLD simulations. Each substituent has a site#_sub#_pres.rtf file generated by msld-py-prep’s charge renormalization algorithm.

Charge Renormalization Background:

The charge renormalization process ensures that:

  1. All substituents at a site have identical atom types for corresponding atoms

  2. All substituents at a site have similar partial charges (within tolerance)

  3. The sum of charges equals the original ligand’s total charge

This allows smooth λ-dynamics transitions between chemically diverse substituents.

RTF File Format:

PRES SUB1
ATOM C1   CG2R61  -0.115
ATOM H1   HGR61    0.115
ATOM N1   NG2R60  -0.570
ATOM H2   HGR61    0.115
...

The parser extracts:

  • Atom name: C1, H1, etc. (column 2)

  • Atom type: CG2R61, HGR61, etc. (column 3) - used for CGenFF force field

  • Charge: -0.115, 0.115, etc. (column 4) - partial charges in electron units

PDB File Parsing

PDB files contain 3D atomic coordinates and are used for:

  • Computing AEVs (Atomic Environment Vectors) for DeepSet embeddings

  • Spatial filtering in multi-site systems (finding nearby substituents)

  • Validating element identification against CGenFF atom types

Fragment Structure:

msld-py-prep generates fragment PDB files (site#_sub#_frag.pdb) that include:

  1. Anchor atoms: First atoms connecting to core (inherited from core structure)

  2. Fragment atoms: Unique atoms specific to this substituent

The anchor atoms ensure proper bonding when fragments are combined with the core during MSLD simulations. These atoms appear in both the core and fragment PDB files, so duplicate detection is important when combining structures for AEV computation.

Three-Tier Parsing Strategy:

  1. RDKit parser (primary): Handles standard PDB formats from databases like RCSB

  2. Flexible whitespace parsing (fallback): For CHARMM-style non-standard formats

  3. Fixed-column parsing (last resort): Standard PDB column positions

Spatial Filtering for Multi-Site Systems:

For accurate AEV computation, identify nearby substituents from other sites:

from mllf.file_handling.read_pdb import (
    find_reference_subs_from_other_sites,
    calculate_min_distance,
    find_nearby_pdbs
)

# Find reference substituents (sub1) from other sites within 5.1 Å
nearby_refs = find_reference_subs_from_other_sites(
    target_pdb='site1_sub2_frag.pdb',
    prep_dir='prep',
    cutoff=5.1  # ANI-2x radial cutoff
)
# Returns: ['site2_sub1_frag.pdb', 'site5_sub1_frag.pdb', ...]

# Calculate minimum distance between two structures
coords1, _ = parse_pdb_file('core.pdb')
coords2, _ = parse_pdb_file('site1_sub2_frag.pdb')
min_dist = calculate_min_distance(coords1, coords2)
print(f"Minimum distance: {min_dist:.2f} Å")

Duplicate Detection (for core/substituent overlap):

from mllf.file_handling.read_pdb import (
    find_duplicate_atoms,
    remove_duplicate_atoms
)

core_coords, core_elements = parse_pdb_file('core.pdb')
sub_coords, sub_elements = parse_pdb_file('site1_sub2_frag.pdb')

# Find atoms in sub that duplicate atoms in core
duplicate_indices = find_duplicate_atoms(core_coords, sub_coords, tolerance=1e-4)

# Remove duplicates
filtered_coords, filtered_elements = remove_duplicate_atoms(
    sub_coords, sub_elements, core_coords, tolerance=1e-4
)

Bias Coefficient Files

Bias coefficients are stored in variables.py files containing YAML-formatted bias matrices (b, c, x, s). These files are read by the MSLD simulator.

Reading Bias Coefficients:

from mllf.file_handling.read_bias_coeff import parse_bias_file

# Parse a variables.py file
bias_data = parse_bias_file('variables.py')
# {
#   'b': [0.1, 0.2, -0.05],  # Linear bias vector (length N)
#   'c': [[0.0, 0.3, -0.1],  # Quadratic bias matrix (N x N)
#         [0.0, 0.0, 0.2],
#         [0.0, 0.0, 0.0]],
#   'x': [[...], ...],       # Skew bias matrix
#   's': [[...], ...]        # End bias matrix
# }

Writing Bias Coefficients:

from mllf.file_handling.write_bias_coeff import write_bias_inp_from_graph

# Write from a Graph object
write_bias_inp_from_graph(
    graph=graph,
    filename='combo_dir/variables.py',
    sub_counts=[3, 2],  # 3 subs at site 1, 2 subs at site 2
    mapping={'cs': 'quadratic', 'ss': 'end', 'xs': 'skew'}
)

Matrix Conventions:

  • b (linear): 1D vector, one value per substituent

  • c (quadratic): Antisymmetric matrix where c[i][j] = -c[j][i]

  • x (skew): Full matrix, forward/backward directions independent

  • s (end): Full matrix, forward/backward directions independent

Simulation Output Parsing

Parse MSLD simulation output to extract population counts, transitions, transition rates, and per-pair relative free-energy differences (\(\Delta\Delta G_{i \to j}\)) between substituents.

Basic Usage:

from mllf.file_handling.read_output import (
    terminated_normally,
    parse_single_population,
    parse_transitions_and_rates,
    parse_single_ddg,
)

# Read simulation output
with open('output.txt', 'r') as f:
    output_text = f.read()

# Check termination status
success = terminated_normally(output_text)

# Parse population counts
populations = parse_single_population(output_text)
# {
#   1: {'counts': {0.95: 1250, 0.99: 1180}, 'site': 1},
#   2: {'counts': {0.95: 1320, 0.99: 1290}, 'site': 1},
#   3: {'counts': {0.95: 950, 0.99: 890}, 'site': 2},
#   ...
# }

# Parse transitions and rates
transitions, rates = parse_transitions_and_rates(output_text)
# transitions: {site: {lambda: count}}
# rates: {site: {lambda: rate}}

# Parse per-pair biased ΔΔG (relative free-energy difference between substituents)
ddg = parse_single_ddg(output_text)
# {(2, 3): -1.23, (2, 4): None, (3, 4): None, ...}
# float : ΔΔG_{i→j} at highest λ — round-trip transitions observed
# None  : NaN or ±Inf in output — no round-trip crossings, ΔΔG undefined

Output Format Examples:

Population table from MSLD output:

Total Population Count
BLOCK> 0.95 0.99
SINGLE POPULATION>   1      1250    1180
SINGLE POPULATION>   2      1320    1290
SINGLE POPULATION>   3       950     890

Transitions table:

SINGLE TRANSITIONS>   1      245     238
SINGLE TRANSITIONS>   2      198     192

Transition rates:

SINGLE TRANS RATES>   1    0.196   0.190
SINGLE TRANS RATES>   2    0.158   0.154

ΔΔG table (biased relative free-energy difference \(\Delta\Delta G_{i \to j}\) per substituent pair):

             BLK(I)..BLK(J).....> 0.950 ....> 0.990 .....> 0.950 ....> 0.990
SINGLE DDG>       2      3   -Infinity   -Infinity   -Infinity   -Infinity
SINGLE DDG>       2      4         NaN         NaN         NaN         NaN
SINGLE DDG>       3      4    Infinity    Infinity    Infinity    Infinity

Columns are: nobias_λ1, nobias_λ2, bias_λ1, bias_λ2. parse_single_ddg returns the last bias column (highest λ, e.g. 0.990) per pair as the \(\Delta\Delta G_{i \to j}\) estimate — the biased relative free-energy difference between substituents i and j at that λ cutoff. NaN means zero crossings between that pair; ±Infinity means only one crossing direction occurred (no round-trip estimate). Both are returned as None because no \(\Delta\Delta G\) can be computed without at least one complete round-trip transition. A finite value means round-trip transitions were observed and the \(\Delta\Delta G\) estimate is considered reliable.

Block IDs start at 2 in CHARMM MSLD output (block 1 = reference). For a 0-based node index i, block_id = i + 2.

Dynamic Lambda Detection:

The parser automatically detects lambda values from header lines:

  • Searches for BLOCK> or SITE> headers containing lambda values

  • Falls back to default [0.95, 0.99] if not found

  • Handles variable numbers of lambda windows

Combination Generation

During contextual bandit training, the policy network learns by running MSLD simulations on different substituent combinations and using the results as rewards. Rather than evaluating all possible combinations upfront, combinations are generated dynamically during training to provide diverse molecular contexts for policy optimization.

Purpose of Combinations:

A combination represents a specific selection of substituents across all λ-sites that will be included in a single MSLD simulation. For example, in a 2-site system with 3 substituents at site 1 and 2 substituents at site 2, a combination might select substituents [1, 2] from site 1 and substituent [1] from site 2, creating a 3-substituent simulation (2 at site 1, 1 at site 2).

Combination Strategy:

The combination generator uses a rotating anchor strategy to ensure comprehensive coverage of the substituent space:

  • Within-site combinations: Each substituent serves as an “anchor” with additional substituents from the same site forming the tail. For example, with substituents [1,2,3]:

    • Anchor 1 generates: [1,2], [1,3], [1,2,3]

    • Anchor 2 generates: [2,1], [2,3], [2,1,3]

    • Anchor 3 generates: [3,1], [3,2], [3,1,2]

    This ensures all pairwise and multi-substituent interactions are explored with each substituent taking a turn as the “reference” state.

  • Cross-site combinations: For multi-site systems, within-site selections are combined via Cartesian product. Each site contributes at least 2 substituents, ensuring sufficient alchemical transitions for meaningful simulation statistics. For example, if site 1 has 75 valid selections and site 2 has 186 valid selections, this generates 13,950 cross-site combinations.

  • Size limits: Combinations are limited to a maximum number of substituents per site (default: 10) to prevent computational expense from large simulations. All substituents can still participate across different combinations.

File Renumbering:

Since each combination includes only a subset of substituents, files are renumbered to ensure consistent indexing within each combination directory. For example, if a combination selects original substituents [2, 5] from site 1, these are renumbered as [1, 2] in the combination directory. The mapping.json file records the original file paths and new names for traceability.

RTF Topology Updates:

RTF files contain PRES tokens that reference site and substituent numbers. When files are copied and renumbered, these tokens are automatically updated to match the new indices, ensuring CHARMM can correctly parse the topology.

Single-Site Handling:

For testing single-site combinations (pairwise interactions within one site), the generator automatically augments the core structure with the first substituent from any excluded sites. This maintains molecular completeness—if you’re only testing site 1 pairs but the system also has site 2, the core.pdb and core.rtf are augmented with site 2’s sub1 atoms so the full molecular structure is preserved.

See Also

External Tools: