File Handling ============= Overview -------- The ``mllf.file_handling`` module provides parsers and writers for the various file formats used in multisite λ-dynamics simulations. These utilities handle: * **RTF files**: CHARMM topology fragments with atom types and charges * **PDB files**: Protein Data Bank structure files with 3D coordinates * **Bias coefficient files**: Variables files containing bias matrices (b, c, x, s) * **Simulation output**: Parsing population counts, transitions, and rates from MSLD output The RTF and PDB files are typically generated by the upstream tool `msld-py-prep `_, which handles ligand alignment, charge renormalization, and fragment decomposition. See the `File Preparation Workflow`_ section below for details on this upstream process. .. _File Preparation Workflow: File Preparation Workflow -------------------------- Before using ``mllf`` for bias coefficient prediction, ligand structures must be prepared using `msld-py-prep `_. This tool performs: 1. **Alignment**: Aligns ligands to a representative structure 2. **Maximum Common Substructure (MCS) search**: Identifies common core atoms 3. **Charge Renormalization**: Adjusts partial charges for consistent sampling 4. **Fragment Generation**: Creates core and site-specific substituent files **Important Validation**: * Check charge deviation: msld-py-prep flags >5% charge changes * Verify total charge: Sum of core + all site charges should equal ligand charge * Visual inspection: Use PyMOL's ``vis_check.py`` to verify core/fragment assignment **Common Pitfalls**: 1. **Chiral groups**: MCS may incorrectly merge chiral atoms. Workaround: temporarily rename chiral atom types before MCS, then restore afterward. 2. **Large charge deviations**: If renormalization causes >5% charge change, consider moving problematic core atoms to fragment definitions in MCS_for_MSLD.txt. 3. **Lone pairs**: msld-py-prep may include virtual sites (LP atom types). The ``mllf`` RTF parser automatically filters these out. **For More Information**: See the `msld-py-prep examples `_ for detailed tutorials. Key reference: * Vilseck et al. "Optimizing Multisite λ-Dynamics Throughput with Charge Renormalization" *J. Chem. Inf. Model.* 2022, 62(6), 1479-1488. DOI: 10.1021/acs.jcim.2c00047 RTF File Parsing ---------------- RTF (Residue Topology File) fragments contain atom type and charge information for substituents in MSLD simulations. Each substituent has a ``site#_sub#_pres.rtf`` file generated by msld-py-prep's charge renormalization algorithm. **Charge Renormalization Background**: The charge renormalization process ensures that: 1. All substituents at a site have **identical atom types** for corresponding atoms 2. All substituents at a site have **similar partial charges** (within tolerance) 3. The sum of charges equals the original ligand's total charge This allows smooth λ-dynamics transitions between chemically diverse substituents. **RTF File Format**: .. code-block:: text PRES SUB1 ATOM C1 CG2R61 -0.115 ATOM H1 HGR61 0.115 ATOM N1 NG2R60 -0.570 ATOM H2 HGR61 0.115 ... The parser extracts: * **Atom name**: C1, H1, etc. (column 2) * **Atom type**: CG2R61, HGR61, etc. (column 3) - used for CGenFF force field * **Charge**: -0.115, 0.115, etc. (column 4) - partial charges in electron units PDB File Parsing ---------------- PDB files contain 3D atomic coordinates and are used for: * Computing AEVs (Atomic Environment Vectors) for DeepSet embeddings * Spatial filtering in multi-site systems (finding nearby substituents) * Validating element identification against CGenFF atom types **Fragment Structure**: msld-py-prep generates fragment PDB files (``site#_sub#_frag.pdb``) that include: 1. **Anchor atoms**: First atoms connecting to core (inherited from core structure) 2. **Fragment atoms**: Unique atoms specific to this substituent The anchor atoms ensure proper bonding when fragments are combined with the core during MSLD simulations. These atoms appear in both the core and fragment PDB files, so duplicate detection is important when combining structures for AEV computation. **Three-Tier Parsing Strategy**: 1. **RDKit parser** (primary): Handles standard PDB formats from databases like RCSB 2. **Flexible whitespace parsing** (fallback): For CHARMM-style non-standard formats 3. **Fixed-column parsing** (last resort): Standard PDB column positions **Spatial Filtering for Multi-Site Systems**: For accurate AEV computation, identify nearby substituents from other sites: .. code-block:: python from mllf.file_handling.read_pdb import ( find_reference_subs_from_other_sites, calculate_min_distance, find_nearby_pdbs ) # Find reference substituents (sub1) from other sites within 5.1 Å nearby_refs = find_reference_subs_from_other_sites( target_pdb='site1_sub2_frag.pdb', prep_dir='prep', cutoff=5.1 # ANI-2x radial cutoff ) # Returns: ['site2_sub1_frag.pdb', 'site5_sub1_frag.pdb', ...] # Calculate minimum distance between two structures coords1, _ = parse_pdb_file('core.pdb') coords2, _ = parse_pdb_file('site1_sub2_frag.pdb') min_dist = calculate_min_distance(coords1, coords2) print(f"Minimum distance: {min_dist:.2f} Å") **Duplicate Detection** (for core/substituent overlap): .. code-block:: python from mllf.file_handling.read_pdb import ( find_duplicate_atoms, remove_duplicate_atoms ) core_coords, core_elements = parse_pdb_file('core.pdb') sub_coords, sub_elements = parse_pdb_file('site1_sub2_frag.pdb') # Find atoms in sub that duplicate atoms in core duplicate_indices = find_duplicate_atoms(core_coords, sub_coords, tolerance=1e-4) # Remove duplicates filtered_coords, filtered_elements = remove_duplicate_atoms( sub_coords, sub_elements, core_coords, tolerance=1e-4 ) Bias Coefficient Files ----------------------- Bias coefficients are stored in ``variables.py`` files containing YAML-formatted bias matrices (b, c, x, s). These files are read by the MSLD simulator. **Reading Bias Coefficients**: .. code-block:: python from mllf.file_handling.read_bias_coeff import parse_bias_file # Parse a variables.py file bias_data = parse_bias_file('variables.py') # { # 'b': [0.1, 0.2, -0.05], # Linear bias vector (length N) # 'c': [[0.0, 0.3, -0.1], # Quadratic bias matrix (N x N) # [0.0, 0.0, 0.2], # [0.0, 0.0, 0.0]], # 'x': [[...], ...], # Skew bias matrix # 's': [[...], ...] # End bias matrix # } **Writing Bias Coefficients**: .. code-block:: python from mllf.file_handling.write_bias_coeff import write_bias_inp_from_graph # Write from a Graph object write_bias_inp_from_graph( graph=graph, filename='combo_dir/variables.py', sub_counts=[3, 2], # 3 subs at site 1, 2 subs at site 2 mapping={'cs': 'quadratic', 'ss': 'end', 'xs': 'skew'} ) **Matrix Conventions**: * **b (linear)**: 1D vector, one value per substituent * **c (quadratic)**: Antisymmetric matrix where ``c[i][j] = -c[j][i]`` * **x (skew)**: Full matrix, forward/backward directions independent * **s (end)**: Full matrix, forward/backward directions independent Simulation Output Parsing -------------------------- Parse MSLD simulation output to extract population counts, transitions, transition rates, and per-pair relative free-energy differences (:math:`\Delta\Delta G_{i \to j}`) between substituents. **Basic Usage**: .. code-block:: python from mllf.file_handling.read_output import ( terminated_normally, parse_single_population, parse_transitions_and_rates, parse_single_ddg, ) # Read simulation output with open('output.txt', 'r') as f: output_text = f.read() # Check termination status success = terminated_normally(output_text) # Parse population counts populations = parse_single_population(output_text) # { # 1: {'counts': {0.95: 1250, 0.99: 1180}, 'site': 1}, # 2: {'counts': {0.95: 1320, 0.99: 1290}, 'site': 1}, # 3: {'counts': {0.95: 950, 0.99: 890}, 'site': 2}, # ... # } # Parse transitions and rates transitions, rates = parse_transitions_and_rates(output_text) # transitions: {site: {lambda: count}} # rates: {site: {lambda: rate}} # Parse per-pair biased ΔΔG (relative free-energy difference between substituents) ddg = parse_single_ddg(output_text) # {(2, 3): -1.23, (2, 4): None, (3, 4): None, ...} # float : ΔΔG_{i→j} at highest λ — round-trip transitions observed # None : NaN or ±Inf in output — no round-trip crossings, ΔΔG undefined **Output Format Examples**: Population table from MSLD output: .. code-block:: text Total Population Count BLOCK> 0.95 0.99 SINGLE POPULATION> 1 1250 1180 SINGLE POPULATION> 2 1320 1290 SINGLE POPULATION> 3 950 890 Transitions table: .. code-block:: text SINGLE TRANSITIONS> 1 245 238 SINGLE TRANSITIONS> 2 198 192 Transition rates: .. code-block:: text SINGLE TRANS RATES> 1 0.196 0.190 SINGLE TRANS RATES> 2 0.158 0.154 ΔΔG table (biased relative free-energy difference :math:`\Delta\Delta G_{i \to j}` per substituent pair): .. code-block:: text BLK(I)..BLK(J).....> 0.950 ....> 0.990 .....> 0.950 ....> 0.990 SINGLE DDG> 2 3 -Infinity -Infinity -Infinity -Infinity SINGLE DDG> 2 4 NaN NaN NaN NaN SINGLE DDG> 3 4 Infinity Infinity Infinity Infinity Columns are: ``nobias_λ1, nobias_λ2, bias_λ1, bias_λ2``. ``parse_single_ddg`` returns the last bias column (highest λ, e.g. 0.990) per pair as the :math:`\Delta\Delta G_{i \to j}` estimate — the biased relative free-energy difference between substituents *i* and *j* at that λ cutoff. ``NaN`` means zero crossings between that pair; ``±Infinity`` means only one crossing direction occurred (no round-trip estimate). Both are returned as ``None`` because no :math:`\Delta\Delta G` can be computed without at least one complete round-trip transition. A finite value means round-trip transitions were observed and the :math:`\Delta\Delta G` estimate is considered reliable. Block IDs start at 2 in CHARMM MSLD output (block 1 = reference). For a 0-based node index ``i``, ``block_id = i + 2``. **Dynamic Lambda Detection**: The parser automatically detects lambda values from header lines: * Searches for ``BLOCK>`` or ``SITE>`` headers containing lambda values * Falls back to default ``[0.95, 0.99]`` if not found * Handles variable numbers of lambda windows Combination Generation ---------------------- During contextual bandit training, the policy network learns by running MSLD simulations on different substituent combinations and using the results as rewards. Rather than evaluating all possible combinations upfront, combinations are generated dynamically during training to provide diverse molecular contexts for policy optimization. **Purpose of Combinations**: A combination represents a specific selection of substituents across all λ-sites that will be included in a single MSLD simulation. For example, in a 2-site system with 3 substituents at site 1 and 2 substituents at site 2, a combination might select substituents [1, 2] from site 1 and substituent [1] from site 2, creating a 3-substituent simulation (2 at site 1, 1 at site 2). **Combination Strategy**: The combination generator uses a **rotating anchor strategy** to ensure comprehensive coverage of the substituent space: * **Within-site combinations**: Each substituent serves as an "anchor" with additional substituents from the same site forming the tail. For example, with substituents [1,2,3]: - Anchor 1 generates: [1,2], [1,3], [1,2,3] - Anchor 2 generates: [2,1], [2,3], [2,1,3] - Anchor 3 generates: [3,1], [3,2], [3,1,2] This ensures all pairwise and multi-substituent interactions are explored with each substituent taking a turn as the "reference" state. * **Cross-site combinations**: For multi-site systems, within-site selections are combined via Cartesian product. Each site contributes at least 2 substituents, ensuring sufficient alchemical transitions for meaningful simulation statistics. For example, if site 1 has 75 valid selections and site 2 has 186 valid selections, this generates 13,950 cross-site combinations. * **Size limits**: Combinations are limited to a maximum number of substituents per site (default: 10) to prevent computational expense from large simulations. All substituents can still participate across different combinations. **File Renumbering**: Since each combination includes only a subset of substituents, files are renumbered to ensure consistent indexing within each combination directory. For example, if a combination selects original substituents [2, 5] from site 1, these are renumbered as [1, 2] in the combination directory. The ``mapping.json`` file records the original file paths and new names for traceability. **RTF Topology Updates**: RTF files contain PRES tokens that reference site and substituent numbers. When files are copied and renumbered, these tokens are automatically updated to match the new indices, ensuring CHARMM can correctly parse the topology. **Single-Site Handling**: For testing single-site combinations (pairwise interactions within one site), the generator automatically augments the core structure with the first substituent from any excluded sites. This maintains molecular completeness—if you're only testing site 1 pairs but the system also has site 2, the core.pdb and core.rtf are augmented with site 2's sub1 atoms so the full molecular structure is preserved. See Also -------- * :doc:`cb_setup` - CB infrastructure and graph construction * :doc:`deepset_pretraining` - AEV computation and atom features * :doc:`workflow` - Complete training workflow * :doc:`api` - API reference for file handling modules **External Tools**: * `msld-py-prep `_ - Upstream file preparation * `ParamChem `_ - CGenFF parameterization server * `TorchANI `_ - ANI-2x AEV computation