File Handling
=============

Overview
--------

The ``mllf.file_handling`` module provides parsers and writers for the various file formats
used in multisite λ-dynamics simulations. These utilities handle:

* **RTF files**: CHARMM topology fragments with atom types and charges
* **PDB files**: Protein Data Bank structure files with 3D coordinates
* **Bias coefficient files**: Variables files containing bias matrices (b, c, x, s)
* **Simulation output**: Parsing population counts, transitions, and rates from MSLD output

The RTF and PDB files are typically generated by the upstream tool `msld-py-prep 
<https://github.com/Vilseck-Lab/msld-py-prep>`_, which handles ligand alignment, charge 
renormalization, and fragment decomposition. See the `File Preparation Workflow`_ section 
below for details on this upstream process.

.. _File Preparation Workflow:

File Preparation Workflow
--------------------------

Before using ``mllf`` for bias coefficient prediction, ligand structures must be prepared
using `msld-py-prep <https://github.com/Vilseck-Lab/msld-py-prep>`_. This tool performs:

1. **Alignment**: Aligns ligands to a representative structure
2. **Maximum Common Substructure (MCS) search**: Identifies common core atoms
3. **Charge Renormalization**: Adjusts partial charges for consistent sampling
4. **Fragment Generation**: Creates core and site-specific substituent files

**Important Validation**:

* Check charge deviation: msld-py-prep flags >5% charge changes
* Verify total charge: Sum of core + all site charges should equal ligand charge
* Visual inspection: Use PyMOL's ``vis_check.py`` to verify core/fragment assignment

**Common Pitfalls**:

1. **Chiral groups**: MCS may incorrectly merge chiral atoms. Workaround: temporarily
   rename chiral atom types before MCS, then restore afterward.
2. **Large charge deviations**: If renormalization causes >5% charge change, consider
   moving problematic core atoms to fragment definitions in MCS_for_MSLD.txt.
3. **Lone pairs**: msld-py-prep may include virtual sites (LP atom types). The ``mllf``
   RTF parser automatically filters these out.

**For More Information**:

See the `msld-py-prep examples <https://github.com/Vilseck-Lab/msld-py-prep/tree/main/examples>`_
for detailed tutorials. Key reference:

* Vilseck et al. "Optimizing Multisite λ-Dynamics Throughput with Charge Renormalization"
  *J. Chem. Inf. Model.* 2022, 62(6), 1479-1488. DOI: 10.1021/acs.jcim.2c00047

RTF File Parsing
----------------

RTF (Residue Topology File) fragments contain atom type and charge information for
substituents in MSLD simulations. Each substituent has a ``site#_sub#_pres.rtf`` file
generated by msld-py-prep's charge renormalization algorithm.

**Charge Renormalization Background**:

The charge renormalization process ensures that:

1. All substituents at a site have **identical atom types** for corresponding atoms
2. All substituents at a site have **similar partial charges** (within tolerance)
3. The sum of charges equals the original ligand's total charge

This allows smooth λ-dynamics transitions between chemically diverse substituents.

**RTF File Format**:

.. code-block:: text

   PRES SUB1
   ATOM C1   CG2R61  -0.115
   ATOM H1   HGR61    0.115
   ATOM N1   NG2R60  -0.570
   ATOM H2   HGR61    0.115
   ...

The parser extracts:

* **Atom name**: C1, H1, etc. (column 2)
* **Atom type**: CG2R61, HGR61, etc. (column 3) - used for CGenFF force field
* **Charge**: -0.115, 0.115, etc. (column 4) - partial charges in electron units

PDB File Parsing
----------------

PDB files contain 3D atomic coordinates and are used for:

* Computing AEVs (Atomic Environment Vectors) for DeepSet embeddings
* Spatial filtering in multi-site systems (finding nearby substituents)
* Validating element identification against CGenFF atom types

**Fragment Structure**:

msld-py-prep generates fragment PDB files (``site#_sub#_frag.pdb``) that include:

1. **Anchor atoms**: First atoms connecting to core (inherited from core structure)
2. **Fragment atoms**: Unique atoms specific to this substituent

The anchor atoms ensure proper bonding when fragments are combined with the core during
MSLD simulations. These atoms appear in both the core and fragment PDB files, so
duplicate detection is important when combining structures for AEV computation.

**Three-Tier Parsing Strategy**:

1. **RDKit parser** (primary): Handles standard PDB formats from databases like RCSB
2. **Flexible whitespace parsing** (fallback): For CHARMM-style non-standard formats
3. **Fixed-column parsing** (last resort): Standard PDB column positions

**Spatial Filtering for Multi-Site Systems**:

For accurate AEV computation, identify nearby substituents from other sites:

.. code-block:: python

   from mllf.file_handling.read_pdb import (
       find_reference_subs_from_other_sites,
       calculate_min_distance,
       find_nearby_pdbs
   )
   
   # Find reference substituents (sub1) from other sites within 5.1 Å
   nearby_refs = find_reference_subs_from_other_sites(
       target_pdb='site1_sub2_frag.pdb',
       prep_dir='prep',
       cutoff=5.1  # ANI-2x radial cutoff
   )
   # Returns: ['site2_sub1_frag.pdb', 'site5_sub1_frag.pdb', ...]
   
   # Calculate minimum distance between two structures
   coords1, _ = parse_pdb_file('core.pdb')
   coords2, _ = parse_pdb_file('site1_sub2_frag.pdb')
   min_dist = calculate_min_distance(coords1, coords2)
   print(f"Minimum distance: {min_dist:.2f} Å")

**Duplicate Detection** (for core/substituent overlap):

.. code-block:: python

   from mllf.file_handling.read_pdb import (
       find_duplicate_atoms,
       remove_duplicate_atoms
   )
   
   core_coords, core_elements = parse_pdb_file('core.pdb')
   sub_coords, sub_elements = parse_pdb_file('site1_sub2_frag.pdb')
   
   # Find atoms in sub that duplicate atoms in core
   duplicate_indices = find_duplicate_atoms(core_coords, sub_coords, tolerance=1e-4)
   
   # Remove duplicates
   filtered_coords, filtered_elements = remove_duplicate_atoms(
       sub_coords, sub_elements, core_coords, tolerance=1e-4
   )


Bias Coefficient Files
-----------------------

Bias coefficients are stored in ``variables.py`` files containing YAML-formatted
bias matrices (b, c, x, s). These files are read by the MSLD simulator.

**Reading Bias Coefficients**:

.. code-block:: python

   from mllf.file_handling.read_bias_coeff import parse_bias_file
   
   # Parse a variables.py file
   bias_data = parse_bias_file('variables.py')
   # {
   #   'b': [0.1, 0.2, -0.05],  # Linear bias vector (length N)
   #   'c': [[0.0, 0.3, -0.1],  # Quadratic bias matrix (N x N)
   #         [0.0, 0.0, 0.2],
   #         [0.0, 0.0, 0.0]],
   #   'x': [[...], ...],       # Skew bias matrix
   #   's': [[...], ...]        # End bias matrix
   # }

**Writing Bias Coefficients**:

.. code-block:: python

   from mllf.file_handling.write_bias_coeff import write_bias_inp_from_graph
   
   # Write from a Graph object
   write_bias_inp_from_graph(
       graph=graph,
       filename='combo_dir/variables.py',
       sub_counts=[3, 2],  # 3 subs at site 1, 2 subs at site 2
       mapping={'cs': 'quadratic', 'ss': 'end', 'xs': 'skew'}
   )

**Matrix Conventions**:

* **b (linear)**: 1D vector, one value per substituent
* **c (quadratic)**: Antisymmetric matrix where ``c[i][j] = -c[j][i]``
* **x (skew)**: Full matrix, forward/backward directions independent
* **s (end)**: Full matrix, forward/backward directions independent

Simulation Output Parsing
--------------------------

Parse MSLD simulation output to extract population counts, transitions, transition rates,
and per-pair relative free-energy differences (:math:`\Delta\Delta G_{i \to j}`) between
substituents.

**Basic Usage**:

.. code-block:: python

   from mllf.file_handling.read_output import (
       terminated_normally,
       parse_single_population,
       parse_transitions_and_rates,
       parse_single_ddg,
   )
   
   # Read simulation output
   with open('output.txt', 'r') as f:
       output_text = f.read()
   
   # Check termination status
   success = terminated_normally(output_text)
   
   # Parse population counts
   populations = parse_single_population(output_text)
   # {
   #   1: {'counts': {0.95: 1250, 0.99: 1180}, 'site': 1},
   #   2: {'counts': {0.95: 1320, 0.99: 1290}, 'site': 1},
   #   3: {'counts': {0.95: 950, 0.99: 890}, 'site': 2},
   #   ...
   # }
   
   # Parse transitions and rates
   transitions, rates = parse_transitions_and_rates(output_text)
   # transitions: {site: {lambda: count}}
   # rates: {site: {lambda: rate}}
   
   # Parse per-pair biased ΔΔG (relative free-energy difference between substituents)
   ddg = parse_single_ddg(output_text)
   # {(2, 3): -1.23, (2, 4): None, (3, 4): None, ...}
   # float : ΔΔG_{i→j} at highest λ — round-trip transitions observed
   # None  : NaN or ±Inf in output — no round-trip crossings, ΔΔG undefined

**Output Format Examples**:

Population table from MSLD output:

.. code-block:: text

   Total Population Count
   BLOCK> 0.95 0.99
   SINGLE POPULATION>   1      1250    1180
   SINGLE POPULATION>   2      1320    1290
   SINGLE POPULATION>   3       950     890

Transitions table:

.. code-block:: text

   SINGLE TRANSITIONS>   1      245     238
   SINGLE TRANSITIONS>   2      198     192

Transition rates:

.. code-block:: text

   SINGLE TRANS RATES>   1    0.196   0.190
   SINGLE TRANS RATES>   2    0.158   0.154

ΔΔG table (biased relative free-energy difference :math:`\Delta\Delta G_{i \to j}` per substituent pair):

.. code-block:: text

                BLK(I)..BLK(J).....> 0.950 ....> 0.990 .....> 0.950 ....> 0.990
   SINGLE DDG>       2      3   -Infinity   -Infinity   -Infinity   -Infinity
   SINGLE DDG>       2      4         NaN         NaN         NaN         NaN
   SINGLE DDG>       3      4    Infinity    Infinity    Infinity    Infinity

Columns are: ``nobias_λ1, nobias_λ2, bias_λ1, bias_λ2``.
``parse_single_ddg`` returns the last bias column (highest λ, e.g. 0.990) per pair as
the :math:`\Delta\Delta G_{i \to j}` estimate — the biased relative free-energy difference
between substituents *i* and *j* at that λ cutoff.
``NaN`` means zero crossings between that pair; ``±Infinity`` means only one crossing
direction occurred (no round-trip estimate). Both are returned as ``None`` because no
:math:`\Delta\Delta G` can be computed without at least one complete round-trip transition.
A finite value means round-trip transitions were observed and the :math:`\Delta\Delta G`
estimate is considered reliable.

Block IDs start at 2 in CHARMM MSLD output (block 1 = reference). For a 0-based node
index ``i``, ``block_id = i + 2``.

**Dynamic Lambda Detection**:

The parser automatically detects lambda values from header lines:

* Searches for ``BLOCK>`` or ``SITE>`` headers containing lambda values
* Falls back to default ``[0.95, 0.99]`` if not found
* Handles variable numbers of lambda windows

Combination Generation
----------------------

During contextual bandit training, the policy network learns by running MSLD simulations 
on different substituent combinations and using the results as rewards. Rather than 
evaluating all possible combinations upfront, combinations are generated dynamically 
during training to provide diverse molecular contexts for policy optimization.

**Purpose of Combinations**:

A combination represents a specific selection of substituents across all λ-sites that 
will be included in a single MSLD simulation. For example, in a 2-site system with 
3 substituents at site 1 and 2 substituents at site 2, a combination might select 
substituents [1, 2] from site 1 and substituent [1] from site 2, creating a 3-substituent 
simulation (2 at site 1, 1 at site 2).

**Combination Strategy**:

The combination generator uses a **rotating anchor strategy** to ensure comprehensive 
coverage of the substituent space:

* **Within-site combinations**: Each substituent serves as an "anchor" with additional 
  substituents from the same site forming the tail. For example, with substituents [1,2,3]:
  
  - Anchor 1 generates: [1,2], [1,3], [1,2,3]
  - Anchor 2 generates: [2,1], [2,3], [2,1,3]
  - Anchor 3 generates: [3,1], [3,2], [3,1,2]
  
  This ensures all pairwise and multi-substituent interactions are explored with each 
  substituent taking a turn as the "reference" state.

* **Cross-site combinations**: For multi-site systems, within-site selections are 
  combined via Cartesian product. Each site contributes at least 2 substituents, ensuring 
  sufficient alchemical transitions for meaningful simulation statistics. For example, 
  if site 1 has 75 valid selections and site 2 has 186 valid selections, this generates 
  13,950 cross-site combinations.

* **Size limits**: Combinations are limited to a maximum number of substituents per site 
  (default: 10) to prevent computational expense from large simulations. All substituents 
  can still participate across different combinations.

**File Renumbering**:

Since each combination includes only a subset of substituents, files are renumbered 
to ensure consistent indexing within each combination directory. For example, if a 
combination selects original substituents [2, 5] from site 1, these are renumbered 
as [1, 2] in the combination directory. The ``mapping.json`` file records the original 
file paths and new names for traceability.

**RTF Topology Updates**:

RTF files contain PRES tokens that reference site and substituent numbers. When files 
are copied and renumbered, these tokens are automatically updated to match the new 
indices, ensuring CHARMM can correctly parse the topology.

**Single-Site Handling**:

For testing single-site combinations (pairwise interactions within one site), 
the generator automatically augments the core structure with the first substituent 
from any excluded sites. This maintains molecular completeness—if you're only testing 
site 1 pairs but the system also has site 2, the core.pdb and core.rtf are augmented 
with site 2's sub1 atoms so the full molecular structure is preserved.


See Also
--------

* :doc:`cb_setup` - CB infrastructure and graph construction
* :doc:`deepset_pretraining` - AEV computation and atom features
* :doc:`workflow` - Complete training workflow
* :doc:`api` - API reference for file handling modules

**External Tools**:

* `msld-py-prep <https://github.com/Vilseck-Lab/msld-py-prep>`_ - Upstream file preparation
* `ParamChem <https://cgenff.umaryland.edu/>`_ - CGenFF parameterization server
* `TorchANI <https://aiqm.github.io/torchani/>`_ - ANI-2x AEV computation