Workflow System

Overview

The workflow system automates the complete pipeline for multisite λ-dynamics simulations with contextual bandit training:

Combination Generation: Create all valid site/substituent combinations
Splitting: Divide combinations into training/validation/test sets
Training: Train CB policy on graph structures
Simulation: Run MD simulations with optimized bias coefficients
Compression: Archive simulation outputs for storage

Running the Workflow

Basic Usage

The workflow is driven by a YAML configuration file:

python -m mllf.cli.workflow --config examples/workflow_14benz.yaml

Or using the convenience wrapper:

cd examples
python run_workflow_deepset.py workflow_14benz.yaml
python run_workflow_deepset.py my_config.yaml  # Use custom config

Configuration Format

A workflow config is a YAML file specifying which operations to run and their parameters. Key sections include:

system: Environment type (solvent, gas, protein)
create_combos: Generate combinations from fragment files
split: Divide combinations into train/val/test sets
pretrain: Optional pretraining from existing simulations
curriculum: Progressive training stages (see Curriculum Learning)
training: Model architecture and hyperparameters
reward: Reward function weights and thresholds
output: Checkpointing and output organization
archive: Automatic compression of completed runs

See the Configuration File for a full annotated YAML file.

Combination Generation

Principles

Combinations are generated from site/substituent fragment files:

Input files: site{N}_sub{M}_{label}.{rtf,pdb} files in the input directory
Sites: Identified by the site number (N)
Substituents: Identified by the sub number (M) within each site

Warning

Minimum Substituents Required: Each site must have at least 2 substituents. MSLD simulations will not run correctly with only a single substituent at a site. If any site has only 1 substituent, combination generation will fail with an error. To resolve this, either add more substituents to the site or add the site information to your core structure files (e.g., core.pdb and core.rtf if using msld-py-prep).

The generator creates two types of combinations:

Within-site combinations: Multiple substituents from a single site
Cross-site combinations: Substituents from multiple sites simultaneously

Note

Combination Size Limit: By default, each combination is limited to at most 10 substituents per site (max_subs_per_site=10). This prevents combinatorial explosion while still allowing all substituents to participate across different combinations. For example, with 50 substituents at a site, the generator will create combinations like [1,2,...,10], [1,2,...,9,11], etc., but not [1,2,...,11]. This limit can be increased via the --max-subs command-line option or the max_subs_per_site parameter in the API.

Lazy Directory Creation

For systems with large combination spaces (e.g., 14,211 total combinations), creating all directories upfront is inefficient—most will never be used in training. The workflow implements lazy (on-demand) directory creation:

Metadata Generation: During the combination generation phase, the system:

Lists all possible combinations without creating directories
Saves metadata to combo_metadata.json with:
- Combination name (e.g., comb_0001_site2_1__site2_2)
- Path where directory will be created
- Sites and substituents included
- Counter for ordering
Writes manifest files listing all possible combinations

On-Demand Creation: Directories are created only when needed:

During training/validation splits, combinations are selected but not created
When a combination is accessed for training, the workflow:
1. Checks if the directory exists
2. If not, loads metadata from combo_metadata.json
3. Creates the directory with all required files
4. Continues with training

Benefits:

Disk space efficiency: Only create ~1-2% of possible combinations (e.g., 142 training + 142 validation out of 14,211 total)
Faster initialization: Split generation completes in seconds instead of hours
Filesystem efficiency: Avoid creating thousands of unused directories
Scalability: Handle massive combination spaces (100K+ combinations)

Directory Structure

Each combination directory (created on-demand) has a standardized structure:

generated_combos/
├── combo_metadata.json                  # Metadata for all combinations
├── manifest.txt                         # List of all combination names
├── train_manifest.txt                   # Training combination names
├── val_manifest.txt                     # Validation combination names
├── test_manifest.txt                    # Test combination names
├── comb_0001_site2_1__site2_2/          # Created on-demand
│   ├── info.py                          # System configuration
│   ├── mapping.json                     # File renumbering mapping
│   ├── msld_flat.py                     # Simulation script (copied)
│   └── prep/
│       ├── site2_sub1_pres.rtf
│       ├── site2_sub1_frag.pdb
│       ├── site2_sub2_pres.rtf
│       ├── site2_sub2_frag.pdb
│       ├── core.rtf
│       ├── core.prm
│       └── other_support_files...
├── comb_0262_site1_1__site1_2__site2_1__site2_2/  # Cross-site
│   ├── info.py
│   ├── mapping.json
│   ├── msld_flat.py
│   └── prep/
│       ├── site1_sub1_pres.rtf          # Preserves site numbering
│       ├── site1_sub2_pres.rtf
│       ├── site2_sub1_pres.rtf          # Site 2 keeps site2_ prefix
│       ├── site2_sub2_pres.rtf
│       └── ...
└── ...

File Naming Convention: The renaming preserves site identity:

Files maintain their site number (site1_*, site2_*, etc.)
Substituents are renumbered sequentially within each site
Original site/sub mapping is preserved in mapping.json

Combination Metadata Files

Each combination directory contains standardized metadata files:

info.py: System configuration loaded by simulation scripts

import numpy as np
import os

info = {}
info['name'] = 'comb_0262_site1_1__site1_2__site2_1__site2_2'
info['nsubs'] = [2, 2]              # Substituents per site [site1, site2]
info['nblocks'] = np.sum(info['nsubs'])  # Total substituents (4)
info['ncentral'] = 0                # Central replica for replica exchange
info['nreps'] = 1                   # Number of replicas
info['nnodes'] = 1                  # MPI nodes
info['enginepath'] = os.environ.get('CHARMMEXEC', '')
info['temp'] = 298.15               # Temperature in Kelvin

mapping.json: File renumbering information

[
  {
    "original": "/path/to/site1_sub2_pres.rtf",
    "new_name": "site1_sub1_pres.rtf",
    "original_site": 1,
    "original_sub": 2,
    "new_site": 1,
    "new_sub": 1
  },
  {
    "original": "/path/to/site2_sub5_pres.rtf",
    "new_name": "site2_sub1_pres.rtf",
    "original_site": 2,
    "original_sub": 5,
    "new_site": 2,
    "new_sub": 1
  }
]

This tracks how original fragment files were renumbered during combination creation, enabling traceability back to source files.

Manifest Files

Manifest files list combination names (one per line):

comb_0001_site2_1__site2_2
comb_0002_site2_1__site2_3
comb_0003_site2_1__site2_4
comb_0075_site1_5__site1_1__site1_2
comb_0262_site1_1__site1_2__site2_1__site2_2
...

Manifest files enable reproducible splits and batch operations. The full paths are constructed by prepending the out_dir from the configuration: {out_dir}/{combo_name}.

Graph Construction

During training, molecular graphs are constructed from combination directories to provide input for the policy network. Graphs are built from RTF topology files with DeepSet embeddings as node features, representing each substituent’s 3D structure and chemistry as learned 64-dimensional vectors.

For complete details on graph construction, node features, edge expansion, and the RGCN/policy architecture, see Contextual Bandit Setup.

Training Pipeline

System Configuration

The system section specifies environment-level parameters that affect how molecular structures are processed during training:

system:
  solvent_state: solv  # Environment type

Solvent State:

Specifies the simulation environment to determine which atoms are included as context during AEV computation for DeepSet embeddings:

solv or solvent: Includes core structure and nearby substituents from other sites (within 5.1 Å)
gas or vacuum: Includes core structure and nearby substituents (without solvent effects)
protein: Includes core structure, nearby substituents, AND nearby protein atoms (within 5.1 Å)

The environment type determines what molecular context the DeepSet encoder “sees” when computing atomic environment vectors. For protein systems, including nearby protein atoms in the AEV computation naturally encodes protein-specific interactions into the learned embeddings. See AtomBondGNN Pretraining for technical details on context-aware AEV computation.

The solvent state is also preserved in graph_info.json for metadata tracking.

Auto-Detection (legacy):

Previously, the system attempted to auto-detect solvent state from directory names (e.g., 14benz_solv → solv). This is now deprecated in favor of explicit configuration for clarity and reliability.

Reward Function

Pretraining

Before training begins, the policy can be pretrained using behavior cloning (supervised learning with MSE loss) to imitate successful bias coefficients from completed simulations. For complete details on pretraining loss, data organization, and transfer learning strategies, see CB Behavior Cloning.

Training Reward

During training, the policy is optimized using REINFORCE with rewards computed from simulation trajectories. The reward function prevents degenerate solutions (e.g., convergence to single-substituent states) through multiple components:

\[R_{\text{total}} = coverage\_factor \times (R_P + R_T + R_{\text{entropy}}) + R_{\text{penalties}}\]

where \(coverage\_factor = \left(\frac{N_{\text{visited}}}{N_{\text{subs}}}\right)^2\) is a smooth quadratic multiplier that scales all positive reward components by coverage. At 100% coverage it is 1.0; at 50% it is 0.25; at 0% it is 0.0. This replaces the earlier hard completeness gate (which clipped all positive reward to −0.01 when any substituent was unvisited) with a smooth gradient signal that rewards partial progress.

This eliminates \(R_U\) (the explicit uniformity term) and the adaptive coverage penalty \(P_{\text{cov}}\) — both are now subsumed by \(coverage\_factor\).

Population Balance Reward \(R_P\):

Encourages equal sampling across all substituents with balanced populations:

\[R_P = w_P \cdot \frac{\sum_{k \in \text{visited}} p_k}{P_{\text{baseline}}} \cdot C_F\]

where:

\(w_P\) is the population weight (default: 0.5)
\(p_k\) is the population count for visited substituent \(k\)
\(P_{\text{baseline}}\) is the normalization constant (default: 500.0)
\(C_F = \min(1.0, T_{\min} / (2 \times N_{\text{req}}))\) is the confidence factor
\(T_{\min}\) is the minimum transitions across all sites
\(N_{\text{req}}\) is the minimum required transitions per site (default: 10)

The confidence factor scales population rewards based on data reliability, reducing false rewards from low-transition runs with unreliable population distributions. Within-visited uniformity is now captured entirely by \(R_{\text{entropy}}\) (see below) rather than by the balance factor \(e^{-CV}\) which has been removed.

Transition Reward \(R_T\):

Rewards frequent transitions between substituents, with bonus for high transition counts:

\[\begin{split}R_T = \begin{cases} w_T \cdot \frac{\sum_{s=1}^{N_{\text{sites}}} T_s}{T_{\text{baseline}}} & \text{if all sites have } \geq \text{min_transitions_per_site} \\ w_T \cdot \frac{\sum_{s=1}^{N_{\text{sites}}} T_s}{T_{\text{baseline}}} \times 1.5 & \text{if avg. trans/site} > 2 \times \text{min_transitions_per_site} \\ 0 & \text{otherwise (sites below threshold)} \end{cases}\end{split}\]

where:

\(w_T\) is the transition weight (default: 0.75)
\(T_s\) is the transition count for site \(s\)
\(T_{\text{baseline}}\) is the normalization constant (default: 50.0)
The 1.5× bonus applies when average transitions per site exceeds 20 (2× the default minimum)

Entropy Bonus \(R_{\text{entropy}}\):

Rewards uniform population distributions using normalized Shannon entropy:

\[R_{\text{entropy}} = \alpha_{\text{entropy}} \cdot \frac{H(\mathbf{p})}{H_{\max}}\]

where \(H(\mathbf{p}) = -\sum_k \frac{p_k}{P_{\text{total}}} \log \frac{p_k}{P_{\text{total}}}\) is Shannon entropy and \(H_{\max} = \log(N_{\text{subs}})\) is maximum possible entropy.

Tiered Transition Penalties \(R_{\text{penalties}}\):

The penalty system uses three tiers based on the worst-performing site, with multi-site awareness to fairly handle systems with multiple λ-sites:

Base Penalty (determined by \(T_{\min}\), the minimum transitions across all sites):

\[\begin{split}P_{\text{base}} = \begin{cases} 40.0 & \text{if } T_{\min} = 0 \quad \text{(Tier 1: "Death Floor")} \\ 32.0 & \text{if } T_{\min} = 1 \\ 24.0 & \text{if } T_{\min} = 2 \\ 2.0 + 2.0(N_{\text{req}} - T_{\min}) & \text{if } 3 \leq T_{\min} < N_{\text{req}} \quad \text{(Tier 2: "Climbing Ramp")} \\ 0.0 & \text{if } T_{\min} \geq N_{\text{req}} \quad \text{(Tier 3: "Success Zone")} \end{cases}\end{split}\]

Multi-Site Degradation (incremental penalty for multiple failing sites):

\[\begin{split}P_{\text{trans}} = \begin{cases} P_{\text{base}} + 4.0(n_{\text{bad}} - 1) & \text{if } n_{\text{bad}} > 1 \\ P_{\text{base}} & \text{if } n_{\text{bad}} = 1 \\ 0 & \text{if } n_{\text{bad}} = 0 \end{cases}\end{split}\]

where \(n_{\text{bad}} = |\{s : T_s < N_{\text{req}}\}|\) counts sites below threshold.

Concentration Penalty (per-site check for single-substituent dominance):

\[P_{\text{conc}} = \sum_{s=1}^{N_{\text{sites}}} \mathbb{1}\left[\frac{\max_k p_{s,k}}{\sum_k p_{s,k}} > 0.8\right] \cdot \gamma \cdot 5.0 \cdot \left(\frac{\max_k p_{s,k}}{\sum_k p_{s,k}} - 0.8\right)\]

Total penalties are summed and clamped: \(R_{\text{penalties}} = -\min(60.0, P_{\text{trans}} + P_{\text{conc}})\)

Default Hyperparameters:

reward:
  w_P: 0.5                                # Population weight
  w_T: 0.75                               # Transition weight
  w_U: 0.3                                # Accepted for API compatibility; coverage handled by coverage_factor
  gamma: 4.0                              # Base penalty coefficient
  P_baseline: 500.0                       # Population normalization
  T_baseline: 50.0                        # Transition normalization
  min_transitions_per_site: 10            # Tier 3 threshold
  min_coverage_ratio: 0.5                 # Accepted for API compatibility; coverage handled by coverage_factor
  entropy_bonus: 8.0                      # Entropy bonus coefficient
  concentration_penalty_threshold: 0.8    # Single-substituent dominance threshold

Policy Gradient Training:

The policy is optimized using an Actor-Critic architecture where the policy network (actor) predicts bias coefficients and a value network (critic) provides state-dependent baselines for variance reduction. This approach prevents catastrophic forgetting of pretrained weights and enables more stable learning.

For architectural details on the RGCN encoder, policy network, and value network, see Contextual Bandit Setup.

Simulation Execution

Launching Simulations

Simulations are launched via subprocess, running CHARMM with bias coefficients written to variables.py from the policy’s sampled actions. The simulator outputs transition counts and population distributions for reward computation.

Output Parsing

After simulation completes, the framework parses output.txt from the output directory to extract:

Total transitions per site \(T_s\) for each λ-site
Per-substituent populations \(p_{s,k}\) at each site
Coverage ratio (fraction of substituents visited)
Per-site concentration (maximum population fraction at each site)

These metrics feed directly into the reward function components described in the Reward Function section above.

Curriculum Learning

Curriculum learning progressively trains the policy on increasingly complex combinations, similar to how students learn from simple to complex problems. Instead of training on all possible combinations at once, the policy masters simpler tasks before advancing to harder ones.

Why Curriculum Learning for MSLD

MSLD bias coefficient optimization has a natural difficulty hierarchy:

Easy: Single-site pairs (2 substituents, 1 site)

Simplest edge interactions to learn
Clear cause-and-effect relationships
Provides foundation for pairwise biases

Medium: Single-site triplets (3 substituents, 1 site)

Introduces crowding/density effects
More complex interaction patterns
Tests generalization from pairs

Hard: Multi-site combinations (2+ sites with multiple substituents each)

Cross-site interaction effects
Exponentially larger search space
Requires composition of learned patterns

Training directly on hard combinations often fails because:

Reward signals are noisy and unclear
Policy has no foundation to build upon
Pretrained weights get overwhelmed by complex gradients

Curriculum learning solves this by building skills incrementally.

Configuration

Enable curriculum learning in your workflow YAML:

curriculum:
  enabled: true
  max_train_combos_per_stage: 100  # Optional: limit combinations per stage

  stages:
    # Stage 1: Pairs at single sites
    - name: pairs_single_site_easy
      min_subs: 2
      max_subs: 2
      min_sites: 1
      max_sites: 1
      epochs: 50

    # Stage 2: Triplets at single sites
    - name: triplets_single_site
      min_subs: 3
      max_subs: 3
      min_sites: 1
      max_sites: 1
      epochs: 50

    # Stage 3: Cross-site combinations
    - name: pairs_two_sites
      min_subs: 4  # 2 per site
      max_subs: 4
      min_sites: 2
      max_sites: 2
      epochs: 50

  # Progression criteria
  progression:
    type: epoch  # Advance after completing stage epochs

Stage Configuration

Each stage specifies:

Combination Filters:

min_subs, max_subs: Total substituents in combination
min_sites, max_sites: Number of sites represented

Training Duration:

epochs: Number of training epochs for this stage

Optional Settings:

max_train_combos: Stage-specific limit on training combinations (overrides global setting)
reward_override: Modify reward weights for this stage (e.g., emphasize transitions early)

Combination Selection

Filtering Process:

For each stage, the workflow:

Filters all training combinations by stage criteria (min/max subs/sites)
If filtered count exceeds max_train_combos_per_stage, randomly selects subset
Uses reproducible random selection (seeded by split.seed + stage_index)

Important: Random selection is uniform across all matching combinations. If a stage allows both pairs (2 subs) and triplets (3 subs) via min_subs: 2, max_subs: 3, the 100 selected combinations will be a random mix with no preference for either size.

Reproducibility: Same seed produces same combination selection across runs.

Progression Criteria

Stages advance based on progression criteria:

Epoch-based (default):

progression:
  type: epoch

Advances after completing the specified number of epochs for current stage.

Reward-based (experimental):

progression:
  type: reward
  reward_threshold: 10.0  # Minimum average reward to advance

Advances only if average reward over last 5 epochs exceeds threshold.

Combined:

progression:
  type: both
  reward_threshold: 10.0

Must complete all epochs AND meet reward threshold.

Training Flow Example

=== Training with Curriculum ===

Stage 1: pairs_single_site_easy (epochs 1-50)
├── Filtered: 41 combinations (2 subs, 1 site)
├── Training on all 41 combinations
└── Epoch 50 completes → Advance to Stage 2

Stage 2: triplets_single_site (epochs 51-100)
├── Filtered: 186 combinations (3 subs, 1 site)
├── Limited to 100 random combinations
└── Epoch 100 completes → Advance to Stage 3

Stage 3: pairs_two_sites (epochs 101-150)
├── Filtered: 1,681 combinations (4 subs, 2 sites)
├── Limited to 100 random combinations
└── Epoch 150 completes → Training complete

Training Output:

=== Starting Stage 1/3: pairs_single_site_easy ===
Filtered to 41 training combinations for this stage

--- Epoch 1/150 - Stage 1/3: pairs_single_site_easy (epoch 1/50) ---
Epoch 1 Stats:
  Loss: 12.3456
  Value Loss: 45.6789
  Avg Reward: -28.5432

[... epochs 2-50 ...]

============================================================
=== Advancing to Stage 2/3: triplets_single_site ===
============================================================
Filtered to 186 training combinations for this stage
Limiting to 100 random training combos (from 186 available)

--- Epoch 51/150 - Stage 2/3: triplets_single_site (epoch 1/50) ---

Stage-Specific Reward Tuning

Advanced users can override reward parameters per stage:

stages:
  - name: pairs_single_site_easy
    min_subs: 2
    max_subs: 2
    min_sites: 1
    max_sites: 1
    epochs: 50
    reward_override:
      w_T: 0.9              # Emphasize transitions early
      min_transitions_per_site: 5  # Lower threshold for easier combinations

This allows fine-tuning the reward function to match stage difficulty.

Checkpointing and Resume

Long-running training jobs (e.g., 50 epochs) can be interrupted by SLURM time limits, system maintenance, or manual cancellation. The workflow implements two-level checkpointing to enable automatic resume without losing progress.

Configuration

Enable checkpointing in your workflow YAML:

output:
  base_dir: /path/to/training_output
  save_checkpoints: true    # Enable checkpoint saving
  checkpoint_freq: 5         # Save every N epochs

Training-Level Checkpoints

Location: {base_dir}/checkpoint_epoch_XXX.pt

Saved every checkpoint_freq epochs, containing:

epoch: Completed epoch number
encoder_state: Full RGCN encoder state dict
policy_state: Full edge policy state dict
optimizer_state: Optimizer state (momentum, learning rates, etc.)
stats: Training statistics (loss, average reward)

Automatic Resume

When training restarts, the workflow:

Scans for checkpoint_epoch_*.pt files
Loads the latest checkpoint (highest epoch number)
Restores model and optimizer state
Continues from the next epoch

For each combination in each epoch:

Checks for epoch_results.pt in the combination’s directory
If found: loads cached reward/actions/logp, skips simulation
If not found: runs simulation, computes reward, saves checkpoint

Archiving Combinations

Combination directories can be automatically archived to save disk space using two strategies: per-stage archiving (during curriculum training) or post-training archiving (after all training completes). Each combination directory is compressed into a .tar.gz file, optionally removing the original.

Configuration

Enable archiving in your workflow YAML:

archive:
  enabled: true               # Enable archiving
  per_stage: true             # Archive after each curriculum stage (or false for post-training)
  pattern: 'comb_*'           # Glob pattern for directories to archive (post-training only)
  remove_after: false         # Remove originals after successful archiving
  archive_dir: /path/to/archives  # Where to store .tar.gz files

Per-Stage Archiving (Curriculum Training)

Best for: Long curriculum training runs where disk space is limited.

When per_stage: true, the workflow archives combinations at the end of each curriculum stage in the background while the next stage’s simulations begin. This provides:

Immediate space recovery: Free up disk as soon as each stage completes
No training delays: Archiving runs concurrently with next stage setup
Stage-specific organization: Each stage gets its own archive directory

Behavior:

After a curriculum stage completes (e.g., after epoch 50 of stage 1)
Archive job launches in background (bash script with tar commands)
Next stage begins immediately (simulations submit while archiving runs)
After training completes, workflow waits for any remaining archive jobs

Configuration Example:

curriculum:
  enabled: true
  stages:
    - name: pairs_single_site_easy
      min_subs: 2
      max_subs: 2
      epochs: 50
    - name: pairs_single_site_full
      min_subs: 2
      max_subs: 2
      epochs: 50

archive:
  enabled: true
  per_stage: true              # Archive after each stage
  remove_after: false
  archive_dir: /path/to/archives

Timeline:

Epoch 1-50 (Stage 1) → Stage 1 completes → Archive job starts in background
                                          ↓
Epoch 51 begins (Stage 2) ← Simulations submit while Stage 1 archives

Epoch 51-100 (Stage 2) → Stage 2 completes → Archive job starts in background
                                            ↓
Epoch 101 begins (Stage 3) ← Stage 2 continues archiving in background

Post-Training Archiving

Best for: Non-curriculum training or when you want to keep all data until the end.

When per_stage: false (or not specified), the workflow archives combinations once after all training completes.

Behavior:

After training completes successfully, all directories matching pattern are compressed into individual .tar.gz archives
Archives are moved to archive_dir (if different from source)
Original directories are removed if remove_after is true

Configuration Example:

archive:
  enabled: true
  per_stage: false             # Archive once at the end (default)
  pattern: 'comb_*'            # Directories to archive
  remove_after: false
  archive_dir: /path/to/archives

Manual Archiving

You can also archive combinations manually:

from mllf.file_handling.generate_combinations import archive_combo_dirs
from pathlib import Path

# Archive all comb_* directories
archived = archive_combo_dirs(
    out_dir=Path('generated_combos'),
    pattern='comb_*',
    remove=False  # Keep originals
)

print(f"Created {len(archived)} archive files")

Complete Workflow Example

Full Pipeline Script

The main training workflow is implemented in examples/run_workflow_deepset.py:

cd examples
python run_workflow_deepset.py workflow_14benz.yaml

This executes:

Combination generation (if create_combos specified)
Train/val/test split based on split configuration
Model initialization (RGCN encoder + edge policy)
Checkpoint detection and resume (if checkpoints exist)
Training loop with SLURM job submission
Checkpoint saving at checkpoint_freq intervals
Archiving combinations (if archive.enabled is true)

Configuration File

A complete workflow configuration (workflow_14benz.yaml) includes:

# System environment
system:
  solvent_state: solv

# Generate combinations
create_combos:
  input_dir: /path/to/14benz
  out_dir: /path/to/generated_combos
  include_patterns: [msld_flat.py]

# Data splitting
split:
  train_frac: 0.9
  val_frac: 0.1
  seed: 42

# Pretraining (optional but recommended)
pretrain:
  model_path: models/pretrained_policy.pt

# Curriculum learning
curriculum:
  enabled: true
  max_train_combos_per_stage: 100
  stages:
    - name: pairs_single_site
      min_subs: 2
      max_subs: 2
      epochs: 50
    - name: triplets_single_site
      min_subs: 3
      max_subs: 3
      epochs: 50
  progression:
    type: epoch

# Model architecture
training:
  encoder:
    hidden_dims: [64, 64]
    out_dim: 32
  policy:
    mlp_hidden: 64
  value_network:
    hidden_dims: [64, 32]
    lr: 0.001
  optimizer:
    lr: 0.0001

# Simulation settings
run_sims: true
max_concurrent_jobs: 60
timeout: 1200

# Reward function
reward:
  w_P: 0.5
  w_T: 0.75
  w_U: 0.3
  gamma: 4.0
  lambda_entropy: 0.5

# Checkpointing
output:
  base_dir: /path/to/training_output
  save_checkpoints: true
  checkpoint_freq: 5

# Per-stage archiving
archive:
  enabled: true
  per_stage: true
  archive_dir: /path/to/archives

Workflow System

Overview

Running the Workflow

Basic Usage

Configuration Format

Combination Generation

Principles

Lazy Directory Creation

Directory Structure

Combination Metadata Files

Manifest Files

Graph Construction

Training Pipeline

System Configuration

Reward Function

Simulation Execution

Launching Simulations

Output Parsing

Curriculum Learning

Why Curriculum Learning for MSLD

Configuration

Stage Configuration

Combination Selection

Progression Criteria

Training Flow Example

Stage-Specific Reward Tuning

Checkpointing and Resume

Configuration

Training-Level Checkpoints

Automatic Resume

Archiving Combinations

Configuration

Per-Stage Archiving (Curriculum Training)

Post-Training Archiving

Manual Archiving

Complete Workflow Example

Full Pipeline Script

Configuration File

See Also