mllf.cb.value_net
Value network for baseline estimation in REINFORCE.
The value network learns to predict the expected reward for a given combination, providing a state-dependent baseline that reduces variance in policy gradient updates. This is a standard component in Actor-Critic methods (A2C, PPO, etc.).
Classes
|
Per-edge Q(s, a) critic for per-pair credit assignment. |
|
Value network that predicts expected reward from graph encoding. |