mllf.cb.value_net

Value network for baseline estimation in REINFORCE.

The value network learns to predict the expected reward for a given combination, providing a state-dependent baseline that reduces variance in policy gradient updates. This is a standard component in Actor-Critic methods (A2C, PPO, etc.).

Classes

QNetwork(in_dim[, action_dim, hidden_dims])

Per-edge Q(s, a) critic for per-pair credit assignment.

ValueNetwork(emb_dim[, hidden_dims])

Value network that predicts expected reward from graph encoding.