mllf.cb.policy

Edge value policy built on top of a node encoder.

For each directed edge this policy produces a vector of means and a vector of log-standard-deviations (one per predicted coefficient). The MLP outputs concatenated [mu_1,…,mu_D, logsigma_1,…,logsigma_D] which are split and used to parameterize a per-edge independent Gaussian distribution. The agent samples continuous actions v_ij ~ N(mu_ij, sigma_ij^2) for every directed edge and returns sampled values plus per-edge log-probabilities.

Classes

`EdgePolicy`(encoder, emb_dim[, ...])
`EdgeValueMLP`(in_dim[, hidden, ...])