Algorithm: online actor-critic

For t = 1 to T − 1 do:

1) In s_t, take action a_t ~ π, get $(s_{t}, a_{t}, r_{t}, s_{t}_{+ 1})$

2) update V with target $r_{t} + γ V (s_{t + 1})$

3) evaluate $A (s_{t}, a_{t}) = target - V (s_{t})$

4) $\nabla_{θ} J_{θ} \approx \nabla_{θ} \log π_{θ} (a_{t} | s_{t}) A (s_{t}, a_{t})$

5) $θ \leftarrow θ + a \nabla_{θ} J_{θ}$

end for

where α is the learning rate and γ is a discount factor.