Algorithm: online actor-critic
For t = 1 to T − 1 do:
1) In st, take action at ~ π, get ( s t , a t , r t , s t + 1 )
2) update V with target r t + γ V ( s t + 1 )
3) evaluate A ( s t , a t ) = target − V ( s t )
4) ∇ θ J θ ≈ ∇ θ log π θ ( a t | s t ) A ( s t , a t )
5) θ ← θ + a ∇ θ J θ
end for
where α is the learning rate and γ is a discount factor.