Algorithm: online actor-critic

For t = 1 to T − 1 do:

1) In st, take action at ~ π, get ( s t , a t , r t , s t + 1 )

2) update V with target r t + γ V ( s t + 1 )

3) evaluate A ( s t , a t ) = target V ( s t )

4) θ J θ θ log π θ ( a t | s t ) A ( s t , a t )

5) θ θ + a θ J θ

end for

where α is the learning rate and γ is a discount factor.