1:

Initialize arbitrarily

2:

for episode = 1, M do

3:

Initialize s

4:

Repeat

5:

Choose where using probability

6:

Choose a from s using policy derived from (e.g., ε-greedy)

7:

Take action a, observe

8:

9:

10

Train network using

11:

12:

until is terminal

13:

end for