| 1: | Initialize |
| 2: | for episode = 1, M do |
| 3: | Initialize s |
| 4: | Repeat |
| 5: | Choose |
| 6: | Choose a from s using policy derived from |
| 7: | Take action a, observe |
| 8: |
|
| 9: |
|
| 10 | Train network using |
| 11: |
|
| 12: | until |
| 13: | end for |