1: | Initialize arbitrarily |
2: | for episode = 1, M do |
3: | Initialize s |
4: | Repeat |
5: | Choose where using probability |
6: | Choose a from s using policy derived from (e.g., ε-greedy) |
7: | Take action a, observe |
8: |
|
9: |
|
10: |
|
11: | until |
12: | end for |