1:	Initialize arbitrarily
2:	for episode = 1, M do
3:	Initialize s
4:	Repeat
5:	Choose where using probability
6:	Choose a from s using policy derived from (e.g., ε-greedy)
7:	Take action a, observe
8:
9:
10:
11:	until is terminal
12:	end for