| Initialize Q values |
| Repeat t times (t = number of learning episodes) |
| Select a random state s |
| Repeat until the end of the learning episode |
| Select an action a |
| Receive an immediate reward r |
| Observe the next state |
| Update the Q table according to the update rule |
| Set |