HDQN-based Algorithm for routing |
Input: status of the node, Output: routing policy Initialize evaluation, target networks with parameters δ; Initialize experience replay memory D; for Episode = 1, 2, ..., Neps do Initialize state st; for TS t = 1, 2, ...,T do Obtain st; Select with probability ; Randomly select at with probability ; Forward the data to the next node, obtain the corresponding reward from formula and st+1; Update the current state to the next latest state to get new network input; Store transition {st, at, rt, st+1} into experience replay memory; if the learning process starts then Randomly sample M transitions from experience replay memory; Update evaluation network from formula; Calculate the target Q-value for the current state: Update target network periodically; end if end for end for |