HDQN-based Algorithm for routing

Input: status of the node,

Output: routing policy

Initialize evaluation, target networks with parameters δ;

Initialize experience replay memory D;

for Episode = 1, 2, ..., Neps do

Initialize state st;

for TS t = 1, 2, ...,T do

Obtain st;

Select a t = arg max Q ( s t , a t ) with probability ε ;

Randomly select at with probability 1 ε ;

Forward the data to the next node, obtain the corresponding reward from formula and st+1;

Update the current state to the next latest state to get new network input;

Store transition {st, at, rt, st+1} into experience replay memory;

if the learning process starts then

Randomly sample M transitions from experience replay memory;

Update evaluation network from formula;

Calculate the target Q-value for the current state: y i = { r j , if d a t a i s s u c c e s s f u l l y f o r w a r d e d t o BS r j + γ max a t + 1 Q ^ ( ϕ j + 1 , a t + 1 ; θ ) , otherwise

Update target network periodically;

end if

end for

end for