HDQN-based Algorithm for routing

Input: status of the node,

Output: routing policy

Initialize evaluation, target networks with parameters δ;

Initialize experience replay memory D;

for Episode = 1, 2, ..., N^eps do

Initialize state s_t;

for TS t = 1, 2, ...,T do

Obtain s_t;

Select $a_{t} = \arg \max Q (s_{t}, a_{t})$ with probability $ε$ ;

Randomly select a_t with probability $1 - ε$ ;

Forward the data to the next node, obtain the corresponding reward from formula and s_t₊₁;

Update the current state to the next latest state to get new network input;

Store transition {s_t, a_t, r_t, s_t₊₁} into experience replay memory;

if the learning process starts then

Randomly sample M transitions from experience replay memory;

Update evaluation network from formula;

Calculate the target Q-value for the current state: $y_{i} = {\begin{cases} r_{j}, if d a t a i s s u c c e s s f u l l y f o r w a r d e d t o BS \\ r_{j} + γ \max_{a_{t + 1}} \hat{Q} (ϕ_{j + 1}, a_{t + 1}; θ^{-}), otherwise \end{cases}$

Update target network periodically;

end if

end for