Data: Layer f i , output gradients δ L δ z i
CPU pinned memory buffer P i − 1
CPU thread T c o m p
CUDA events E d a t a i , E d a t a i + 1 , E c o m p i
CUDA Streams S d a t a , S c o m p
Result: δ L δ z i − 1 , δ L δ θ i
Allocate ( z i − 1 );
S d a t a ⇐ z i − 1 ← P i − 1 ;
S d a t a ⇐ E d a t a i ;
Wait ( E d a t a i + 1 );
Allocate ( δ L δ z i − 1 , δ L δ θ i );
S c o m p ⇐ δ L δ z i − 1 ← δ L δ z i × δ z i δ z i − 1 ;
S c o m p ⇐ δ L δ θ i ← δ L δ z i × δ z i δ θ i .