Data: Layer f i , output gradients δ L δ z i

CPU pinned memory buffer P i 1

CPU thread T c o m p

CUDA events E d a t a i , E d a t a i + 1 , E c o m p i

CUDA Streams S d a t a , S c o m p

Result: δ L δ z i 1 , δ L δ θ i

Allocate ( z i 1 );

S d a t a z i 1 P i 1 ;

S d a t a E d a t a i ;

Wait ( E d a t a i + 1 );

Allocate ( δ L δ z i 1 , δ L δ θ i );

S c o m p δ L δ z i 1 δ L δ z i × δ z i δ z i 1 ;

S c o m p δ L δ θ i δ L δ z i × δ z i δ θ i .