Data: Layer f i , input activation z i 1

CPU pinned memory buffer P i 1

CPU thread T d a t a

CUDA events E d a t a i , E c o m p i

CUDA Streams S d a t a , S c o m p

Result: z i

Allocate (zi);

S c o m p z i f i ( z i 1 ) ;

S c o m p E c o m p i ;

In Thread T d a t a :

S d a t a P i 1 z i 1 ;

S d a t a E d a t a i ;

Wait ( E d a t a i , E c o m p i );

Free ( z i 1 ).