Data: Layer fi , output gradients δLδzi CPU pinned memory buffer Pi−1 CPU thread Tcomp CUDA events Eidata , Ei+1data , Eicomp CUDA Streams Sdata , Scomp Result: δLδzi−1 , δLδθi Allocate ( zi−1 ); Sdata⇐zi−1←Pi−1 ; Sdata⇐Eidata ; Wait ( Ei+1data ); Allocate ( δLδzi−1 , δLδθi ); Scomp⇐δLδzi−1←δLδzi×δziδzi−1 ; Scomp⇐δLδθi←δLδzi×δziδθi . |