Layer

Layer Context

Total Context

Input × Output

frame 1

[t − 2, t + 2]

5

120 × 512

frame 2

{t − 2, t, t + 2}

9

1536 × 512

frame 3

{t − 3, t, t + 3}

15

1536 × 512

frame 4

{t}

15

512 × 512

frame 5

{t}

15

512 × 1500

stats pooling

[0, T}

T

1500T × 3000

segment 6

{0}

T

3000 × 512

segment 7

{0}

T

512 × 512

softmax

{0}

T

512 × N