Layer | Layer Context | Total Context | Input × Output |
frame 1 | [t − 2, t + 2] | 5 | 120 × 512 |
frame 2 | {t − 2, t, t + 2} | 9 | 1536 × 512 |
frame 3 | {t − 3, t, t + 3} | 15 | 1536 × 512 |
frame 4 | {t} | 15 | 512 × 512 |
frame 5 | {t} | 15 | 512 × 1500 |
stats pooling | [0, T} | T | 1500T × 3000 |
segment 6 | {0} | T | 3000 × 512 |
segment 7 | {0} | T | 512 × 512 |
softmax | {0} | T | 512 × N |