| Layer | Layer Context | Total Context | Input × Output |
| frame 1 | [t − 2, t + 2] | 5 | 120 × 512 |
| frame 2 | {t − 2, t, t + 2} | 9 | 1536 × 512 |
| frame 3 | {t − 3, t, t + 3} | 15 | 1536 × 512 |
| frame 4 | {t} | 15 | 512 × 512 |
| frame 5 | {t} | 15 | 512 × 1500 |
| stats pooling | [0, T} | T | 1500T × 3000 |
| segment 6 | {0} | T | 3000 × 512 |
| segment 7 | {0} | T | 512 × 512 |
| softmax | {0} | T | 512 × N |