The word doesn't get pushed into the FIFO until it's "full", where "full" can be set from 1 to 32 bits. I was figuring on leaving it at 32-bits so a scanline would be 64 bytes, aka 16 words, pushed to the ARM core.
A slight PLL tweak should provide an overclock to 125.33333 (off the shelf Pi...