It shouldn't change in operation so once the timing is figured out it should be fine.
Someone with a faster scope could try changing the "side 1" on the next to last line to "side 0" it'll change the output to do a fast square wave at 2x pixel clock with it going high when sampling and see how it lines up with a real signal.
That’s the thing, it will be hard to constrain the alignment properly, and you will certainly have to re-decide on whether to do the extra clock for alignment on a line-by line basis. At 2x pixel clock it’s maybe doable but there are many gotchas.
The way the system clock division works is very jittery for starters. The integer part of the division is accomplished via skipping that number of system clock cycles, only enabling the PIO clock one in every N system clock cycles. The fractional part of the division works by accumulating a fractional component 1/M and then skipping an additional system clock when the accumulated 1/M count exceeds 1. So for a 133 MHz system clock dividing into 31.3344 MHz, the division factor is approximately 4.255. So that means PIO will usually skip 4, 4, 4, then 5 system clock cycles. But every once in a while, it will skip 5 system clocks in a row twice since the division factor is slightly over 4.25.
With the jittery division issue in mind, consider what happens when HSYNC goes low and we enter a line. First of all, the HSYNC transition may be concurrent with the sampling window around the clock of the Pi Pico. In this case, random data will be seen by the Pico. That is to say that although HSYNC has gone low, it was slightly too late for the Pico to see it and therefore the PIO may randomly take an additional PIO clock to see HSYNC low. Since the sample window around the Pi’s clock is of some finite nonzero amount, That translates to ever so slightly more than one PIO clock cycle worth of time of inaccuracy in sensing the HSYNC transition.
Moreover, you cannot control the fractional clock division accumulation at the moment when the HSYNC transition is noticed by the PIO. If the HSYNC low transition could force the fractional clock divider accumulator to reset to zero, then there would be no issue, but unfortunately at the moment of HSYNC, we don’t know if the PIO will skip 4 or 5 (or 5 twice) system clock cycles next. Although small compared to the previous jitter effect of one whole PIO cycle, the fractional clock jitter directly adds to the HSYNC detection inaccuracy.
So running at 2x pixel clock and 133 MHz system clock, we have 7 system clocks (5 from the first thing, 2 from the second thing) of inaccuracy sensing HSYNC, or 52.5 nanoseconds, compared to a 63.8 ns pixel clock period.
Then we have to add in the additional skew caused by the difference in our clock and the Mac’s. Doubling the clock speed you mentioned earlier, that’s 31.32648 MHz which makes for 44,946 nanoseconds per line. The Mac takes 44,934.64 ns per line so there’s a 12 nanosecond difference. Therefore the Pico’s sample clock will start out at a particular alignment with the pixels from the Mac and then shift 12 nanoseconds by the time the line is done.
And the alignment also isn’t fixed, it will change each line because both oscillators are free-running.
Adding it all up, the inaccuracy is actually greater than a pixel clock period. So it may be hard to select the 512 active pixels in the middle of the line and it may also be hard to avoid capturing repeated pixels, etc.
Please do try the experiment outputting the pixel clock though! I am curious to see such a direct representation of the issues I’m referring to.
edit: oh I forgot, you said you don’t have a fast enough scope. I can try it eventually. Gotta order a Pi Pico though.
Edit2:
So the solution is to run at a faster PIO clock frequency. The way you are doing it works for this, just do more nops or whatever between taking samples. Too bad you can only input one bit at a time this way. I wish you could instruct the PIO input shifter to take in a single bit without sending the whole word to the FIFO yet. Unfortunately we can only do 1 bit at a time so the overhead is abysmal. We can DMA from the FIFO into main RAM but the storage overhead is 8x or 32x (not sure if we can do bytes) so we wanna have the ARM process it quickly into the proper packed format. So therefore the loop to do this has to run at 15.6672 M iterations/sec or 8.5 ARM clocks per word processed from the DMA destination. Probably doable but tight.
Hmmmmm oh I guess we’re overclocking basically 2x though to do DVI. Hahah then the loop on the ARM will be much less constrained.