WarpSE: 25 MHz 68HC000-based accelerator for Mac SE

Zane Kaminski

Administrator
Staff member
Founder
Sep 5, 2021
371
608
93
Columbus, Ohio, USA
I think I just finished the sound fix! It's a bit heavy-handed but it should work well. Every time the sound buffer is written to, the system slows down a bunch for the next 30-40 microseconds. This prevents the Mac from writing sound samples into the buffer too quickly and overwriting the data currently being played. Overall, the Mac performs at an average of about 23 MHz or so while playing single-voice sound and about 16 MHz while four-voice sound is playing. Apps usually only briefly play a single-voice sound and rarely ever use four-voice sound so this shouldn’t affect performance in the usual workloads.

Something I wondered about is the obsession for timing-accurate recreation in the Amiga world. Do they need it that badly? I'm fairly certain Macs don't. To create an accelerator, a '030-compatible design that's functionally correct is likely enough, instruction timings are unlikely to matter much. Only the bus needs timing accuracy, and that can be tested in a PDS device as when supporting bus mastering they basically reimplement the '030 bus.
Yeah, perfect timing accuracy is not really needed on the Mac like it is on the Apple II. I dunno about the Atari and Amiga. I think there are different levels “perfect timing accuracy” in the context of an accelerator.

All the Mac needs to work mostly correctly is for the machine to maintain some minimum performance over 16-32 microseconds or so while doing floppy operations, otherwise the floppy data won’t be read/written fast enough from the IWM/SWIM.

But there’s the sound issue too on the Mac. The cause of the sound corruption is that the accelerated CPU writes into the sound buffer too quickly and overwrites the data currently being played, causing a skip to occur once per frame as the sound buffer is filled. Solution here is to just provide a crude slowdown mechanism. So after writing to the sound buffer, the WarpSE adds like 16 wait states to all RAM reads and ROM reads in the sound driver area of ROM. The wait states continue for 30-40 microseconds so that repeated writes to the sound buffer trigger the slowdown continuously but it doesn’t persist for long after leaving the sound driver’s inner loop. This slowdown mechanism is kinda crude. It would be more accurate to slow the fast 68k’s clock to 7.8336 MHz during sound slowdown but the current hardware can’t do that. So I impose the wait states on reads from any place the sound driver could reside—anywhere in RAM or in a particular part of ROM. This approach is successful in slowing the sound driver instruction fetch rate but it has some problems. Even if we perfectly limited the memory access speed to match that of the Mac SE, the faster clock makes the overall CPU speed faster in the cases where 68k “thinks for a few clocks” between RAM accesses. So therefore during sound slowdown, the WarpSE decreases the RAM speed to less than Mac SE equivalent. Sort of. It’s hard to exactly specify the Mac Plus/SE RAM speed because the 68k can sort of get in or out of phase with the RAM state machine depending on whether it takes a multiple of four clocks between accesses. But the takeaway is that since the sound driver does a lot of longword operations in between which the 68k “thinks” for two clocks, and these will be faster on the WarpSE, we need to slow the RAM and ROM speed even more to compensate. Works fine since you never get to the timing-critical floppy inner loop within 40 microseconds of leaving the sound inner loop. But still a bit sloppy.

So then the next level of accelerator timing accuracy is perfect cycle-accuracy, where some event triggers a slowdown that’s cycle-accurate, as opposed to what the WarpSE does. The Apple II needs this because its floppy routines have exact timing. They just blindly read/write bytes to the Disk II card in a precisely-times loop to access the floppy. On the Mac, there’s some polling of whether the IWM can accept/supply new data so you just have to be fast enough during the floppy inner loop.

And of course when doing your own 68k core, its instruction execution timing must match the original 68k to achieve perfect cycle-accuracy.

Could the Vampire 68080 core be licensed at all?
Yeah, I will ask the Vampire people again sometime. I asked Gunnar, the project leader, in 2015 but he never got back to me. I’ll try again eventually but first things first so I’d like to do some hard ‘030-based accelerators for the SE/30 and LC first.

My usual m.o. with products is to do the revision A first as a design that sort of punches high for its weight but using legacy chips. Than for rev. B I redo it all with modern parts. These often have more latency than legacy parts so rev. B doesn’t usually improve performance, just manufacturability. It’s kinda like Intel’s old tick/tock. You can see this approach in our (still mostly out of stock) Apple II cards. Rev. A of many of them had big old chips and then we redid them with modern chips, maybe increased the RAM capacity, but didn’t add any new capabilities.

My point being, rev. B of the WarpSE will have a 68000 in an FPGA with only slight performance improvements. Maybe we can figure out how to do 8 MB RAM without the formal MMU. Then I’ll focus on another accelerator architecture applicable to the SE/30, LC/LCII/CC, IIsi/IIci. It’ll have a hard ‘030 but I can pursue the Vampire core licensing for the revision B of those products. Eventually I can bring that architecture (with either hard or FPGA ‘030) back to the SE as a “WarpSE Pro” but that’s a long long way off, like multiple years.
 
Last edited:
  • Like
Reactions: retr01 and JeffC

Zane Kaminski

Administrator
Staff member
Founder
Sep 5, 2021
371
608
93
Columbus, Ohio, USA
Still hoping for a Plus version I can pop my 16MHz 68HC000 into. :)
Someone should redo the WarpSE’s layout for the Plus. Basically just the shape of the board needs changed. A few CPLD firmware changes are required too but nothing major. I can do that part. It can go in the Classic too. All of our products are totally open-source, even for commercial use. We are trying to encourage further development of our work and obviously nobody can afford to make these cards for others without getting paid. Anyone interested? @Bolle @Kai Robinson ? It’s mainly just layout work that’s required.
 
Last edited:

max1zzz

Moderator
Staff member
Sep 23, 2021
233
564
93
27
I could do, would just need to find time between all the other projects!
Though now might not actually be a bad time as I have pretty much finished off the two big projects I have been working on (LCIII DIIMO and the TechStep)
hmmm.....
 

Zane Kaminski

Administrator
Staff member
Founder
Sep 5, 2021
371
608
93
Columbus, Ohio, USA
I could do, would just need to find time between all the other projects!
Though now might not actually be a bad time as I have pretty much finished off the two big projects I have been working on (LCIII DIIMO and the TechStep)
hmmm.....
Free to be forked at any time: https://github.com/garrettsworkshop/Warp-SE

I’ll put our usual license in later (which is some poorly written document which I have no idea if it’s enforceable) but basically you can do any commercial stuff with it as long as you remove the Garrett’s Workshop logos from the board.
 

JeffC

Tinkerer
Sep 26, 2021
122
79
28
Seattle, WA
Someone should redo the WarpSE’s layout for the Plus. Basically just the shape of the board needs changed. A few CPLD firmware changes are required too but nothing major. I can do that part. It can go in the Classic too. All of our products are totally open-source, even for commercial use. We are trying to encourage further development of our work and obviously nobody can afford to make these cards for others without getting paid. Anyone interested? @Bolle @Kai Robinson ? It’s mainly just layout work that’s required.
Does someone still make something like a Killy Clip that would connect the accelerator to the Plus CPU?
 

Zane Kaminski

Administrator
Staff member
Founder
Sep 5, 2021
371
608
93
Columbus, Ohio, USA
Does someone still make something like a Killy Clip that would connect the accelerator to the Plus CPU?
I don’t think so—that’s the reason I don’t wanna sell a Plus accelerator. I usually want my gizmos to be really easy to install and use. I feel like having to solder it on or whatever just makes it too hard for a lot of people and also hard to support paying customers. Like, Big Mess O’ Wires says it’s a pain supporting customer of his ROM SIMM with its fitment issues, so a solder-on accelerator would be even worse. If only someone could make a really reliable and easy-to-install clip as you mentioned…
 
  • Like
Reactions: retr01

retr01

Senior Tinkerer
Jun 6, 2022
2,473
1
793
113
Utah, USA
retr01.com
Big Mess O’ Wires says it’s a pain supporting customer of his ROM SIMM with its fitment issues, so a solder-on accelerator would be even worse. If only someone could make a really reliable and easy-to-install clip as you mentioned…

Yep. My BMOW Rominator II SIMM wouldn't work in my recapped SE/30 with new SIMM sockets. See my thread all about that. I think Steve at BMOW became somewhat frustrated. Then later, I learn from Kay in Japan that SE/30 boards can warp over time. Oh boy. :)
 

JeffC

Tinkerer
Sep 26, 2021
122
79
28
Seattle, WA
I don’t think so—that’s the reason I don’t wanna sell a Plus accelerator. I usually want my gizmos to be really easy to install and use. I feel like having to solder it on or whatever just makes it too hard for a lot of people and also hard to support paying customers. Like, Big Mess O’ Wires says it’s a pain supporting customer of his ROM SIMM with its fitment issues, so a solder-on accelerator would be even worse. If only someone could make a really reliable and easy-to-install clip as you mentioned…
I had a guy solder this male-to-male adapter onto my Plus CPU to mount an accelerator, unfortunately that accelerator seems to have died. It should work pretty well, but still not a simple install since you need to solder to each leg of the CPU, but it does allow the accelerator to be easily removed.

1681164856720.png
 
  • Like
Reactions: Zane Kaminski

Zane Kaminski

Administrator
Staff member
Founder
Sep 5, 2021
371
608
93
Columbus, Ohio, USA
I had a guy solder this male-to-male adapter onto my Plus CPU to mount an accelerator, unfortunately that accelerator seems to have died. It should work pretty well, but still not a simple install since you need to solder to each leg of the CPU, but it does allow the accelerator to be easily removed.

View attachment 11785
Yeah that looks better. That way you can easily swap a defective accelerator. We could do this I guess but I think it would be better for someone else to tackle the project. I should do a whole new architecture of accelerator for the SE/30 or something rather than re-laying out this thing for the Plus.
 
  • Like
Reactions: retr01

Zane Kaminski

Administrator
Staff member
Founder
Sep 5, 2021
371
608
93
Columbus, Ohio, USA
I've got a fairly complete description of the four-voice sound issue for anyone who's interested in the technical details. I keep talking about it but I never show exactly what I mean with diagrams and stuff.

As we all know, video generation in the classic Mac is based on the 15.6672 MHz system clock. This serves as the video pixel clock. Each line of video is composed of a total of 704 pixel time periods, 512 active and 192 where the display is blanked. Similarly, each frame of video is composed of 370 scanned lines, of which 342 are active and 28 where the display is blanked. We can visualize the timespan of each video frame with a diagram like this:
Code:
                   time --->                 
-----------------------------------------------
|             active video            | blank |
-----------------------------------------------
 ^                                   ^ ^ ^ ^ ^
 |                                   | | | | |
 | first line of frame (line 0)      | | | | | last line of frame (line 369)

                                     | | | |
                                     | | | | VBL (vsync, vblank) ends
                                     | | |
                                     | | | VBL starts
                                     | |
                                     | | first inactive line (line 342)
                                     |
                                     | last active line (line 341)
So to be clear, the diagram above represents one frame of time or about 16.6 ms and some of the video-related things that happen in the course of a frame.

Sound generation is synchronized to the video cycle and a new sound sample from the sound buffer is read from memory and output after each line. The sound buffer has 370 entries (numbered 0-369), exactly the same as the number of video lines. The sound buffer is interleaved every other byte with the disk speed buffer so although it has 370 entries, the whole sound buffer spans 740 bytes since every other byte in the buffer is for the (unused on 800k/1.44M drives) disk speed buffer. So the sound buffer must be updated once per frame with 370 new samples. This occurs as a “VBL task” after the vertical blanking interrupt. Let’s look at the code in the Mac ROM for generating 370 samples of four-voice sound:
Screenshot 2023-04-08 at 8.32.04 PM.png

The significant thing here is that the update starts at byte index 370 in the sound buffer, halfway into it. 185 samples are generated (half of the buffer), and then the driver goes back and generates another 185 samples starting at the beginning of the buffer. So the buffer is first filled up from the middle to the end, and then from the beginning to the middle.

Now let’s look at the beginning of the code for generating single-voice sound:
Screenshot 2023-04-08 at 8.42.40 PM.png

I’ve omitted the loop to compute a sample but the significant thing to notice here is that in this case, the sound driver is not adding 370 to the sound buffer pointer in order to skip 185 samples into the sound buffer. In the simpler single-voice mode, the sound driver adds 64 to the pointer in order to skip 32 samples into the buffer.

So the takeaway here is that in four-voice mode, sound generation starts at position 185 in the sound buffer, halfway into it, whereas in single-voice mode, sound generation starts at position 32 in the sound buffer, less than one tenth of the way into the buffer.

Now we know from Andy Hertzfeld’s famous story “Sound by Monday,” that the Mac spends almost 50% of its CPU time generating four-voice sound. This matches what we are seeing in the four-voice loop, which takes 160 clock cycles or 20+ microseconds. A sound sample is output every 45 microseconds so yeah, that’s about half the CPU time. Now let’s make some more of those time diagrams but this time indicating the progress of the CPU outputting the sound buffer and the Mac sending out the samples.

Here’s a diagram showing what’s happening in the video cycle when the Mac begins generating sound samples in the problematic four-voice mode:
Code:
-----------------------------------------------
|             active video            | blank |
-----------------------------------------------
  ####                 ^
  ^  ^                 ^
  |  |                 |
  |  |                 | sample gen start
  |  |
  |  | max output position in buffer
  |
  | min output position in buffer
I’m showing here the range of possible positions of the sound currently being output when sound samples start to be generated. We don’t know exactly how long it will take to do other VBL stuff before sound generation begins, so we don’t know exactly where in the video cycle the Mac will be when sound generation starts. Therefore the diagram indicates a fairly broad range of where in the frame (and therefore where in the sound buffer) the Mac is currently outputting from.

Now I’m gonna sort of move time forward and do the diagram again. We know that samples are generated in four-voice mode at about 50% of the speed that they are output. So let’s do the diagram again for the future point in time where sample generation is half complete. This time I’m just gonna show the range of sound buffer output positions with hash marks and the sample generation pointer with the ^ symbol:
Code:
-----------------------------------------------
|             active video            | blank |
-----------------------------------------------
             ####                            ^

And now as the final sample is generated:
Code:
-----------------------------------------------
|             active video            | blank |
-----------------------------------------------
                      ^ ####

All these diagrams were scaled for regular 7.8336 MHz CPU speed. Were the processor just a little bit faster, the diagram for when the final sample is generated would look like this:
Code:
-----------------------------------------------
|             active video            | blank |
-----------------------------------------------
                  #### ^
This is the cause of the sound problem! If the CPU generates the samples too quickly, generated samples overwrite samples which have yet to be played. This causes the sound to sort of skip slightly 60 times per second. Since it's happening so fast, it doesn't just sound like a CD or record skipping, you get this 60 Hz content and it sounds like a deep groan superimposed on the music. Even a slight increase in speed during sound generation will cause the sound generation pointer to cross over the sound playback pointer, corrupting the audio data as it is played.

So there is a simple way to fix it. It’s a bit coarse in terms of the speed impact but it works. Whenever the sound buffer is written to, slow the CPU way down for the next 30 microseconds or so. During sound generation the slowdown will be continuously triggered, so sound generation will go at 8 MHz equivalent speed and everything else will go at 25 MHz. If no sounds ever play, the CPU never slows down and you get the full 25 MHz. If there are just one-voice sounds playing, that takes maybe 12% CPU time or so and the impact on speed is not so bad. 12% at 8 MHz and 88% at 25 MHz so that averages out to 23 MHz. But with four-voice sounds, you take a big speed hit. 50% of the CPU time is spent at 8 MHz equivalent speed and the other half at 25 MHz, averaging out to 16 MHz or so.

The fix works, basically confirming that we have the correct cause of the issue. It was also interesting to find the perfect slowdown speed. The closer I approached the right (slow) speed, the better the sound quality as the amount of overlap in the sound buffer decreased. We will have to test all the various programs using four-voice sound to make sure that they all sound good on the WarpSE.

The "sound slowdown" solution is okay and I think I’m going to use in the final WarpSE but I would like a better way to fix the problem. Maybe I will write about my better idea another day... Sorry for any typos or wordiness, I was just trying to get this out quick in case anyone is interested in exactly the reason for the sound problem.
 

retr01

Senior Tinkerer
Jun 6, 2022
2,473
1
793
113
Utah, USA
retr01.com
This is the cause of the sound problem! If the CPU generates the samples too quickly, generated samples overwrite samples which have yet to be played. This causes the sound to sort of skip slightly 60 times per second. Since it's happening so fast, it doesn't just sound like a CD or record skipping, you get this 60 Hz content and it sounds like a deep groan superimposed on the music. Even a slight increase in speed during sound generation will cause the sound generation pointer to cross over the sound playback pointer, corrupting the audio data as it is played.

Makes sense! :) I can visualize that already in my head. The CPU is getting low on "gas," and the sound suffers. It works fine until "more" sound is needed, so processing time is necessary—no sound processor to take the load off the CPU.

The fix works, basically confirming that we have the correct cause of the issue. It was also interesting to find the perfect slowdown speed. The closer I approached the right (slow) speed, the better the sound quality as the amount of overlap in the sound buffer decreased. We will have to test all the various programs using four-voice sound to make sure that they all sound good on the WarpSE.

The "sound slowdown" solution is okay and I think I’m going to use in the final WarpSE but I would like a better way to fix the problem. Maybe I will write about my better idea another day... Sorry for any typos or wordiness, I was just trying to get this out quick in case anyone is interested in exactly the reason for the sound problem.

What if you add a sound processor and route sound to that? Yet, isn't there already a chip on the board for sound?
 

lilliputian

Tinkerer
Mar 6, 2022
231
96
28
Los Angeles, California, USA
I have the soldering experience to be able to do something along the lines of JeffC's solution, but forgive my ignorance, is it not possible to simply remove the original processor and install a socket for the accelerator, or is it required as a sort of pass-through to the logic board? I do know that a Processor Direct Slot like on the SE is just what it sounds like: a direct connection to the processor.
 

Zane Kaminski

Administrator
Staff member
Founder
Sep 5, 2021
371
608
93
Columbus, Ohio, USA
I have the soldering experience to be able to do something along the lines of JeffC's solution, but forgive my ignorance, is it not possible to simply remove the original processor and install a socket for the accelerator, or is it required as a sort of pass-through to the logic board? I do know that a Processor Direct Slot like on the SE is just what it sounds like: a direct connection to the processor.
Yeah, it can be done, but then the installation instructions have to have a “desolder your CPU and solder in a socket” step, which is fairly difficult to do with the large DIP-64 package of the MC68000. It’s also pretty easy to destroy the through hole plating on the motherboard if you pull the chip out too hard.
 
  • Like
Reactions: lilliputian

Zane Kaminski

Administrator
Staff member
Founder
Sep 5, 2021
371
608
93
Columbus, Ohio, USA
What if you add a sound processor and route sound to that? Yet, isn't there already a chip on the board for sound?
The key difficulty is that the algorithm used to generate the sound is particular to the CPU speed and video/sound output timing. We could fix it in software it but I prefer not to patch the ROM/OS. The issue is that that sound generation algorithm runs too fast and the sound buffer gets corrupted. So any solution that doesn’t involve taking a while between committing successive samples to sound RAM won’t work.

But what we can do in the future hardware with the big FPGA is have a queue that stores sound samples and writes them out to motherboard RAM with a delay. That way the CPU doesn’t need slowed down and the FPGA does the work of delaying the sample writeback instead. We need another RAM buffer in the FPGA for that though so it would be a good use of a large FPGA’s block RAM. Can’t fit in the current CPLD..
 
Last edited:

Melkhior

Tinkerer
Jan 9, 2022
98
49
18
All the Mac needs to work mostly correctly is for the machine to maintain some minimum performance over 16-32 microseconds or so while doing floppy operations
That's probably not a big ask. If a 8 MHz 68000 can do it, I'd expect any half-decent softcore in a non-trivial FPGA to be able to do it just as well. Delays related to e.g. DDR refresh can be non-trivial and create big latency spike, but averaged over such a long period I think they're unlikely to be an issue. And that's going for large amount of memory; for the amount of memory in SE-era machine, SRAM might be cheap enough to be an option, at least for prototyping.
(I'm currently thinking how to patch the ROM to add an extra entry in the chunk table in 32-bits-booting systems, so that MacOS can use extra RAM 'somehow' connected to the PDS... I won't be mentioning again a specific brand of FPGAs ;-) ).

But there’s the sound issue too on the Mac (...)
Isn't that just on the SE (and maybe earlier) ?

I haven't looked at how this is implemented in hardware on later machines, but I have implemented an audio component for the NuBusFPGA (sound from my HDMI mointor on the Q650!), and with SoundManager 3.0 it can be interrupt-driven (and double-buffered). I would assume post-SE machine would use a similar approach and not care about timing anymore?

Yeah, I will ask the Vampire people again sometime.
Purely an opinion, I'm not fond of proprietary stuff creeping in open-source projects. Unless they're willing to go open-source, I'm not sure there's a point. Maybe ask the Suska people instead/as well? They already share the 68K30L (and others), they might be willing to open the 68K30 (w/ MMU) if it brings them extra help. And it will likely be an easier learning curve if you're looking at the 68K00/68K10 for the WarpSE/FPGA you've mentioned.

At some point someone will advocate just using a Zynq or similar and emulating the 68K by software on the Arm core, Buffee-like... But then why not also software-emulate the peripherals and just run Qemu or some other emulators ?

Edit: I linked the Buffee but I was really thinking about the PiStorm I think. Both nice idea, but FPGA are much more fun than boring old software :)
 
Last edited:

max1zzz

Moderator
Staff member
Sep 23, 2021
233
564
93
27
Free to be forked at any time: https://github.com/garrettsworkshop/Warp-SE

I’ll put our usual license in later (which is some poorly written document which I have no idea if it’s enforceable) but basically you can do any commercial stuff with it as long as you remove the Garrett’s Workshop logos from the board.
Forked it :)

Guess I better go grab the plus from the loft now!