WarpSE: 25 MHz 68HC000-based accelerator for Mac SE

rjkucia

Tinkerer
Dec 21, 2021
233
81
28
Madison, Wisconsin, USA
i'm curious about this answer as well.

my understanding is maybe it wouldn't because an '030 is also faster per cycle. that is. gets more done in a single cycle then a plain 68k ?

that is . a 25Mhz 68000 is slower then a 25mhz 68030

that is. thats my guess.
Right, but it sounds like there's also some neat tricks happening here with video processing, not to mention that 25MHz vs 16MHz is a huge difference and may offset architectural improvements of the '030.
 
  • Like
Reactions: Zane Kaminski

Kai Robinson

TinkerDifferent Board President 2023
Staff member
Founder
Sep 2, 2021
1,153
1
1,162
113
42
Worthing, UK
Any estimate for how this will end up performing? At 25MHz & with the other improvements you've talked about it sounds like it would be a good deal faster than an SE/30, just without the extra memory capacity. Is that accurate?

A 16MHz '000 is half the speed of the '030 at 16MHz - the '030 has some L1 cache at least, and it does more work, per cycle.
 

Zane Kaminski

Administrator
Staff member
Founder
Sep 5, 2021
371
608
93
Columbus, Ohio, USA
Yes, it won’t be quite as fast as the SE/30. Maybe 3/4 as fast.

Theoretically (i.e. at peak), 68030 is a lot faster than 68000. Ignoring the difference in MHz, ‘030 can execute an instruction in just two clock cycles as opposed to four for the 68000. ‘030 can also work with 32 bits of data at once, as opposed to 16. The peak performance of a 68030 is up to 4x faster than a 68000 at the same speed, and of course the SE/30 has twice the clock speed of the SE and doesn't suffer from bandwidth degradation due to "vampire video" as the does the SE. So in theory the SE/30 could be more than 8x faster than the SE. In reality 68030 is complex and doesn't achieve its peak performance as often as 68000 achieves its peak performance. 68030 has more things that have to go correctly in order to achieve the peak speed (L1 cache, instruction overlapping, etc.), whereas 68000 basically achieves peak performance whenever you hook it up to a sufficiently fast RAM system.

Here are some speed figures comparing the Mac SE, Mac SE with WarpSE accelerator, and the SE/30:
SpecMac SEWarpSEMac SE/30
Clock speed7.8336 MHz25 MHz15.6672 MHz
Peak MIPS1.96 MIPS6.25 MIPS7.83 MIPS
ROM bandwidth3.92 MB/sec12.5 MB/sec15.6672 MB/sec
RAM bandwidth3.27 MB/sec12.5 MB/sec15.6672 MB/sec
Video RAM bandwidth3.27 MB/sec3.27 MB/sec3.92 MB/sec

So yeah, 3x faster than the SE in most measures but not quite as fast as the SE/30.

Also, notice how the SE/30’s VRAM bandwidth is not so good compared to its RAM bandwidth. Since the SE/30’s clock is twice as fast, they cut the VRAM width down to 8 bits from 16, as opposed to doubling it to 32 like they did for the regular RAM. So the VRAM speed is basically unchanged from the SE.

One thing not captured in the table is that when the SE/30 writes a longword (32 bit quantity) to VRAM, the processor bus is occupied for 16 clock cycles (1021 nanoseconds @ 15.6672 MHz) and the 68030 can only continue executing if the requisite data is in its rather small 256+256 byte L1 cache. On the WarpSE, this same operation occupies the fast bus for only 320 nanoseconds, some 3x faster. After that, because of the longword posted write buffer, the MC68k on the WarpSE can continue executing from any of the onboard fast RAM or fast ROM at full speed unless another write to video memory occurs before the first one has finished trickling out of the posted write buffer to the RAM on the SE PDS. So although the SE/30's L1 cache is a bit faster than the WarpSE's fast RAM, the fast RAM is way bigger and thus "shields" the processor from slow bus activity more effectively in some cases.
 
Last edited:

Zane Kaminski

Administrator
Staff member
Founder
Sep 5, 2021
371
608
93
Columbus, Ohio, USA
Okay as I said, in most operations the SE/30 has the edge but here's an interesting case where the WarpSE oughta be about 2x faster than the SE/30. I didn't think this could be true but now I have found a case where it is.

Referring to line 481 in BitBlt.a in the QuickDraw source code:
Code:
COPY32  MOVE.L  (A4)+,(A5)+                     ;TABLE TO COPY 0..32 WORDS
        MOVE.L  (A4)+,(A5)+                     ;wordCount = 30
        MOVE.L  (A4)+,(A5)+                     ;wordCount = 28
        MOVE.L  (A4)+,(A5)+                     ;wordCount = 26
        MOVE.L  (A4)+,(A5)+                     ;wordCount = 24
        MOVE.L  (A4)+,(A5)+                     ;wordCount = 22
        MOVE.L  (A4)+,(A5)+                     ;wordCount = 20
        MOVE.L  (A4)+,(A5)+                     ;wordCount = 18
        MOVE.L  (A4)+,(A5)+                     ;wordCount = 16
        MOVE.L  (A4)+,(A5)+                     ;wordCount = 14
        MOVE.L  (A4)+,(A5)+                     ;wordCount = 12
        MOVE.L  (A4)+,(A5)+                     ;wordCount = 10
        MOVE.L  (A4)+,(A5)+                     ;wordCount = 8
        MOVE.L  (A4)+,(A5)+                     ;wordCount = 6
        MOVE.L  (A4)+,(A5)+                     ;wordCount = 4
        MOVE.L  (A4)+,(A5)+                     ;wordCount = 2
...

This is the COPY32 unrolled loop in quickdraw that accomplishes copying a longword-aligned horizontal string of pixels from one location to another. Each instruction reads a longword and then writes it back to another location. This will be quite slow on the SE/30 if the source and destination are both in VRAM. Even in the likely event the code above is in the '030's instruction cache, the data cache is so small that the data being copied would not likely be cached. Therefore the SE/30 has to spend 32 clock cycles (2042 nanoseconds) to read and then write the data to execute each instruction even if the instructions are in cache and execution is perfectly overlapped. On the WarpSE, the data can be read out of fast RAM while the previous write is posting to VRAM and it would take an average of 1356 nanoseconds to execute each instruction (varies depending on vampire video).

Downside is that this is not the only type of unrolled loop in QuickDraw and others don't have such a favorable speed on the WarpSE compared to the SE/30. Consider the FILL32 unrolled loop in the same file:
Code:
FILL32  MOVE.L  D6,(A5)+                        ;TABLE TO FILL 0..32 WORDS
        MOVE.L  D6,(A5)+                        ;wordCount = 30
        MOVE.L  D6,(A5)+                        ;wordCount = 28
        MOVE.L  D6,(A5)+                        ;wordCount = 26
        MOVE.L  D6,(A5)+                        ;wordCount = 24
        MOVE.L  D6,(A5)+                        ;wordCount = 22
        MOVE.L  D6,(A5)+                        ;wordCount = 20
        MOVE.L  D6,(A5)+                        ;wordCount = 18
        MOVE.L  D6,(A5)+                        ;wordCount = 16
        MOVE.L  D6,(A5)+                        ;wordCount = 14
        MOVE.L  D6,(A5)+                        ;wordCount = 12
        MOVE.L  D6,(A5)+                        ;wordCount = 10
        MOVE.L  D6,(A5)+                        ;wordCount = 8
        MOVE.L  D6,(A5)+                        ;wordCount = 6
        MOVE.L  D6,(A5)+                        ;wordCount = 4
        MOVE.L  D6,(A5)+                        ;wordCount = 2
...

Here the routine moves the contents of the register D6 into video RAM, thus filling a 32-pixel strip with a particular pattern. Unfortunately here our advantage is eliminated because there is no need to read the source data--it's in one of the CPU's registers. So the 68030 can execute this routine twice as fast, something like 1021 nanoseconds per instruction and WarpSE turns out to be a bit slower with the same 1356 nanosecond speed per instruction as COPY32.
 

rjkucia

Tinkerer
Dec 21, 2021
233
81
28
Madison, Wisconsin, USA
Got it, that’s really interesting!

What’s the advantage of using the 68k on this card instead of a 020, 30, or 40? Is it mainly cost/complexity? I now have this vision in my head of a Quadra 840AV-B&W Compact hybrid that probably isn’t feasible, however I don’t know enough to know why.
 
  • Like
Reactions: Zane Kaminski

Zane Kaminski

Administrator
Staff member
Founder
Sep 5, 2021
371
608
93
Columbus, Ohio, USA
Yeah, putting an '030 on is sort of a complexity issue but not one of necessary complexity. I could basically swap out the 68HC000 for a 68030 in this design with very few changes. The capabilities of 68030 are much higher than 68000 but only if the processor is kept well-fed by the memory system. I would say that the memory system on the WarpSE is fast-ish but somewhat primitive. It's only 16 bits wide, there is no L2 cache, no burst access, no prefetching, etc. because the speed of MC68k doesn't demand it even at 25 MHz. 68030 has a very fast bus and to achieve maximum speed on a 68030 system, some of those techniques (L2, burst, prefetching, etc.) are required. 68030s are also sort of expensive compared to 'HC000s and also people want an FPU in their 68030 system, adding more cost. Were I to do a 68030-based accelerator, I'd want to include a lot of these features so as to maximize the benefit of the pricey '030. Maybe in the future! But all those features add a lot of complexity so first things first.
 

Patrick

Tinkerer
Oct 26, 2021
434
1
224
43
personally i kind of like sticking with a '000* for maximum compatibility.
I kinda want to use the programs i used with the computers i did back then. but it would be nice if i didn't have to wait as much as i used to.

*i remember when new, the community talked about '030 '040. but did anybody shorten 68000 to '000 ?
 
  • Like
Reactions: Ubik

rjkucia

Tinkerer
Dec 21, 2021
233
81
28
Madison, Wisconsin, USA
personally i kind of like sticking with a '000* for maximum compatibility.
I kinda want to use the programs i used with the computers i did back then. but it would be nice if i didn't have to wait as much as i used to.

*i remember when new, the community talked about '030 '040. but did anybody shorten 68000 to '000 ?
What kind of compatibility issues do the other models present?

Side note - has anyone ever hacked a '060 into a Mac? Sounds like they should be fully compatible, but probably hard to find.
 

Patrick

Tinkerer
Oct 26, 2021
434
1
224
43
Side note - has anyone ever hacked a '060 into a Mac? Sounds like they should be fully compatible, but probably hard to find.
I think i saw somebody use a fpga to emulate a 68k cpu, and stuck it in the cpu socket of a mac. .. i can't seem to find it. but i think they were able to get system 7 to boot. and maybe they didn't have sound or something.

..

but i can't find it now and my google-foo is failing me.

EDIT: and i think it was a 060 .....
 
Last edited:

Patrick

Tinkerer
Oct 26, 2021
434
1
224
43
yup thats it. thats what i was thinking of



EDIT: and sorry for derailing conversation.
 
Last edited:

alxlab

Active Tinkerer
Sep 23, 2021
287
312
63
www.alxlab.com
yup thats it. thats what i was thinking of



EDIT: and sorry for derailing conversation.
Yeah actually talked about that at the beginning of this post but it's a whole other topic. Last news I read for PiStorm is that it was working except for SCSI. If I remember correctly, Zane's thoughts were some of the timings were probably off in the CPU emulation right now which is keeping it from working 100%. It's an interesting project but with the general shortage of parts to make the PiStorm and the Pis themselves it become hard to get and expensive.

Regarding the 060 it been talked about for ages. Here's a pretty good post that goes over that topic:

https://68kmla.org/bb/index.php?threads/and-daystar-68060-accelerators.19929/

Even if 060s did work in Macs, good luck finding one with an FPU at a decent price.

Think this current project with 68000 hits the sweet spot of being very doable while keeping the price from going crazy. Mind you the PiStorm is supposed be a direct 68000 replacement so once that matures further there's nothing stopping that from being slapped onto Zane's design after.
 

Ubik

Tinkerer
Nov 2, 2021
41
55
18
Orange County, CA
Regarding upclocked 68000s, here's some data of the Speedometer 3.2 performance increase from my FDHD SE's 16Mhz 68000 SuperMac Speedcard vs. stock (I included my SE/30 vs SE Speedcard for reference as well). So I would guess that indeed a 25Mhz 68000 would be a magnitude higher in the mid- 3s as was discussed.

While some may think this is a minimal increase, the difference even the 16Mhz 68000 makes to the user experience is night and day. What's more, compared to the many 030 and 020 accelerator SE options I've used since the early 90s , there are less issues. In fact the only issue I have is the distorted sound problem that exists with all accelerated SEs.
 

Attachments

  • SE16vSE-NoFPU.png
    SE16vSE-NoFPU.png
    26.2 KB · Views: 112
  • SE16vSE-FPU.png
    SE16vSE-FPU.png
    25.5 KB · Views: 107
  • SE16vSE30.png
    SE16vSE30.png
    21.3 KB · Views: 101
  • Like
Reactions: Zane Kaminski

Zane Kaminski

Administrator
Staff member
Founder
Sep 5, 2021
371
608
93
Columbus, Ohio, USA
So high clock speeds, lots of heat, and sub-par IPC? Motorola was just ahead of the curve, they made a Pentium 4!
Interestingly, there are a lot of good tidbits in the NetBurst/Pentium 4 architecture. I guess they had to use a lot of tricks to make what turned out to be a bad architecture performant enough lol. I have been trying to get through "The Unabridged Pentium 4" but it's very boring lol despite the fact that I usually enjoy technical documentation.
Screen Shot 2022-03-16 at 7.01.18 AM.png
(very unabridged...)

Honestly i'd be really interested to see what would be possible with the Apollo '080 core that's used in the Vampire series of Amiga accelerators.
In late 2015, when I re-became interested vintage Macs, I asked the Apollo guy if he would license the core, even as a “black box” netlist without source, but he never answered... I’ve been wanting to do a 68k-compatible core but obviously I have too many projects aaaand I don’t really know how to make a CISC processor. RISC is comparatively easy and I’ve made several RISC CPU designs in college but a high-performance CISC is very complicated. And I often wonder if it would just be easier achieving good compatibility and performance with a software emulator instead. One day I’ll try to make a 6502 core for FPGA and if that goes well, I’ll take what I learned and take a stab at 68k.

Regarding upclocked 68000s, here's some data of the Speedometer 3.2 performance increase from my FDHD SE's 16Mhz 68000 SuperMac Speedcard vs. stock (I included my SE/30 vs SE Speedcard for reference as well). So I would guess that indeed a 25Mhz 68000 would be a magnitude higher in the mid- 3s as was discussed.

While some may think this is a minimal increase, the difference even the 16Mhz 68000 makes to the user experience is night and day. What's more, compared to the many 030 and 020 accelerator SE options I've used since the early 90s , there are less issues. In fact the only issue I have is the distorted sound problem that exists with all accelerated SEs.
Nice!!! Hopefully the WarpSE will be even more than 50% faster. The Speedcard is pretty good but of course the WarpSE has the full 4.25 MB of RAM+ROM onboard, as opposed to the Speedcard's 16 kB cache, plus the posted write buffer. So hopefully we can pick up speed there although of course there are diminishing performance returns as cache size is increased.

We are also going to try and fix the sound issue in hardware... fingers crossed on this because I don't 100% understand it but I think I know enough to address the problem. What I know is that when using more intensive sound generation modes (i.e. four-voice mode), the Sound Manager of the Mac OS is programmed to update the sound buffer at a different point than in the simpler modes. Presumably in the four-voice mode, the point in the sound buffer where the rewriting begins is closer to the current sample being played, the idea being that the samples will take longer to generate and the current audio being played won't be overwritten. So I think what's happening is that with an accelerated CPU, the Sound Manager blasts through updating the sound buffer very quickly and starts overwriting the current samples being played, in essence suddenly "skipping" the playback a little bit. This happens 60 times per second since the Sound Manager updates the sound buffer every frame. So the sound you get is the audio you're supposed to hear but it keeps skipping around 60 times per second, and then because of the skipping around, the samples often line up so as to produce a 60 Hz sound. Of course the Mac's speaker has little bass so you only hear the harmonics of the 60 Hz fundamental frequency. That's what it sounds like to me--the right sound but all broken up and then some other tones superimposed on top. However, I have not really thoroughly analyzed it and I don't even have an accelerated Plus or SE myself. I'm just referring to YouTube videos on the sound issue that I've seen.

So if this is really true then the solution for the sound problem is to slow the CPU way down when writing to the sound buffer, enforcing a maximum MB/sec bandwidth on that area of memory such that the timing of existing sound drivers is the same as before. Unfortunately that basically means that whatever percentage of time is spent on updating the sound buffer will go unaccelerated, but that's okay. I will have to tune the bandwidth limit a bit but I think this is the right way to go. And of course the limit will just apply to the sound buffer RAM areas so as to avoid slowing down the other areas of RAM that are in use the rest of the time when the CPU is doing the program logic and GUI operations and whatnot.
 
Last edited:

alxlab

Active Tinkerer
Sep 23, 2021
287
312
63
www.alxlab.com
In late 2015, when I re-became interested vintage Macs, I asked the Apollo guy if he would license the core, even as a “black box” netlist without source, but he never answered... I’ve been wanting to do a 68k-compatible core but obviously I have too many projects aaaand I don’t really know how to make a CISC processor. RISC is comparatively easy and I’ve made several RISC CPU designs in college but a high-performance CISC is very complicated. And I often wonder if it would just be easier achieving good compatibility and performance with a software emulator instead. One day I’ll try to make a 6502 core for FPGA and if that goes well, I’ll take what I learned and take a stab at 68k.

Maybe it would be possible to get a license since I found this page yesterday since I was curious:

https://amitopia.com/free-68080-fpga-core-license-by-apollo-team-is-great/

Here's a forum post that goes into further detail about using their 68080:

http://www.apollo-core.com/knowledge.php?b=4&note=28854&x=0&z=3pUrVp

1647448274879.png


I think it's possible but you're probably looking into the hundreds of dollars for the resulting product.

Personally I still think the PiStorm could work and would provide a more cost effective solution.

There's also FPGA implementation that are available like:

https://github.com/ijor/fx68k

Then there's also the Buffee 68030 that looked interesting:

https://amitopia.com/buffee-68030-a...DSVcwqe0CvBWUVQNl_mjDBnc_5TQoNzRISnbcz_BGYaCw
 
  • Like
Reactions: Zane Kaminski

Zane Kaminski

Administrator
Staff member
Founder
Sep 5, 2021
371
608
93
Columbus, Ohio, USA
Okay! The prototype design is complete now, I think. I’ve checked all the pinouts and such. Looks good! The schematic is attached to this post for anyone interested to look at. I’ve also got an interactive HTML assembly guide too.

The final thing I added to the WarpSE is a few extra wires to the CPLD which give the ability for the accelerated CPU to not take over the main bus. This would allow the accelerator to work in parallel with the Mac’s CPU as an I/O processor. Interesting but that’s a crazy idea and will probably never be fully implemented in any meaningful way. The main reason for this is to allow the accelerator to disable itself while installed. The way I have it right now is that if you hold both programmer’s keys, flip the power on, then release both buttons (releasing NMI first), the accelerator will inhibit itself and the system will act exactly like a regular Mac SE. I also changed the wiring for the DIP switches slightly so that applying the various button combinations of reset and NMI could also be used in place of the DIP switches. So I might eliminate the DIP switch in the future and just rely on the NMI/IRQ buttons. Not sure yet.

Okay if you have any feature suggestions... speak now or hold your peace until revision B!

Feature list, just to reiterate:
  • MC68HC000 running at 20/25 MHz (selectable)
  • Onboard 4 MB fast RAM and 1 MB fast ROM accessible with zero wait states
  • When using fast ROM, ROM chips on the Mac’s motherboard are not required. Alternatively fast ROM can be disabled and the Mac will use motherboard ROM.
  • Fast ROM can be rewritten during system operation
  • Posted write buffer allows up to two consecutive writes to video memory with zero wait states, improving video performance
  • No sound distortion—sound fix in hardware
  • Accelerator can be disabled by holding programmer’s keys at power-up
  • Onboard USB update system: Takes 5 minutes to update but 20 minutes to verify update data. Update verification optional but recommended
  • 33 MHz clock speed possible with faster RAM+ROM+oscillator swap or extra wait states. Extra wait states not currently implemented
  • Clock speeds lower than 20 MHz may not work due to architectural constraints
Schematic: https://garrettsworkshop.github.io/Warp-SE/Documentation/Schematic.pdf
Interactive placement guide: https://garrettsworkshop.github.io/Warp-SE/Documentation/GW4410A-Assembly.html
Rough theory of operation: https://garrettsworkshop.github.io/Warp-SE/Documentation/index.html
CPLD timing/fit report: https://garrettsworkshop.github.io/Warp-SE/cpld/XC95144XL/WarpSE_html/fit/appletref.htm
 
Last edited:

alxlab

Active Tinkerer
Sep 23, 2021
287
312
63
www.alxlab.com
Okay if you have any feature suggestions... speak now or hold your peace until revision B!

I want grey scale graphics support and bluescsi built in and a 72-pin ram slot so I can use my 4MB 72-pin ram that's worthless up to now now! :mad:

JK JK

I dunno man. Looks like you we're pretty thorough on the features already.
 
  • Like
Reactions: Zane Kaminski