WarpSE: 25 MHz 68HC000-based accelerator for Mac SE

Zane Kaminski

Administrator
Staff member
Founder
Sep 5, 2021
372
612
93
Columbus, Ohio, USA
@JDW I think one of the important differences in what you’re describing compared to autocorrelation is that autocorrelation is the correlation of a signal to itself, historically. That’s as opposed to the correlation between a signal and some other signal. Like I said, I don’t really understand the meaning of autocorrelation. Seems kinda mathematically deep and I would have to brush up on statistics in order to get a handle on the specific meaning. But just the fact that there’s a 60 Hz periodicity present in the distorted sound but not the correct sound indicates the theory is at least on the right track. I would say it's best not to overanalyze at this point. We have a good idea about the cause of the problem, now let's see if the proposed solution works to fix it.

Hmm but it would also be good to repeat this same kind of analysis on the SE/30 with and without the Carrera040. Whether the distortion on the SE/30 has the same properties as on the SE would give us a clue as to whether the issue is similar or has some different underlying cause.
 
  • Like
Reactions: JDW

JDW

Administrator
Staff member
Founder
Sep 2, 2021
1,656
1,416
113
53
Japan
youtube.com
it would also be good to repeat this same kind of analysis on the SE/30 with and without the Carrera040. Whether the distortion on the SE/30 has the same properties as on the SE would give us a clue as to whether the issue is similar or has some different underlying cause.
To accomplish that, @Kay K.M.Mods would need to connect an audio recorder to his Carrera040 SE/30 using a cable with two 1/8" jacks at either end, and then record the same sound at either 16-bit 44KHz or 48kHz, with and without the Carrera enabled, just as I did. Might be nice if WinterGames could be used, as that would allow me to audibly hear any similarities in the distortions between my SE with SpeedCard and his SE/30 with Carrero040.
 

Melkhior

Tinkerer
Jan 9, 2022
98
50
18
(...) if we developed our own 68k core to be put on an FPGA
The main problem would be the MMU. There's a bunch of open-source 68k-compatible soft-cores of various level of compatibilities out there, but AFAIK none include the MMU. The Suska '030 is supposed to have one, but it's not in the open-source version.

Another issue for the MMU is the '030 and '040 aren't fully compatible. It wasn't a major issue for HW design back in the day, as the OS would be adapted to the HW and the userland remained compatible (e.g. Sun did that between the sun3 which used a '020 + Sun-conceived MMU and the sun3x which used the '030 built-in MMU).

And that MMU is _complicated_ (the '030 more than the '040). There's already FOSH version of MMU for e.g. SPARC (LEON cores are GPL, not sure if Temlib qualify as 'open-source' from its license), RISC-V (many already, more coming) or PowerPC (microwatt), yet the highly-popular-in-its-day '030 is still MIA...

The FPU is optional on Mac and most 68k systems so is not a requirement at first I would think.

Area is probably not a big concern anymore. FPGA are becoming bigger and cheaper (when there's no chip shortage, that is), and Artix-7 / Spartan-7 aren't that expensive. I have a Linux-running, 100 MHz, quad-core VexRiscv RV32GBK (so complete with shared FPU, crypto, bitmanip) SoC in an Artix-7 100T with room to spare, a full 68030 should fit in the 35T I use for NuBusFPGA (it's less than 50% full with the DDR3 controller and a small embedded version of VexRisc to run the acceleration code).
 

JDW

Administrator
Staff member
Founder
Sep 2, 2021
1,656
1,416
113
53
Japan
youtube.com
Look at that! Again I don't really know what this means, but there's a clear repetition every 16-17 milliseconds. Hmm! So something is happening that often which is messing up the sound.
Someone replied in the Audacity forum to suggest I use track Spectrogram view to more clearly see the noise...

1654347392266.png


The upper track is the noisy track (SpeedCard enabled) in the Spectrogram views below.

1654347373189.png


Changing it to grayscale makes the vertical bands in the noisy audio even more clear. Measuring from the middle to the middle of two adjacent vertical bands shows them to be 16ms apart, or possibly 16.5ms (1/16.5ms is about 60Hz), confirming what you found, Zane.
 

Zane Kaminski

Administrator
Staff member
Founder
Sep 5, 2021
372
612
93
Columbus, Ohio, USA
Ooo... nice. Is there a thread about this? Is the intent to allow bench-testing a SE | SE/30 'board with a VGA monitor & ATX PSU?
Yeah, we need something to facilitate probing the board for debugging since it’s pretty hard to access inside the chassis. And also for serial production testing purposes. There’s no dedicated thread for it; I just whipped it up in a day or two.



The main problem would be the MMU. There's a bunch of open-source 68k-compatible soft-cores of various level of compatibilities out there, but AFAIK none include the MMU. The Suska '030 is supposed to have one, but it's not in the open-source version.
Yeah, and the existing cores aren't really amenable to adding the MMU either since it's hard to add the TLB into their "pipelines." Hell, I don't think any of the open-source 68k cores are really pipelined at all beyond prefetch, although I have not looked at Suska '030. All the ones I've seen are 68000-type microarchitectures, not the somewhat-more-pipelined '020/'030-type cores, and certainly not fully-pipelined 68040-type microarchitectures. And all the cores I have seen are pretty poorly documented although I guess not much documentation is necessary for a non-pipelined CPU implementation. So none of them are, in my opinion, a great place to start if you wanna turn it into something with an MMU and which can do 1 instruction/clock.

And that MMU is _complicated_ (the '030 more than the '040). There's already FOSH version of MMU for e.g. SPARC (LEON cores are GPL, not sure if Temlib qualify as 'open-source' from its license), RISC-V (many already, more coming) or PowerPC (microwatt), yet the highly-popular-in-its-day '030 is still MIA...
'030 software compatibility is, in my opinion, the best target for a 68k core project. Quadras are pretty fast but the much loved SE/30 is kinda slow. As for nobody managing to implement the MMU... CISC is a feature--it makes it harder for anyone else to clone your CPU hahah. Can we sort of bunt on the whole translation table walk state machine and use a RISC-y software-managed TLB? We could add an "ultravisor" mode (I'm avoiding calling it a hypervisor) and drop into that to handle a TLB miss. Not as fast but it's serviceable if the core does single-cycle execution at 100+ MHz. Most people these days use the 68k Mac OS with virtual memory off anyway so this technique sounds acceptable. We can implement a somewhat incomplete MMU table walk firmware and just add features as operating systems require them.

Area is probably not a big concern anymore. FPGA are becoming bigger and cheaper (when there's no chip shortage, that is), and Artix-7 / Spartan-7 aren't that expensive. I have a Linux-running, 100 MHz, quad-core VexRiscv RV32GBK (so complete with shared FPU, crypto, bitmanip) SoC in an Artix-7 100T with room to spare, a full 68030 should fit in the 35T I use for NuBusFPGA (it's less than 50% full with the DDR3 controller and a small embedded version of VexRisc to run the acceleration code).
You would think area isn't of prime importance anymore, but it creeps up in a few ways, I think. I guess it's not area itself that's the issue... Obviously area can be well-spent on useful microarchitectural features in the CPU but I have seen a lot of designs that have a lot of verbosity. This translates into deep structures which are slow and consume a lot of area. It'd be cool to crank up the 68k core to like 100 MHz on a lower-end FPGA than a Spartan-7. That's hard unless the pipe stages are designed with reference to the FPGA architecture so that each stage has a certain LUT depth and no more than that. IMO if you are targeting high speed on an FPGA, the little logic circuits comprising your design will reflect that. For example, on an LUT6 FPGA, a 2:1 mux costs the same in the timing budget as a 4:1 mux, but anything wider gets expensive quickly. On an LUT4 FPGA, 2:1 is faster than 4:1 but it only requires 3 inputs so you can use the additional input to add another element of functionality to a LUT4 2:1 mux for free.

One example of this kind of sloppy coding comes up frequently in video controller cores. You have some reg [n:0] holding the current horizontal and vertical count. Then it's all too easy to try and generate the sync signals like so: hsync <= (hcount > hsync_start) && (hcount < hsync_end); Noooo! Don't use less-than/greater-than. Do an equality comparison with hsync_start and _end instead, setting/resetting hsync at the appropriate hcount values. That's smaller and faster. There are many other examples which are more subtle than this too but none are coming to mind at the moment.

On a sort of similar note, the Xilinx MIG DDR3 controller, while capable of high bandwidth, isn't great to pair with a "shallow" microarchitecture such as the one we would be developing where lower latency improves the execution speed a lot. And it's not just Xilinx's MIG, I don't think anybody's is that great. Lattice FPGAs don't have a hard RAM controller. I think they offer soft cores which are probably also slow lol. I usually roll my own SDR SDRAM controller for vintage/legacy projects and it works well. The controller isn't hard to make and making it myself allows me to optimize the command issuance and whatnot for the application at hand. DDR is harder and has more latency so I would say you should only pick it if you need the size, bandwidth, or you're trying to reduce pin count by going to a narrower interface.

Overall Spartan-7 is good but it's only got 150 I/O pins in the BGA-224 package and the BGA-324 only goes up to 210 pins if you get the "50k" capacity chip. 150 pins is enough to implement just an '030/'040 bus but a great accelerator oughta have an '030 bus plus an independent SDRAM bus. Spartan-6 was nice because it had almost 200 I/O in the BGA-256 so you could fit both buses. New larger FPGAs have also recently been getting separate "high-speed" and "wide-range" I/O. 3.3V is the new 5V on these and only the wide-range I/O pins can interface to 3.3V signals. The high-speed pins max out at 1.8V or something. It's harder to level-shift 1.8V to 5V and there's significantly more delay in the buffer that way, although if you are using DDR2/3 memory instead of SDR (more bandwidth but more latency too) then the high-speed pins aren't a big constraint since you can spend em on the RAM.

I also have sort of decided to quit using Altera and Xilinx parts. Too bad, I even had a Spartan-6-based Mac LC/CC accelerator that was like half done. Doesn't fit in the low-end CPLD-type FPGAs from Lattice for lack of block RAM for the L2 cache. The higher-end Lattice stuff is unobtainable now. I will say of Lattice though that they have not not discontinued a single product line since COVID. Many have been unavailable, but Lattice assures future availability of the early MachXO small FPGA series, the newest generation of the LC4000 5V-tolerant CPLDs, as well as their new larger stuff. That certainly can't be said of Altera and Xilinx. XC9500XL is dead, Spartan-6 is dead, MAX II/V would be dead but for a big customer outcry, etc. Obviously FPGA is a fast-moving category where stuff is discontinued a lot but pre-COVID, everyone anticipated Spartan-6 to be around until 2029+ since it was so cheap and good, with a lot of market penetration in cheap stuff. So the S-6 EOL came as a bit of a surprise.

It has long bothered me how Altera and Xilinx are not very "proud of their pricing." That is to say the major American and European distributors (Digi-Key, Mouser, etc.) seem to be required by A. and X. to advertise rather high prices for the chips with no quantity discount. It's hard to get a feel for the real cost of the device in smallish (100-1000) quantity. I guess the solution to this is to contact my FAE/sales office but obviously chip companies' sales organizations have little interest in small operations such as mine. It's doubly true for Xilinx and Altera. Not only am I a little customer, but I want the little FPGAs! At least at Lattice all they make is the little stuff so I only check one of the boxes. The emphasis on selling lots of small parts rather than fewer really big FPGAs is evident in Lattice's their overall strategy: pre-COVID, it was easy to determine the price of Lattice FPGAs in quantity. You could even buy them in up to 1000s quantity directly from Lattice's online store for a fair price.

Lattice has the ECP5 as their large-ish (i.e. a bit smaller than Spartan-sized) general-purpose FPGA. You can get it with 12k-85k LUT4s and 500-4000 kbits of block RAM. Looks good for an '030/'040-type core and it even comes in QFP-144 if you're willing to forego "backside SDRAM" and maybe externally demux address/data. As I recall, ECP5 was originally only in BGA. Then post-COVID Lattice launched the QFP-144 package so it might not currently actually exist but I'm excited for it eventually. That would make ECP5 the biggest FPGA in QFP package. Of course they have BGA-256 as well with something like 200 I/O.

So I would rather not throw the 50k LE Spartan-7 at the problem and instead target something like the Lattice ECP5 LFE5U-45F with 45k LEs. And those logic elements are LUT4s, not LUT6s. ECP5 can sort of merge four LUT4s into one LUT6. So comparing to the Xilinx, it's got 11.25k LUT6s compared to 32,600 LUT6s in the Spartan-7 with "50k logic cells" but I think that's enough for a pipelined scalar CPU. And if it works well on the ECP5, then we can put it on the Spartan or Artix or whatver, run it faster, enlarge the caches, etc. More power! But I think starting small with a focus on efficient logic machinery would be good.

Check out this talk:
Very interesting, this guy has implemented an Pentium Pro-type superscalar out-of-order x86 on an Altera Arria or Stratix or something. He claims "x86 is easy" but of course he works for Intel... As I recall, his core used like 30k LUT6s. So I think a pipelined, in-order, non-superscalar '030/'040 68k could fit pretty easily in the "45k" (i.e. 11k LUT6s) ECP5.


Edit: Wow, just took a look at the Suska 68K30L source. Looks great! Much better organized than the other 68k cores I've seen. This guy writes good VHDL. Maybe we can enhance this into something with MMU and 1 MIPS/MHz. We'd have to tear up the pipeline but the code is quite clean as a reference and the instruction decoder is very well organized.


Someone replied in the Audacity forum to suggest I use track Spectrogram view to more clearly see the noise...
Great work! You are having much more success than me with the spectral analysis. I could not see anything, but maybe that's because I did the FFT on the whole first note, rather than the rolling spectrogram display you are using. That's why I used autocorrelation haha, it found the thing I was looking for whereas I could not locate a 60 Hz peak or harmonics in the spectrum of the whole note.
 
Last edited:
  • Like
Reactions: cy384 and JDW

JDW

Administrator
Staff member
Founder
Sep 2, 2021
1,656
1,416
113
53
Japan
youtube.com
Regarding upclocked 68000s, here's some data of the Speedometer 3.2 performance increase from my FDHD SE's 16Mhz 68000 SuperMac Speedcard vs. stock...
Here is my SpeedCard benchmark results, also booted into System 6.0.8, but from an overclocked MacSD, also with 68881 FPU, and both settings in the SpeedCard Control Panel enabled. (My motherboard ROMs & IWM are an older set that do not supper 1.44MB drives.)

1654440080517.png


Speedometer 3.23

Basically the same scores as you, but my SE (SpeedCard Disabled) was set to the baseline, so it gets 1.00 instead of 0.99. FPU score was exactly the same as yours, which makes sense.

SpeedCard PROM version 4.01 (same as yours?):
1654440416763.png


The basic "feel" when the SpeedCard is enabled is noticeably faster than the stock 8MHz CPU, and I would define it as "fun usable," whereas the stock processor is just "usable."

All said, Zane's 20 or 25MHz baby should be a real screamer (for a 68000 accelerator).
 
  • Love
Reactions: Ubik

Melkhior

Tinkerer
Jan 9, 2022
98
50
18
It'd be cool to crank up the 68k core to like 100 MHz on a lower-end FPGA than a Spartan-7. That's hard unless the pipe stages are designed with reference to the FPGA architecture so that each stage has a certain LUT depth and no more than that.
100 MHz on my Artix-7 is fairly easy to reach; the bitmanip instructions in my VexRiscv are from my plugin generator, and are not optimized at all (just some two-cycles stuff for the really heavy instructions) - I'm no hardware designer (and my knowledge of SpinalHDL is very limited).

One example of this kind of sloppy coding comes up frequently in video controller cores. You have some reg [n:0] holding the current horizontal and vertical count. Then it's all too easy to try and generate the sync signals like so: hsync <= (hcount > hsync_start) && (hcount < hsync_end); Noooo! Don't use less-than/greater-than. Do an equality comparison with hsync_start and _end instead, setting/resetting hsync at the appropriate hcount values. That's smaller and faster.
Well, the NuBusFPGA doesn't do that, because that part of the code was written by somewhat more competent than me. On the other hand, I now need to fix the hardware cursor I added in the SBusFPGA because it does exactly that :)

On a sort of similar note, the Xilinx MIG DDR3 controller, while capable of high bandwidth, isn't great to pair with a "shallow" microarchitecture such as the one we would be developing where lower latency improves the execution speed a lot. And it's not just Xilinx's MIG, I don't think anybody's is that great.
LiteDram does the job for me on 7-series for DDR3. For the MAX10 there's a performance-oriented implementation from the EEVBlog forum. As for latency, the answer has been 'caches, caches, caches' for a while... and having the design a bit latency-tolerant is good for future-proofing it.

In my opinion, you want optimized stuff for 'commercial' or one-person project. For collaborative open-source stuff, simplicity is a more useful goal at least short-term - if there's a piece that already work, use that, it can be improved/replaced later. A working, pipelined 68030 would be nice, even it is a bit large and interfaced with off-the-shelf IPs on a standard board. Once there's a working, easily reusable proof-of-concept, the community will pick it up and improve it.

Overall Spartan-7 is good but it's only got 150 I/O pins in the BGA-224 package and the BGA-324 only goes up to 210 pins if you get the "50k" capacity chip.
The 50k has a larger package with 250 I/Os as well, same as the Artix-7. Does the 68030 need so many? SBusFPGA is making do with a 324 package so at most 210 I/Os, and SBus has separate A and D lines so you have a lot of signals - plus the DDR3 & USB chip on the board (I don't use this USB except for accessing the flash), micro-sd, the USB host and a pmod connector.

New larger FPGAs have also recently been getting separate "high-speed" and "wide-range" I/O. 3.3V is the new 5V on these and only the wide-range I/O pins can interface to 3.3V signals.
Yes, that's an issue... The new Artix UltraScale+ AU10P looks good, but it only has 72 HD I/O, vs double that for high-speed (plus the GTP stuff), and those HD I/O seems slower than the Artix-7 I/Os :-(

XC9500XL is dead
Darn, I have one of those in NuBusFPGA... (which I (c|sh)ould redesign without by dropping the VGA interface to reclaim a lot of pins).

Lattice has the ECP5 as their large-ish (i.e. a bit smaller than Spartan-sized) general-purpose FPGA.
The ECP5 indeed looks nice (I went with Xilinx because there's so many boards with their FPGA I could find one that fitted my requirements), and the 85F definitely can fit a pretty decent CPU - that's what is used in this project. But it's still too complex for me to design a PCB with it :-(

Edit: Wow, just took a look at the Suska 68K30L source. Looks great! Much better organized than the other 68k cores I've seen. This guy writes good VHDL. Maybe we can enhance this into something with MMU and 1 MIPS/MHz. We'd have to tear up the pipeline but the code is quite clean as a reference and the instruction decoder is very well organized.
No matter how nice, VHDL is still VHDL :) I wrote the first SBus interface for the SBusFPGA in VHDL, learning on the way, and did not enjoy it. It and Verilog (system or not) feel like writing assembly to me. You don't do it until you have a specific need and run out of options. I rewrote everything in Migen (to leverage Litex), and I haven't looked back. I just wrote a new NuBus interface in Migen for the NuBusFPGA to replace the Verilog one, it' s so much nicer and productive for a software guy like me...

Also before starting with the Suska - maybe ask the guy if the one he has is not open-source for 'business' reasons, or if he might open it some day. Having a full working 68030 would be a good starting point :)
 

Kai Robinson

TinkerDifferent Board President 2023
Staff member
Founder
Sep 2, 2021
1,179
1
1,184
113
42
Worthing, UK

Burpethead

New Tinkerer
Jun 5, 2022
3
4
3
Hi Zane and everyone! The time, dedication, engineering, and passion put into this project is amazing.

I am just wondering, if you don't mind me asking, if since the WarpSE card design will be based off the Mac Performer card, which was compatible with both the Plus & SE - if compatibility with the Plus will be considered for the WarpSE? I think that would be so amazing. Of course the Plus required a Killy-Klip clip interface to mate the accelerator. The Killy-Klips are now extremely rare.

I am no engineer but at it most simple - would it just require some kind of male PDS slot --> Killy Klip converter cable? And perhaps some sort of ROM sensor to differentiate the future sound fix between Plus and SE?

I have e-mailed Kay Koba and he said he would be interested in producing a reproduction Killy Klip if there was enough demand.

Benefits of the Killy-Klip design would be:

1) Compatibility with the Mac Plus (And there are a ton of Pluses out there)
2) With VGA out on the card, it could provide a workaround from the Mac Plus Analogue board that all have failing, non replaceable Flybacks.
3) Installing into the SE with the Killy Klip would leave open the PDS slot in the SE for other cards.

There are many Mac Pluses and it would increase the market of the card. Developing a PDS Slot --> Killy Clip adapter could be very helpful with all the SE add ons out there, and could be fun for Mac hobbyists.

The clip could be something like a $20 - $40 option? If not a future option.

This guy has taken a different approach to the clips... but I don't know much Japanese...


There is a gentleman in Austria who can source curved Acrylic, an LCD, and a 3D printed bezel for the original compact Macs - that provides an LCD with a proper CRT look. Could be a great match for the VGA out on the WarpSE card (potentially in Pluses). I have ordered his last screen and it has been stuck in customs for over two weeks - I am excited to receive it. However the gentleman has stated he will make more if asked. The link is here: https://www.etsy.com/listing/120581...how_sold_out_detail=1&ref=nla_listing_details

Anyway would LOVE Plus compatibility for the WarpSE.... I have personally always been a Plus fan - and the processor upgrades for them are even more rare than the SE accelerators.

Even if the Plus isn't supported, THANK YOU so much for everything and I may still end up getting one and finding an SE!

- Burp
 

Attachments

  • Screen Shot 2022-06-06 at 9.33.30 AM.jpg
    Screen Shot 2022-06-06 at 9.33.30 AM.jpg
    335.3 KB · Views: 101
Last edited:

Zane Kaminski

Administrator
Staff member
Founder
Sep 5, 2021
372
612
93
Columbus, Ohio, USA
Hi Zane and everyone! The time, dedication, engineering, and passion put into this project is amazing.
Hello and thanks for your interest in the project!

I am just wondering, if you don't mind me asking, if since the WarpSE card design will be based off the Mac Performer card, which was compatible with both the Plus & SE - if compatibility with the Plus will be considered for the WarpSE?
Actually I think you might be a bit mistaken. The WarpSE is an all-new design with various architectural features not present in older accelerators like the Performer. I like reproduction board projects like the Mac SE Reloaded and making small convenience-type changes to the reproduction boards is good, but I think studying an old accelerator design with the intent of copying it into a new accelerator is not too useful. So much has changed that the constraints on the original design are basically irrelevant and you can do something much better.

With VGA out on the card, it could provide a workaround from the Mac Plus Analogue board that all have failing, non replaceable Flybacks.
Actually there is no VGA output on the WarpSE. Maybe you are thinking of the little breakout board from a few posts ago. That's just for powering the the SE motherboard and viewing the screen with it on a board of wood. It's just for testing purposes and I don't plan to release it as a product or anything. Some other people have made similar gizmos and they're pretty simple.

Okay now getting down to the meat of your question, Plus compatibility. There is one gotcha which prevents the current WarpSE chipset from being compatible with the Plus using a Killy Klip. The WarpSE uses the Mac's 15.6672 MHz clock signal but only the 7.8336 MHz clock goes to the 68000 on the Mac's motherboard. So to make it work you'd have to additionally tap the 15.6672 MHz clock. Maybe I can fix this in a future revision of the WarpSE chipset and only depend on the 7.8336 MHz clock. With that, you could make a 68k-to-PDS adapter and it'd work, but it'd be better at that point to re-layout the board for the Plus to accept the Killy Klip. Lemme get finished with this first version and then maybe I can revise it to eliminate dependence on that pesky clock signal.
 

Zane Kaminski

Administrator
Staff member
Founder
Sep 5, 2021
372
612
93
Columbus, Ohio, USA
Not sure why the Xilinx is dead - there are plenty on Mouser, and they're not listed as NRND (ie, when the manufacturer is taking final orders):
I might be wrong but I think the general understanding is that Xilinx is EOL'ing a bunch of their older stuff. Spartan-6 has not been officially discontinued but the word people are getting from Xilinx is that its days are numbered. I think the same is true of XC9500XL. Hope not but I had basically assumed so. I really oughta contact my local FAE/sales people and see what they say. Fortunately I already own 115 of the XC95144XLs for the WarpSE project and I just got a good quote on another 150 or so from a tiny distributor in India who has like 6000 or something. Most all of the big distributes have jacked up the price a lot or run out of XC9500XLs.


100 MHz on my Artix-7 is fairly easy to reach; the bitmanip instructions in my VexRiscv are from my plugin generator, and are not optimized at all (just some two-cycles stuff for the really heavy instructions) - I'm no hardware designer (and my knowledge of SpinalHDL is very limited).


Well, the NuBusFPGA doesn't do that, because that part of the code was written by somewhat more competent than me. On the other hand, I now need to fix the hardware cursor I added in the SBusFPGA because it does exactly that :)


LiteDram does the job for me on 7-series for DDR3. For the MAX10 there's a performance-oriented implementation from the EEVBlog forum. As for latency, the answer has been 'caches, caches, caches' for a while... and having the design a bit latency-tolerant is good for future-proofing it.

In my opinion, you want optimized stuff for 'commercial' or one-person project. For collaborative open-source stuff, simplicity is a more useful goal at least short-term - if there's a piece that already work, use that, it can be improved/replaced later. A working, pipelined 68030 would be nice, even it is a bit large and interfaced with off-the-shelf IPs on a standard board. Once there's a working, easily reusable proof-of-concept, the community will pick it up and improve it.


The 50k has a larger package with 250 I/Os as well, same as the Artix-7. Does the 68030 need so many? SBusFPGA is making do with a 324 package so at most 210 I/Os, and SBus has separate A and D lines so you have a lot of signals - plus the DDR3 & USB chip on the board (I don't use this USB except for accessing the flash), micro-sd, the USB host and a pmod connector.


Yes, that's an issue... The new Artix UltraScale+ AU10P looks good, but it only has 72 HD I/O, vs double that for high-speed (plus the GTP stuff), and those HD I/O seems slower than the Artix-7 I/Os :-(


Darn, I have one of those in NuBusFPGA... (which I (c|sh)ould redesign without by dropping the VGA interface to reclaim a lot of pins).


The ECP5 indeed looks nice (I went with Xilinx because there's so many boards with their FPGA I could find one that fitted my requirements), and the 85F definitely can fit a pretty decent CPU - that's what is used in this project. But it's still too complex for me to design a PCB with it :-(


No matter how nice, VHDL is still VHDL :) I wrote the first SBus interface for the SBusFPGA in VHDL, learning on the way, and did not enjoy it. It and Verilog (system or not) feel like writing assembly to me. You don't do it until you have a specific need and run out of options. I rewrote everything in Migen (to leverage Litex), and I haven't looked back. I just wrote a new NuBus interface in Migen for the NuBusFPGA to replace the Verilog one, it' s so much nicer and productive for a software guy like me...

Also before starting with the Suska - maybe ask the guy if the one he has is not open-source for 'business' reasons, or if he might open it some day. Having a full working 68030 would be a good starting point :)
Maybe I should get on some of these new HDLs... VHDL is not my style, a bit too much typing to get something done, but I do like Verilog, despite some of its flaws. It's kinda like C. In a complex program you end up inventing parts of C++ (e.g. virtual method calls, etc.). So similarly in Verilog, the language makes it hard to reuse code so you end up creating intervening abstractions to do stuff. But it's usually workable.
 

Burpethead

New Tinkerer
Jun 5, 2022
3
4
3
Hello and thanks for your interest in the project!


Actually I think you might be a bit mistaken. The WarpSE is an all-new design with various architectural features not present in older accelerators like the Performer. I like reproduction board projects like the Mac SE Reloaded and making small convenience-type changes to the reproduction boards is good, but I think studying an old accelerator design with the intent of copying it into a new accelerator is not too useful. So much has changed that the constraints on the original design are basically irrelevant and you can do something much better.


Actually there is no VGA output on the WarpSE. Maybe you are thinking of the little breakout board from a few posts ago. That's just for powering the the SE motherboard and viewing the screen with it on a board of wood. It's just for testing purposes and I don't plan to release it as a product or anything. Some other people have made similar gizmos and they're pretty simple.

Okay now getting down to the meat of your question, Plus compatibility. There is one gotcha which prevents the current WarpSE chipset from being compatible with the Plus using a Killy Klip. The WarpSE uses the Mac's 15.6672 MHz clock signal but only the 7.8336 MHz clock goes to the 68000 on the Mac's motherboard. So to make it work you'd have to additionally tap the 15.6672 MHz clock. Maybe I can fix this in a future revision of the WarpSE chipset and only depend on the 7.8336 MHz clock. With that, you could make a 68k-to-PDS adapter and it'd work, but it'd be better at that point to re-layout the board for the Plus to accept the Killy Klip. Lemme get finished with this first version and then maybe I can revise it to eliminate dependence on that pesky clock signal.
Thank you so much! All makes sense!
 
  • Like
Reactions: JDW

Zane Kaminski

Administrator
Staff member
Founder
Sep 5, 2021
372
612
93
Columbus, Ohio, USA
@Melkhior I looked for FPGAs that could host the Suska 68K00/10/30L but there's really nothing big enough that's currently obtainable. I know where to get the larger MachXO2 and MachXO3 variants from Lattice but they are still quite small and also expensive per LUT compared to Spartan or whatever since they have the on-die configuration flash. MachXO3 has max 9400 LUT4s and 432kbits of block RAM. Maybe Suska 68K00/10 could fit but it will likely be tight given the larger FPGAs on the Suska-III-series boards.

More about the Suska cores: The 68K00 is an older design from 2006 whereas the 68K10 and 68K30L are newer and can evidently run faster. The 68K10 has an MC68000 compatibility pin which you can set appropriately and it lets you do the MOVE to SR instruction in user mode (which is privileged in the '10) so if you ground that or whatever, the core acts like a 68000. The older 68K00 also implements some '020+ addressing modes not available on the MC68000. The 68K10 does not implement these, making it better for compatibility.

Too bad I can't find anything to put it in! Maybe I will try compiling the 68K10 in Lattice Diamond to see how many LUTs it would use in the MachXO2/3.

Edit:

Suska 68K10 mapping to MachXO2-7000 results:
1654671629342.png

Fail!!! 17,182 LUTs required out of 6,864. Only 2313 registers required though.
Looks like we need something bigger than the Lattice nonvolatile stuff. Lemme see what timing 68K10 gets on ECP5.


Edit2: Preliminary results for the Suska series cores on the ECP5 FPGA are only okay. I'm using Lattice Diamond and Synplify to compile the design. 68K10 fits well in the ECP5 25k but it's hard pushing it past 25 MHz. Best I could get (post-route) was 29.2 MHz, so hardly an improvement on 25 MHz lol. 68K30L was slow too--16MHz. I think this is a synthesis problem since the ECP5 is pretty good. Lemme try changing the compiler settings and maybe I can increase the speed.
 
Last edited:
  • Like
Reactions: cy384 and Androda

Zane Kaminski

Administrator
Staff member
Founder
Sep 5, 2021
372
612
93
Columbus, Ohio, USA
Okay last update from me about the Suska core since we can't actually make anything with an FPGA big enough to put it on.

Well I was tired last night when I was compiling the Suska core for the Lattice ECP5 FPGA so I didn't think of something obvious. I think in retrospect one possible reason why the timing analysis didn't turn out faster is pretty obvious. In the 68000, the ALU takes two clocks to come up with a result for a 16-bit operation. The control logic in the Suska core is responsible for not consuming a result coming out of the ALU on the cycle right after new input is put into the ALU. To indicate this, there should be a "multicycle" constraint in the timing constraints, since it would be hard for the timing analyzer to figure this out itself. The multicycle constraint indicates to the timing analyzer that there's a long timing path, the result of which is not supposed to be consumed on the clock right after the input is put in. Hm. Maybe I can add the requisite multicycle constraints but doing so will basically require that I figure out how the whole core works, which I was kinda trying to avoid.

So hopefully the slowest path in the core which is restricting us to 25 MHz is actually only sampled two clocks after it starts to change, and therefore the maximum frequency would be more like 50 MHz.
 
Last edited:
  • Like
Reactions: eric and cy384

Zane Kaminski

Administrator
Staff member
Founder
Sep 5, 2021
372
612
93
Columbus, Ohio, USA
Just got the WarpSE prototypes back from JLCPCB! I had most of the parts put on by them but I have to add the 68000, CPLD, RAM, PDS connector, update system crystal, 25 MHz oscillator, 3.3V regulator, and put the ROMs in the sockets. I also have to make a program to rearrange the ROM file before I can program the ROM chips since I don’t have the 68k address bus connected to the ROM address bus in sequential order.

F3F88825-0302-46A3-8B5A-220FC0765F45.jpeg
 
Last edited:

retr01

Senior Tinkerer
Jun 6, 2022
2,474
1
796
113
Utah, USA
retr01.com
This WarpSE accelerator PDS card sounds incredible for the SE! :)

A WarpSE/30 accelerator with two PDS connectors and a 24-bit video card with VGA and HDMI for the SE/30 PDS would be fantastic! Is that on your following project list @Zane Kaminski? It would have to be compatible with BMOW's Rominator II SIMM and any network card in one of the PDS connectors on the Warp SE/30 card.

FPGA is great, yet I like it when it's actual SE or SE/30 internal parts with some modern help.
 
  • Like
Reactions: Zane Kaminski

retr01

Senior Tinkerer
Jun 6, 2022
2,474
1
796
113
Utah, USA
retr01.com
Actually I think you might be a bit mistaken. The WarpSE is an all-new design with various architectural features not present in older accelerators like the Performer. I like reproduction board projects like the Mac SE Reloaded and making small convenience-type changes to the reproduction boards is good, but I think studying an old accelerator design with the intent of copying it into a new accelerator is not too useful. So much has changed that the constraints on the original design are basically irrelevant and you can do something much better.
I agree with you, @Zane Kaminski, regarding the new accelerator design versus copying from the old accelerator design to the new one. I have noticed the problem with most new accelerators that are copies of old designs.

It is better to go with new designs with modern parts to achieve what was impossible back in the day while maintaining the overall design of the compact Macs. Also, current approaches have ways to control certain things such as the clock frequency when needed because many games would be way too fast as being coded back in the day when most people did not have accelerators in the Macs. For example, the Apple Squeezer accelerator for the Apple IIGS has step frequency adjustments accessible in the classic desktop accessory or the control panel in IIGS System 6.

Back in the day, the old accelerators did not have that option as it was off/on. Sometimes slowing things down rather than turning them off would achieve the right balance, depending on the program running. Another caveat is that old accelerators cause some programs to freeze or not function properly other than running too fast. Perhaps new accelerators based on modern design can address some of that by automatically adjusting to match the bus speed on the logic board after the accelerated CPU has processed the data passing the data through the bus.
 
  • Like
Reactions: Zane Kaminski

Zane Kaminski

Administrator
Staff member
Founder
Sep 5, 2021
372
612
93
Columbus, Ohio, USA
A WarpSE/30 accelerator with two PDS connectors and a 24-bit video card with VGA and HDMI for the SE/30 PDS would be fantastic! Is that on your following project list @Zane Kaminski?

Long answer if I am to be thorough but I guess I'll write a bunch about a possible “WarpSE/30" since I've been asked so many times in varying levels of detail. A really great accelerator for an '030 system is a difficult project to truly get right. I had the WarpLC partially done six months ago but abandoned the design:
Pasted Graphic 3.tiff

This was gonna go in the LC, LC II, CC, etc. and add a 33 MHz ‘030, FPU, 64 MB SDRAM with 32-bit width and 4-1-1-1 timing (75 MB/sec), 32 kB L2 cache (inside the FPGA), and ESP32 for WiFi networking. It has since been canceled! I will explain the various reasons why but let me first say a few things we believe in doing differently with our accelerators:

Onboard RAM
Firstly, we believe in maxing out the system RAM with RAM onboard the accelerator. The 4 MB of RAM on the WarpSE is cheap and its inclusion obviates the need for L2 cache which would just be a few kilobytes. The onboard RAM is fast enough to keep pace with the 25 MHz 68k so no faster cache is needed. On SE/30 the benefits of maxing out the RAM on the accelerator are more pronounced. The obvious thing is price. 128 MB in 30-pin SIMM format is expensive. We used to have the best price at $80 but the SIMM market too cutthroat so we had to quit. The same 128 MB in SDR SDRAM just requires four chips costing $1-2 each. And with onboard SDRAM, we can design a really fast RAM controller capable of more bandwidth than any 68k Mac, including the IIfx and Quadra 840AV. So we are committed to always maxing out the RAM no matter what.

Minimizing legacy parts
Second, we are highly disinclined toward using legacy chips. The WarpSE misses the mark here. Most of the WarpSE’s chips are new but the RAM and 68k CPU are legacy. These chips are not made anymore and their prices will rise and quality will decline (by that I mean that there will be counterfeits, used chips with bent pins, etc.) over time. Old chips use a lot of power and space on the board and their presence also prevents us from meeting the latest environmental standards. The current WarpSE design is already fast so I would say that the primary goal for the next generation of WarpSE is to eliminate the legacy RAM and CPU. Well, we could hook modern SDRAM up to the current WarpSE easily enough but replacing the CPU will require an FPGA, and one with about 100x more logic resources than the little Xilinx CPLD serving as the the WarpSE’s chipset. We could keep the old 68k in the next WarpSE revision but I feel strongly that we must at least adopt SDRAM. SDRAM is 3V-only and the 5V-amplitude signals put out by an old MC68HC000 will blow up an SDRAM chip if connected directly. So to add SDRAM, we would have to put more of those pesky little buffer chips in between the RAM and 68k:
Pasted Graphic 2.tiff

The function of the buffer chips is to safely take in the 5V-amplitude signals from the 68k and output 3V signals which are safe for the SDRAM. Only a few tens of cents each but they are a bit of a pain. If we instead put the 68k in an FPGA, we can actually eliminate some of the buffers instead of having to add more because of the SDRAM. Fortunately Melkhior has pointed out the open-source Suska 68K10 core. Looks to be a well-tested, 25 MHz-capable 68000 implementation which we can use for the future WarpSE. Unfortunately there are no suitable open-source 68030-compatible CPU cores so any ‘030-based accelerators (such as for LC or SE/30) will have either make due with a “real” 68030 for now or uhh, not exist.

No passthrough connectors
Third, we are highly disinclined toward PDS pass-through connectors. It sounds good but you get into a 3V/5V level-shifting problem sorta similar to what I just described. I’ll elaborate. For the WarpSE, it was acceptable to use a small, simple FPGA (referred to as a “CPLD”) to implement the accelerator “chipset.” The function of the chipset is to control the onboard RAM, transfer data between the accelerated CPU and main motherboard, etc. This older CPLD can interface directly with the Mac’s 5V signals, saving us from having to add more of those little buffers. But for an ’030 accelerator, even if we are using a separate CPU outside of the FPGA, there are too many things the chipset has to do: burst transfers, 16/32 bit byte steering, L2 cache, etc. An old, small, 5V-tolerant CPLD won’t cut it. So we need to use a new, 3V-only FPGA. That means more of those little buffers to translate between 5V and 3V. The 5V/3V bus distinction makes it difficult to route the pass-through on the board too. Look at the back of the DiiMO030 accelerator with pass-through:
1654834996792.png

On the DiiMO030, it seems that the the signals from the bottom PDS connector go straight up to the passthrough connector on top, hitting any chips they need to connect to on their way through the middle of the board. On a possible WarpSE/30, since the 5V PDS signals can’t directly connect to the FPGA and SDRAM, none of the PDS bus passthrough wires go to anything but the two connectors and the buffers at the bottom of the accelerator card. Routing those signals through the middle of the board would make a big mess of the layout since they don’t have anything in the middle to connect to. So to simplify routing and minimize required layer count in the board, the whole PDS bus has to sort of go around everything to reach the passthrough on top. See this layout concept which convinced us not to do the passthrough:
Pasted Graphic.tiff

Not too much room for the go-around! Notice how on the now-abandoned WarpLC design, we avoided this pesky detail. Basically the same architecture but no passthrough so the layout is nicer:
Pasted Graphic 5.tiff

Much simpler layout without the passthrough! The final strike is the matter of the PDS connector itself. The 96-pin EuroDIN connectors like on the SE and LC PDS can be obtained cheaply from Chinese manufacturers. But few Chinese manufacturers make the SE/30 PDS’s 120-pin connector and it’s quite a bit more expensive from the well-known American/European manufacturers. Last I checked it was like $7 for the 120-pin SE/30 PDS connector from Molex versus $1-2 for the 96-pin SE/LC/Portable/NuBus connector from a no-name Chinese factory. Chinese connectors are good! I like the cheap plastic used in cheap connectors. Doesn't melt as easily as the softer plastic from the fancy connector companies. And then there’s the matter of soldering the connector pins! We have mostly automated surface-mount production using our pick-and-place machine and conveyor reflow oven. We like doing the production ourselves. Saves money and we get to be in charge of the quality. But we have to hand-solder through-hole stuff like the PDS connectors. Obviously we have to put on at least one PDS connector in order to have an accelerator, but we would rather not solder the additional 120 pins for the passthrough. It would be easier to put more surface-mount chips to implement the video card or whatever. So overall it’s just a big headache in every way for the passthrough and we should be building new network and video implementations anyway rather than supporting the passthrough. Also the PDS passthrough is necessarily 15.6672 MHz, whereas if we built the video card directly onto the accelerator's fast bus, updating the screen would be way faster.



Okay now back to the WarpLC and WarpSE/30. We wanted to make the WarpLC and then enhance that architecture into the WarpSE/30. The LC is less powerful than the SE/30 so 33 MHz would be acceptable for an LC accelerator but an SE/30 accelerator oughta be 40 or 50 MHz. Plus the SE/30 accelerator has to have the full 128 MB RAM instead of 64 MB on the WarpLC. LC owners don’t expect a passthrough since the computer has onboard (low-res) video output and nobody is expecting to add a network card and an accelerator. So the plan was to ship the WarpLC as an accelerator with WiFi hardware but no driver support, then work on the WiFi driver, and then once it was done, do the WarpSE/30 with integrated networking and possibly video. (And of course the WarpSE/30 would have no passthrough.)

So why did we cancel the WarpLC and therefore also the WarpSE/30?

Well, the most immediate reason was that we based this design fairly heavily around the Xilinx Spartan-6 FPGA, which since the beginning of 2022 has been unobtainable and is slated for discontinuation by Xilinx. We were planning on using some specific Spartan-6 features which are not present in other vendors’ FPGAs. We didn’t anticipate the Spartan-6 discontinuation in September 2021 when we started the WarpLC design. Second, these designs have like 21 of those little buffers. That’s a lot! They’re cheap enough ($0.15 nominally, $0.25-0.35 these days) but 21 is a ton of those little things.

Third, the WiFi interface on the WarpLC is supposed to be a clone of my NuBus WiFi card concept which has been languishing. The NuBus WiFi card hardware is in need of a redesign. I think it would be prudent to work on that first before dreaming up another, separate product that’s supposed to integrate the unfinished networking.

Fourth and probably most importantly,, maybe we should be focusing on the problem of implementing the 68030 rather than implementing the supporting logic of the rest of the accelerator. I looked just now on eBay at 68030 prices. For a 40 or 50 MHz chip, it’s at least $35. Many of the chips are fake too, as reported by many of our members and others. When I say “fake,” I don’t really mean that the chip doesn’t work or isn’t a 68030, but that it has been “refurbished” and re-marked with a faster speed grade and newer mask revision. Makes it difficult to 40/50 MHz if we have to bin the chips ourselves. And then what do we do with the slow chips? Return? The sellers will take them back, but if you return too much stuff on eBay at once, you start getting emails about eBay’s “abusive buyer policy.” Hahah, what about an abusive seller policy lol… We are anticipating the counterfeit chip problem getting worse as opposed to better. Once we have our 68030 core in hand, this whole problem will be solved and everything will get faster and cheaper. With everything internal to the FPGA, we will be able to reduce RAM and L2 cache latencies, increasing performance. There will be fewer buffer chips, since we don’t have a 5V 68k CPU which must be connected to the 3V-only FPGA and SDRAM. And of course we won’t have to spend $50 on the 68030 and 68882.

Regarding what you said about shying away from doing a lot in an FPGA, I can see how an FPGA 68030 might not feel like a win from the perspective of a vintage chip purist. I do understand. Garrett and I are obsessed with a few Honda/Acura models. Our favorites are the 2004-2008 “CL9” Honda Accord / Acura TSX and the 1996-2000 “EJ” Honda Civic. We own both. Garrett owns the Civic and Garrett’s Workshop owns the Acura TSX (it used to be mine), both with manual transmission of course. We have other (newer) daily drivers but we never wanna give up the TSX and Civic. What we like about the cars is this sort of beautiful unity between the driving feel, the engineering prowess, and the overall cost of ownership. Very much like the Mac. Fun to drive but then you start fixing the car and you see the sublime engineering. So minimal but so robust and good and reliable and smart. I love Honda and I love Apple for basically the same reasons. But eventually our Civic and TSX will be unmaintainable. Come 2025, Honda will no longer be legally obligated to produce parts for the Civic. Aftermarket parts are almost all garbage, except the fancy racing ones which are usually a lot more expensive than the parts from the dealer. In the Civic, all the rubber suspension bushings were totally shot after 20 years and we replaced them with polyurethane ones from a well-respected racing parts brand. Polyurethane lasts much longer than rubber but it gives a distinctly harsher road feel. I think we would have preferred new original rubber bushings. The Acura is worse lol, it’s been crashed into on the side. Had the body pulled to spec but it has some lingering suspension problem we’ve been too busy to figure out. We want a new 2008 Acura TSX and a new 2000 Honda Civic! And we don’t care if the cars are in an FPGA! Hahahah…

My point with the Honda-Apple comparison is that old parts are going away and will be steadily rising in price. Where do you get a 2008 Acura TSX A-Spec 6-speed manual with tech package that’s not trashed? Very hard to find, and nobody is making new 2008 Acuras. Fortunately we can sort of do this with Apple stuff. But just like the cars, we have not forgotten the importance of the original feel! With everything we do, we aim for complete compatibility, reliability, and above all, to preserve the original Macintosh feel! Over in the Amiga community, Gunnar van Boehm has his (closed-source) “Apollo 68080” core which he claims is 20x faster than a 68040. That’s all well and good but we would not want to put too fast of a CPU in our accelerator. One of the signature elements of the classic Mac is how it sort of takes a moment to redraw stuff when you move a window around. We don’t wanna make that imperceptibly fast, just less annoying than the old slow speed. The feel is important.

So, to answer your question in the most succinct way, yes, we are working toward the WarpSE/30 but there is a good bit of important work to be done first. Even if we settle for a physical 68030+68882 instead of the FPGA 68030, we should at least get the NuBus WiFi card done so as to be able to integrate it. The FPGA market is also pretty unstable now and it would be best to wait for it to stabilize before designing the WarpLC again around another FPGA just for it to be discontinued again because of the chip shortage.

Long message! Hope it clarifies something about our thinking.
 
Last edited:

retr01

Senior Tinkerer
Jun 6, 2022
2,474
1
796
113
Utah, USA
retr01.com
Thank you, @Zane Kaminski, for the excellent clarification. It makes sense, and I agree with you about the analogy with the Hondas. Yes, the Hondas (and their luxury Acura brand) are fine automobiles! I am a Ford F150 truck guy who sees General Motors as the evil IBM empire of the old days versus Ford making things new and keeping the feel of the Ford truck toughness. The Macs are hardier and tougher than the "PCs." Haha.

As you pointed out about the new designs, that would be making the unique approach of making things simpler while keeping the same feel rather than following the old ways such as fewer and grouped chips. Recall how Woziank used the fewest possible chips and paths to build the first Apple compared to other engineers of the day who used too many chips and paths?

I am wondering about the PDS pass-through and the number of cards. Since it is better to forgo the pass-through PDS stuff on the new accelerator cards, what about a dedicated PDS expander card that passes through and does not slow down or meddle with anything? Alternatively, would an all-in-one PDS card make sense to modernize a Mac SE or SE/30 down the road? Give the SE or SE/30 acceleration, 24-bt color video out (VGA and HDMI), and networking (802.11n, ac Wifi, or RJ45 Ethernet) to save money versus multiple PDS cards and PDS pass-through cards?