WarpSE: 25 MHz 68HC000-based accelerator for Mac SE

Zane Kaminski · Nov 2, 2021

Hi everyone.

I wanted to post about my new design for a 25 MHz 68HC000-based Macintosh SE accelerator. As usual with my gizmos, this is a collaboration with Garrett Fellers of Garrett's Workshop. I'm the principal designer and Garrett is the codesigner and he manufactures our boards.

My aim with this card is to achieve a speedup of 3x or so while maintaining the greatest degree of compatibility with applications. Therefore I have employed an MC68HC000 running at up to 25 MHz instead of an '030. The card has 4 MB of legacy 60ns DRAM onboard which can be accessed with 0 wait states at 25 MHz. The Macintosh SE ROM is also reproduced on the board in two 512kx8 70ns flash ROMs. In order to achieve the requisite 3x speedup on graphics performance, I have implemented a longword posted write buffer. This lets the accelerated CPU write to video/sound memory up to twice in a row with 0 wait states. The accelerator trickles out these writes from the fast CPU to the slower PDS bus while the fast CPU gets to continue executing from RAM or ROM memory on the card. CPU speed is selectable, either 20 MHz or 25 MHz. The current board is using the Xilinx XC95144XL-10TQG100C CPLD to implement the control logic but that's just because we at GW have a few trays of them from a cancelled project. The control logic is amenable to implementation in the Lattice LC4128Z/E/C/B/V CPLD too.

The CPLD on the bottom left controls everything. Here's a block diagram of the system including several functional units inside the CPLD as well as external components like the ROM, RAM, 68k CPU, and the latches for the posted write buffer.

Of course, my description above sort of just scratches the surface. There are four main blocks in the CPLD shown in the diagram above. These are the FSB controller, DRAM/ROM controller, I/O bus target (i.e. slave port), and I/O bus agent (i.e. master controller). I have tried my best to specify the behavior of these blocks in my timing diagrams available here: https://garrettsworkshop.github.io/Warp-SE/Documentation/index.html

These are complicated but I'm hoping some of our skilled members can take a look. The diagrams are a bit of a work in progress but I believe they all basically make sense.

Here's an annotated render of the board showing where the external components are placed:

One interesting feature that I have added partway through development is the USB update system. I am using the inexpensive CH340G USB-to-serial chip to bit-bang the CPLD's JTAG port in order to do a software update. The data rate is pretty slow, maximum 1 kbit/sec, so it will take several minutes to update the CPLD. Nevertheless it's totally worth it since the additional cost for the update system is quite low. Here's the schematic for the update part of the system:

Source: https://github.com/garrettsworkshop/Warp-SE

I am of course looking for feedback, but my main aim in posting this documentation is to get some information out there on how to design an accelerator. Boards and PAL/GAL images are good to reverse engineer for study and reproduction, but with that approach you don't get get to hear from the designer about the tradeoffs and reasoning. I think my design methodology works well and I will teach it to anyone who's interested and has the required prerequisite knowledge. I like designing by myself and with Garrett but we are feeling lonely! It would be good to get some collaborative projects going where the understanding of how they work is distributed among the community rather than just with a few people. So this big dump of documentation is my attempt to start down that path.

FAQ - Here are some questions which others have asked me privately and figured I'd post a condensed version of their question and my answer.

Q: What is the "sound QoS" or "sound rate limiter?"
A: It's a fix for a behavior in the sound driver in the Mac ROM that causes audio glitches when running with an accelerated processor. The sound driver regenerates the samples in the sound buffer following vblank. Generating the sound samples takes a while and the sound driver has to be careful that it doesn't write into the buffer such that that samples yet to be played are overwritten. Therefore the sound driver starts updating the buffer at a different index depending on the sound generation mode. Modes that generate samples quickly have to start further away from the current sample being played so that they don't cross over the current playback index, causing glitches in the sound output. In four-voice mode, the samples take longer to generate, so the sound driver begins updating samples closer to the current index being played. With an accelerated CPU, the samples are generated faster and so the current sample being played is overwritten. The fix for this is to slow the CPU way down when accessing the sound buffer.

Q: What are "test vectors?"
A: Test vectors are waveform sequences used to verify that a digital system is working as expected. They're like test cases in software development. The methodology goes like this... We make test vectors which correspond to the behavior of the system surrounding a system under test. So in this case the system under test is the accelerator control logic and the test vectors should represent expected behaviors of the BBU, MC68HC000, etc. All of the timing diagrams in the documentation on GitHub are good candidates to turn into test vectors. These are combined with the compiled system control logic and simulated so we can see how the control logic will behave when installed in the Mac (as long as our test vectors represent the actual behavior of the system). Generally there are also output vectors that show the expected behavior of the system under test too, and any deviation from the expected output vectors is flagged as a failed test case.

Project Status
Accelerator is working in the lab, with a reported 3.25x overall speedup in Speedometer 3.

To-Do Board

Frozen	*All but done*	Needs work	Not yet started
Parts selection: CPLD, RAM, ROM	CPLD verilog
Schematic	Documentation & timing diagrams
Board layout	Prototype board bringup

Hi everyone.

I wanted to post about my new design for a 25 MHz 68HC000-based Macintosh SE accelerator. As usual with my gizmos, this is a collaboration with Garrett Fellers of Garrett's Workshop. I'm the principal designer and Garrett is the codesigner and he manufactures our boards.

My aim with this card is to achieve a speedup of 3x or so while maintaining the greatest degree of compatibility with applications. Therefore I have employed an MC68HC000 running at up to 25 MHz instead of an '030. The card has 4 MB of legacy 60ns DRAM onboard which can be accessed with 0 wait states at 25 MHz. The Macintosh SE ROM is also reproduced on the board in two 512kx8 70ns flash ROMs. In order to achieve the requisite 3x speedup on graphics performance, I have implemented a longword posted write buffer (hence the multitude of '573 latches near the PDS connector). This lets the accelerated CPU write to video/sound memory up to twice in a row with 0 wait states. The accelerator trickles out these writes from the fast CPU to the slower PDS bus while the fast CPU gets to continue executing from RAM or ROM memory on the card. CPU speed is selectable, either 20 MHz or 25 MHz. The current board is using the Xilinx XC95144XL-10TQG100C CPLD to implement the control logic but that's just because we at GW have a few trays of them from a cancelled project. The control logic is amenable to implementation in the Lattice LC4128Z/E/C/B/V CPLD too.

Source is available here: https://github.com/garrettsworkshop/Warp-SE

Screen Shot 2021-11-02 at 2.35.17 PM.png

The CPLD on the bottom left controls everything. Here's a block diagram of the system including the several blocks inside the CPLD (zooming in required lol):

Of course, my description above sort of just scratches the surface. There are four main blocks in the CPLD shown in the diagram above. These are the FSB controller, DRAM/ROM controller, I/O bus target (i.e. slave port), and I/O bus agent (i.e. master controller). I have tried my best to specify the behavior of these blocks in my timing diagrams available here: https://garrettsworkshop.github.io/Warp-SE/Docs/index.html

These are complicated but I'm hoping some of our skilled members can take a look. The diagrams are a bit of a work in progress but I believe diagrams 0 through 16 are correct. Diagrams 17 through 26 are half done. Every timing diagram (but for a few at the end which I haven't finished) has a detailed explanation of what's going on and why.

I am of course looking for feedback, but my main aim in posting this documentation is to get some information out there on how to design an accelerator. Boards and PAL/GAL images are good to reverse engineer for study and reproduction, but with that approach you don't get get to hear from the designer about the tradeoffs and reasoning. I think my design methodology works well and I will teach it to anyone who's interested and has the required prerequisite knowledge. I like designing by myself and with Garrett but we are feeling lonely! It would be good to get some collaborative projects going where the understanding of how they work is distributed among the community rather than just with a few people. So this big dump of documentation is my attempt to start down that path.

Expect updates to the timing diagrams! I have yet to finish diagrams 17-20 and diagrams 21-26 are missing the explanation. I'll get to that in the next day or so. I welcome all questions no matter how simple but hopefully we can keep the conversation here regarding engineering rather than organizational questions like when the board will be available, etc.

FAQ - Here are some questions which others have asked me privately and figured I'd post a condensed version of their question and my answer.

Q: What is the "sound QoS" or "sound rate limiter?"
A: It's a fix for a behavior in the sound driver in the Mac ROM that causes audio glitches when running with an accelerated processor. The sound driver regenerates the samples in the sound buffer following vblank. Generating the sound samples takes a while and the sound driver has to be careful that it doesn't write into the buffer such that that samples yet to be played are overwritten. Therefore the sound driver starts updating the buffer at a different index depending on the sound generation mode. Modes that generate samples quickly have to start further away from the current sample being played so that they don't cross over the current playback index, causing glitches in the sound output. In four-voice mode, the samples take longer to generate, so the sound driver begins updating samples closer to the current index being played. With an accelerated CPU, the samples are generated faster and so the current sample being played is overwritten. The fix for this is to slow the CPU way down when accessing the sound buffer.

Q: What are "test vectors?"
A: Test vectors are waveform sequences used to verify that a digital system is working as expected. They're like test cases in software development. The methodology goes like this... We make test vectors which correspond to the behavior of the system surrounding a system under test. So in this case the system under test is the accelerator control logic and the test vectors should represent expected behaviors of the BBU, MC68HC000, etc. All of the timing diagrams in the documentation on GitHub are good candidates to turn into test vectors. These are combined with the compiled system control logic and simulated so we can see how the control logic will behave when installed in the Mac (as long as our test vectors represent the actual behavior of the system). Generally there are also output vectors that show the expected behavior of the system under test too, and any deviation from the expected output vectors is flagged as a failed test case.

Project Status/To-Do
Here's my attempt to keep track of the project to-do. Stuff in the "frozen" category is assumed not to be changing for the currently-planned release. "All but done" is stuff close to being finished or able to commit to but maybe we need to do some more simulation or something. "Needs work," is stuff that... needs work.

Frozen	*All but done*	Needs work
Parts selection: CPLD, RAM, ROM	Board layout	Test vectors (copy these from documentation)
Schematic	CPLD verilog
	Documentation & timing diagrams

alxlab · Nov 3, 2021

That's really cool. I've been long interested in PDS accelerators but have very limited knowledge in the implementation of accelerators and electronics in general. This is absolutely great info

Any thoughts about using a PiStorm as 68000 accelerator in the Mac SE PDS slot (assuming the FC lines get implemented)?

Zane Kaminski · Nov 3, 2021

alxlab said:
Any thoughts about using a PiStorm as 68000 accelerator in the Mac SE PDS slot (assuming the FC lines get implemented)?

Shouldn’t need the FC lines. I don’t think anything in the 68000 Mac chipsets uses ‘em. Are you sure the issue isn’t that the Mac uses exact timing for many routines? Does the PiStorm emulator run in the ARM CPU’s user or supervisor mode? If it’s a user app then it can be preempted any time and so it won’t work with the floppy driver and possibly other areas of the system.

I had a failed project back in 2015 that’s basically like the PiStorm’s emulation approach. Problem was that my goal was Quadra performance and I was afraid the processors available wouldn’t cut it without implementing a JIT translation engine. My choices for processor fell into three categories: 32-bit ARM microcontrollers (ARMv7-M), 32-bit ARM applications processors (ARMv7-A), or 64-bit ARM applications processors. The 64-bit processors were attractive because they had enough registers to completely store the MC68k CPU state. 32-bit ARM, on the other hand, has fewer registers than MC68k so in the emulated implementation for each instruction you would have to load the registers from memory and then store them back.

So 32-bit apps processors were out because why suffer through making the emulator a Linux kernel module or setting up the MMU and whatnot to run barebones on the CPU when you can’t even fit all of the 68k state in the ARM32’s registers. That left 64-bit processors and 32-bit microcontrollers. At the time, the fastest ARM MCU was STM32H7 at 400 MHz. I wanted 20-25 MIPS so that wasn’t fast enough given that you have to load and store the emulated CPU registers before and after each instruction. 64-bit ARM was rejected for cost reasons. I bought a few devkits but the price for an ARM64 system on a module was like $100 and implementing the chip alone would call for a required a really advanced HDI PCB. Too costly, would be better to just buy a 50 MHz ‘030.

I could write a JIT that would translate MC68k to ARM, this enabling the emulator on the ARM to not swap the MC68k registers back and forth between memory and ARM registers but that’s even harder than implementing an interpretive emulator alone. I hear someone has gotten this approach working.

Now things have changed and there is a 1 GHz ARM MCU, the NXP i.MXRT1170 (ironically a distant successor of the 68k-based DragonBall MX). This could actually achieve my 25 MIPS goal.

I should implement it as a devboard so that someone can get an emulator working.

alxlab · Nov 4, 2021

Yeah from what I understand Emu68 is running in Supervisor mode based on the post of the developer Michal Schulz:
https://www.patreon.com/posts/emu68-meets-54448728

User mode doesn't sound like it would make much sense.

On the PiStorm -> Machintosh discord AlexCL has been testing the PiStorm by directly replacing the 68000 CPU in a Mac SE. He's gotten as far as booting from a floppy but it's been inconsistent and hard drive is not working. The inconsistency does very much sound like a timing issue. Looks likes there's potential for the PiStorm to be used as an accelerator in a Mac SE but there's a lack of technical knowledge regarding how the Mac SE and Macs in general work. My first though would be to use the PDS slot instead of removing the original CPU of the board but I understand that's another layer of potential problems.

Zane Kaminski · Nov 4, 2021

Hmm maybe "exact timing" was too strong a description. It's more like when interrupts are disabled (for floppy or serial access) the Mac has to have some minimum average performance over 5-10 microseconds or so. So you can't really delay execution for any significant amount of time, like a page fault or context switch would ruin a pending floppy access in progress. Not sure if the same is true of SCSI. Basically any time you see the Mac's mouse get all jerky (floppy access, serial data transfer) you know that if the emulator was stopped for even a few microseconds (and context switch is hundreds of microseconds optimistically) the pending access will probably be ruined. Hence why the Mac has to disable interrupts and the mouse stops moving smoothly--it needs that time to be free of interruptions so the floppy stuff can get done with the correct timing. I dunno how the Amiga does it but I always associated the "separate processor for floppy" approach with Commodore since it was used in the 1541 and whatnot.

So Michal seems to be running bare-metal and in supervisor mode, but maybe there is some additional gotcha. It should be solvable--there has been a lot of work on bare metal code on the Pi since I last thoroughly investigated the problem in 2015-16. Or the problem could be some subtlety of the M68k that isn't captured in the emulator.

This makes me really happy to see since a lot of people told me the emulation approach is impossible. Could never figure out their arguments lol since Apple did it quite well in the PPC transition. One thing that I'm not as much a fan of is the Raspberry Pi approach though. The Pi hardware is cheap but I would feel bad putting a big board stack on my Mac. Am I just hung up on aesthetics that don't matter? I always want it to be one single board, plugs in to PDS, no soldering, etc. The Pi approach is certainly cheaper and obviously you can make such a system on a PDS card. I was afraid the Pi would fall short in terms of I/O if we ever tried to take it to '020/'030 systems though. There are a lot more pins on the '030 bus and more bandwidth is required to talk to color framebuffers, etc. The Pi has relatively few GPIO pins on the hat connector. But it might be possible to get '030-class bandwidth by really quickly muxing all the bits onto few lines, and the CPLD grabs em and initiate a bus cycle. Depends on how fast the Pi can toggle a GPIO.

I'm gonna put the i.MXRT1170 (1 GHz ARM Cortex-M7 microcontroller) on a board and I'll send one to anyone who seems interested and capable of porting one of the existing emulators.

One of the best things about the RT1170 is that it has a whole 2 MB of SRAM. Some large portion of that can be configured as tightly-coupled memory accessible in by the CPU one clock. This is quite good for performance... Gary Davidian's emulator for the PPC consisted mainly of a 512 kB jump table / switch statement kind of construct that contained the first two PPC instructions required to emulate any of the 65536 M68k instruction opcodes. To get good performance, a lot of locality of reference was required in the jump table. The thought was sort of like one of the pro-RISC arguments: In CISC processors, 80% of the time you're executing the same 20% of the instructions. When running Davidian's 68k emulator, you want the PowerPC to cache the table entries corresponding to the M68k instructions most frequently executed. So if you use all sorts of instruction types in your program (and this is quite unusual to do, as the RISC proponents noted), it will slow way down since CPUs (including in the Pi) only have like 32 kB of instruction cache. Thus the tightly-coupled memory of the RT1170 and other Cortex-M3/4/7 MCUs is useful. Even if apps don't execute a great variety of instructions, you get a greater degree of predictability of performance and the indirection of the table makes the branch predictors and whatnot work better.

Were I making an M68k emulator for an ARMv7-M MCU with a 512kB tightly-coupled memory, I would spend the 512 kB on a 64k x 8B table like Davidian did, but task the code therein for each instruction with loading operands into the registers, then jump into a somewhat more generic routine possibly stored in non-tightly-coupled RAM that finishes executing the instruction. Davidian was writing for PPC so he had the luxury of storing all the M68k registers in the PPC register. Thus some of the entries for the register-to-register M68k instructions were two-liners... one PPC instruction to implement the M68k opcode and then a return. But I think the big benefit of putting the table in TCM is that the indirection of going through the table will afford the branch and prefetch stuff in the processor and SDRAM controller something to sort of base their predictions on, so to speak. The idea is that the latency impact of the jump to non-tightly-coupled memory will be minimized by sort of pipelining the fetches ahead of execution of the four 2-byte instructions in each table entry.

Okay but if we discuss the emulation approach too much here then nobody will want to buy/build/improve any real MC68k-based accelerators lol and I have a few of these in the pipeline. I think getting the emulator approach to the level of fit and finish that at least I personally would want would be difficult and the associated cost would be more than with the Pi-based approaches. But I will do an i.MXRT1170-based board for the SE and we'll see if anyone does any next steps on it. I'll post about this in another thread.

alxlab · Nov 4, 2021

Okay but if we discuss the emulation approach too much here then nobody will want to buy/build/improve any real MC68k-based accelerators lol and I have a few of these in the pipeline.

Yeah sorry about! Don't won't to distract and hijack this thread. I'll write my comments on your above post about emulation in another post. It's really cool to see your enthusiasm and thoughts on the subjects

Despite the possibilities of emulation being around, I don't think it diminishes to usefulness and coolness of hardware accelerators like the ones you've been working on. As you mentioned the old 68K CPUs are still widely available at reasonable costs (for now lol) so why not use them.

My only comment of the above design would be about the lack of a CPU socket. I kinda like being able to swap CPU's for testing purposes.

alxlab · Nov 5, 2021

I was wondering how the system knows to use the accelerator in the PDS slot. Looking at the MC68000 datasheet it seems to start off by asserting /BR (active low in this case) from the co processor which in this case would be the accelerator. Looking at the pinout for the SE PDS slot it should be PIN 3A and if I look at the PCB for Warp-SE it's looks like it's just grounded.

So am I correct to assume that to simply disable the main processor and use the one on the PDS slot you just need to ground the /BR pin?

I'm curious because it would be cool to just have a PDS card with a socket to test 68000@8Mhz chips on a Mac SE. I guess it would also be useful to test a PiStorm without having to desolder the CPU from the board.

Zane Kaminski · Nov 5, 2021

alxlab said:
to simply disable the main processor and use the one on the PDS slot you just need to ground the /BR pin?

Yeah basically. /BR low makes MC68k give up the bus after the current /AS cycle. So your hardware can pull /BR low, then wait two clocks (in case MC68k has already decided to start a new bus cycle concurrent with your /BR request). Then you watch for /AS high, indicating that MC68k is done with the current cycle and will release the bus. Or you can watch /BG which goes low one clock (IIRC) before the /AS cycle ends. Using /BG lets you skip waiting to see if MC68k has decided to start another cycle before processing the bus request, in the sense that /BG won’t go low until right before MC68k is actually gonna give up the bus rather than starting another cycle. Once /AS is high after the /BG assertion (and you’ve either waited two clocks or seen /BG low) the alternate master can drive the bus. So you don’t have to use /BG as long as you account for MC68k possibly starting a new bus cycle concurrent with your bus request.

Once your card has the bus, it’s not required to assert /BGACK. MC68k won’t take the bus back unless both /BR and /BGACK are inactive, so instead of /BGACK you can just keep /BR low while using the bus. The idea is with /BGACK is to accommodate multiple alternate masters. The first alternate master to request the bus does so, then asserts /BGACK and desserts /BG. MC68k sees /BGACK low so it doesn’t take the bus back even though /BR is no longer asserted. Then a second alternate bus master can pull /BR low to request the bus while /BGACK indicates that the first alternate master still has the bus. Once the first alternate master releases the bus, /BR is already low so MC68k doesn’t take the bus back and the second alternate master gets a chance to master the bus. So /BGACK is not so useful unless you have two alternate bus masters.

So yeah, just tie /BG low and MC68k will give up the bus. No need to look for /BG or assert /BGACK. There’s one gotcha though. MC68k drives the bus immediately following /RESET going low. Then after reset is released, it’s not clear from the manual whether MC68k immediately releases the bus or whether it takes time to do so. So you shouldn’t just have your address bus drivers on an alternate master card always active, otherwise there might be some bus contention at startup. So you oughta gate your address driver /OE pins with /RESET and then also synchronously count a few clocks after reset is released before the address drivers are enabled and a bus transaction is allowed to take place.

alxlab · Nov 5, 2021

Yeah basically. /BR low makes MC68k give up the bus after the current /AS cycle. So your hardware can pull /BR low, then wait two clocks (in case MC68k has already decided to start a new bus cycle concurrent with your /BR request). Then you watch for /AS high, indicating that MC68k is done with the current cycle and will release the bus. Or you can watch /BG which goes low one clock (IIRC) before the /AS cycle ends. Using /BG lets you skip waiting to see if MC68k has decided to start another cycle before processing the bus request, in the sense that /BG won’t go low until right before MC68k is actually gonna give up the bus rather than starting another cycle. Once /AS is high after the /BG assertion (and you’ve either waited two clocks or seen /BG low) the alternate master can drive the bus. So you don’t have to use /BG as long as you account for MC68k possibly starting a new bus cycle concurrent with your bus request.

Yeah that makes a lot of sense after looking at the MC68000 timing diagram for 2-Wire bus arbitration.

Once your card has the bus, it’s not required to assert /BGACK. MC68k won’t take the bus back unless both /BR and /BGACK are inactive, so instead of /BGACK you can just keep /BR low while using the bus. The idea is with /BGACK is to accommodate multiple alternate masters. The first alternate master to request the bus does so, then asserts /BGACK and desserts /BG. MC68k sees /BGACK low so it doesn’t take the bus back even though /BR is no longer asserted. Then a second alternate bus master can pull /BR low to request the bus while /BGACK indicates that the first alternate master still has the bus. Once the first alternate master releases the bus, /BR is already low so MC68k doesn’t take the bus back and the second alternate master gets a chance to master the bus. So /BGACK is not so useful unless you have two alternate bus masters.

Ah that makes a lot of sense. So basically this would be required if there's more than 2 CPUs that would take over the bus.

So you shouldn’t just have your address bus drivers on an alternate master card always active, otherwise there might be some bus contention at startup. So you oughta gate your address driver /OE pins with /RESET and then also synchronously count a few clocks after reset is released before the address drivers are enabled and a bus transaction is allowed to take place.

So just to make sure I understand correctly, if I want to just slap a CPU on a PDS card that will run at the same clock speed as the original MC68k then in this case the CPU on the PDS card, then the "address bus drivers on an alternate master card" would be the CPU on the PDS card and should be turned off until after a few cycles the /RESET is inactive to avoid contention?

This is like learning an entirely new language. Needing to learn what thing like "assert, "negate" and "driver" means.

Hopefully I'll learn enough to understand and not fry anything.

I really appreciate the time your taking to go through and explain this stuff.

If anyone else is interested in the source of the timing diagram I've attached the MC68000 Reference Manual I took it from.

Zane Kaminski · Nov 6, 2021

Hmm I think I may have been misremembering about MC68k driving the address bus during reset. Check this diagram from the user's manual:

Screen Shot 2021-11-06 at 2.19.56 PM.png

It's more that MC68 might drive the bus before 4 clocks after Vcc has reached minimum operating voltage. Seems explicit that it doesn't drive the bus while reset is asserted. But what's unclear is whether MC68k will recognize the bus request right after reset. I guess if the bus request is recognized during the "internal startup time" then MC68k won't start loading the initial stack pointer and whatnot. So upon revisiting it I think inhibiting your card's address~~/control~~ outputs is not necessary. This makes sense since otherwise you couldn't hook multiple MC68000s to the same bus without buffers in between.

Edit: No, wait a minute. The diagram says, "all control signals inactive." So that means high, not tristate and therefore you seem to have to disable the control signal outputs but not the address and data buses.

Zane Kaminski · Nov 22, 2021

Timing diagrams are finished, I think. Gotta translate them into verilog and then I can simulate to make sure my verilog is as expected. Maybe I should write a program that takes the WaveJSON format I did the timing diagrams in and outputs verilog for simulation. Hmm... Yeah that's probably my next step so I don't have to manually translate all 25 timing diagrams.

Zane Kaminski · Dec 1, 2021

Hmmm well the WaveJSON-to-verilog approach for generating the test vectors was not really successful, long story short. I'm just gonna do another set of test vectors for verification purposes. Also I need to get an SE board mounted on a board of wood with a PSU, disk, and some kind of video converter to ease testing.

alxlab · Dec 2, 2021

Guess not feeling to comfortable with testing the PDS card with the CRT high voltage exposed eh?

Zane Kaminski · Dec 2, 2021

alxlab said:
Guess not feeling to comfortable with testing the PDS card with the CRT high voltage exposed eh?

Just a pain in the ass for hooking up the logic analyzer and scope, plus I plan on making 100+ of these so I have to do production testing.

Zane Kaminski · Dec 4, 2021

I've had to add another little subsystem to my design:

Screen Shot 2021-12-03 at 9.33.19 PM.png

In the top right is the new USB update functionality. My implementation is fairly expensive ($4-5) but I think it's a necessity for users to be able to update the card without special hardware. The MCU in the QFN package is actually an ESP32-PICO-V3... massively overpowered and a bit pricey ($3-4) for the intended XSVF player capability but I think I need basically the same functionality on multiple future cards I'm planning, including a few '030 accelerators with WiFi capability so I would like to only maintain one piece of updater software if possible. I would probably pick STM32F030 instead of ESP32 but we are in the middle of the STM32 shortage and the price of those chips has been inflated so I'm going with the ESP32 for now. Unfortunately I haven't implemented any interface for the ESP32 to actually provide any extra functionality to the Mac. Maybe that can come in a future version. The board is kinda packed and there aren't many pins on the CPLD left to implement an SPI interface or anything along those lines. I would need to rethink the architecture to easily accommodate WiFi or something and it would probably involve a move to a higher-pin-count CPLD/FPGA, SDRAM, ROM-in-RAM, etc.

Here's the XSVF library I plan to use to play the XSVF to the CPLD's JTAG port in order to program it: https://github.com/ORSoC/libxsvf

Also I've renumbered the "GW" model number to make room for some other stuff in the "GW44xx" series of Mac-specific gizmos.

alxlab · Dec 5, 2021

Really coming along. The ESP32 just tucks in there neatly. They $3-4 should be that bad compared to the total cost of the whole thing right?

Ron's Computer Videos · Dec 6, 2021

Very excited about this project!

Kai Robinson · Dec 6, 2021

Zane Kaminski said:
Hmmm well the WaveJSON-to-verilog approach for generating the test vectors was not really successful, long story short. I'm just gonna do another set of test vectors for verification purposes. Also I need to get an SE board mounted on a board of wood with a PSU, disk, and some kind of video converter to ease testing.

Get yourself a Meanwell RT-65B Power Supply, some wire and a 14-pin Molex Mini-Fit Jr plug & crimp tool - you can make a really simple bench power supply for these

Zane Kaminski · Dec 11, 2021

I've been doing test vectors and everything is looking good.

Here's one picture which is admittedly incredibly hard to read--probably nobody should bother with it:

Anyway, the point is that the above picture is showing a simulation of the DRAM controller behavior under some reasonable requests from an MC68k processor at 25 MHz. In this simulation, first MC68k accesses a non-RAM/ROM location, then it reads ROM, writes ROM, reads RAM, writes RAM, then reads RAM again. Concurrent with this, there are urgent and non-urgent RAM refresh requests coming in to the controller. The idea with the urgent/non-urgent RAM request mechanism is that the DRAM controller can delay the execution of a non-urgent refresh so as to start it concurrent with a ROM or I/O access rather than as soon as possible. This ensures that the DRAM refresh is overlapped as much as possible with non-RAM activity. If a non-urgent request is pending for too long, it becomes urgent and then the controller executes it as soon as there is an opportunity in the DRAM timing sequence. The simulation shows this behavior is working as intended!

Now if we were making an ASIC or some kind of super-high-reliability system, we would want to generate a bazillion of these test cases so as to achieve something like 99.9% "test coverage" to ensure there are no mistakes in the controller logic. Of course that takes a long time so I am gonna settle for a good bit of simulation test coverage, but then use real-world functional testing to find any remaining issues. Any issues uncovered after the accelerator ships can be fixed via software update.

Zane Kaminski · Jan 9, 2022

Small update on the project. Long story short, we are prototyping the USB update system for this accelerator on an upcoming Commodore 64 RAM product. Cost is a bit tighter on that card than this accelerator so we have eliminated the update microcontroller on there and we will be directly bit-banging the CPLD’s JTAG port using the GPIO-type modem control (RTS, CTS, etc.) signals from the CH340G USB-to-serial chip during an update. The actual serial UART port, formerly used to talk to the update MCU, will be unused. We are not 100% sure about this approach but it saves $3 so we oughta investigate. Once we verify that our idea for eliminating the update MCU is free of any gotchas on the Commodore gizmo then we will be removing the update MCU from the SE accelerator design as well and then building some prototypes.

WarpSE: 25 MHz 68HC000-based accelerator for Mac SE

Administrator

Active Tinkerer

Administrator

Active Tinkerer

Administrator

Active Tinkerer

Active Tinkerer

Administrator

Active Tinkerer

Attachments

Administrator

Administrator

Administrator

Active Tinkerer

Administrator

Administrator

Active Tinkerer

Tinkerer

TinkerDifferent Board President 2023

Administrator

Administrator