Cache Discustion/Questions

trag · Oct 29, 2021

Kind of a more general continuation of the WTB 1MB cache for x100 machines thread. So paging @Ron's Computer Videos into the thread too.

(First, I should mention that the Power_Macintosh Hardware Developer Note (for X100 machines) says the cache logic is in the HMC chip. So no need to fiddle the bus signals from the cache.)

General Cache Theory As I understand it (happy for corrections), an L2 cache works by storing a subset of the memory (DRAM) contents in a faster form of memory. The tricky part is keeping track of what's stored where and updating it as needed. Typically one set of SRAM stores the cached data and another set of SRAM stores the addresses for the stored data. These are called DATA SRAM and TAG SRAM respectively.

To avoid having to make the TAG SRAM (addresses) the full width of the address bus (e.g. 32 bits), the System Address is divided into Upper bits and Lower bits (more significant, less significant).

The lower address bits are used as the address of data in the DATA SRAM. At that same address (just the lower bits of the full address) the remaining Upper bits of the address are stored in the TAG SRAM.

When a Read is performed, the contents of the TAG SRAM at the Lower Bits address is read. This TAG SRAM content is compared to the Upper Bits address. If it matches, then the contents of the DATA SRAM is valid and should be used.

For example, imagine one has an address of BEEF CAFE (32 bits, 8 4bit Hex digits) and at that address in main memory one has stored DEAD FADE . If this is also present in the cache, than at the address CAFE in the DATA SRAM one would find, DEAD FADE. At the address CAFE in the TAG SRAM one would find BEEF.

Other addresses, such as DEAD CAFE, when tested against the TAG SRAM would find that DEAD /= BEEF and so would be a cache miss.

In reality the divide between the upper and lower bits depends on the size of the cache, the word size of the system, and the total memory space that is covered by the cache.

END Theory

I'm looking at three cache modules. Two for the X100, and one (I think) for the X500/X600. These are all 256K, so not the cool 1 MB, yet.

We'll call them #1, #2 and #3 from top to bottom.

#1 and #2 have a pin pattern of 30, 25, 25. from right to left (that's how the silk screen labels the pins, R => L).

#3 has a pin pattern of 20, 35, 25, also R => L.

The only branding on #1 seems to be from IDT, the chip maker. Perhaps they ran these off for Apple?

#2 says Apple Computer 256K cache SIMM (which is wrong, it's a DIMM, the HDN has the same error). The white sticker on one of the chips says "For use in 6100 only) which doesn't make a lot of sense. 6100 and 7100 had exactly same chipset and bus speeds.

#3 just says Apple Computer.

Chips

#1 has eight (8) IDT 71256SA15 SRAMs on board. It has two more chips, which must be the tag RAM, but they are completely blank/unlabeled. It's not even faint, there's just nothing there. On the back there's a pair of GAL16V8C5 suggesting the need for a little more logic than what's in the HMC. There's also an eight pin SOIC labeled AV70-1. From experience, I know this is almost certainly an ICS AV9170 Which is a fun little clock synchronizing chip. I'm surprised to see it there, and I suspect it's overengineering, but maybe they had trouble with their clock phases not lining up.

#2 is actually a little more interesting, since all the chips are marked. There are two Cypress CY7B181-12 4K X 18 Tag SRAMs and eight (8) CY7C193-22 32K X 8 SRAMs. Plus a pair of GALs and a 74FCT244 buffer chip.

#3 holds two IDT71B74S8 8K X 8 Tag SRAMs and two Micron MT58LC32K36B2LG-9 32K X 36 SRAMs.

Questions:

Q1: The DATA SRAM of #2 and #3 is 32K X 64 when added together (ignoring parity bit in the MT58...) But the TAG SRAM only has 8K X 18 and 8K X 16 respectively.

I would expect the TAG SRAM to need to be 32K X ...., so that the addresses match up. Is there some form of set associativeness going on? Or am I utterly failing to understand how caching works?

There seems to be only 1/4 as much TAG SRAM as is needed to provide comparison for the upper address bits.

Or does the PPC always grab four words (32 bytes) at a time?

Q2: It would take eight 8K X 8 IDT71B74 chips to make the TAG SRAM of a 1 MB cache? X 2 to get 16 bit width. X 4 to get 32K depth. So 2 X 4 = 8.

I found a supply of the 71B74-12 for $1.60 each...

Q3: Just how fast does the address available to comparison (TAG SRAM) complete need to be in the X100? Assume 40 MHz bus speed. It might be easier to just use 5V SRAM and add separate comparators. But the fastest comparators I could find that aren't PECL (voltage levels for logic are wonky) have a maximum speed of 4.5ns. Although typical is 1.5ns. With 5V SRAM faster than 10ns extremely rare, that's going to be 15 or 16ns to results, worst case, although 12ns would be typical.

Might be easier to level translate from/to 3.3 V and use 2.5ns or 5ns SRAM, but then you loose much of what you gain in the level translaters. Sigh.

1 MB Cache for X100

Still need to see if I can find my 1MB cache in the attic, but in theory, based on these examples, a 1 MB cache would have DATA SRAM configured as 128K X 64. The TAG SRAM would need to be 32K X 14 (maybe less than 14). And I'm seeing 15ns and 22ns for the DATA SRAM and 12ns for the TAG SRAM. Can't tell the speed of the TAG SRAM on #1 because the TAG SRAM is unmarked.

Cache for X500/X600 I think this is probably an uninteresting topic, because G3/G4 cards with integrated L2 cache are so ubiquitious. The need for/interest in L2 cache modules ends with the X100 PowerMacintosh -- except for the possibility of "consumer" machines such as Alchemy and Gazelle based systems. G3 upgrades for those systems are fairly rare, so a cache module might be interesting.

Kai Robinson · Oct 29, 2021

Paging @Zane Kaminski

Zane Kaminski · Oct 29, 2021

My current angle on this problem (EDIT: specifically for the x100 machines) is to use 3.3V synchronous SRAM plus LVC/LVT buffers (or voltage limiting FET switches if necessary) for the data memory and then implement the entirety of the tag stuff (memory and address compare) in an FPGA e.g. Xilinx Spartan-6. The tags for a 1 MB cache fit perfectly in the (~$10) Spartan-6 XC6SLX9. You can get it in BGA-256 (1.0mm pitch) or TQFP-144 (0.5mm pitch). It’s got 32 1kx18 RAM blocks so you can make a 32kx18 tag store. With 32-byte lines that makes 1 MB of cache and you only need to use 13 of the 18 bits (12 tag, 1 valid) per tag entry. Is the PMx100 cache two-way set associative? You can implement that too if necessary.

The 3.3V SSRAMs for the data are plenty fast and the critical thing is just to get SSRAM chips with the right type of pipelining. I think the PMx100 uses the kind of RAM without output registers, so data comes out immediately after a the rising edge on which a read command is submitted (“CL1” in SDRAM parlance).

Once you have the FPGA tag RAM implementation you just wire it all up identically to the original card (behind the buffers of course).

Is this reasonable? The verilog for a tag RAM is basically trivial and we should be able to achieve timing closure with 40 MHz bus speed fairly easily using a modern FPGA and fast enough buffers. The pin-to-pin delay can be slower than expected for larger FPGAs but it’s mitigated by the fact that the 601 bus is a mostly synchronous interface with a lot of data valid setup time for the address and whatnot before the clock (unlike 68030 for example which has tCO of like half a clock cycle). It’s more a question of whether my proposed construction is any good. I always like pursuing the new parts approach on this kind of thing but it often makes the gizmo more expensive or harder to make for the benefit of more longevity.

trag · Oct 30, 2021

Zane Kaminski said:
implement the entirety of the tag stuff (memory and address compare) in an FPGA e.g. Xilinx Spartan-6. The tags for a 1 MB cache fit perfectly in the (~$10) Spartan-6 XC6SLX9. You can get it in BGA-256 (1.0mm pitch) or TQFP-144 (0.5mm pitch). It’s got 32 1kx18 RAM blocks so you can make a 32kx18 tag store.

Clever. I like the concept. Making the Block RAMs of an FPGA into the TAG RAM is a great solution. Gets around the problem that there just doesn't seem to be properly sized tag RAMs available.

Search

Cache Discustion/Questions

trag

Tinkerer

Attachments

Kai Robinson

TinkerDifferent Board President 2023

Zane Kaminski

Administrator

trag

Tinkerer