ThinkC Mixing the text feedback immediacy of ANSI projects and toolbox graphics from MacTraps

Mu0n · Jul 22, 2023

No, I'm just using Photoshop CS 2 to convert to gray-scale, then 1 bit (diffusion dither) while playing with contrast brightness to influence the result before switching to 1bit.

Then, it's export to PICT, transfer to basilisk. Edit in resedit to give it file type PICT, open in Superpaint v2 and copy & paste into a resource file in resedit.

Mu0n · Jul 22, 2023

turns out my SE/30 had its ROM simm inserted partially (it got pushed out)
so I didn't have to recap its analog board ASAP

Mu0n · Jul 31, 2023

@Crutch , another one made specifically for your talents.

Striving to do a video about it will convince me to get to the finish line. I've been mulling for multiple years about these methods. As seen evidenced here:

What's your recollection of Mac Plus offscreen CopyBits refresh rate?

Context: I'm trying to refresh a Mac Plus' whole screen 512x342 (for starters..., I will reduce that size going forward) by installing a VBL Task whose purpose is to CopyBits the contents of an offscreen bitmap to the screen bits of what is shown. Of course, the best way to do this is to limit...

68kmla.org

Target machine: Mac Plus or other similar 68000 compacts
Goal: offer less-than-fullscreen graphics drawing that beats copybits and does something blockmove can't do since it can copy non-continuous orderly chunks of graphical data without requiring it all to full-width like blockmove would need it to be.

Feel free to demolish/support my assumptions:

-68000 being 16-bit addresses and data registers, do a tight asm loop with words (=2 bytes = 16 bits) right? I'm not targetting a 32-bit clean ROM and later 680xx cpu.

Here's my code:

C:

asm {
        move.l nbOfLines,d1 //number of lines to do

@startofline:
        move.l nbOfWords,d0 //number of words (or 2-bytes, 16bits) to draw
        move.l source,a0
        move.l dest,a1
@stilldoingline:
        move.w (a0)+,(a1)+  //copying 1 word or 2 bytes or 16 pixels
        dbeq d0,@stilldoingline
     
        addi #0x0020,source //bump source and dest by $20, which 512 in decimal, next line
        addi #0x0020,dest
        dbeq d1,@startofline //done with a line
 
    }

source and dest are addresses to the bases of the bitmaps I'm using. These addresses are well defined since they are exactly what I use in other functions I'm testing based with copybits and blockmove. bm is an offscreen bitmap structure.

Code:

    source = (long)bm.baseAddr;
    dest = (long)backgroundWin->portBits.baseAddr;

My test currently attempts to draw a 128x72 bitmap picture, so 8 words wide (nbOfWords = 8) and 72 lines (nbOfLines = 72) long. Checking the Monitor (with cmd-M) yields good contents of the address and data registers at the beginning of the loop, using A0, A1, D0, D1 in the prompt (I don't have Macsbug or TMON installed, just the default Apple programmer thingy).

The debugger easily shows me that my addi command doesn't increase the address by what I want, which would be 0x20 (which is 64 words = 512 bits), it increased the most significant word instead of the least significant one.

after the first addi line:

Mu0n · Jul 31, 2023

ok, I'm getting some progress; instead of doing this:

Code:

addi #0x0020,source //bump source and dest by $20, which 512 in decimal, next line
addi #0x0020,dest

I do this instead:

Code:

add.l #0x40,source //bump source and dest by 0x40, which 512 in decimal, next line
add.l #0x40,dest

gives this:

instead of this:

you can kind of see the structure of the black area at first, but it gets skipped over quicker and then gets into garbage pretty fast. I've tried every combination of skipping faster/slower. I'm checking that the crucial move.w (a0)+,(a1)+ moves up values the right way under the monitor and yes, it increments the address values of both by exactly 2 bytes every time that line executes.

Crutch · Jul 31, 2023

Sure, a couple things.

If you want to add something to a long, you want addi.l. Your initial code had just “addi”, which defaults to addi.w. Looks like you fixed that.

It seems you are trying to advance at the end of each line by 512 pixels = 64 bytes = 0x40. (Addresses are counted in bytes not words, of course.). So you were also on the right track by changing 0x20 to 0x40 in your second example.

However … you don’t actually want to advance by 512 pixels because you are already at the RIGHT EDGE of your image and you want to advance to the LEFT EDGE of the next row. That’s (512 pixels - image width), which I think is (512-128) = 384 in your example.

384 bits = 48 bytes = 0x30. You should be advancing by 0x30 at the end of each line.

Mu0n · Jul 31, 2023

Crutch said:
However … you don’t actually want to advance by 512 pixels because you are already at the RIGHT EDGE of your image and you want to advance to the LEFT EDGE of the next row. That’s (512 pixels - image width), which I think is (512-128) = 384 in your example.

384 bits = 48 bytes = 0x30. You should be advancing by 0x30 at the end of each line.

for a few minutes before I wrote my last post this morning, I thought this was it, but that can't be for 2 reasons:

1) I wouldn't have gotten a nice rectangular area of graphics if I was starting of the right edge on the next line
2) I'm adding 0x40 to a left aligned backup of the base addresses + j*0x40 for both my bitmap and my windowptr portBits (well, not really its base address, but at a certain constant position from the base but you get the point). The a0 and a1 registers start from there and are incremented in the loop during a given line.

I'll humor your solution though, because it costs nothing and I've tried it earlier myself (now both images are supposed to be on the same screen for comparison purposes. The one that appears correctly on the right come from a classic CopyBits using the exact same BitMap structure, while the staggered bad one is the asm block copy with the += 0x30 to the base address each line).

Mu0n · Jul 31, 2023

Sanity checks:

Code:

Start of 72 lines loop
   Start of 8 words loop
      source = (long)bm.baseAddr = 0x398F6F0
      dest = (long)backgroundWin->portBits.baseAddr + 16 (just to put it 16*8 pixels to the right, arbitrarily somewhere else) = 0x05EC0010
      a0 = 0398F6F0
      a1 = 05Ec0010
      d0 = 00000008  //0x08 words = 128 pixels
      d1 = 00000048  //0x48 lines = 72 lines

After 4 copies of words,
d0 = 00000004 //half-way through the line
d1 = 00000048 //no line has been finished yet

Status of the top of the image (the SOURCE), you can even "see" the first lone white pixel near the top left in this chunk inside the monitor (I did DM 398F6F0, TD as in Dump Memory from a specific memory address):

Status of the top of the destination area in the window (I did DM 05EC0010):

This is strange as I would expect for FFFF's to start appearing here!
Also unsettling, I can't get the leftmost 2 hex characters of the addresses I need to peek in this view of memory. Am I in the right place?

After looping back to line #2, we have these addresses:

we see that source = former bm.baseAddr + 0x40 and dest = former backgroundWin->portBits.baseAddr + 0x40 (+ 0x10 arbitrary offset I put in), so it should still be aligned to the left of the image, 64 bytes later
at the start of line 2,

a0 = 0x03 98F72C (same as the updated source)
a1 = 0x05 EC0050 (same as the updated dest)
d0 = 8 //ready to do a full line
d1 = 47 //the total of lines has been reduced by 1

Checking the source, the lone E has been displaced?

Still no change from a dump to the destination (but the question remains,am I viewing the right place)

Crutch · Aug 1, 2023

Oh, I thought you were blitting from screen to screen. I didn’t realize your source bitmap (as opposed to the desired image within the bitmap) was only 128 pixels wide.

In that case,

Your source bitmap is 128 pixels wide. That’s an even multiple of 16. Your rowBytes should be 16. You don’t need to add anything to the source pointer at the end of a line. The bits in memory just continue immediately with the next row. Delete the line “addi.l #0x20, source”.
Your destination bitmap is the portBits of a window, which is just screenBits. That’s 512 pixels wide. As I said before, if you have just blitted 128 bits, to get to the next row you must add (512 - 128) = 384 bits = 48 bytes = 0x30. I promise this is true!

Mu0n · Aug 1, 2023

Crutch said:
Oh, I thought you were blitting from screen to screen. I didn’t realize your source bitmap (as opposed to the desired image within the bitmap) was only 128 pixels wide.

In that case,

Your source bitmap is 128 pixels wide. That’s an even multiple of 16. Your rowBytes should be 16. You don’t need to add anything to the source pointer at the end of a line. The bits in memory just continue immediately with the next row. Delete the line “addi.l #0x20, source”.

Your destination bitmap is the portBits of a window, which is just screenBits. That’s 512 pixels wide. As I said before, if you have just blitted 128 bits, to get to the next row you must add (512 - 128) = 384 bits = 48 bytes = 0x30. I promise this is true!

is correct, I got blindsided by the (a0)+,(a1)+, thinking everything is symetrical, they are of course not! Thanks for the sanity check.
I believe you're missing the fact I'm using 2 variables for addressing instead of just a single parsing one. The first variable is a register (A1) and lets me jump from word to word on a single line. The second variable is a kind of beginning-of-line bookmark that's used to seed where a1 points to at first, then patiently waits for the line to be drawn, then gets updated with a +0x40 to go to the next line, before passing that address to A1 again so *it* can run off to do its words.

Change 1 - commenting out addi.l #0x40,source
yields this:

perfectly normal, since I have to stop doing move.l source,a0 in @startofline to stop myself from feeding a0 a now stagnant address that never gets pushed on, only the post-increment (a0)+ will push a0 along now, so

Change 2: I put that line only once before the loops:

uhhhhhhhhh...............................what.

Crutch · Aug 1, 2023

Oh, you want DBRA not DBEQ there. DBEQ says “first check if condition code Z is true and if so, stop looping, else continue unless the counter has run out”. You just want to loop until the counter runs out, which is DBRA (equivalently DBF ... “never stop looping unless the counter has run out”). I can’t instantly see why this is giving you the result it is, but I’m pretty sure that first DBEQ will start blitting the next line of the source bitmap anytime you move over 16 white pixels, which is of course not what you want.

If that doesn’t fix it, there is something weird happening and I will try your code myself.

And sorry you’re right about my misinterpretation of #2. My brain refused to see that you were doing it differently than I would.

Not to be annoying but your way is slightly slower because you are doing the extra “MOVE.L dest,a1” once per line. If you just use a single dest register and increment it by 0x30 once per line, you can delete that MOVE.L (and free up a register — also by the way, you should ensure THINK its putting all those variables in registers, and consider blitting 32 pixels at a time instead of 16 by changing the inner-loop MOVE.W to a MOVE.L and adjusting your counter to use longs instead of words).

Mu0n · Aug 1, 2023

YES, thank you! A little bit more was needed though:

asm block is now:

Code:

    asm {
        move.l nbOfLines,d1 //number of lines to do
        move.l source,a0      
@startofline:
        move.l nbOfWords,d0 //number of words (or 2-bytes, 16bits) to draw
        move.l dest,a1
@stilldoingline:      
        move.w (a0)+,(a1)+  //copying 1 word or 2 bytes or 16 pixels

        dbra d0,@stilldoingline  
        addi.l #0x00000040,dest //move up 1 line = 512 pixels
        dbra d1,@startofline //done with a line
    }

but in order to make this work, both loop max values had to be decreased by 1.
nbOfWords had to become 7 instead of 8
nbOfLines had to become 341 instead of 342
since I guess you have to take into account the 0th value passthrough of those loops.

As for using 32-bit long words instead of just 16-bit words, that was one of my earlier interrogations a few posts ago. Doesn't it make sense to use 16-bit for the 68000 processor as it only has 16 pins for address lines and just 16 pins for data lines? My thought is that using 32-bit chunks of data and 32-bit address has to be broken down into separate steps for the high 16 and low 16 bits, that's gotta add some overhead, no? Maybe it makes sense with a 68030 or 68040 processor, but not for the lowly 68000? I'm absolutely unsure of myself in these assertions, but you know what? I can just go ahead and test it out and measure how slower this is.

As for doing a +0x30 manually, knowing that my bitmap is 128 pixels wide and that my display is fixed at 512 pixels wide screen, I know that this adds just an operation that could be removed. I can now measure it to see if it's of any consequence across a massive amount of blits. But my rationale for doing this with an extra pointer makes it more versatile for other sizes of bitmaps, while doing it your way is hardcoding a 128 pixel wide bitmap. I could do multiple functions, each hardcoded in their respective manners for a handful of wanted sizes. Again, I can just time things out to see if it's all worth it.

Mu0n · Aug 1, 2023

EDIT- these results are no good, see below for correct ones.

Early results (mini-vMac, Finder 4.1, under 1x speed):

ASM block copy with words (16bit) of a 128x72 bitmap, average of 100 times: 166 microseconds per blit (copybits was 14166 microseconds!)
ASM block copy with words (16bit) of a 128x72 bitmap, average of 200 times: 83 microseconds per blit (copybits was 14250 microseconds!)
ASM block copy with words (16bit) of a 128x72 bitmap, average of 500 times: 33 microseconds per blit (copybits was 14300 microseconds!)

I thought it'd be faster, but this is next level faster.

ASM block copy with long words (32bit) of a 128x72 bitmap, average of 100 times: 166 microseconds per blit (copybits was 14166 microseconds!)
ASM block copy with long words (32bit) of a 128x72 bitmap, average of 200 times: 83 microseconds per blit (copybits was 14250 microseconds!)
ASM block copy with long words (32bit) of a 128x72 bitmap, average of 500 times: 33 microseconds per blit (copybits was 14300 microseconds!)

absolutely no change so far. The real is gonna be later on real metal.

edit - I also tried to get rid of move.l dest,a1 and just increment +0x30 to a1 between lines, absolutely no change in benchmarking, at least as far as mini-vMac 1x speed is concerned.

Crutch · Aug 1, 2023

Even on a 68000, a MOVE.L (a0)+, (a1)+ takes (I believe) only 67% longer than a MOVE.W. Source: https://oldwww.nvg.ntnu.no/amiga/MC680x0_Sections/timmove.HTML

Looks like 20 vs 21 clocks.

So if you do (MOVE.L DBRA) instead of (MOVE.W DBRA MOVE.W DBRA) it should definitely be faster.

But yeah, try it on real hardware and see!

Crutch · Aug 1, 2023

Oh, I see why the optimizations aren’t improving your timing.

Highly suspicious fact: Your benchmarks are always giving the same result. 100*166 = 200*83 = 500*33 ≈ 16600 microseconds = 1/60 second = exactly 1 tick.

Explanation: You are getting all these blits done inside the timing resolution of TickCount.

Try increasing the number of bits until the total duration increases roughly linearly with # blits. Then measure the impact of optimizations from there.

Mu0n · Aug 2, 2023

I'm an idiot. The move from C to asm remove the outer loop so I was always doing it only once.

I've set yet another register for the repeat loop control.

Also I read that I shouldn't use any old register freely to make sure I don't lose control as per:

so I replaced:
A0 -> A2
A1 -> A3
A3 -> A4

D0 -> D3
D1 -> D4
D2 -> D6

fun fun fun times refactoring the asm code that's now slightly repeated in 4 places

mini-vMac benchmarks at 1x speed:

16-bit - move.w
using a flexible reference pointer allowing for other sizes (aka +0x40 to go to the next line with that reference)
ASM block copy with words (16bit) of a 128x72 bitmap, average of 100 times: 3166 microseconds per blit (copybits was 14166 microseconds!)
ASM block copy with words (16bit) of a 128x72 bitmap, average of 200 times: 3083 microseconds per blit (copybits was 14250 microseconds!)
ASM block copy with words (16bit) of a 128x72 bitmap, average of 500 times: 3100 microseconds per blit (copybits was 14300 microseconds!)
ASM block copy with words (16bit) of a 128x72 bitmap, average of 2000 times: 3091 microseconds per blit (copybits was 14300 microseconds!)

using an inflexible reference and jumping to next line with +0x30, a hardcoded accounting for image width used
ASM block copy with words (16bit) of a 128x72 bitmap, average of 100 times: 2666 microseconds per blit (copybits was 14166 microseconds!)
ASM block copy with words (16bit) of a 128x72 bitmap, average of 200 times: 2750 microseconds per blit (copybits was 14250 microseconds!)
ASM block copy with words (16bit) of a 128x72 bitmap, average of 500 times: 2700 microseconds per blit (copybits was 14300 microseconds!)
ASM block copy with words (16bit) of a 128x72 bitmap, average of 2000 times: 2691 microseconds per blit (copybits was 14300 microseconds!)

32-bit - move.l
using a flexible reference pointer allowing for other sizes (aka +0x40 to go to the next line with that reference)
ASM block copy with longs (32bit) of a 128x72 bitmap, average of 100 times: 2500 microseconds per blit (copybits was 14166 microseconds!)
ASM block copy with longs (32bit) of a 128x72 bitmap, average of 200 times: 2416 microseconds per blit (copybits was 14250 microseconds!)
ASM block copy with longs (32bit) of a 128x72 bitmap, average of 500 times: 2400 microseconds per blit (copybits was 14300 microseconds!)
ASM block copy with longs (32bit) of a 128x72 bitmap, average of 2000 times: 2400 microseconds per blit (copybits was 14300 microseconds!)

using an inflexible reference and jumping to next line with +0x30, a hardcoded accounting for image width used
ASM block copy with longs (32bit) of a 128x72 bitmap, average of 100 times: 2000 microseconds per blit (copybits was 14166 microseconds!)
ASM block copy with longs (32bit) of a 128x72 bitmap, average of 200 times: 2000 microseconds per blit (copybits was 14250 microseconds!)
ASM block copy with longs (32bit) of a 128x72 bitmap, average of 500 times: 2000 microseconds per blit (copybits was 14300 microseconds!)
ASM block copy with longs (32bit) of a 128x72 bitmap, average of 2000 times: 1991 microseconds per blit (copybits was 14300 microseconds!)

You are vindicated at every turn @Crutch :

1) going for 32-bit copying instead of 16-bit copying gives a roughly 35% speed increase
2) hardcoding the jump to the next line destination gives a roughly 15% speed increase
3) doing both gives a roughly 55% speed increase!

sideways takeaway: I can finally stop doing so many repetitions, 100x is plenty enough.

Crutch · Aug 2, 2023

Nice and great to see the intuitive results match the benchmarking! This is a really nice project by the way.

Side note: in your asm code as you shared above, you were actually OK using a0, a1, d0, etc. Those registers are considered “trashable“ registers by the ROM and so can be nuked any time you call a Toolbox routine, but are guaranteed to be preserved within an asm block between calls to the Toolbox. (Because if you don’t call the Toolbox, and you never leave the asm block, no other code is executing between your asm statements … except maybe interrupts, which must preserve all registers.)

Mu0n · Aug 2, 2023

mini-vMac benchmarks at 1x speed VS real Mac Plus

16-bit - move.w
using a flexible reference pointer allowing for other sizes (aka +0x40 to go to the next line with that reference)
ASM block copy with words (16bit) of a 128x72 bitmap, average of 100 times: 3166 microseconds VS 3666
ASM block copy with words (16bit) of a 128x72 bitmap, average of 200 times: 3083 microseconds VS 3666
ASM block copy with words (16bit) of a 128x72 bitmap, average of 500 times: 3100 microseconds VS 3633
ASM block copy with words (16bit) of a 128x72 bitmap, average of 2000 times: 3091 microseconds VS 3616

using an inflexible reference and jumping to next line with +0x30, a hardcoded accounting for image width used
ASM block copy with words (16bit) of a 128x72 bitmap, average of 100 times: 2666 microseconds VS 3166
ASM block copy with words (16bit) of a 128x72 bitmap, average of 200 times: 2750 microseconds VS 3166
ASM block copy with words (16bit) of a 128x72 bitmap, average of 500 times: 2700 microseconds VS 3200
ASM block copy with words (16bit) of a 128x72 bitmap, average of 2000 times: 2691 microseconds VS 3175

32-bit - move.l
using a flexible reference pointer allowing for other sizes (aka +0x40 to go to the next line with that reference)
ASM block copy with longs (32bit) of a 128x72 bitmap, average of 100 times: 2500 microseconds VS 2833
ASM block copy with longs (32bit) of a 128x72 bitmap, average of 200 times: 2416 microseconds VS 2833
ASM block copy with longs (32bit) of a 128x72 bitmap, average of 500 times: 2400 microseconds VS 2833
ASM block copy with longs (32bit) of a 128x72 bitmap, average of 2000 times: 2400 microseconds VS 2833

using an inflexible reference and jumping to next line with +0x30, a hardcoded accounting for image width used
ASM block copy with longs (32bit) of a 128x72 bitmap, average of 100 times: 2000 microseconds VS 2500
ASM block copy with longs (32bit) of a 128x72 bitmap, average of 200 times: 2000 microseconds VS 2416
ASM block copy with longs (32bit) of a 128x72 bitmap, average of 500 times: 2000 microseconds VS 2366
ASM block copy with longs (32bit) of a 128x72 bitmap, average of 2000 times: 1991 microseconds VS 2366

Mu0n · Aug 2, 2023

Also, rereading that thread again in this post:

What's your recollection of Mac Plus offscreen CopyBits refresh rate?

Context: I'm trying to refresh a Mac Plus' whole screen 512x342 (for starters..., I will reduce that size going forward) by installing a VBL Task whose purpose is to CopyBits the contents of an offscreen bitmap to the screen bits of what is shown. Of course, the best way to do this is to limit...

68kmla.org

particularly this passage from National Treasure @Crutch,

Also, you should not be calling CopyBits from a VBL task. CopyBits is on the list of “Routines that may Move or Purge Memory” in Inside Macintosh and therefore shouldn’t be called at interrupt time. (If you must use CopyBits you can set a flag in your VBL task and have your application’s main event loop check the flag and call CopyBits itself when appropriate.) If you roll your own bit blit routine by writing directly to the screen, you won’t have this problem, which is another reason to do it in your case.

Testing that idea out: I get flickering again, because I can't ensure that the graphic will be done in a timely fashion before the electron beam cuts through my CopyBits operation.
But I do confirm that doing CopyBits inside the VBL task will produce garbled graphics from uncertain RAM locations.
All I'm doing is doing CopyBits from global BitMaps that only contain total black or total white and I'm using a Rect to do the drawing (for black) and wiping (for white) oprations from these BitMaps to these locations.

I know this is moot since I'll eventually use my better ASM block copy function, but it'd nice to SHOW how CopyBits can be used and shown to be slower.
Am I relegated to CopyBits into an alt buffer during a frame, and switch to it every other frame? I made it work without that before, I don't remember how. I could dig up very old code.

Search

ThinkC Mixing the text feedback immediacy of ANSI projects and toolbox graphics from MacTraps

Mu0n

Active Tinkerer

Mu0n

Active Tinkerer

Mu0n

Active Tinkerer

What's your recollection of Mac Plus offscreen CopyBits refresh rate?

Mu0n

Active Tinkerer

Crutch

Tinkerer

Mu0n

Active Tinkerer

Mu0n

Active Tinkerer

Crutch

Tinkerer

Mu0n

Active Tinkerer

Crutch

Tinkerer

Mu0n

Active Tinkerer

Mu0n

Active Tinkerer

Crutch

Tinkerer

Crutch

Tinkerer

Mu0n

Active Tinkerer

Crutch

Tinkerer

Mu0n

Active Tinkerer

Mu0n

Active Tinkerer

What's your recollection of Mac Plus offscreen CopyBits refresh rate?