SCAMP updateThu 24 June 2021
Tagged: cpu, electronics, software
I've made a bit more progress on my SCAMP CPU. I/O performance is improved significantly since last time, the CompactFlash card now lives on a PCB instead of a breadboard, I'm using a real (ish) serial console instead of an FTDI cable, and I have a more permanent power supply instead of the bench power supply.
Here's a video I recorded showing the computer booting up, and writing, compiling, and running a short program:
I would like to share more stuff via video, but I'm not very good at it, and still feel silly while talking to the camera on my own.
It previously took about 19 seconds to get from reset to the shell, which is now down to 6 seconds, mainly because of the CompactFlash I/O improvements in the kernel. I wasn't specifically looking to optimise the boot time. 19 seconds is fine. A faster boot is just a happy side effect.
I have finally got around to rewriting the CompactFlash I/O inner loops in assembly language.
It's probably beyond obvious to real assembly language programmers, but I found that unrolling the loop was a big improvement. My first attempt was more like:
loop: in x, CFDATAREG # 5 cycles (read data from device) ld (r3++), x # 8 cycles (write to buffer) dec r0 # 5 cycles (loop overhead) jnz loop # 4 cycles (loop overhead)
You can see that it only spends 13 out of 22 cycles (59%) on instructions that are actually copying data. The rest is loop overhead. With the loop unrolled, that's now 104 out of 113 cycles (92%). We're copying 256 words each time we read a block from the card, so the total cycle count has reduced from about 5600 to about 3600 (-35%), just by unrolling the loop into groups of 8.
I found that the memcpy() on every disk I/O operation was also quite a significant bottleneck. That also benefited from loop unrolling, using the Duff's device trick of jumping part way into a multiple-of-N loop to copy non-multiple-of-N blocks of data. It's not particularly easy to read, but you can read my memcpy() implementation here. I found calculating the jump target too inconvenient, so I did it by indexing into an array of loop offsets instead. It ain't stupid if it works.
The text editor was originally grabbing characters from the console via the kernel, using the read() call on the input file descriptor. This worked OK, except arrow keys didn't work. Arrow keys work by sending an escape character, followed by a "[", followed by a letter to indicate which arrow it is. Left arrow is the 3-character sequence "^[ [ D".
The reason arrow keys didn't work is that with the serial line running at 115200 baud, and the CPU at 1 MHz, we only get about 86 clock cycles to process each incoming character before another one is here. The overhead of grabbing each character via the kernel was just way too high. I tried slowing the serial line down to 9600 baud, but it still wasn't enough to handle arrow keys.
Even aside from that, I found the text editor almost unusably slow. It redraws parts of the display on every keypress and this was taking too long. I profiled this in the emulator and found that it was spending about 480,000 cycles between accepting a character of input and being ready to accept the next character. At 1 MHz, that translates to 480 ms, which means if you type any faster than 25 wpm, it's going to start dropping characters. I like to type at more like 120 wpm, so that's no bueno.
Most of the 480,000 cycles were spent on providing output via the kernel, rather than on calculating anything interesting.
So to solve these problems, I rewrote the text editor's text input and output code in assembly language, inside the text editor. It now interacts with the UART directly instead of using the kernel. This isn't ideal, because for example if I ever gain support for multiple console devices, the text editor will need to handle that separately instead of getting it for free via the kernel. But that seems unlikely anyway.
The editor now takes only 78,000 cycles to process an input character, which means it's good for about 150 wpm, which I was initially delighted with. However I eventually realised that this gets worse at the end of the line. It redraws the entire line every time the line changes, so when there are more characters it takes longer. At the end of the line this rises to about 213,000 cycles, or 56 wpm, which is still a bit on the slow side. So there's probably more work to do here. Probably I just rewrite the line-drawing code in assembly language.
From watching the video, you might be surprised that it takes about a minute to compile a single-line program. This is probably my next avenue for optimisation, because compiling programs is one of the things I definitely want to be able to do with this computer. (My ultimate goal is to use this computer to solve as much as possible of this year's Advent of Code).
I'm not completely sure what it is wasting so much time on. Originally, compiling programs was slow because it re-compiled the entire standard library every time. But the library is now cached as an (approximation of an) object file that can just be dumped at the start of the generated binary. The pattern on the lights suggests that compilation is very I/O bound. I need to profile it and find out.
I have now got a PCB to hold the CompactFlash card:
The long wire that's not connected to anything is meant to indicate the internal card's "busy" status with an LED on the front, but I decided to rip off the pad that it was soldered to, so it doesn't work at the moment.
And it contains provision for a removable "external" card which can be inserted through the front panel. There is no software support for this yet (and in fact I may have done something wrong on the PCB, because with a CompactFlash card plugged in to the 2nd slot the computer won't even boot correctly; need to investigate).
I'm currently using the RC2014 VGA Serial Terminal to connect SCAMP to a keyboard and monitor, but this isn't completely ideal because a.) I'd like to put this back in my RC2014, and b.) the pins aren't labelled, so it is quite tricky to make sure I am plugging everything into the correct ones.
The creator of this board, Marco Maccaferri, also designed a similar board that is a standalone VGA serial terminal, rather than plugging into the RC2014 backplane. Unfortunately he no longer offers kits for sale, but they are available from a website called connect.gi. I found the seller to be friendly and helpful.
I have received my standalone VGA serial terminal, but I decided to solder the 3.3v regulator (not pictured) on backwards so now it doesn't work. I then broke one of the legs while trying to desolder it, but hopefully once my replacement regulator arrives I will be able to switch to the standalone board.
I bought this power supply ages ago but hadn't got around to testing it until this week:
It is a 5v 10A DC power supply. It turns out that 10 amps is about 9.5 amps more than I need, but at least it's futureproof.
One strange quirk of SCAMP is that it doesn't work properly at 5.0v, it needs more like 4.8v. I previously believed that this was just down to miscalibration of the voltage display on my bench power supply, but now I believe it is because I have not made any effort to meet the spec required for interfacing with the CompactFlash card. It seems to get stuck waiting for the card while trying to read the kernel off the disk if running at 5.0v. But for the time being, I don't care what the problem is. I just run it at 4.8v and it's fine.
The power supply has a trimpot on it to let you adjust the output voltage, but I found that even at the lowest setting it was barely below 5.0v, and the computer wouldn't boot. The trimpot was variable from 0 Ohms to 1 kOhm, and I found that the lowest output voltage was achieved with the trimpot set to the highest resistance. So the solution is clear: I just need to replace the trimpot with one that goes higher.
I replaced the 1 kOhm trimpot with a 2 kOhm trimpot that I already had in stock, and was able to set it to 4.8v and the computer boots! Great success.
Don't tell anyone I'm not qualified service personnel.
One of the bugs last time was that the text output from the bootloader and kernel startup was missing a load of characters, because they tried to output bytes faster than the serial line could take them. I've now made all of the serial output code check the UART status before writing anything, so that problem is gone.
Another bug was that typing a space would sometimes insert a nul character instead of a space. I do not know what was causing this problem. It has gone away now that I'm not stuffing bytes into the UART faster than it can transmit them, so maybe I was running into a strange edge case of the UART? Not sure.
Another bug-in-waiting was that my "32 MB" CompactFlash card only has 62,270 blocks instead of 65,536. The SCAMP filesystem assumes that it always has 65,536 blocks to play with, so this is no good, and would eventually lead to block contents disappearing into a black hole. My current "solution" to this problem is to mark the missing blocks as "used" so that they can never be allocated for use. This isn't an ideal mechanism because it makes resizing the filesystem non-trivial (there is nothing to indicate which blocks are actually used and which are just missing). But it'll do for now.
I have designed an initial CAD model for the case I'd like to build:
And ordered some fancy "Russian birch" plywood to make it out of. Unfortunately I failed to notice when ordering that the plywood has a lead time of about 25 days, so I won't be receiving that any time soon.
I still need to build a clock circuit. I have a bit of a mental ugh field around this. I'm not sure why. I think there is a bit of a dependency loop, because I don't know how to best design the clock circuit before I know what frequency it needs to run at, and I won't know what frequency the CPU can cope with until I have a better clock.
I would like to do some profiling of the compiler, assembler, and compiler driver, to work out why it is taking nearly 60 seconds to compile a 1-line program. I'm sure a lot of it is overhead that will scale sub-linearly with program size, but it definitely feels like this should be faster.
I have found that the VGA serial terminal does not seem to handle the page up/down keys, at all. It's not just that my text editor doesn't understand the escape sequences: it doesn't receive any escape sequences! If I can't work out how to get the VGA serial terminal to send them, then I'll probably just provide Ctrl-U and Ctrl-D as synonyms for page up/down.
I have also found that the VGA serial terminal sends code 127 for the "DEL" key, whereas gnome-terminal (or Linux, or something else in the pipeline) sends 127 for the backspace key. That's annoying. Before discovering that, I had just supported every observed variation of codes for each special key, so that it supports both types of terminal transparently. Unfortunately I am now in some danger of reinventing termcap, which I would like to avoid.
If you like my blog, please consider subscribing to the RSS feed or the mailing list: