The ultimate SCAMP development environmentWed 15 June 2022
Tagged: cpu, software
I've been thinking about what I want the SCAMP development workflow to look like for this year's Advent of Code. At first I was planning to do it in FORTH, but I've tried to get into FORTH several times and haven't got on with it. I like the simple REPL of FORTH but I very much do not like the language. So the plan is to come up with a way to make a REPL for SLANG.
The reason to prefer a REPL over the edit-compile-test workflow is that it saves wasting time on recompiling and rerunning the entire program every time a small portion changes. In principle you can update a function definition, and then re-compile and re-run that function, without having to recompile the rest of the program, or repeat (say) parsing of the input file.
I think it would not be too much trouble to make an environment that lets me type a line of SLANG code at the command line, and then compiles the code, loads the compiled code into memory, and executes it, "inside" the process of the development environment. This wouldn't be compiling a program as such, it would just be a tiny amount of code each time. As long as we tell the compiler the addresses of all of the known globals, and what the start address of the code will be, this seems perfectly viable.
Once we have an interactive environment where we can type in single statements and compile and execute them, there are a number of interesting things we could do:
- Notice declarations of globals and allocate them manually, so that they can be reused in later statements.
- Remember the source for each function definition so that it can be edited later without having to type it out from scratch.
- Save every input line in a history file so that the history can be manually edited down and turned into a standalone program.
- Save the process binary out to a file that can be executed to get back to the current state.
I have already got a rudimentary implementation just to find out how well it works. It's called "RUDE", for something like "RUDE Ultimate Development Environment". It's only about 350 lines of code. It can notice global declarations, compile and execute code, and save out the project in both source and binary form.
You can try it out in the online SCAMP emulator if you want, just run rude at the shell to get into RUDE.
One advantage over FORTH is that function calls in SLANG always have an extra layer of indirection. This only exists to simplify the compiler (there's no need for a distinction between ordinary function calls and function pointer calls if every function is a function pointer... Galaxy brain), but it means when we update the implementation of a function, all of the callers of that function automatically use the new implementation. In FORTH, callers of a function still use the old implementation until they are in turn recompiled, which is bad because you can fix a bug and find that your fix is not actually in use until you track down everything that calls the function that contained the bug.
Here's an example RUDE session with what I have so far (in the emulator, running about 20x faster than real hardware):
(All of the "... 1st pass ..." stuff is spam from the compiler and assembler.) The important things to note in this session are:
- Printing "Hello, world!" using printf. The call to printf is compiled, then executed. The text "Hello, world!" appears on the console, and we see the return value of 14 from printf (the number of characters written).
- Declaring a variable "n" and initialising it to 0. The assignment is compiled and then executed. There is no real return value, so the displayed return value of 20960 is some spurious memory address.
- Declaring a variable called "f" and initialising it to point at a function that increments "n" and then prints out the value of "n". The function definition and assignment is compiled and then executed, which executes the assignment to "f" (but does not execute the function body).
- Calling f(). We can see that "n" is incremented with each call (and we get the return value of 20 from printf, by accident).
- Exiting RUDE (typing ^D to give EOF on stdin). We now have files project.sl and project. project.sl contains a source listing of the session, and project contains a copy of the process binary from memory, taken at runtime after the latest command line input.
- Executing ./project takes us back into RUDE (but with no "RUDE AWAKENING" message, because we're dropped straight back to the REPL, we're not restarting RUDE), with the in-memory state exactly as it was when we exited! Good magic.
- Calling f() proves that both the function implementation and the value of "n" are retained.
My current implementation, although it proves the concept (maybe), is not workable for several reasons:
Compiling takes too long
It turns out that it takes over 30 seconds (at 1 MHz) to compile the empty string. This means almost all of the compilation time, for typical statements, is wasted on overhead which is exactly the same every time.
What is this overhead?
- preparing source inside RUDE: ~8 seconds
- context switching in the kernel: ~7 seconds
- compiler: ~7 seconds
- assembler: ~10 seconds
(Assuming I didn't blunder the profiling).
When we're compiling "the empty string", we're not just compiling the empty string. RUDE also has to provide names and addresses of globals, of which there are a bit over 200 in a blank environment (mostly the functions in the standard library). A lot of time is spent formatting and writing these 200 names and addresses (in RUDE) and reading and parsing them (in the compiler and assembler). I think that accounts for most of the overhead.
The rest of the overhead is context switching in the kernel, which means saving the parent process (RUDE) out to disk, and reading child processes into memory (compiler and assembler). I've spent quite a lot of time optimising context switching in the past so I expect there's not low-hanging fruit here, short of reducing the number of context switches. Currently we have RUDE swapped out, the compiler swapped in, RUDE swapped back in to do a tiny amount of work, RUDE swapped back out again, the assembler swapped in, and RUDE finally swapped back in again, which is about 6 lots of reading/writing a process to disk.
Unknown binary size
When we compile a statement, we need to malloc() some space to put it in, and we need to tell the assembler what the starting address will be (there's not yet any sensible way to create position-independent code).
Perhaps you can see the problem here: we don't know what address the code will be placed at until after we've allocated space, and we don't know how much space to allocate until after we've generated the code, which we can't do until after we know where it's going to be placed.
Currently RUDE solves this by always allocating 1000 words for every statement. This is bad, both because it is way too much in the normal case and pointlessly wastes memory, but also plausibly because it could be too small in some cases.
Memory leak for almost all statements
But it's worse than just temporarily allocating too much space for each statement. We're permanently allocating too much space for each statement!
RUDE has no knowledge at all of what the statements are doing. The only special case it currently supports is that statements beginning var are declarations of globals, which require special handling so that the addresses of the globals can be tracked.
Lots of statements are completely ephemeral. For example, printf("Hello, world!\n", 0) doesn't contain any data that needs to persist after it has finished executing, so it would be perfectly fine to free() the memory that we put the code in, after it's been executed. But without peering inside the code (and, in turn, peering inside printf() to work out whether it is expecting to keep hold of the string), there's no way to know that the statement doesn't contain static allocations that need to stick around.
Statements that include function definitions, strings, or array literals contain static allocations, and if executing the statement results in a reference to one of the aforementioned allocations being kept around somewhere, then it is not safe to free (all of) the code space because that would result in freeing the memory backing the corresponding function, string, or array literal.
Currently RUDE just doesn't free anything. The compiled code for every input statement stays in memory forever, even when no references to it remain. Incidentally, this is the same approach that simpler FORTH implementations take, as far as I can tell. They just don't even implement free(), all allocations take the "HERE" pointer, and advance it ready for the next allocation. That said, FORTH doesn't have to allocate code space for most input statements, because they are executed without being compiled.
We need to solve all of the above 3 "killer problems" in order to turn RUDE into a usable system. I haven't got a complete picture in mind yet, but I have the following ideas which might form part of it.
"Interpreting" simple statements
Simple statements could be interpreted directly inside RUDE. We already have the table of all known globals. If we did a tiny amount of parsing, to recognise simple literals and function calls, and throw everything else to the real compiler, then we could "interpret" simple statements by pushing function arguments onto the stack and then calling the function pointer that we look up in the globals table. We might save both execution time, because simple statements aren't compiled, and memory usage, because we don't need any memory to store code we're not producing.
I definitely want to do this, in the absence of some other great breakthrough that makes compilation effectively instantaneous (see below).
Put the compiler and assembler inside RUDE
The limit case of interpreting simple statements is to interpret all statements. The easiest way to do this would be to basically copy and paste the source of the compiler (and assembler) and put them both inside the RUDE process, so that we never have to context switch, and we already have the table of global names and addresses ready to hand.
This would be an ideal solution, apart from one thing: memory. Currently the RUDE binary is about 15 Kwords, the compiler is 18K, and the assembler is 20K. That comes to 53K, which is already too much to fit in memory (taking account of the kernel), let alone any code or data for the user program.
In practice it wouldn't be as bad as that because the library functions would not need to be duplicated. The library blob comes to 9K, so if we lose 2 copies of that, then we could expect putting the compiler and assembler inside the RUDE process to bring it up to 35K, which is still on the extremely heavy side.
This idea might be worth playing with anyway, because I can't think of any better way to reduce compilation overhead. But the memory cost is very high, it certainly wouldn't be workable for the larger Advent of Code problems. But maybe it could work if I develop in RUDE, get it to work on the sample input, and then turn it into a standalone program to run on the full input.
If everything were inside the one process, it might also make it easier to defer memory allocation until such a time as the length of the binary is known.
Give names and address to the compiler and assembler in a better format
Currently the names of globals are passed to the compiler as ordinary SLANG source, like:
extern printf; extern getchar; extern malloc; ...
These are just directives that tell the compiler that a name exists as a global but is declared somewhere else. It's just a way to say that it's fine to refer to these globals, even though the compiler didn't see them, because something else is going to tell the assembler where they live.
Similarly, they are defined for the assembler as ordinary assembler directives, like:
.d _printf 0x2f10 .d _getchar 0x2f0a .d _malloc 0x2e8d ...
So that the assembler knows that any reference to _printf is actually a reference to 0x2f10.
The problem with this scheme is that it is designed to be easy for humans to write, and parsing it with the ordinary parser is very slow. Since this table is both generated and consumed by the machine, maybe it could be in a binary format which is faster to work with, and can be parsed with no validation or bounds-checking.
(I care about performance, not security. Fight me. Nobody's complaining that RUDE is a non-viable project because it's too easy for hackers to break into your SCAMP!).
Stop the compiler from validating names of globals
Rather than create a whole new format for passing globals into the compiler, we could just have a switch to make the compiler assume that all unrecognised names are valid globals. The downside is that typo'd names won't get rejected until assembly time, and the message will be slightly more confusing, but it would save us time in the compiler.
It's probably not worth doing because even if we saved 100% of the overhead in the compiler, and 50% of the overhead in RUDE, it would only knock about 7 seconds (25%) off the total time to compile the empty string.
Have the compiler directly output machine code instead of assembly language
Instead of having the compiler write assembly code, and then passing the assembly code to the assembler to turn it into machine code, we could have the compiler output machine code directly.
Reasons not to do this include:
- Opcodes can change whenever I change the instruction table, so any code that cares about opcodes needs a system to keep it updated, which is a hassle. (Although I don't do that much any more, so maybe it's fine to say opcodes are set in stone now).
- We'd still need some equivalent to the "2nd pass" of the assembler which touches up addresses that weren't already known during the first pass. Probably this would be mostly copy-and-pasted out of the assembler and into the compiler.
- It's helpful to be able to have the compiler driver (either slc or RUDE) pass in its own assembly language header and footer that can use labels instead of hard-coded addresses. Without a separate assembly step these addresses would probably have to be patched up in the binary, which is obviously more trouble and less robust.
But those are small problems.
The assembler accounts for the majority of the runtime in compiling normal programs as well as short statements in RUDE, so removing it could more than double the compiler throughput. So I probably want to do this.
Then it might be worth revisiting "Put the compiler inside RUDE" because we would save almost the entire memory cost of the assembler.
Split up lexing and parsing
The parsing framework I use for both the compiler and the assembler does not distinguish between lexing and parsing. Every input token to the parser is a single character. This introduces quite a lot of overhead because sometimes the same characters are examined many times until enough backtracking has happened to work out how to parse them. Historically this was tackled by first tokenising the input (so each character is only examined once) and then parsing the (much shorter) stream of tokens instead of characters.
I'm not sure how much help this would be. I have structured the parser in both the compiler and the assembler to try to minimise the amount of backtracking on common inputs. It's possible that separate tokenising would actually make performance worse because now instead of 1 stream there are 2 streams that need to be consumed!
Compile everything twice
To work out how much memory we need for the code, we could compile it once with a placeholder address, measure how large it is, make the allocation, and then compile it "for real". This would double the time spent on compiling, which makes it sound absurd, but in combination with some other solution that makes compiling very fast, it might be the easy way out.
In principle the size of the generated code can depend on the address it is going to be placed at (because some instructions have shorter forms for addresses in the top page and the bottom page), but in practice I think it would always be the same size, because we're not putting code in the top or bottom page.
Make realloc() shrink in-place
Currently realloc() always makes a new allocation, copies the data, and then frees the old allocation. But when you're reducing the size of an allocation, it's always safe to just shrink the existing allocation and return the unused space to the free space pool.
So we could just allocate a block that's way too large, put our code in it, and then shrink it to fit.
The only reason I haven't implemented shrink-in-place already is that almost every use case I had for realloc() involves growing an allocation to handle more data.
I think making realloc() shrink in-place is the best solution to not knowing the binary size ahead of time.
Generate position-independent code
I remembered that I have already written up my thoughts on how position-independent code could work for when I wanted to implement strace().
Basically it involves making a no-op function call, so that after it returns you have your Program Counter in r254, and then you know the global offset to where the code is running and can patch up all the addresses before running the "real" code. It's a bit of a hassle but would work.
So maybe I could do that and then defer allocation until after the code is already compiled? Probably a bad idea, it seems both more difficult to implement and more wasteful of memory than the "Make realloc() shrink in-place" idea.
Manual statement annotation
To solve the problem of wasting memory on side-effect-free statements, we could require the user to manually annotate "important" statements with an exclamation mark, so that they're not immediately free()'d after execution. This seems error-prone though, I'm not sure it's a good idea.
Ben pointed out that the interactive REPL system I'm describing is a bit like IPython. IPython annotates each statement with a number, and each statement sticks around until manually deleted. Maybe I could do something like that, where instead of deciding ahead-of-time whether a statement should be free()'d or not, the user just has the option to manually free statements that they no longer care about. This would also be a good way to manage the SLANG source history without having to manually delete dead-ends.
I don't really like it, but maybe it's not as bad as it sounds? And I don't have any other ideas for how to stop wasting memory, so maybe it's the best you can do?
So I have a lot of ideas to be pondering, but still no complete picture for how RUDE is going to work.
I think the way to go is to start with the promising ideas that would be useful even if RUDE never works, namely:
- Have the compiler directly output machine code instead of assembly language
- Make realloc() shrink in-place
And then at least I've got some benefit out of it, and I can reassess performance and memory usage to work out whether I want to press on with RUDE or ditch it.
If you like my blog, please consider subscribing to the RSS feed or the mailing list: