So one of the things I am wanting to do is to put all of these different symbols/instructions into the same 8bit range. All of the CPU instructions take up about 170 symbols, and I have 32 contiguous numbers available too, for all of these “source code” usages. If I could put all of these things in the same space, there is only one transformation that needs to happen… and perhaps only one representation in the program? Could I then merge all of these Instruction/Source Token/BytecoIL enums?
Source code needs to add the ability to create and reference anchors, as well as calling routines by hash and including/passing/defining parameters on macros. It also needs the ability to declare whitespace and comments, and to somehow define imported names.
Name resolution is a big open question here; we always want to resolve to and include the hash to avoid ambiguity, but we also want to hold onto the “intended name” that the author used to resolve the name. This would probably only be used for computer aided heuristics to check package updates or alternative names for existing routines, but probably shouldn’t be erased. This should only be included in a module of source code though, not each symbol. Does this mean that essentially a module includes a root name table (and a map of tables)? Is a module actually just what a Context is right now?
We start at a human-version of the code, with names. Then we validate that code by making sure it is semantically valid and that the names resolve to hashes. Now we have validated code, which is different than the unvalidated source token. This is perhaps what I’ve been calling “bytecoil”.
Write: Comment Source: Comment Byteco: Ignored Write: New Line Source: New Line Byteco: Ignored Write: Tab Source: Indent Byteco: Ignored Write: Instruction Source: Instruction Byteco: Instruction Write: Number Source: Number Literal Byteco: Instruction(LIT) Write: Routine Call
I keep trying to write things out like this, but it doesn’t feel useful. I’m doing this thing where I tokenize the source code, but then I turn it into structures, like Macro, Import, and Routine, which loses the token aspect. If these structures retained the format of being a list of tokens, it might be easier to reason about this whole system as a transformation of token streams. Another way to look at it is that we transform the text tokens into symbols, which can also have source tokens inside of them. This helps us reason about the structure, but doesn’t help with the idea of a simple export format for the code…
Source code can be in an unvalidated format while under development, and in a validated and normalized form when exported to the library as a symbol. This split is I think what is causing some confusion. A source module doesn’t necessarily have to contain validated symbols. What if the symbol set stayed the same through validation, it just culled/ensured they were correct?
Am I worrying too much about this intermediate state? I really want a simple symbolic representation of the source code, but perhaps I’m spending too much time aligning the source code with the “symbol” output. The other weird thing is that a macro output is basically source-level code, while the routines are closer to final assembly. A macro should have things like parameter usage and anchor definitions, but the routine should have only instructions and hash-routine calls.
---
Once again, trying to think through the line between “co language tokens” and “cpu (loader) instructions”. When we assemble a source context, we have all kinds of extra stuff, like padding, comments, anchors, and names that don’t have any business bothering the CPU. Our goal is to make sense of these additional semantics to output the intended machine code. When assembling, we don’t even output routine calls with hashes, we strictly output an executable ROM that is pure COINS. When we encounter a routine call, we need to actually import that routine into the ROM, and reference its compiled routine address. When we encounter a macro use, we create a macro rendering context with the passed in parameters and then expand all labels, inserting the entire resulting byte sequence.