I spent some more time thinking about redesigning the stack manipulation for the CPU. Essentially, I feel like if you forget about all of the Forth words like Pop, Over, Pick, Roll, and Tuck, the actual core stack operations are Copy and Delete. Copying targets some elements of the stack and adds them to the top, while deleting does the same targeting but just... deletes them. A "Move" operation would just be a copy, followed by a delete. Pop is delete, Over is a copy, Pick is a copy, Roll is a copy and delete, and Tuck is a different copy and delete.
I'm thinking that I can implement Copy and Delete with two bits for the bit-depth, one bit for the stack target, and then at least one more bit for a toggle between a parameter-less operation that has an implicit function, and a version with a 1-byte stack parameter that specifies the target.
---
After working through it a bit more, I ended up with the following: each stack will have a "Copy", "Move", "Delete", and "Stash" opcode that uses a bitmask, but they will also have shorthand opcodes like "Dup" "Drop", and "Swap". There's room for 8 opcodes on each stack, so I have one more to spend, though I might change Stash to be the top value only instead of a bitmask (or add that explicit operation with my remaining budget).
Either way, I'm feeling a lot better about this stack manipulation set up as opposed to my previous situation with extra stacks and the hold register.