- 04. June 2013 - Dolphin-Emu
So, some people may have been aware that i am reworking the DSP LLE JIT of Dolphin-Emu. My main goal is to enable better optimizations during recompiling. To accomplish that, the original JIT has been split into a parser, optimizer and emitter, using a graph structure describing the instruction flow in between.
But first, let me step back a bit and explain the mess i just wrote(If you are already familiar with the acronyms used above, you can skip this paragraph). Dolphin-Emu is an emulator that emulates the Nintendo GameCube and Wii hardware. On these consoles, the Digital Signal Processor(DSP) is usually used to produce and reprocess audio signals. The DSP can be instructed to load a program(also called microcodes or ucodes) from main system memory. There are only few different programs in use. In emulation, there are two classes of emulation strategies: High Level Emulation(HLE) or Low Level Emulation(LLE). HLE recognizes programs or parts of programs and replaces them with native functions, while LLE uses the program itself to replicate the behavior of the real processor. With LLE, the programs can be either interpreted sequentially, one DSP instruction at a time, or recompiled from DSP instruction set to host machine instruction set in part or whole. If the program parts only are recompiled when they are needed, the program parts are recompiled Just In Time(JIT).
The DSP instruction set has many specialized instructions, like special instructions to form loops and instructions that (1)store a value from register to memory, (2) process contents of registers into another registers and (3)load a value from memory into a register, all in the same instruction. Loops are used quite a lot, and leaving the recompiled code and finding the place to reenter seems to be costly.
The original (and still only shipping) DSP LLE JIT recompiler goes through the code one instruction at a time, until it either hits a branch instruction or the maximum block size. There are some optimizations happening already, like keeping guest registers in host registers for as long as possible, but these need to work with the linear processing of the DSP instructions.
This way of doing business prevents some useful optimizations, like globally determining the value of certain registers. It is better to have a whole program block in some kind of intermediate representation, so the properties of instructions can be easily determined and the instruction flow easily manipulated. For this, the split into a parser, optimizer and emitter seems logical.
The parsers job is similar to the job of the original recompiler. But instead of creating an instruction stream for use by the host processor, it builds a graph of the instruction flow, containing all the information necessary for emitting it. It also inserts some special instructions used to check for interrupts and loop ends.
The Emitter just takes the "graph"(at this point, it has to be a linear chain of instructions, except for branches) and emits the host instruction stream for that. It handles branching, recombining and looping instruction flows mostly by itself(the actual branch still happens in the instruction emitter).
The optimization step is where all the interesting stuff happens. There are three things it is required to do before the emitter can do its job:
- The instructions must be broken down into their constituents, that is, each instruction accessing a guest register gets split into guest loading instructions, executing instructions and guest storing instructions. The multi-action instructions mentioned above already get split up into parallel graph parts during parsing.
- The instructions must be linearized, leaving only boring linear graph segments(with the exception of branching).
- Host registers must be assigned to the instructions in the graph.
Currently, these steps are implemented, along with some simple constant propagation mechanisms. Performance is the same as or better than the original JIT, although host register lifetime is still limited to single instructions, i.E. the used guest registers get loaded and stored for every instruction.