Minimize logic in each pipeline stage to minimize design complexity and maxmize clock speed (roughly the same as for MIPS, but more extreme).
- No speculative branches.
- Use branch delay slots.
- No operand forwarding.
- All instructions have the same latency.
- I.e. every instruction has trailing "delay slots".
- No data hazard resolution.
- Exception: Cache misses (if applicable) cause the entire pipeline to stall.
Possible optimization to reduce the number of delay slots: Partition the execution part of the pipeline into several pipelines where each execute pipeline has its own register file (e.g. integer + float + fixed point). Would require a straight forward way to transfer data between register files.
Branch Write back
_______________ _______________________
/ \ / \
v \v \
PC -> IF -> ID -> RF -> EX1 -> ... -> EXn -> WB
^ ^
| v
ICache DCache
Branch - 2 delay slots:
- BN branch if register is negative, PC+immediate
- BP branch if register is positive, PC+immediate
- BA branch always, PC+immediate
- J jump always, register address
Compare - set all bits of register to 1/0 if true/false:
- SEQ, SNE, SLT, SLTU, etc.
- Conditional write-back is simple to implement.
- E.g. "discard result of next instruction if not true".
- 16 GP registers (per pipeline, e.g. integer + float?).
- Possibly only SIMD registers?
- Size? 32/64/more bits?
Use fixed size 32 bit instruction words.
- Makes it easier to keep a constant stream of instructions (one per clock).
- Loading 32-bit or 64-bit immediate values is cumbersome without proper operand forwarding.
3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
|Op |1 1| Rd | Ra | Rb | ? (shift/mask/func?) | <- ALU
|Op |1 0| Rd | Ra | Imm16 | <- Load+ALU
|Op |0 1| Imm4 | Ra | Rb | Imm12 | <- Store
|Op |0 0| Imm4 | Ra | Imm16 | <- Branch
Also consider VLIW.