This document was originally written several years ago. At the time I was working as an execution core verification engineer at Arm. The following points are coloured heavily by working in and around the execution cores of various processors. Apply a pinch of salt; points contain varying degrees of opinion.
It is still my opinion that RISC-V could be much better designed; though I will also say that if I was building a 32 or 64-bit CPU today I'd likely implement the architecture to benefit from the existing tooling.
Mostly based upon the RISC-V ISA spec v2.0. Some updates have been made for v2.2
The RISC-V ISA has pursued minimalism to a fault. There is a large emphasis on minimizing instruction count, normalizing encoding, etc. This pursuit of minimalism has resulted in false orthogonalities (such as reusing the same instruction for branches, calls and returns) and a requirement for superfluous instructions which impacts code density both in terms of size and number of instructions.
Consider the following C code, for example:
int readidx(int *p, size_t idx)
{ return p[idx]; }
This is a simple case of array indexing, a very common operation. Consider the compilation of this for x86_64:
mov eax, [rdi+rsi*4]
ret
or ARM:
ldr r0, [r0, r1, lsl #2]
bx lr // return
Meanwhile, the required code for RISC-V:
# apologies for any syntax nits - there aren't any online risc-v
# compilers
slli a1, a1, 2
add a0, a1, a1
lw a0, a0, 0
jalr r0, r1, 0 // return
RISC-V's simplifications make the decoder (i.e. CPU frontend) easier, at the expense of executing more instructions. However, scaling the width of a pipeline is a hard problem, while the decoding of slightly (or highly) irregular instructions is well understood (the primary difficulty arises when determining the length of an instruction is nontrivial - x86 is a particularly bad case of this with its' numerous prefixes).
The simplification of an instruction set should not be pursued to its' limits. A register + shifted register memory operation is not a complicated instruction; it is a very common operation in programs, and very easy for a CPU to implement performantly. If a CPU is not capable of implementing the instruction directly, it can break it down into its' constituent operations with relative ease; this is a much easier problem than fusing sequences of simple operations.
We should distinguish the "Complex" instructions of CISC CPUs - complicated, rarely used, and universally low performance, from the "Featureful" instructions common to both CISC and RISC CPUs, which combine a small sequence of operations, are commonly used, and high performance.
- Highly unconstrained extensibility. While this is a goal of RISC-V, it is also a recipe for a fragmented, incompatible ecosystem and will have to be managed with extreme care.
- Same instruction (
JALR
) used for both calls, returns and register-indirect branches (requires extra decode for branch prediction)- Call:
Rd
=R1
- Return:
Rd
=R0
,Rs
=R1
- Indirect branch:
Rd
=R0
,Rs
≠R1
- (Weirdo branch:
Rd
≠R0
,Rd
≠R1
)
- Call:
- Variable length encoding not self synchronizing (This is common - e.g x86 and Thumb-2 both have this issue - but it causes various problems both with implementation and security e.g. return-oriented-programming attacks)
- RV64I requires sign extension of all 32-bit values. This produces unnecessary top-half toggling or requires special accomodation of the upper half of registers. Zero extension is preferable (as it reduces toggling, and can generally be optimized by tracking an "is zero" bit once the upper half is known to be zero)
- Multiply is optional - while fast multipliers occupy non-negligible area on tiny implementations, small multipliers can be created which consume little area, and it is possible to make extensive re-use of the existing ALU for a multiple-cycle multiplications.
LR
/SC
has a strict eventual forward progress requirement for a limited subset of uses. While this constraint is quite tight, it does potentially pose some problems for small implementations (particularly those without cache)- This appears to be a substitute for a CAS instruction, see comments on that
- FP sticky bits and rounding mode are in the same register. This requires serialization of the FP pipe if a RMW operation is performed to change rounding mode
- FP Instructions are encoded for 32, 64 and 128-bit precision, but not 16-bit (which is significantly more common in hardware than 128-bit)
- This could be easily rectified - size encoding
2'b10
is free - Update: V2.2 has a decimal FP extension placeholder, but no half-precision placeholder. The mind kinda boggles.
- This could be easily rectified - size encoding
- How FP values are represented in the FP register file is unspecified but observable (by load/store)
- Emulator authors will hate you
- VM migration may become impossible
- Update: V2.2 requires NaN boxing wider values
- No condition codes, instead compare-and-branch instructions. This is not problematic by itself, but rather in its' implications:
- Decreased encoding space in conditional branches due to requirement to encode one or two register specifiers
- No conditional selects (useful for highly unpredictable branches)
- No add with carry/subtract with carry or borrow
- (Note that this is still better than ISAs which write flags to a GPR and then branch upon the resulting flags)
- Highly precise counters seem to be required by the user level ISA. In practice, exposing these to applications is a great vector for sidechannel attacks
- Multiply and divide are part of the same extension, and it appears that if one is implemented the other must be also. Multiply is significantly simpler than divide, and common on most CPUs even where divide is not
- No atomic instructions in the base ISA. Multi-core microcontrollers are increasingly common, and LL/SC type atomics inexpensive (only 1 bit of CPU state required for minimal single CPU implementations).
LR
/SC
are in the same extension as more complicated atomic instructions, which limits implementation flexibility for small implementations- General (non
LR
/SC
) atomics do not include aCAS
primitive- The motivation is to avoid the need for an instruction which reads 5 registers (
Addr
,CmpHi:CmpLo
,SwapHi:SwapLo
), but this is likely to impose less overhead on the implementation than the guaranteed-forward-progress LR/SC which is provided to replace it
- The motivation is to avoid the need for an instruction which reads 5 registers (
- Atomic instructions are provided which operate on 32-bit and 64-bit quantities, but not 8 or 16-bit
- For RV32I, no way to tranfer a DP FP value between the integer and FP register files except through memory
- e.g. RV32I 32-bit
ADD
and RV64I 64-bitADD
share encodings, and RVI64 adds a differentADD.W
encoding. This is needless complication for a CPU which implements both instructions - it would have been preferable to add a new 64-bit encoding instead - No
MOV
instruction. TheMV
assembler alias is implemted asMV rD, rS
->ADDI rD, rS, 0
.MOV
optimization is commonly performed by high-end processors (especially out-of-order); recognizing RISC-V's canonicalMV
requires oring a 12-bit immediate- Absent a
MOV
instruction,ADD rD, rS, r0
would actually be a preferable canonicalMOV
as it is easier to decode and CPUs normally have special case logic for recognizing the zero register
- Absent a
JAL
wastes 5 bits encoding the link register, which will always beR1
(orR0
for branches)- This means that RV32I has 21-bit branch displacements (insufficient for large applications - e.g. web browsers - without using multiple instruction sequences and/or branch islands)
- This is a regression from the v1.0 ISA!
- Despite great effort being expended on a uniform encoding, load/store instructions are encoded differently (register vs immediate fields swapped)
- It seems orthogonality of destination register encoding was preferred over orthogonality of encoding two highly related instructions. This choice seems a little odd given that address generation is the more timing critical operation.
- No loads with register offsets (
Rbase
+Roffset
) or indexes (Rbase
+Rindex
<<Scale
). FENCE.I
implies full synchronization of instruction cache with all preceding stores, fenced or unfenced. Implementations will need to either flush entire I$ on fence, or snoop both D$ and the store buffer- In RV32I, reading the 64-bit counters requires reading upper half twice, comparing and branching in case a carry occurs between the lower and upper half during a read operation
- Normally 32-bit ISAs include a "read pair of special registers" instruction to avoid this issue
- No architecturally defined "hint" encoding space. Hint encodings are those which execute as
NOP
s on current processors but which have some behavior on later varients- Common examples of pure "
NOP
hints" are things like spinlock yields. - More complicated hints have also been implemented (i.e. those which have visible side effects on new processors; for example, the x86 bounds checking instructions are encoded in hint space so that binaries remain backwards compatible)
- Common examples of pure "
thats exactly how they justify it - they're looking for niches from the very small (IoT) to the very large (supercomputers)
No rationale? this is what have GPUs evolved into - it's why all the heavy compute is done on those now - it would be more elegant to have that functionality embedded in the CPU. our current situation of offloading the main compute to a peripheral is messy.
ARM have done similar with SVE2. Intel would prefer to have done it (avx512/larabee) , its just they couldn't compete with GPUs (now, you could say "riscv would not compete with GPUs either" but imagine building a RISC-V based PCIe vector card to do AI and Crypto)