Skip to content

Instantly share code, notes, and snippets.

@erincandescent
Created July 25, 2019 23:32
Show Gist options
  • Save erincandescent/8a10eeeea1918ee4f9d9982f7618ef68 to your computer and use it in GitHub Desktop.
Save erincandescent/8a10eeeea1918ee4f9d9982f7618ef68 to your computer and use it in GitHub Desktop.

Foreward

This document was originally written several years ago. At the time I was working as an execution core verification engineer at Arm. The following points are coloured heavily by working in and around the execution cores of various processors. Apply a pinch of salt; points contain varying degrees of opinion.

It is still my opinion that RISC-V could be much better designed; though I will also say that if I was building a 32 or 64-bit CPU today I'd likely implement the architecture to benefit from the existing tooling.

Mostly based upon the RISC-V ISA spec v2.0. Some updates have been made for v2.2

Original Foreword: Some Opinion

The RISC-V ISA has pursued minimalism to a fault. There is a large emphasis on minimizing instruction count, normalizing encoding, etc. This pursuit of minimalism has resulted in false orthogonalities (such as reusing the same instruction for branches, calls and returns) and a requirement for superfluous instructions which impacts code density both in terms of size and number of instructions.

Consider the following C code, for example:

int readidx(int *p, size_t idx)
{ return p[idx]; }

This is a simple case of array indexing, a very common operation. Consider the compilation of this for x86_64:

mov eax, [rdi+rsi*4]
ret

or ARM:

ldr r0, [r0, r1, lsl #2]
bx lr // return

Meanwhile, the required code for RISC-V:

# apologies for any syntax nits - there aren't any online risc-v
# compilers
slli a1, a1, 2
add a0, a1, a1
lw a0, a0, 0
jalr r0, r1, 0 // return

RISC-V's simplifications make the decoder (i.e. CPU frontend) easier, at the expense of executing more instructions. However, scaling the width of a pipeline is a hard problem, while the decoding of slightly (or highly) irregular instructions is well understood (the primary difficulty arises when determining the length of an instruction is nontrivial - x86 is a particularly bad case of this with its' numerous prefixes).

The simplification of an instruction set should not be pursued to its' limits. A register + shifted register memory operation is not a complicated instruction; it is a very common operation in programs, and very easy for a CPU to implement performantly. If a CPU is not capable of implementing the instruction directly, it can break it down into its' constituent operations with relative ease; this is a much easier problem than fusing sequences of simple operations.

We should distinguish the "Complex" instructions of CISC CPUs - complicated, rarely used, and universally low performance, from the "Featureful" instructions common to both CISC and RISC CPUs, which combine a small sequence of operations, are commonly used, and high performance.

The Middling

  • Highly unconstrained extensibility. While this is a goal of RISC-V, it is also a recipe for a fragmented, incompatible ecosystem and will have to be managed with extreme care.
  • Same instruction (JALR) used for both calls, returns and register-indirect branches (requires extra decode for branch prediction)
    • Call: Rd = R1
    • Return: Rd = R0, Rs = R1
    • Indirect branch: Rd = R0, RsR1
    • (Weirdo branch: RdR0, RdR1)
  • Variable length encoding not self synchronizing (This is common - e.g x86 and Thumb-2 both have this issue - but it causes various problems both with implementation and security e.g. return-oriented-programming attacks)
  • RV64I requires sign extension of all 32-bit values. This produces unnecessary top-half toggling or requires special accomodation of the upper half of registers. Zero extension is preferable (as it reduces toggling, and can generally be optimized by tracking an "is zero" bit once the upper half is known to be zero)
  • Multiply is optional - while fast multipliers occupy non-negligible area on tiny implementations, small multipliers can be created which consume little area, and it is possible to make extensive re-use of the existing ALU for a multiple-cycle multiplications.
  • LR/SC has a strict eventual forward progress requirement for a limited subset of uses. While this constraint is quite tight, it does potentially pose some problems for small implementations (particularly those without cache)
    • This appears to be a substitute for a CAS instruction, see comments on that
  • FP sticky bits and rounding mode are in the same register. This requires serialization of the FP pipe if a RMW operation is performed to change rounding mode
  • FP Instructions are encoded for 32, 64 and 128-bit precision, but not 16-bit (which is significantly more common in hardware than 128-bit)
    • This could be easily rectified - size encoding 2'b10 is free
    • Update: V2.2 has a decimal FP extension placeholder, but no half-precision placeholder. The mind kinda boggles.
  • How FP values are represented in the FP register file is unspecified but observable (by load/store)
    • Emulator authors will hate you
    • VM migration may become impossible
    • Update: V2.2 requires NaN boxing wider values

The Bad

  • No condition codes, instead compare-and-branch instructions. This is not problematic by itself, but rather in its' implications:
    • Decreased encoding space in conditional branches due to requirement to encode one or two register specifiers
    • No conditional selects (useful for highly unpredictable branches)
    • No add with carry/subtract with carry or borrow
    • (Note that this is still better than ISAs which write flags to a GPR and then branch upon the resulting flags)
  • Highly precise counters seem to be required by the user level ISA. In practice, exposing these to applications is a great vector for sidechannel attacks
  • Multiply and divide are part of the same extension, and it appears that if one is implemented the other must be also. Multiply is significantly simpler than divide, and common on most CPUs even where divide is not
  • No atomic instructions in the base ISA. Multi-core microcontrollers are increasingly common, and LL/SC type atomics inexpensive (only 1 bit of CPU state required for minimal single CPU implementations).
  • LR/SC are in the same extension as more complicated atomic instructions, which limits implementation flexibility for small implementations
  • General (non LR/SC) atomics do not include a CAS primitive
    • The motivation is to avoid the need for an instruction which reads 5 registers (Addr, CmpHi:CmpLo, SwapHi:SwapLo), but this is likely to impose less overhead on the implementation than the guaranteed-forward-progress LR/SC which is provided to replace it
  • Atomic instructions are provided which operate on 32-bit and 64-bit quantities, but not 8 or 16-bit
  • For RV32I, no way to tranfer a DP FP value between the integer and FP register files except through memory
  • e.g. RV32I 32-bit ADD and RV64I 64-bit ADD share encodings, and RVI64 adds a different ADD.W encoding. This is needless complication for a CPU which implements both instructions - it would have been preferable to add a new 64-bit encoding instead
  • No MOV instruction. The MV assembler alias is implemted as MV rD, rS -> ADDI rD, rS, 0. MOV optimization is commonly performed by high-end processors (especially out-of-order); recognizing RISC-V's canonical MV requires oring a 12-bit immediate
    • Absent a MOV instruction, ADD rD, rS, r0 would actually be a preferable canonical MOV as it is easier to decode and CPUs normally have special case logic for recognizing the zero register

The Ugly

  • JAL wastes 5 bits encoding the link register, which will always be R1 (or R0 for branches)
    • This means that RV32I has 21-bit branch displacements (insufficient for large applications - e.g. web browsers - without using multiple instruction sequences and/or branch islands)
    • This is a regression from the v1.0 ISA!
  • Despite great effort being expended on a uniform encoding, load/store instructions are encoded differently (register vs immediate fields swapped)
    • It seems orthogonality of destination register encoding was preferred over orthogonality of encoding two highly related instructions. This choice seems a little odd given that address generation is the more timing critical operation.
  • No loads with register offsets (Rbase+Roffset) or indexes (Rbase+Rindex << Scale).
  • FENCE.I implies full synchronization of instruction cache with all preceding stores, fenced or unfenced. Implementations will need to either flush entire I$ on fence, or snoop both D$ and the store buffer
  • In RV32I, reading the 64-bit counters requires reading upper half twice, comparing and branching in case a carry occurs between the lower and upper half during a read operation
    • Normally 32-bit ISAs include a "read pair of special registers" instruction to avoid this issue
  • No architecturally defined "hint" encoding space. Hint encodings are those which execute as NOPs on current processors but which have some behavior on later varients
    • Common examples of pure "NOP hints" are things like spinlock yields.
    • More complicated hints have also been implemented (i.e. those which have visible side effects on new processors; for example, the x86 bounds checking instructions are encoded in hint space so that binaries remain backwards compatible)
@experiment9123
Copy link

experiment9123 commented Apr 16, 2021

". There aren't many use cases for more RAM at this point, except for HPC clusters and so forth. "

thats exactly how they justify it - they're looking for niches from the very small (IoT) to the very large (supercomputers)

And their approach to vectors seems based on "It was this way on some Crays we worked on in the past." That's it. There's no real rationale beyond that.

No rationale? this is what have GPUs evolved into - it's why all the heavy compute is done on those now - it would be more elegant to have that functionality embedded in the CPU. our current situation of offloading the main compute to a peripheral is messy.
ARM have done similar with SVE2. Intel would prefer to have done it (avx512/larabee) , its just they couldn't compete with GPUs (now, you could say "riscv would not compete with GPUs either" but imagine building a RISC-V based PCIe vector card to do AI and Crypto)

@ciano1000
Copy link

". There aren't many use cases for more RAM at this point, except for HPC clusters and so forth. "

thats exactly how they justify it - they're looking for niches from the very small (IoT) to the very large (supercomputers)

And their approach to vectors seems based on "It was this way on some Crays we worked on in the past." That's it. There's no real rationale beyond that.

No rationale? this is what have GPUs evolved into - it's why all the heavy compute is done on those now - it would be more elegant to have that functionality embedded in the CPU. our current situation of offloading the main compute to a peripheral is messy. ARM have done similar with SVE2. Intel would prefer to have done it (avx512/larabee) , its just they couldn't compete with GPUs (now, you could say "riscv would not compete with GPUs either" but imagine building a RISC-V based PCIe vector card to do AI and Crypto)

GPU's don't do masking of vector instructions the way the RISC V vector extensions envision. GPU's usually use 4 instruction bits to refer to a mask register, RISC V ran out of bits in their encoding so only use 1 bit (literally mask on/off) and then the mask itself is set in the lower bits of the v0 register. This is super messy for hardware implementations as these bits need to be scattered to each respective lane in the vector unit which leads to a ton more wiring and general complexity. Discussed in full detail in this presentation: https://youtu.be/WzID6kk8RNs?t=552

@mbitsnbites
Copy link

and then the mask itself is set in the lower bits of the v0 register. This is super messy for hardware implementation ... Discussed in full detail in this presentation: https://youtu.be/WzID6kk8RNs?t=552

Thanks for that reference, @ciano1000 ! I was not aware that RVV had the mask bits laid out like that - it's kind of odd.

I'm currently hacking on a microcontroller ISA (in lack of a better term - think Cortex-M0, AVR, really tiny RV32, etc), V16, with the bold goal of including vector instructions in the base ISA (based on the observation that vector operation adds very little hardware cost, but adds great performance benefits for very low end implementations). I currently have a very simple masking mechanism: basically CMP insn -> single true/false flags reg + MASK prefix insn + any vector insn, so effectively there's not even a proper mask register (the "vector CC" register is used as a mask register). OTOH I don't aim for full GPU-like performance or support for very complex vector code, but rather the stuff like libc routines and the DOOM core rasterization loops.

@ciano1000
Copy link

ciano1000 commented Mar 8, 2024

and then the mask itself is set in the lower bits of the v0 register. This is super messy for hardware implementation ... Discussed in full detail in this presentation: https://youtu.be/WzID6kk8RNs?t=552

Thanks for that reference, @ciano1000 ! I was not aware that RVV had the mask bits laid out like that - it's kind of odd.

I'm currently hacking on a microcontroller ISA (in lack of a better term - think Cortex-M0, AVR, really tiny RV32, etc), V16, with the bold goal of including vector instructions in the base ISA (based on the observation that vector operation adds very little hardware cost, but adds great performance benefits for very low end implementations). I currently have a very simple masking mechanism: basically CMP insn -> single true/false flags reg + MASK prefix insn + any vector insn, so effectively there's not even a proper mask register (the "vector CC" register is used as a mask register). OTOH I don't aim for full GPU-like performance or support for very complex vector code, but rather the stuff like libc routines and the DOOM core rasterization loops.

Yeah it's a bit strange alright, but for your needs it's inconsequential, it mainly becomes an issue when you want to build a high performance out-of-order core, doesn't make it impossible, but adds unnecessary difficulty. It's yet to be seen how this would affect actual performance since there hasn't been a RISC-V chip that has targeted high end desktop computing yet (although maybe that's telling by itself...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment