This is an idea for adding vector processing support to tiny RISC style microcontroller CPU:s (e.g. similar to FEMTORV32).
It could be especially beneficial for CPU:s without a pipeline, where each instruction normally takes 2+ clock cycles to complete.
The purpose of vector processing is to reduce the number of clock cycles required for each operation:
- Loop logic overhead is reduced (increment, compare, branch) as several data elements are processed in each loop iteration.
- Non-pipelined implementation: Instruction fetch & decode cycles are eliminated while iterating over a vector.
- Pipelined in-order implementation: RAW hazard stalls are eliminated, keeping the execute pipeline busy during vector operations.
In essence, this proposal can provide a performance boost of 2x or better for certain workloads, with nearly zero added hardware complexity.
Note: This proposal should be considerably easier to implement in hardware than the RISC-V V extension, for instnace.
New instructions are added that behave exactly the same way as regular scalar instructions (e.g. ADD, LW), except that they act on vector registers instead of scalar registers.
Ideally the new instructions are encoded in the same way as the scalar counterparts, except for a few bits that are flipped in the instruction word. This makes decoding simpler than if completely new instruction encodings are used.
For RISC-V implementations that only support 32-bit instructions (e.g. no compressed instructions), a simple solution is to encode the vector mode in the two least significant bits (according to The RISC-V Instruction Set Manual Volume I: Unprivileged ISA, 20191213, §26.3). The bits could be used for encoding operand types as follows:
Bits 1..0 | Destination | Source 1 | Source 2 |
---|---|---|---|
00 | Vector | Vector | Vector |
01 | Vector | Vector | Scalar |
10 | Vector | Scalar | Vector |
11 | Scalar | Scalar | Scalar |
A vector length ("VL") register needs to be added. Its size will typically be at most 5 bits (supporting vector lengths up to 32 elements). Writing to the register can be done in one of the following ways:
- Add a
setvl
instruction that copies an integer value to the VL register. For instance for RISC-V, there is already aVSETVL
instruction in the V extension that could be used for this purpose. - Alternatively, one of the existing scalar registers could be designated as a mirror of the VL register. For instance
for RISC-V, when any instruction writes to the
X31
register theVL
register could be written to simultaneously. Note that in this case amin
instruction can come in handy for low-overhead vector length calculation in vector loops. RISC-V has amin
instruction in the B extesion.
A vector element counter needs to be added to the implementation. Its size is the same as the vector length register. Whenever a vector (or scalar) operation is started the counter is set to zero, and on each subsequent clock cycle the counter is incremented until it reaches the value of the VL register.
A new "VECTOR OPERATION ACTIVE" state needs to be added. For a pipelined implementation this state will stall all stages before the execute stage. For a non-pipelined implementation it will be a specific state after DECODE and before the next FETCH.
The vector element counter is used for indexing vector elements within a single vector register.
A natural choice is to add a new register file dedicated to vector registers. It could contain as many vector registers as the scalar register file (e.g. 32 registers), since the vector instructions typically has the same register address size (e.g. 5 bits) as the scalar instructions. However, since vector registers may require lots of hardware resources, and most programs do not need as many vector registers as scalar registers, it could be desirable to only have 1/2 or 1/4 of the number of scalar registrs. E.g. 8 vector registers is sufficient for most applications.
The length of each vector register is an important but flexible design decision. Anything between 4 and 32 vector elements per vector register should be fine.
To reduce hardware resource usage (e.g. FPGA LUT:s), vector registers can instead be mapped on top of the existing scalar
register file. Thus a virtual vector register could be [R8,R9,R10,R11]
, for instance.
A couple of options exist for how to map virtual vector registers to scalar registers:
- Overlapping:
V[n][k]
->R[n+k]
- Segmented:
V[n][k]
->R[n<<N | k]
It is also possible to extend the scalar register file, e.g. from 32 registers to 64 registers. The upper registers would only be accesible by vector instructions (this is how it is done in Simple-V in libreSOC, for instance).
Benefits:
- Easy to mix scalar and vector data and operations.
- Easy to perform vector element manipulation and permutations (by modifying the underlying scalar registers).
- Accelerate certain scalar operations (e.g. push many scalar registers to the stack).
Cons:
- Conflicts with scalar register allocation.
- Not as many vector registers / elements as with a dedicated VRF.
You usually want to treat loads/stores as a special case for vector operations. For instance, scalar base index + stride
can be more useful than vector base index + offset
at times. Here, stride can be [0, 4, 8, 12, ...]
for instance.
Generating a series of linearly incrementing addresses on-the-fly requires slightly more hardware:
- An address offset calculator is required (a registered adder). It is cleared to zero on the first vector cycle, and incremented by a fixed value on each consecutive cycle. The increment value is the stride operand of the instruction (typically the second source operand), which should be a scalar value (usually an immediate value).
- An extra MUX is most likely required for making the selection of scalar offset vs stride offset.
Looking back at the suggested operand type encoding above, stride based operations could be used for imemdiate operand instructions (e.g. LW and ADDI for RISC-V) when the Vector <- Scalar, Vector
operand configuration is used (i.e. the scalar immediate operand is turned into a vector operand by the stride address generator).