This document lists ISA or processor implementation wishlists/ideas.
They are mostly random ideas; they are not reviewed, guaranteed to work, or even just make sense, but at some point I had good use cases for each one of them.
ISA wishlists found elsewhere.
- y-cruncher Technology Wishlist - ISA
- Lockless Inc. X86 Instruction Wishlist
- David Chisnall's CPU Feature Wishlist
- Fast memcpy, A System Design (ACM SIGARCH)
This section lists a collection of potential ideas for ISA extensions.
Native byte code support
Support direct execution/interpretation of custom bytecode (in a sense, optimized support for threaded code).
Native global read/write barriers
Support hooking up load/store instructions with custom behaviors.
This could be used to speed up GC, to enable sanitizers in production, or more in general to efficiently support safe language semantics.
Instruction virtualization
Support hooking up/override instructions with custom behaviors. This is effectively a superset/generalization of the previous point.
In addition to the previous usecases, this could be used for program instrumentation, virtualization, sandboxing, and so on.
Green threads support (quick stack switching)
Alternate code sequences
Provide multiple equivalent targets for e.g. a CALL/JUMP, and let the processor dynamically pick the most efficient one based on sampling of past executions An example of when this may be useful is in the case ofif (a || b) {
// ...
}
if b
is side-effect free, this can be evaluated either as (respecting the short-circuit nature of ||
)
if (a) {
if (b) {
// ...
}
}
or as
if (b) { // swapped order of conditions
if (a) {
// ...
}
}
or as
c = bool(a) | bool(b)
if (c) { // single conditional jump
// ...
}
which one is going to be faster depends on the how predictable a
and b
are, how correlated they are (e.g. consider the case in which each is individually unpredictable, but bool(a)|bool(b)
is highly predictable because a = !bool(b)
), and how expensive it is to evaluate each of them.
With an alternate code sequence instructions (alt_inst_start
/alt_inst_end
) code could instead be emitted as
switch (alt_inst_start(2)) {
case 0:
if (a) {
if (b) {
end1:
alt_inst_end();
// ..
}
}
end2:
alt_inst_end();
break;
case 1:
if (b) {
if (a) {
goto end1
}
}
goto end2;
case 2:
if (bool(a) | bool(b)) {
goto end1;
}
goto end2;
}
and this would enable the processor to dynamically sample how long each sequence takes on average, and then start to pick more frequently the one that, given the dynamic conditions found during execution, is faster on average.
Probabilistic jump
Conditional branch that jumps to its destination with a certain probability (e.g. 1/n) or with a certain rate (e.g. once every n microseconds)if one_in(n) { // probability = 1/n
// e.g. sampling/cleanup code
}
if once_every(t) { // taken ~ every t ns (if executed)
// e.g. sampling/cleanup code
}
Supporting this natively in the ISA would allow CPUs to integrate it with the built-in RNG/timer and branch predictor to ensure that on average the jump is taken the appropriate number of times, while causing zero branch mispredictions and without the need for expensive interrupts.
Atomic operations on whole cachelines
LL/SC instructions to support atomic load/store/cas of the contents of a whole cacheline (or a subset of it)
Memory operations
Native implementation basic memory operations that can be potentially be pushed down to the memory controller or to the memory itself.Note that some of these functions may already be partially or fully implemented by the V extension.
memcpy
/memmove
, extended to support element size and stride parameters for both source and destination (memcpy(dst, src, elem_sz, elem_cnt, dst_stride, src_stride)
with the currentmemcpy
behaviour being equivalent tomemcpy(dst, src, sz, 1, sz, sz)
)memset
, extended to support multi-byte patterns (memset(dst, dst_len, src, src_len)
, with the currentmemset
behavior being equivalent tomemset(dst, dst_len, &value, 1)
memcmp
, extended to support a mask parametermemmem
, extended to support a stride parameter (memmem(haystack, haystack_len, needle, needle_len, stride)
, with the currentmemmem
behavior being equivalent tomemmem(haystack, haystack_len, needle, needle_len, 1)
)memswp(aptr, bptr, len)
swap the contents of two memory areasmemchr
, extended to support a needle parameter (memchr(haystack, haystack_len, stride, needle)
) that allows to specify multiple bytes to search for, with the currentmemchr
behavior being equivalent tobytemask needle = {0}; needle[c/8] = 1<<(c%8); memchr(haystack, haystack_len, 1, needle)
- memsieve (TBD: filter out bytes)
The stride
parameters indicates how far apart candidate matches should be. E.g., if stride
is 4 then candidate matches are at offset 0, 4, 8, 12, ... from the starting address. A common case is having the stride
equal to length of the needle (for memmem
) and of the source (for memset
), but the stride can be both shorter or longer than those (TODO: extend all functions to support backwards searches via negative strides).
Ideally we should support also the terminator variants to enable efficient operations on 0-terminated strings. Ideally we should also support arbitrary terminator lengths:
memccpy
, extended to support terminator multi-byte patterns (memccpy(dst, src, len, stop, stop_len)
, with the currentmemccpy
behavior being equivalent tomemccpy(dst, src, len, &value, 1)
)memcset
,memccmp
,memcmem
, ...
A related idea, enabled by having some of these operations supported natively by the ISA, is described below in Bulk memory operations lowering.
Some of these instructions would benefit from strided memory access acceleration (see implementation ideas section).
Strided memory access mode
Allow any memory access instruction - as well as the D-caches - to use strided access mode, to be able to transparently and efficiently perform SoA/AoS transforms.This would benefit from strided memory access acceleration (see implementation ideas section).
Associative table lookup (Content-Addressable Memory)
Implement a fully-associative table lookup instruction that can be used to quickly map sparse keys to values.Such an instruction could expose/reuse the TLB machinery to allow code to efficiently implement hashmaps, sparse LUTs, jump tables, global remapping or indirection, etc.
Note that the proposed memmem
function, with the extension described in the previous point, could functionally perform the same task.
Range search
Related to associative lookups, this instruction would - given a scalar and a SIMD vector - search for the range in the vector where the value is found.Assuming a N elements vector (v[0], v[1], ..., v[N-1])
where v[i] <= v[i+1]
for each i, the instruction would perform the following:
int search<typename T>(T s, T *v, int N) { // N > 0
if (s < v[0])
return 0;
for (int i=1; i<N; i++)
if (v[i-1] <= s && s < v[i])
return i;
return N;
}
This would be useful to implement range lookups, table lookups, jump tables, etc.
If the relationship v[i] <= v[i+1]
is not valid in v
, then the result is either a fault or undefined.
Deterministic timing mode
Allow to temporarily disable all performance mechanisms that can be used as side-channels, e.g. all forms of speculation, superscalar execution, resource sharing, caching, variable instruction timing, etc.Stack-based capability controls
Jump target instruction
A jump target instruction is an instruction that marks valid jump targets. If a jump/call or other PC modification is executed, and the instruction at the new PC is not a jump target instruction, an exception is raised and execution is aborted.This requires the definition of a variant of each jump/call instruction ("jump/call-and-check-target"), that instructs the processor to ensure that the next instruction executed (at the jump/call target) is a jump target instruction.
The jump target instruction should be a no-op if the last instruction executed was not a jump/call-and-check-target instruction (if this is not desirable, the jump target instruction could be immediately prefixed by an instruction that unconditionally raises an exception).
Text processing instructions
Note: some of these functionalities may be already partially or fully implemented by the RISC-V V extension.
- Validating and converting between ASCII, UTF-8, UTF-16, and UTF-32, Unicode codepoints
- Case-insensitive memory comparison (see
memcmp
in memory instructions) - Case conversion
Fast non-cryptographic hashing functions
To speed up workloads that require fast hashing of short fixed/variable-sized blobs (e.g. string hashing, pointer hashing, etc.), like hashmaps.Generalized mixing function
Generalized mixing function like#include <type_traits>
enum op { opXor, opOr, opAnd };
template <typename T1, typename T2, op op,
class = typename std::enable_if_t<std::is_unsigned_v<T1> &&
std::is_unsigned_v<T2>>>
T1 mix(const T2 n, const T1 m[sizeof(T2) * 8],
const T1 m2[sizeof(T2) * 8] = {}) {
T1 out = n & 1 ? m[0] : m2[0];
for (int i = 1; i < sizeof(T2) * 8; i++) {
T1 f = n & (1 << i) ? m[i] : m2[i];
switch (op) {
case opXor: out ^= f; break;
case opOr: out |= f; break;
case opAnd: out &= f; break;
}
}
return out;
}
that can be used both for bit shuffling (shifts, rotations, pdep/pext, reverse, permutations, ...), other operations (clmul, ...) but especially as fast integer mixing step with ideal avalanche properties (see e.g. https://gist.github.com/badboy/6267743, http://burtleburtle.net/bob/hash/integer.html, https://marc-b-reynolds.github.io/math/2017/10/13/IntegerBijections.html, and https://github.com/skeeto/hash-prospector) for integer hashing.
Register dependency bulk clear
Instructions to clear multiple registers to prevent register dependencies, e.g.regclear [imm32]
where each bit of the immediate maps to each architectural register. If bit N in the immediate is 1, register N is cleared.; otherwise it's left untouched. Potentially could also have a regunset [imm32]
variant that, instead of clearing, simply marks the registers as having "undefined value".
Number conversion instructions
Instructions to convert from/to ASCII representation of integers and floats (itoa
/atoi
/ftoa
/atof
).
Masked move
Perform a masked move to/from memory or register, optionally atomically. In pseudo-code:dst = (src & mask) | (dst & ~mask)
Cross-process call
Same-ring or cross-ring call that performs an efficient call to a procedure in a different address space.Similar in functionality/concept to the Mill's portal calls, XPC or other efficient cross-address-space calls.
Branchless comparison instructions
Branchless versions of common integer comparison sequences:instruction | pseudocode |
---|---|
min(s|u)(8|16|32|64) D, S1, S2 |
D = S1 < S2 ? S1 : S2 |
max(s|u)(8|16|32|64) D, S1, S2 |
D = S1 > S2 ? S1 : S2 |
minmax(s|u)(8|16|32|64) SD1, SD2, S3 |
(SD1, SD2) = (SD1 < S3 ? SD1 : S3, SD2 > S3 ? SD2 : S3) |
sort(s|u)(8|16|32|64) SD1, SD2 |
(SD1, SD2) = SD1 < SD2 ? (SD1, SD2) : (SD2, SD1) |
clip(s|u)(8|16|32|64) SD1, S2, S3 |
SD1 = SD1 < S2 ? S2 : (SD1 > S3 ? S3 : SD1) |
boundcheck(s|u)(8|16|32|64) SD1, S2, S3 |
SD1 = SD1 < S2 ? -1 : (SD1 >= S3 ? 1 : 0) |
joob(s|u)(8|16|32|64) SD1, S2, S3, TPC |
SD1 = SD1 < S2 ? -1 : (SD1 >= S3 ? 1 : 0); jnz SD1, TPC |
For RISC-V, these could be pseudo-instructions with a canonical explicit form that can be reliably handled by macro-op fusion.
Also worth noting that Zicond
now allows to code a min
/max
operation as a sequence of sltu
/slt
+ czero.nez
+ czero.eqz
+ or
, so most of the above can be coded with at most a couple of these sequences. It is unlikely (but not impossible) that CPUs will be able to perform macro-op merging across 4 instructions to recognize a min
/max
operation, and it almost guaranteed to be impossible to recognize the 8 instructions required for things like clip
/minmax
/sort
/boundcheck
.
The file cmp_unit.v contains a simple implementation of this (except joob
). It consumes ~600 basic logic gates for 32 bits words, ~1200 for 64 bits.
Everything described above could be also extended to floating-point values (8, 16, 32, 64, 80, 128).
List of processor implementation ideas
Dual branch speculation
When a 2-way branch is confidently predicted to be highly unpredictable (i.e. the prediction is that the branch is taken randomly 50% of times with no predictable pattern), speculatively execute both branches and then discard the one that was not taken (instead of picking one and then having to flush the pipelines and start over in ~half of the cases).
Some discussion here, even though there is no focus on limiting dual branch speculation only to branches that are extremely unpredicable. Tyson, Gary S. et al. “Limited Dual Path Execution” (2000) seems to be very similar to this idea.
Division/modulo strength reduction
When a division/modulo instruction is frequently executed with a very small amount of possible divisors, the processor could compute their multiplicative inverse and store the inverses in a dedicated small cache. When a division is executed, the cache would be visited and, if it yields a hit, the division would be transparently strength-reduced to multiplications and shifts (or just shifts, in case of power-of-2 divisors).This is normally done by compilers at compile-time, but only for constant divisors. It can be done at runtime using something like libdivide.
Data speculation
Arithmetic and logical operations have specific input values for which the result is constant or for which the operation is an identity.Examples include:
- Constant result:
x & 0 = 0
,x | 0xFF..FF = 0xFF..FF
,x * 0 = 0
,x / x = 1
,x ^ x = 0
, ... - Identity:
x & x = x
,x & 0xFF..FF = x
,x | 0 = x
,x ^ 0 = x
,x + 0 = x
,x - 0 = x
,x / 1 = x
,x >> 0 = x
, ...
Speculation could be extended to consider also the outputs of general operations (including logic/arithemtic operations and even loads from memory/cache). If an operation is predicted to yield a constant or identity, or if the result of the operation is somehow highly predictable, the operation result could be speculated so that subsequent operations could start executing before the results of the previous operation become available.
Shortcircuit conditional moves
Consider psuedocode like the following:int x = 0;
if (some_condition)
x = a + b; // assume that x has a long data dependency
it could be possible to compile it down to:
move x, 0 // x = 0
add t, a, b // t = a + b
conditional_move x, t, some_condition // if (some_condition) x = t
but this, at least in CPUs I have tested, causes the conditional_move
to have a data dependencies on all of some_condition
, x
, and t
, regardless of the value of some_condition
.
Instead, as soon as some_condition
is resolved, the data dependency on the value not used (i.e. either x
or t
, depending on the value of some_condition
) should be dropped to let the conditional move execute immediately (as soon as the remaining value is resolved), and potentially any non-retired instructions the dropped value depends on as the sole user should be cancelled/flushed (to free up execution units).
In the example above, this could mean that e.g. as soon as some_condition is reolved to 0/false, the CPU should immediately drop the data dependency of the conditional move on t
and, assuming x
has retired already, retire immediately. At the same time, if the add
is executing or queued for execution, it should be aborted and/or flushed from the execution pipeline (assuming there are no other uses of t
).
Bulk memory operations lowering
Some fundamental memory operations, e.g. memset
, memcpy
, and memmove
, could be lowered to equivalent bulk instructions to the memory subsystem to minimize data movement between memory and CPU, potentially decreasing power consumption and increasing performance.
By avoiding the data movement between memory and CPU, we reduce usage of shared resources both in the memory subsystem and in the CPU that can be used by other cores/threads instead. This would also help reduce energy consumption, as moving data from memory and back consumes significant energy. Furthermore the memory subsystem could potentially execute some of these operations with a much greater degree of parallelism, thereby reducing wall times required for the operations to complete.
Overlapping load/store merging
Recognize idioms/macros such as:
// short non-overlapping memcpy of length off+W2
load<W1> SRC, R1
load<W2> SRC+off, R2 // off <= W1
store<W1> R1, DST // SRC+off+W2 <= DST && DST+off+W2 <= SRC
store<W2> R2, DST+off
// optionally require explicitly killing R1 and R2
and internally execute a single operation that moves off+W2 bytes from SRC to DST (possibly subject to some additional constraints for the optimization to apply, e.g. the source and destination ranges should not individually cross cache lines)
The alternative would be new instructions like the one proposed in https://www.sigarch.org/fast-memcpy-a-system-design/.
Transparent memory compression
Allow to memory subsystem to transparently compress memory when possible (similar to in-memory texture compression).
Strided access memory acceleration
Related to the strided memory access instructions: see https://dl.acm.org/doi/fullHtml/10.1145/3466752.3480091.Reassociate operations with data dependencies
Reassociable operations that are part of a data dependency should be reassociated transparently to minimize latency due to the data dependency.For example:
OR r0, r1
OR r0, r2
OR r0, r3
should be transparently reassociated, to minimize stalls waiting for the data dependency on r0, to the equivalent microcode of:
OR r0, r1
MOV rX, r2 # rX is a unused non-architectural register
OR rX, r3
OR r0, rX
This way the dependency is only across 2 instructions, not 3 as in the first case.
This would be applicable to other operations beyond OR, e.g. AND/XOR, and also to some arithmetic operations like ADD/SUB, always assuming the intermediate r0 results are not otherwise used.
For an example of this, see the OR reduction in the hot loop of https://godbolt.org/z/o7frnM14d.
TODO: efficient SoA/AoS via strided view of memory exposed by the memory subsystem
S = 2^s
E = 2^e
A = addr in view
Ar = a/E*S + a%E