Skip to content

Instantly share code, notes, and snippets.

@zingaburga
Last active December 28, 2024 10:52
Show Gist options
  • Save zingaburga/805669eb891c820bd220418ee3f0d6bd to your computer and use it in GitHub Desktop.
Save zingaburga/805669eb891c820bd220418ee3f0d6bd to your computer and use it in GitHub Desktop.
ARM’s Scalable Vector Extensions: A Critical Look at SVE2 For Integer Workloads

ARM’s Scalable Vector Extensions: A Critical Look at SVE2 For Integer Workloads

Scalable Vector Extensions (SVE) is ARM’s latest SIMD extension to their instruction set, which was announced back in 2016. A follow-up SVE2 extension was announced in 2019, designed to incorporate all functionality from ARM’s current primary SIMD extension, NEON (aka ASIMD).

Despite being announced 5 years ago, there is currently no generally available CPU which supports any form of SVE (which excludes the Fugaku supercomputer as well as in-preview platforms like Graviton3). This is set to change with ARM having announced ARMv9 which has SVE2 as the base SIMD instruction set, and ARM announcing support for it on their next generation cores across their entire server (Neoverse) and client (Cortex A/X) line up, expected to become generally available early 2022.

Cashing in on the ‘scalable’ moniker that drives cloud computing sales, SVE, particularly SVE2, is expected to be the future of ARM SIMD, and it’s set to gain more interest as it becomes more available. In particular, SVE addresses an issue commonly faced by SIMD developers - maintaining multiple fixed-width vector code paths - by introducing the notion of a variable (or ‘scalable’) length vector. Being SVE’s standout feature, this enables the same SVE code to run on processors supporting different hardware SIMD capabilities without the need to write, compile and maintain multiple fixed-width code paths.

ARM provides plenty of info summarizing SVE/2 and it’s various benefits, so for brevity’s sake, I won’t repeat the same content, and instead focus more on the quirks and aspects that aren’t being touted that I’ve come across, mixed in with some of my thoughts and opinions. Note that I only ever touch integer SIMD and I mostly use C intrinsics, referred to as ACLE (ARM C Language Extensions). To avoid making this article too long, I’ll focus primarily on what I’m more familiar with, which means I won’t be touching much on floating point, assembly or compiler auto-vectorization (despite how important these aspects are to SIMD). This also means I’ll be deliberately skipping over a favourite amongst SIMD examples.

SIMD = SAXPY?

Detailed information on SVE outside of core documentation is relatively scarce at the moment, as it’s largely unavailable in hardware and not yet widely coded for. As such, along with not having access to any SVE hardware, I could be wrong on a number of factors, so corrections welcome.

Note: this document was written in late 2021. ARM has since added changes to SVE which are not fully addressed in this article

So another instruction set I have to care about?

Since SVE2 was announced before any SVE supporting processors have been announced by ARM, it would seem like there’s not much point in restricting yourself to just what SVE supports when writing code. And indeed, all but one of the announced cores which support SVE also support SVE2, which means you could use SVE2 as a baseline, greatly simplifying development (not to mention that ARMv9 associates with SVE2 over SVE).

The unfortunate exception here is the Neoverse V1. It’s also the only announced core (ignoring A64FX) which supports a width greater than 128 bits, so likely the only core that initially will benefit greatly from SVE (compared to NEON code). It’s unknown whether V1 or N2 will gain wider adoption in their target server market, so it’s possible that you may want to ignore specifically targeting the V1 to avoid writing separate SVE and SVE2 code paths (taking note that AWS’ Graviton3 processor is expected to be V1). As SVE2 is meant to be the more complete SIMD instruction set, and potentially a NEON replacement, I’ll largely assume an SVE2 base for the rest of this document.

On the fortunate side, there’s a good chance that your code either won’t work without SVE2, or it works with only SVE and SVE2 doesn’t provide much of a benefit, so you may not even have to worry about the distinction.

The point about V1 being the only 256-bit SVE implementation leads to the next point…

SVE2 - scaling from 128-bit to 128-bit

All currently announced SVE2 cores only support 128-bit vectors, which is also the minimum width an SVE vector can be. This means that you probably won’t be seeing any massive gains initially from adopting SVE2 over NEON (unless the new instructions help a lot with your problem, or enables vectorization not possible under NEON). Regardless, the point of SVE is making the adoption of wider vectors easier, so gains may eventuate in a later core, though lack of any initial gains could fail to encourage developers to port over their NEON code at first.

However, I think that we may be stuck with 128-bit for quite some time in the consumer space, whilst the next ‘large’ server core (Neoverse V2?) will probably get 256-bit SVE2 (if not wider). This is due to the popularity of heterogenous core setups (aka big.LITTLE) among consumer SKUs and SVE requiring that all cores support the same vector width. This means that CPUs will be limited to the narrowest SVE implementation amongst all cores, with the little cores likely being the limiting factor. The widest upcoming consumer core, the Cortex X2, is limited to 128-bit SVE2, possibly due to the presence of the Cortex A510 and/or A710.

Update: the Neoverse V2 reverts to a 128-bit vector width. It's unknown what width future ARM server cores adopt.

Unfortunately, ARM historically hasn’t updated their little cores as frequently as their larger cores, and with the Cortex A510 supporting a shared SIMD unit between two cores (something its predecessor doesn’t), widening SIMD doesn’t appear to be a priority there. (one possibility that could eventuate is if the little cores declare support for a wider vector width but retain narrow SIMD units - this would enable the use of wider SVE on the big cores)

Servers however currently don’t have a heterogenous core setup, so being limited by a particular core type isn’t an issue yet. With that in mind, presumably, SVE isn’t about enabling 256-bit SIMD on mobile, but more about enabling the same ISA to be used in a number of different environments.

However, whilst ARM may not enable >128-bit vectors on consumer cores in the near future, it’s unknown what custom ARM cores from Apple and Qualcomm/Nuvia may bring. On the Apple side though, considering their recently announced A15 chip doesn’t support ARMv9/SVE2, which the M2 is expected to be a derivative of, and the introduction of their own AMX instruction set (in lieu of SME), it raises doubts over their interest in supporting SVE in the near future. As for Qualcomm, the first Nuvia showing is expected in 2023.

Of course, it’ll take time for developers to adopt SVE2, so it’s preferable to have a narrow implementation available, and roll out wider variants in the future.

Extensions: When plain SVE2 isn’t enough

ISA extensions

To add a tinge of complexity, SVE and SVE2 already come with optional extensions, most of which only add a few instructions. However, SVE2’s base instruction set seems to be fairly comprehensive, so there’s a good chance you won’t have any use for these relatively application-specific extensions for many general problems.

One curiosity I have is the SHA3 extension, which only adds a single instruction, RAX1 (rotate by 1 and Xor). This is likely a holdover from the NEON SHA3 extension, which added four instructions, but transitioning to SVE, three of them have been moved to the base SVE2 set. One of the three is XAR (Xor and rotate), which now works on all element sizes. With that being present, one wonders why RAX1 couldn’t have just become as generic as XAR, and why it has to be shoved behind an extension (one can only assume it’s much simpler to Xor before rotating than go the other way around).

In terms of actual hardware support, it’s unknown what extensions CPUs will ultimately support, but from the optimization guides for Neoverse N2/V1 and Cortex A510/A710/X2, they all appear to support BFloat16 and INT8 matrix multiply, but lack the FP32/FP64 matrix multiply extension. Excluding the V1, they all also appear to support all listed SVE2 extensions (BitPerm, AES-128, SHA-3 and SM4), though BitPerm appears to be micro-coded (i.e. very slow) on the A510. Although the N2/A710/X2 listing does look like a copy/paste job from ASIMD, I presume the correct rows were copied (whilst the A510 listing seems to be somewhat confused).

SVE2 extension ops from N2 optimization manual

SME: Moving into the next dimension

Before we get any SVE2 hardware to play with, ARM keeps things exciting, having already defined their next major extension: Scalable Matrix Extensions. SME is largely considered to be a superset of SVE2, but not entirely, and is denoted as a distinct feature from SVE/2.

Time to add a dimension

SME adds a ‘streaming SVE mode’ which permits the processor to have an alternative vector size. My guess is this effectively lets the processor have a latency optimised (regular) mode and a throughput optimised (streaming) mode.

From what I can tell, the following SVE2 instructions aren’t available in SME:

  • ADR (compute vector address)
  • COMPACT, HISTCNT/HISTSEG, MATCH/NMATCH
  • FADDA, FEXPA and floating point trigonometric instructions
  • most (all?) load, store and prefetch instructions
  • FFR instructions
  • anything added by an SVE extension (except some from FP64MatMul - see below)

Despite it’s aim of accelerating matrix multiplication, SME adds a few possibly useful SVE2 instructions, which are:

  • DUP (predicate)
  • REVD (rotate 128-bit elements by 64)
  • SCLAMP/UCLAMP (clamp values, i.e. combined min+max)
  • Quadword (128-bit) TRN/UZP/ZIP (also available in FP64MatMul SVE extension)

AArch64 only

SVE is defined for AArch64 only, so you don’t have to worry about supporting AArch32 + SVE (not to mention that AArch32 is being dropped on most upcoming ARM processors).

Wearing a Mask is Mandatory

Other than offering variable length vectors, the biggest feature SVE introduces is embedded masking, aka lane-wise predication. And to ensure that knowledge of this feature spreads around, ARM makes sure that you can’t code SVE without a mask.

Consider how to add two vectors in C intrinsics for AVX512, NEON and SVE:

_mm512_add_epi32(a, b)       // AVX512
vaddq_u32(a, b)              // NEON
svadd_x(svptrue_b32(), a, b) // SVE
// ( svptrue_bX() sets all predicate lanes to be active )

The thing to take note is the required svptrue_b32() part in the SVE code - this is a mask (or predicate) where all values are set to true (if any element in the mask were to be false, it would ‘deactivate’ the corresponding lane during the operation - the all-true mask is required when no masking is desired).

If you write a lot of code which doesn’t use predication (as I do), ARM makes you atone for such sins by requiring a svptrue_bX() for every violation (or if you’re a rebel, macro the thing away).

As for assembly, ADD actually has an unpredicated form due to the predicated version being destructive, but this isn’t available to all instructions, so you may want to keep an ‘all-true’ predicate register handy.

For reference, AVX512 makes predication optional for both intrinsics and assembly. I feel it would’ve made sense to do the same for SVE, just like they do for instructions which don’t support predication...

Masks not suitable for snouts

Despite being a major feature, there’s quite a number of instructions which don’t support predication. Anything which takes 3 vector inputs, or involves swizzling data, seem to commonly lack support for predication, in addition to instructions such as PMULL or EORBT. Also, there are some instructions that support a choice of predication or being destructive (possibly due to a limitation of coding space), although this is abstracted away in ACLE.

Many instructions which do support masking only support one type (merging or zeroing). ACLE allows zeroing on many merge-only instructions by merging into a zeroed vector, but because of this, zeroing predication may, in some cases, be less efficient than merging predication (despite what one might assume for a modern OoO processor).

This ACLE emulation of predication type doesn’t exist for a small number of cases. Most (but not all) instructions returning a predicate only support zeroing (e.g. NMATCH) whilst others only support merging (e.g. pairwise reductions).

Should I care if you “don’t care”?

In addition to zeroing and merge predication, ACLE has a “don’t care” _x predication variant on most instructions. As there is no equivalent in assembly, this is purely an ACLE feature.

It’s curious as to what purpose this serves, as it could be achieved by simply ignoring predication altogether. As such, why not drop the predicate argument, since it ultimately doesn’t matter? The few instructions where you actually may want predication, but don’t care about masked out lanes (e.g. loads/stores, COMPACT etc), don’t support “don’t care” predication anyway. On the other hand, instructions like ADDP only support merge masking, have a “don’t care” variant despite compilers having no choice in masking mode.

Perhaps it’s for instructions which require predication, but for which you may have an appropriate predicate register handy. If there were only unpredicated intrinsics, the compiler may be forced to have an all-true predicate handy when it’s strictly unnecessary. It may also be to make it easier for programmers to not have to remember what’s supported, when they ultimately don’t care.

Keep that mask straight: no shifting around

Predicate operations seem limited to in-place bitwise logic (AND, ORR etc), NEON inherited swizzles (ZIP, TRN etc), testing and some count/propagation operations (CNTP, BRKB).

Unfortunately, operations like bitwise shift are missing, which may force you to reinterpret the predicate as a vector, operate on that, then move back to a predicate.

Shifting out all but one bit may be simulated via BRKN, for example, mask >> (svcntb()-1) (logical right shift) can be implemented with svbrkn_z(svptrue_b8(), mask, svptrue_pat_b8(SV_VL1)).

Morphing masks into integers/vectors

SVE provides a number of instructions to manipulate predicates, but if you’re doing something special, you may want to apply integer/vector operations to them. Unfortunately, there’s no instruction to move a predicate to a different register set - the only solution is to go through memory.

SVE provides LDR/STR instructions for spilling predicates onto the stack, however you don’t have access to these from ACLE - the manual of which mentions:

No direct support, although the compiler can use these instructions to [refill/spill] registers [from/to] the stack.

Thus you’ll have to coerce the compiler into doing what you want (unions don’t work, neither does typecasting, however you can reinterpret a pointer).

int f(svbool_t mask) {
	uint8_t maskBytes[256/8]; // max predicate width is 2048/8 = 256 bits
	//union { svbool_t p; uint8_t b[256/8]; } maskBytes; -- illegal
	*(svbool_t*)maskBytes = mask; // pointer casting seems to be the only option
	//memcpy(maskBytes, &mask, svcntd()); // very poorly optimised on current compilers
	
	// do something unique with the predicate
	int value = 0;
	for(int i=0; i<svcntd(); i++) {
		value += lookup_table[maskBytes[i]];
	}
	return value;
}

Finally a PMOVMSKB equivalent?

NEON famously lacks an equivalent to x86’s handy PMOVMSKB instruction, often requiring several instructions to emulate. Now that SVE has predication, this becomes much more straightforward. Unfortunately, as there’s no direct predicate to integer instruction, this will need to go through memory (as described above), and you’ll need to at least consider cases where the predicate could be > 64 bits.

Update: SVE2.1 adds a PMOV instruction which can be used to move the predicate to a vector register, which then can be moved to an integer register.

ACLE Tweaks Over NEON

Defined tuple handling functions

I don’t recall the NEON intrinsics documenting how to construct vector tuples (e.g. uint8x16x2_t type) or how to access elements within (all compilers seem to have adopted the tuple.val[] access method). SVE has documented svcreateX, svgetX and svsetX to handle this.

Uninitialized vectors

Another thing lacking in NEON intrinsics is the ability to specify uninitialized vectors, which is now available in SVE. There doesn’t seem to be a way to explicitly specify uninitialized predicates, though I feel that they’re less useful than uninitialized vectors.

Optional type specifier

NEON required vectors to be typed, as well as the intrinsic functions. SVE makes the latter optional via function overloading.

svuint32_t a, b;
svadd_u32_x(svptrue_b32(), a, b); // the `_u32` part can be omitted because we know a and b are u32
svadd_x(svptrue_b32(), a, b);     // ...like this

_n_ forms for constants

ACLE introduces _n_ forms to a number of intrinsics, which could cut down on a bit of boilerplate. Some instructions do support embedding immediate values, but many with _n_ forms don’t (and this doesn’t limit you to immediates), so essentially this saves you specifying svdup_TYPE(...) whenever you use a scalar value that you want broadcasted into a vector.

The functions are overloaded, so you don’t even need to specify the _n_ part, however this does require also dropping the type specifier:

svadd_u32_x(svptrue_b32(), a, svdup_u32(5)); // instead of writing this...
svadd_n_u32_x(svptrue_b32(), a, 5);          // ...you can now use this...
svadd_x(svptrue_b32(), a, 5);                // ...or even this
//svadd_u32_x(svptrue_b32(), a, 5);          // but not this

ACLE Limitations

SVE vectors cannot be embedded in containers (or composite types)

Current GCC (v11) / Clang (v13) doesn’t like dealing with arrays/unions/structs/classes containing SVE types, and refuses to do any pointer arithmetic (including indexing) on them.

This includes cases where no arithmetic would be necessary in optimised code, due to all vectors fitting within addressable registers, and can be an annoyance if you need to pass around multiple vectors (e.g. in some sort of ‘state’ variable) to functions. It can also be problematic if you’re trying to do some generic templating of a common vector type for different ISAs. You could pass around a pointer instead, but this will likely require specifying explicit loads/store intrinsics, and compilers may be less willing to optimise these out.

There are vector-tuple types (note: unrelated to std::tuple), but are limited to 2, 3 or 4 vectors of the same element type. Otherwise, there doesn’t seem to be any way to create any ‘packaged type’ containing vectors unfortunately.

// invalid - cannot use a sizeless type in an array
svuint32_t tables[8];
init_tables(tables);

// correct way to do it
svuint32_t table0, table1, table2, table3, table4, table5, table6, table7;
init_tables(&table0, &table1, &table2, &table3, &table4, &table5, &table6, &table7);

// alternatively, save some typing by using tuples
svuint32x4_t table0123, table4567;
init_tables(&table0123, &table4567);

Here’s an example of a templated design where SVE may seem somewhat awkward:

// example processing functions for NEON/SVE
static inline void process_vector(uint8x16_t& data) {
	data = vaddq_u8(data, vdupq_n_u8(1));
}
static inline void process_vector(svuint8_t& data) {
	data = svadd_x(svptrue_b8(), data, 1);
}

// this concept doesn't work particularly well for SVE
template<typename V>
struct state {
	unsigned processed;
	V data;
};
template<typename V>
void process_state(struct state<V>* s) {
	s->processed += sizeof(V);
	process_vector(s->data);
}

// explicitly instantiate NEON version
template void process_state<uint8x16_t>(struct state<uint8x16_t>* s);
// or SSE, if compiling for x86
//template void process_state<__m128i>(struct state<__m128i>* s);

// perhaps you could do this for SVE...
template<>
struct state<svuint8_t> {
	unsigned processed;
	uint8_t data[256]; // size of largest possible vector
};
template<>
void process_state<svuint8_t>(struct state<svuint8_t>* s) {
	s->processed += svcntb();
	// the following requires an explicit load/store, even if this function is inlined and the state could be held in registers; the compiler will need to recognise the optimisation
	svuint8_t tmp = svld1_u8(svptrue_b8(), s->data);
	process_vector(tmp);
	svst1_u8(svptrue_b8(), s->data, tmp);
}

I personally don’t use SIMD abstraction libraries or layers, but I imagine not being able to embed a vector within a class/union may cause issues with some existing designs.

I don’t know enough about compilers to really understand the fundamental limitations here, but it’s possible that this is just something current compilers have yet to tackle. If so, it could mean a future compiler won’t require programmers deal with this constraint.

No direct indexing of vector elements

I suppose the feature is less useful with variable length vectors, but SVE does guarantee that at least 128 bits exist in a vector, so it’d be nice if at least those could be directly accessed.

// NEON: direct indexing works
int get_third_element(int32x4_t v) {
	return v[2];
}
// SVE: requires explicit intrinsics
int get_third_element(svint32_t v) {
	return svlasta(svptrue_pat_b32(SV_VL2), v);
}

Loading Vector/Predicate Constants

Due to its unknown vector length, ARM doesn’t recommend loading vector/predicate constants, instead providing instructions like INDEX encouraging you to construct the appropriate vector at runtime.

However, since there’s a defined maximum length, it should theoretically be possible to define a 2048-bit vector, or 256-bit predicate and load as much as the hardware supports.

ACLE doesn’t provide direct support for this idea unfortunately (as in, similar to svdup functions).

No direct NEON ↔ SVE casting

Although NEON and SVE registers map onto the same space, ACLE provides no way to typecast a NEON vector to an SVE vector, or vice versa. I’m not sure how useful the feature may be, but it perhaps it could be useful if you have some code which only needs 128-bit vectors, or may help with gradually transitioning a NEON codebase. Under ACLE, NEON ↔ SVE value transfer must go through memory.

Interestingly, full vector reduction instructions (such as UADDV) write to a vector register, but the ACLE intrinsics give back a scalar (i.e. they include an implicit vector → scalar move).

32 bits ought to be enough for any instruction

SVE maintains AArch64’s fixed-length 32-bit instructions. Whilst a fixed-length ISA greatly helps with designing the front-end of a CPU, trying to support up to 3x 32 registers along with 8 or 16 predicate registers per instruction, and maintaining compatibility with AArch64, can lead to some unexpected contortions in the ISA which may initially confuse developers.

one size fits all?

Destructive instructions aren’t scary

Despite the assembly for various instructions, that produce an output from three inputs (such as BSL), looking like they accept four registers, the first two must be identical. This basically means that the instruction only really accepts three registers, where the first input must be overwritten.

BSL z0, z1, z2, z3  # illegal - first two operands must be identical
BSL z1, z1, z2, z3  # okay

The manual mentions this naming convention was chosen to make it easier for readers, though this may initially surprise those writing assembly.

In addition, a number of SVE instructions provide the option of either allowing predication or being non-destructive, but not both.

SVE does provide a ‘workaround’ for destructive instructions - the MOVPRFX instruction, which must be placed directly before a destructive instruction (presumably to enable the decoder to implement macro-op fusion, which should be cheaper than relying on move-elimination). In my opinion, it’s a nice idea, allowing the microarchitecture to determine whether SVE should be treated as a fixed or variable length instruction set, whilst keeping defined instructions to 4 bytes. Requiring an extra 4 bytes just to work around the destructive nature of instructions does feel rather wasteful, but given the options SVE provides, I suspect this not to be too common of a problem to really matter.

In terms of actual implementation of MOVPRFX, the Neoverse V1/N2 and Cortex A510/A710/X2 optimization guides don’t list support for fusing MOVPRFX, list non-zero latency figures and don’t list it as a zero-latency instruction, so the optimisation may be unavailable on these initial SVE implementations.

It’s also interesting to note that it appears MOVPRFX can’t be paired with instructions which are ‘defined as destructive’ like SRI or TBX - it only works for instructions that are destructive as a result of limited encoding space, meaning that ‘defined as destructive’ instructions are likely to incur a move penalty if you need to retain the inputs (unless a future processor introduces zero-latency MOVs on vector registers).

One-handed predicate wielding

Despite there being 16 predicate registers, most instructions only designate 3 bits for the predicate, only allowing P0-7 to be used for predication. P8-15 is generally only accessible to instructions which directly work with predicates (such as predicate AND), or those that yield a predicate (such as compares).

Fixed Width Vector Workloads on Variable Width Vectors

SIMD programmers maintaining fixed width or width specific problems may be concerned with the applicability of an unknown vector length system to their problem. Vector based designs, such as SVE, work well for problems like SAXPY, but if your problem involves fixed width units, or even requires different algorithms depending on the width, how feasible is an SVE implementation when you don’t know the vector width?

square peg in round hole

Some problems may be solvable with some rethinking, others may be frustrating to deal with, and others just may not work well at all (for example, the despacer problem may not scale well under SVE, at least without non-trivial redesign). In general, you may need to accept some level of inefficiency (in terms of performance and implementation complexity), but in the worst case, you may be stuck with only using the bottom 128 bits of the vector, or need to code specifically for each width you wish to support.

Processing in terms of 128-bit units

Given that SVE requires vector widths to be a multiple of 128 bits, it may be worth thinking about the problem in terms of 128-bit units, if possible. For example, when processing an image in terms of 8x8 pixel blocks, instead of trying to fit a block in a vector, consider processing multiple blocks, 128 bits at a time, across each ‘128-bit segment’.

This also fits in nicely with a number of SVE2 instructions, such as MATCH/NMATCH, using a ‘128-bit segment’ concept for their operation. Sadly, many instructions (such as gather, broadcast etc) don’t have the notion of a 128-bit element size, which would’ve helped, efficiency-wise, with adopting such a strategy.

Whilst there’s instructions such as ZIP1.Q and co, these require SME or the FP64 matrix multiplication extension (which don’t appear to be available on upcoming CPUs, based on Neoverse/Cortex optimisation guides). Similarly, SME adds the potentially useful REVD instruction, but support for that will be non-existent for a while.

It’s interesting to note that it’s possible to broadcast an element within each 128-bit segment, but this comes in the form of a multiply instruction, and doesn’t work with 8-bit elements.

Update: ARM has added Hybrid Vector Length Agnostic instructions as a part the SVE2p1 extension, which includes a bunch of handy 128-bit element instructions

Data transposition

A problem that crops up in a number of places (such as AoS ↔ SoA): you have N vectors, each containing N elements, and you with to transpose this data (i.e. take the first element of these N vectors and put them in a single vector, the next element of these N vectors into another, and so on). Consider an example where we have N streams of 32-bit words - if we have a 128-bit vector length, we can fit up to 4 words from each stream in a vector (i.e. N=4). To process these 4 streams concurrently, we read in one vector from each of the 4 streams, then transpose these 4 vectors so that we have the first 32 bits of each data stream in one vector, the next 32 bits in another, and so on.

SIMD transposition

(note: the streams may be longer than shown and exist in non-consecutive memory locations, so LD4 cannot be used here)

If the vector length was 256-bit instead, we’d ideally be processing 8 streams at a time.

This raises an issue: reading N vectors requires knowing N in advance, as you need to allocate the appropriate number of registers. Traditional transposition algorithms (transpose halves, then transpose quarters etc) are width specific, and are designed for power-of-2 vector widths, which doesn’t translate well to variable length vectors.

The other option is to ditch the transposition idea and instead rely on gather, which would allow one implementation to work on all vector sizes. Whilst likely simpler to implement, performance may be a concern, particularly as the vectors get longer.

A compromise could be made:

  • implement 5 power-of-2 variants, using traditional transposition, and a gather variant for other sizes. This still requires you to write 6 implementations however
    • Considering limitations on number of available registers, you may not need an implementation for every power-of-2. Alternatively, you can choose to ignore the larger vector sizes (focusing primarily on 128, 256 and 512-bit) and only optimise for 1024/2048 bit when such a machine eventuates
  • use 64-bit element gathers (unfortunately an ideal 128-bit element gather doesn’t exist) and transpose down to the element size you desire. This is obviously more expensive than traditional transposition, but cheaper than a full gather, whilst enabling you to stick to one implementation
    • If you have separate 64-bit pointers, a 64-bit gather would be your only choice as a 32-bit gather may be unable to reference all sources
    • Since gather can only be done with 32 or 64-bit elements, if your element size is smaller (i.e. 8/16-bit), you’ll also need to take this approach over a pure gather

using gather-transpose

Data Permutation? Just Leave a Mess on the TBL

Trying to shuffle data around in a vector becomes a lot more difficult when you don’t know where the boundaries (length) lie. This will likely cause complications with algorithms which heavily depend on data swizzling/permutation.

length of a string

Despite offering just enough tools to get the job done (i.e. the TBL instruction), SVE2 does little to help on this front, and swizzling seems to be an aspect not well considered. Particularly egregious is that a number of SVE extensions (e.g. FP64MatMul, SME) add new swizzling instructions to cater for the narrow use case the extension is targeting, as clearly the ones included in the base SVE2 instruction set aren’t considered sufficient.

Copying leads to mediocrity at best

NEON had a number of useful swizzling instructions, and I think makes it overall fairly good ISA at swizzling. SVE2 inherits many of these, often with little change, and because of that, it’s much more problematic. To provide an example, let’s take a look at the EXT instruction, which fits this description:

NEON forgoes byte level shifting instructions (such as those in x86’s SSE2) in favour of the EXT instruction, which can be used for both left and right shifts, given appropriate inputs. Note that EXT naturally shifts a vector to the right, but left shift can be done by subtracting the desired shift amount from 16 (the byte width of NEON vectors).

SVE2 carries over this instruction with the same functionality, which makes right shifting just as trivial. However, left shifting now becomes more difficult as you don’t know the vector length in advance (so a 16-n approach doesn’t work), and the shift amount must be an immediate value (so a svcntb()-n solution isn’t possible).

To do a left byte shift, your options may be limited to:

  • write a version of this instruction for every possible vector width, and switch() on the vector width - this may be sensible if this is a critical instruction in a hot loop, where you may want to implement 16 copies of the loop and select the version based on vector width
  • write vectors out to memory, then load it back with an appropriate offset
  • reverse (REV) the vector, do a right byte shift (EXT), then reverse again
  • create a mask then use SPLICE
  • for shifting by 1, 2, 4 or 8 bytes, INSR is also an option

It probably would’ve made sense to include a left-shift version of EXT, or perhaps a byte shift which accepts a variable (non-immediate) shift amount.

TBL and TBX with bytes: different paths to the same destination

Due to the limited number of useful swizzling instructions available, developers may be encouraged/forced into using the all-purpose TBL/TBX to get data into the correct place. The SVE manual explicitly mentions these instructions are not vector length agnostic, which can make it a little awkward to use, as well a potential source of subtle bugs.

SVE’s TBL supports non 8-bit element sizes (in contrast with NEON), and for these, the instruction is fairly straightforward.

However, for 8-bit elements, the distinction between TBL and TBX blurs a lot. In theory, the two instructions do different things, but in canonical SVE code, you can’t rely on that being the case. If the vector width is 2048-bit, both instructions do exactly the same thing, so unless you have separate code paths for different vector lengths (or some other length checking/controlling mechanism), there is no point in ever using TBX with 8-bit elements. Similarly, you can’t rely on TBL‘s zeroing feature (e.g. by setting the bytes you want zeroed to 0xff in the index vector) as it may never be triggered with 2048-bit vectors.

Of course, I fully expect there to be software which doesn’t heed this, particularly since 2048-bit vector machines seem unlikely in the near future, leading to possible unexpected bugs on future machines.

Zeroing (and merging) via predication may have been a better approach, but why come up with some new SVE-esque way of doing things when you can just copy TBL and TBX from NEON and call it a day?

TBL2: two TBLs to bang your head against

(also known as the 2 source vector TBL from SVE2, but ACLE refers to it as ‘TBL2’, so I will as well)

As with (single vector) TBL and TBX, this instruction pretty much works the same way it does in NEON, but as with EXT, it’s also much more awkward to use than the NEON version. To reference elements from the second vector, you need to know the vector length, and then add that to the indicies that refer to elements of the second vector. This means that you’ll almost always need a bunch of instructions to prepare the index vector for TBL2.

Following on from the point in the previous section, using TBL2 with 8-bit elements is completely unusable unless you have vector length checks in place. For 2048-bit vectors, TBL2 can reference none of the elements of the second vector.

It probably would’ve been better if the top bit of each element was used to indicate whether it pulls from the first or second source vector. Alternatively, predication could be used to select the source vector (similar to the SEL instruction). Both of these are likely easier than having to do a count, broadcast and or-merge to reliably select the desired elements.

Are arbitrary sized permutes always wanted?

ARM notes that the permutation instructions TBL, TBL2 and TBX aren’t vector length agnostic. This is understandable, particularly if you want a full vector shuffle/permute, but one does wonder whether limiting the scope may have helped with problems that scale in terms of 128-bit units (such as VPSHUFB in x86 which restricts shuffling to the 16 bytes per lane). On the other hand, full vector permutes are certainly a boon to those used to dealing with the hassles of 128-bit lanes on x86’s AVX.

However, there’s another concern with allowing full width permutes: ARM notes in a presentation that many of the widen/narrow instructions had their behaviour changed to avoid transferring data across long distances. However the TBL family by definition must support transferring data across the entire vector, even if you’re not actually using it for that purpose. This raises a concern with potential performance penalties for using the instruction.

SVE2 pairwise operations

x86’s AVX implementations usually have cross-lane penalties with instructions that cross 128-bit lane boundaries, suggesting that this isn’t just a problem with ARM designs.

However, looking at the Neoverse V1 optimization guide (and assuming accurate figures), there doesn’t appear to be any penalty with a 256-bit (SVE) TBL over a 128-bit (NEON) one. Similarly, other instructions which may need to move data across distances, such as REV, show no penalty.

Aside: it’s interesting to note the way the ALUs are arranged - the Neoverse V1 has 4x 128-bit NEON ALUs which can be arranged into 2x 256-bit for SVE. In such a case, it appears that port V2 is folded into V0 and V3 into V1 - in other words, 128-bit instructions which can be issued to ‘V02’ (i.e. ports V0 and V2) will turn into a 256-bit instruction issued only to V0 (since V2 and V3 are ‘unavailable’ under SVE). TBL in ASIMD is noted as V01, and is also V01 for SVE, suggesting that the Neoverse V1 actually has 2x 256-bit shuffle ports, which simply get restricted to 128-bit when executing NEON, as opposed to the more traditional 4x 128-bit → 2x 256-bit setup. Looking at the Cortex X2 optimisation guide (4x 128-bit SVE2), the ASIMD TBL operates the same, but SVE’s TBL now operates on all four 128-bit ports. This matches the stated Neoverse V1 behaviour, but is weird for SVE’s TBL to have double the throughput of ASIMD’s TBL despite doing the same operation. Another interpretation is that at least some of the published figures for SVE’s TBL are incorrect.

The A64FX guide also doesn’t show a penalty for a 512-bit TBL over ASIMD, but it is noted that TBL always has a higher latency relative to other SIMD instructions on that platform.

It’s entirely possible that moving data across the entire vector just isn’t a problem at current SIMD widths, but could be problematic on wider vectors, or maybe just isn’t that much of a problem period. It’s also unknown whether TBL2 would have any penalty over TBL for width >=256-bit, but speculating from existing ARM architectures, perhaps not.

Update: ARM has now added instructions like TBLQ as part of their Hybrid Vector Length Agnostic instructions in the SVE2p1 extension

Let memory solve it?

SVE adds support for gather/scatter operations, which helps vectorize a number of problems that may otherwise require a lot of boilerplate shuffling. It also can help alleviate issues with lack of good in-register swizzling, since data can be shuffled to/from memory, however it’s not without its downsides:

  • limited to 32/64-bit element sizes (note that the 8/16-bit variants only support sign/zero extension to 32/64-bit)
  • poor performance due to being bottlenecked by the processor’s LSU (and likely being executed as multiple µops), as it’s effectively a scalar operation in terms of performance. In other words, performance decreases linearly as the vector width increases
  • still requires index generation like TBL

Throwing in the towel

Perhaps trying to swizzle data, when you don’t know the length of the vector, is all just too hard. Perhaps we can just limit the vector width instead and pretend to be using a fixed length vector ISA?

SVE does actually provide some support for operating with constrained vector sizes (such as a multiple of 3, or restricting it to a power of 2). Unfortunately, this is done primarily via predication (and support from some other instructions like INCB) and does not help with swizzling instructions (most of which don’t support predication to begin with).

There does, however, seem to be a mechanism for actually setting a vector length, but I don’t think it’s available at EL0 (user-space), so an interface must be provided by the OS (prctl(PR_SVE_SET_VL, ...) under Linux). Note that the contents of vector registers become unspecified after changing the width, and compilers are likely not expecting such changes to occur during execution, so it’s not something that can be changed on a whim. Also, you can only set a vector size smaller than, or same as, what is supported on the hardware; you can’t just set an arbitrary sized vector, such as 512-bits, and expect the hardware to magically deal with it, if the hardware only supports 256-bit vectors. Due to the downsides and the design goal of this mechanism, it seems like this isn’t aimed at easing development, but rather, perhaps serve as a compatibility measure (e.g. allow the OS to run an application with a limited vector width if the application wasn’t designed with wider vectors in mind), or maybe for allowing live server migrations.

However, since the vector length is expected to remain constant throughout execution, you can do checks on the vector length and either have multiple code paths, and/or outright reject supporting some configurations.

Alternatively, since the register files are shared, you can use NEON instructions if you don’t need more than 128 bits, although SVE support doesn’t seem to guarantee NEON support. ACLE also doesn’t help as it doesn’t provide a mechanism to typecast between the two vector types.

The Next Fad: 1664-bit Registers

Update: ARM has now disallowed non power-of-two widths in an update (see C215), which addresses a number of concerns mentioned below

SVE allows any vector width that’s a multiple of 128-bit up to 2048 bits, with no further restriction. This means that a system with 1664-bit vectors would adhere to SVE specifications. I don’t know whether anyone will ever make a machine with such an unusual width, but it undoubtedly enables such flexibility on the hardware side. Regardless, the programmer still has to ensure their code works for such cases to be fully ‘SVE compliant’ (also keep in mind that width can be adjusted down by privileged software, so a system with 512-bit vectors could be configured to operate at 384-bit width). My guess is that ARM is banking heavily on the ‘scalable’ part working well to minimise the impact this has on the programmer, whilst simultaneously maximising hardware configurability.

Unfortunately, not only is SVE not fully width agnostic, this does come with a number of caveats, particularly with dealing with non-power-of-2 widths:

  • memory alignment: if you already do this, it’s probably not worth trying too hard with SVE. I guess you could choose to only bother if the vector width is a power of 2, or maybe try some alignment such that at least some fraction of loads/stores will be aligned (assuming desired alignment boundaries are always a power of 2). It’s possible that trying to cater for alignment isn’t so much of a concern with newer processors (as misalign penalty seems to be decreasing), though interestingly, the addition of memcpy/memset instructions in ARMv9.3 catering for alignment seems to suggest otherwise
  • power-of-2 bit hacks: multiplication/division by vector width is no longer a bit shift away, and if you have code which computes multiples/rounding or similar, such code may need to be revised if you previously assumed vectors would always be a power-of-2 in length (there’s an instruction to multiply by vector width though (RDVL); it may have been handy to have an instruction which does a modulo/divide by vector width without needing to invoke a slow division instruction or resort to runtime division hacks)
  • divide-and-conquer strategies: algorithms such as transposition or custom reductions may require more care as you can’t always repeatedly halve the vector width
  • common multiple: one cannot assume that all possible vector sizes will fit neatly into a statically-chosen smallish array (the lowest common multiple of all SVE vector sizes is 3,843,840 bytes). This can complicate some matters, such as static AoSoA setups
  • bottlenecks on a 128-bit vector machine are likely very different to that of a 2048-bit vector machine. As vectors increase in size, cache/memory bandwidth can become more of a bottleneck, but this may be difficult to predict without benchmarking on a wide vector machine

Currently announced/available SVE/2 widths are 128, 256 (Neoverse V1) and 512 bits (A64FX); it’s unknown if a 384 bit SVE implementation will ever eventuate (seems unlikely to me, but ARM felt it necessary to allow it).

Small/fast vs big/slow

big vs small

On the x86 side, the first AVX512 implementation had an interesting quirk: the frequency of the processor may throttle when executing ‘heavy 512-bit code’. As the vector width is programmer-selectable there, this allowed code to be either latency focused (keeping to shorter vectors) or throughput focused (using longer vectors).

SVE doesn’t readily allow the programmer to select the vector width, which means this choice isn’t available. This could limit hardware implementations to select widths that have a limited impact on frequency/latency. The Neoverse V1 optimisation guide doesn’t provide info on performance when the SVE width is restricted to 128-bit, so it’s yet to be seen if a processor would perform differently depending on the vector width set - particularly with instructions that may not scale so well with vector length (e.g. gather/scatter).

Interestingly, ARM introduces a ‘streaming mode’ as part of SME, which enables it to have a longer vector width relative to SVE, and may be a solution?

I Must Upgrade NEON to SVE2

Apart from being variable length and supporting predication, many instructions in SVE2 mirror that of NEON, which makes understanding easy for anyone familiar with NEON, and helps with translating code.

ARM does highlight a key change made from NEON to SVE2 in how widening/narrowing instructions work:

SVE2 widen/narrow

Not highlighted, but another difference would be that all comparisons yield a predicate result instead of a vector; if the old behaviour is desired, it has to be constructed from the predicate.

Some NEON instructions also seem to be missing SVE2 equivalents:

  • ORN (bitwise or-not): replaceable with BSL2N at the cost of being destructive
  • CMTST (bit-test compare): unsure why this was removed; you’ll probably need to take the manual approach (AND+CMPEQ) with this one
  • vector ↔ scalar moves, e.g. SMOV/UMOV and INS: for extracting elements from a vector, LASTA/LASTB could work.
  • XTN: UZP with a zero vector seems to do the same thing
  • 3-4 vector TBL has been removed, as well as multi-vector TBX; though inconvenient to have removed, those instructions typically performed poorly, and the implementation of 2-vector TBL in SVE2 feels somewhat iffy (at least with 8-bit elements).
  • multi-vector LD1 and ST1: probably not enough benefit over issuing the load/store instructions individually

ARM’s own migration guide can be found here.

Translation Tools

Some developers may be familiar with libraries like sse2neon, which can help transition code written in one ISA to another. At the moment, I don’t know of any tools which can translate intrinsics into SVE/2.

Being variable width, translating a fixed width vector ISA into SVE/2 will likely be more challenging than what a simple header library can achieve.

However, translating the other way around may eventually be possible with something like SIMDe.

Aside: Comparison with AVX512

I may as well make a comment about x86’s closest, very well received competing SIMD extension. The overall comparison between NEON ↔ SSE largely remains, with the ARM side feeling more orthogonal in general (with fewer odd gaps in the instruction set), but has less “Do Tons of Stuff Really Fast” style instructions that x86 has.

SVE2 ↔ AVX512 has been closing the gap however. AVX512 plugs many of the holes that SSE had, whilst SVE2 adds more complex operations (such as histogramming and bit permutation), and even introduces new ‘gaps’ (such as 32/64-bit element only COMPACT, no general vector byte left-shift, non-universal predication etc).

SVE‘s handling of predication feels more useful than AVX512’s implementation - whilst support for it isn’t as extensive, the idea of the WHILELT instruction and FFR are a great use of the feature, allowing it to be easily used for trailing element handling and even speculative loading (useful for null terminated strings, for example).

AVX512 may seem more broken up, due to all its extensions, but x86 has the benefit of there being few implementations and low variability, making this perceived fragmentation mostly illusory in practice (currently, Skylake-X and Ice Lake are the major compatibility targets; though the programmer must be aware of this). It’s unknown how SVE/2 will fare in that regard, but having a solid base instruction set makes it less likely an issue.

As for the elephant in the room, the key difference between the two is that AVX512 allows the programmer to specify the vector width (between 128/256/512 bit) but requires hardware to support all such widths, whereas SVE2 allows the hardware to dictate the width (128-2048 bit) at the expense of the programmer having less control over such.

In terms of hardware support, it’s expected all future ARM designed CPUs will support SVE2 across all A-profile products, albeit many possibly limited to 128-bit vectors, whilst AVX512 is somewhat of an unknown with Intel being uncertain about support for it in their latest client processors. AVX512 does currently have the benefit of actually being available to consumers for a few years though.

Development Resources

Documentation/Information

Being new, there isn’t a lot of info out there about SVE at the moment. ARM provides some documentation, including an instruction manual, as well as an ACLE PDF; the descriptions in the latter are rather spartan, so you’ll often need to refer back to the assembly manual for details.

SVE/2 has recently been added to ARM’s Intrinsics page.

The SVE documentation in Linux is worth checking out for ABI details, if Linux is your target platform.

Compilers

The latest GCC and Clang/LLVM versions have support for SVE2 via intrinsics. I haven’t tested armclang (which I presume wouldn’t have any issues), but have tested GCC 10 and Clang 12 (I’ve found issues in Clang 11 that seem to be fixed in 12, so I suggest the latter). I have not tested auto-vectorization, neither have I looked at assembler support, but recent-ish GNU Binutils’ objdump handles it fine.

I don’t know about support in non C/C++ compilers, but I’d imagine it to be fairly minimal at best, at present.

Testing Code

Without SVE hardware available, your only option is to use an emulator. Even if you have SVE hardware, you’ll probably still want to use an emulator - just because your code works on your SVE CPU doesn’t mean it’ll work on a CPU with a different vector width. I can see this latter point being a not-unlikely source of bugs; the additional complexity of getting something to work with arbitrary sized vectors can mean unnoticed bugs if code isn’t verified against a range of vector sizes.

ARM provides their ARM Instruction Emulator, which unfortunately, isn’t open source and only released for select AArch64 Linux hosts (which shouldn’t be too surprising). This means that you’d likely have to run it on an ARM SBC (of which many aren’t particularly performant), a smartphone which can run a Linux container, or rent an ARM instance from a cloud provider (which is becoming more available these days).

One thing to note is that if you’re doing dynamic dispatch (e.g. testing SVE availability via getauxval in Linux), you’ll need to bypass the check when testing because ARM’s emulator doesn’t intercept that call. If you need to set a vector width dynamically (i.e. via prctl)… I guess you’ll have to hope it works in practice unless you have an actual test machine.

Another option may be to use qemu, or other suggestions from ARM, but I haven’t tried these myself. It seems like ARM’s own IE doesn’t support big-endian, so you may need to use qemu to test compatibility if big-endian support is a concern.

OS support

From what I can tell:

Amongst popular OSes, it seems like only Linux/Android really supports SVE so far.

Conclusion

SIMD programmers have often been annoyed with having to deal with multiple SIMD instruction sets, not to mention multiple ISAs on the same platform (see SSE/AVX/AVX512). SVE aims to address the latter problem (at the expense of introducing yet another ISA) with a different approach, whilst providing greater flexibility to CPU implementers and new capabilities to programmers. Towards that goal, it does feel like SVE largely achieves it with a reasonable balance of concerns, despite all the compromises this entails.

To enable this, SVE has to operate at a higher level of abstraction relative to fixed length SIMD ISAs. In addition to possibly adding some overhead, it inevitably makes this ISA more opinionated, meaning that it can simplify use cases which suit its model whilst making others more challenging to deal with. And SVE does feel like it’s more focused on specific scenarios than NEON was.

Depending on your application, writing code for SVE2 can bring about new challenges. In particular, tailoring fixed-width problems and swizzling data around vectors may become much more difficult when the length is unknown. Whilst SVE2 should provide enough tools to get the job done, it’s underwhelming support for data swizzle operations feels poorly thought out and leaves much to be desired. On the other hand, SVE2 shares a lot of functionality with NEON, which greatly simplifies adopting existing NEON code to SVE2.

Unfortunately, current implementations of SVE intrinsics can make software wrappers more restricted, particularly for those hoping to use one to unify the multiple SIMD ISAs out there. I can also see subtle bugs potentially cropping up for code not tested against a variety of vector lengths, particularly since SVE allows for great flexibility in that regard, which means your development pipeline may need to incorporate an instruction emulator for testing, regardless of hardware availability/support.

Finally, SVE2’s main selling point over NEON - the ability to have >128-bit vectors on ARM - will likely not eventuate on many processor configurations for quite some time. As such, and in addition to slow software/OS adoption, it could be a while before we get to see the true extent of what SVE is capable of, and how well the ‘scalable’ aspect plays out.

@clausecker
Copy link

I thought qemu was cross platform? Regardless, that's an issue with tooling, not the ISA.

The system-level emulator is, the userspace emulator is not (it needs to know the layout of the operating system's datastructures and what its system calls are). If I have to start mutliple full VMs with QEMU just to run the test suite I guess I won't be doing much SVE development in the future.

Why would you think an OS built-in emulator would be any different?

That emulator would only trap/emulate the relevant SVE instructions, not the full instruction set. Thus, problems stemming from difficult to emulate behaviour of the processor (such as aspects of concurrency) can be sidestepped. Presence of such an emulator can be mandated by the ABI specification, making it portable to rely on it (as is the case on SPARC).

I haven't tried qemu (which might indeed require it due to being more of a system emulator), but I haven't had any such problems with Intel's SDE or ARM's Instruction Emulator - they just use existing libraries on your system and have relatively minimal setup requirements.

Neither of these two are portable and both seem to be closed source programs (one even only available for payment apparently?) In particular, neither are available for operating systems I commonly develop on such as FreeBSD. Software I write runs on dozens of operating systems. Having an emulator that is closed source, commercial, and only available on two of them be a mandatory prerequisite for running the full test suite does not inspire confidence.

And in any way: I do not think a userspace emulator can even be written to be fully portable as it has to be adapted manually to the system calls of each operating system it can run on.

If you're only using the bottom 512 bits of the vector, regardless of actual size, it shouldn't really be a problem, since you'll just be ignoring the rest of the mask anyway (assuming you're referring to 8 bit elements).

The use case I was thinking about here was one where I could reasonably write vector-length agnostic code, so a restriction to the bottom 512 bits of the vector would not be sensible.

I don't see how it's that different on the AVX side. If an "AVX1024" comes out, you'll need to write a separate code path to take advantage of it, which you won't be able to test without an emulator or access to whatever CPU that ends up supporting it.

If I write and test AVX code today, it will work and do the exact same thing on all processors that support AVX, even future processors. Such processors won't suddenly start to do different things when executing AVX instructions, possibly breaking my code. I can be certain that having tested the code thoroughly on one machine, it will always be correct.

The same cannot be said about SVE code. Unless I test with all possible vector lengths, there's just no way to be sure that my code will work correctly on future processors. Maybe a future processor with 1664 bit SVE suddenly triggers an unexpected corner case in my code that cannot be triggered with any other vector length and everything crashes and burns.

It is boggling my mind how nobody sees a problem with that. Industry best practice is to test your code both with unit and integration tests and with as much coverage as possible. Programmers, even smart ones, make mistakes and all that. But suddenly we have new instructions that are specified to behave differently on different CPUs (something that was never the case before!) and everybody says “ah no, we don't need to test for that, it'll be just fine. In fact, we won't even add facilities to make it possible test for that. Because you know, it will be fine.” Just imagine some of the usual things that could happen:

  • someone writes SVE code that accumulates data into a register, flushing it out after processing a vector-length dependent amount of data. For short vector lengths this works fine. For longer vector lengths (which have not been/could not be tested for) the register overflows and it fails.
  • someone incorrectly (our of laziness, pragmatism, ...) assumes that the vector length won't exceed 1024 bit anyway (as that's the largest SVE implementation on the market) and writes code that makes such an assumption. In 10 years, CPUs routinely have 2048 bit SVE and the code fails horribly. A ticking time bomb right there.
  • some SVE code stores arrays of temporary vectors on the stack, using a fixed-size array. For small vector lengths as shipped on current CPUs, this works fine. For large vector lengths on future CPUs, we have a surprise zero-day exploit on out hands.
  • as above, but using a VLA for the vectors. Now we no longer have stack corruption, but the function just straight up crashes the program with a stack overflow for larger vector lengths.

I mean SVP64 at least has a user-settable register to clamp the vector length if you need to, but with SVE, there is no ZCR_EL0. It's just completely incomprehensible to me. As if all these vector people drank their own koolaid so hard, they forgot that people could seriously fuck up.

@zingaburga
Copy link
Author

The system-level emulator is, the userspace emulator is not

Ah I see.

That emulator would only trap/emulate the relevant SVE instructions, not the full instruction set
I do not think a userspace emulator can even be written to be fully portable as it has to be adapted manually to the system calls of each operating system it can run on

What's your definition of 'portable'? Yes, there will be OS specific stuff, but that's no different from any other software out there.
Otherwise, I don't see why an application emulator couldn't do a similar thing - that is, trap on SIGILL and emulate the relevant instructions (I wouldn't be surprised if ARM's IE is doing exactly that).

Neither of these two are portable and both seem to be closed source programs (one even only available for payment apparently?)

They are closed source, but are freely available for use. I link to ARM's IE above, and Intel's SDE is available here.

But again, this sounds like an issue with tooling, not an ISA issue nor OS deficiency. If an open source, 'portable' emulator existed, I'm guessing you wouldn't have any qualms?

In particular, neither are available for operating systems I commonly develop on such as FreeBSD

Does FreeBSD even support SVE at the moment? Don't you think mandating the kernel include an emulator would slow down adoption by OSes?

The use case I was thinking about here was one where I could reasonably write vector-length agnostic code, so a restriction to the bottom 512 bits of the vector would not be sensible.

Wait, weren't you saying that your problem is heavily width dependent and doesn't work well with VLA?
If you can't make your problem VLA, ignoring higher bits of the vector/mask seems like a sensible approach.

It is boggling my mind how nobody sees a problem with that

Well, I did point it out in the article, so I don't think you're the only one...

@clausecker
Copy link

What's your definition of 'portable'? Yes, there will be OS specific stuff, but that's no different from any other software out there.
Otherwise, I don't see why an application emulator couldn't do a similar thing - that is, trap on SIGILL and emulate the relevant instructions (I wouldn't be surprised if ARM's IE is doing exactly that).

My usual portability standard is: must work on any POSIX-like system with a C compiler. For this use case, I can reasonably also assume that the system runs ARM64 and that the C compiler supports all the relevant SVE intrinsics. If the library I am developing is written in assembly, I may make further assumptions about ABI (such as, that it follows the ARM procedure call standard and so on).

Trapping SIGILL (resp SIGEMT) to emulate instructions cannot be done in a portable manner as none of the usual specifications (ISO 9899, POSIX, ...) specify the layout of the data structures needed to do so.

They are closed source, but are freely available for use. I link to ARM's IE above, and Intel's SDE is available here.

I do not plan to use closed-source software for my development.

But again, this sounds like an issue with tooling, not an ISA issue nor OS deficiency. If an open source, 'portable' emulator existed, I'm guessing you wouldn't have any qualms?

Once again, I do not believe such a thing can be made. And again, integrating that into a test suite is already pretty annoying. Not to mention that the test suite will have to be run 32 times, once for each possible vector size.

Does FreeBSD even support SVE at the moment? Don't you think mandating the kernel include an emulator would slow down adoption by OSes?

I am not sure if FreeBSD supports SVE right now as I do not have any hardware to test on. ARM could provide the required code as a public domain package for integration into operating systems to enable easy integration.

Wait, weren't you saying that your problem is heavily width dependent and doesn't work well with VLA?
If you can't make your problem VLA, ignoring higher bits of the vector/mask seems like a sensible approach.

You are confusing different applications I mentioned. One is the puzzle-solving application which tries to use vectors for fixed-length structures. Another can use vectors of arbitrary length, but could benefit from processing masks in general purpose registers.

Well, I did point it out in the article, so I don't think you're the only one...

I guess we can agree on that. My prediction: so much code will have trouble with longer vector lengths that they'll retrofit a “max supported vector length” note into their ELF spec as a compatibility measure. Long term I believe it'll be likely that application-class ARM chips won't ever extend past 128 bit SVE due to such issues with existing code.

@zingaburga
Copy link
Author

cannot be done in a portable manner as none of the usual specifications (ISO 9899, POSIX, ...) specify the layout of the data structures needed to do so.

You could just support a wide range of OSes.
I guess you could argue that this still isn't 'portable' in the POSIX sense, but I'd counter that it'd be good enough for all practical purposes. Not to mention that SVE itself isn't supported on all POSIX OSes anyway.

Not to mention that the test suite will have to be run 32 times, once for each possible vector size.

There's 16 possible vector sizes under SVE, but you don't have to test every possible width. Just like you don't have to do 4.2B tests for every function that takes a 32-bit integer, you can be selective with the widths to test.

But yes, needing to incorporate an emulator into the test routine does add complexity.

You are confusing different applications I mentioned. One is the puzzle-solving application which tries to use vectors for fixed-length structures. Another can use vectors of arbitrary length, but could benefit from processing masks in general purpose registers.

Okay that makes more sense. I talk about mask processing in GPRs here.

@jan-wassenberg
Copy link

@zingaburga nice, some quick updates:

The _x predication variant is actually important for avoiding extra MOVPRFX instructions being generated.

It is no longer necessary to worry about non-power of two vector lengths [https://developer.arm.com/documentation/102105/ia-00/]

@zingaburga
Copy link
Author

Interesting, I can't think of how the _x predication could do that, would you be able to explain?
Note that I'm comparing _x predication against no predication.

Also interesting is ARM's reversal on non-power-of-2 widths; looks like this earlier post got actioned. The change does indeed remove a quirk of the ISA.

I see the following in your linked document:

2.45 C215: SVE
Arm is making a retrospective change to the SVE architecture to remove the capability of selecting a non-power-of-two vector length in non-Streaming SVE as well as in Streaming SVE mode.
Specific updates as a result of this change will be communicated in due course.

Does 'selecting' refer to software selection (i.e. setting ZCR_ELx.LEN), or the ability for hardware to select such widths, or both?
It looks like this document hasn't been updated yet, so we'll have to wait and see.

Thanks for the update!

@zingaburga
Copy link
Author

Added some minor updates to address changes ARM has made since late 2021. I haven't gone through the full article (or reviewed all changes made to SVE) and, at the moment, don't feel like rewriting it or writing a new article to address all the changes.

Happy to add 'update' lines if anyone picks up something I've missed.

@jan-wassenberg
Copy link

Interesting, I can't think of how the _x predication could do that, would you be able to explain?

If we insist upon zeromasks then the instruction encoding may not have enough bits for one of the regs, hence MOVPRFX is required. We use _x when we don't care and the predicate is all-true, and _z only where we really want zeros.

Does 'selecting' refer to software selection (i.e. setting ZCR_ELx.LEN), or the ability for hardware to select such widths, or both?

I think one could reasonably conclude based on this information that software will not have to deal with non-pow2.

@zingaburga
Copy link
Author

We use _x when we don't care and the predicate is all-true

I'd assume that in the vast majority of cases, _x would only be used with all-true predicates. Which means that the predicate is basically pointless and they should've just dropped it.

To put it another way, svadd_x(svptrue_b32(), a, b) could've been just svadd(a, b) instead, meaning that the _x variant is rather pointless.
In some weird edge cases, _x might save a PTRUE (not MOVPRFX), but it seems such an odd case for a lot of ACLE to optimise for.

I think one could reasonably conclude based on this information that software will not have to deal with non-pow2.

Yeah, that seems logical.

@jan-wassenberg
Copy link

I'd assume that in the vast majority of cases, _x would only be used with all-true predicates. Which means that the predicate is basically pointless and they should've just dropped it.

Agreed, but they appear to have made the design decision to have predicates in as many places as possible, with a few exceptions (svzip etc.). In fairness, this would allow for clock-gating the inactive lanes, but that seems like a very small win especially when software such as ours just sets ptrue anyway.

@zingaburga
Copy link
Author

zingaburga commented Jan 20, 2023

In fairness, this would allow for clock-gating the inactive lanes

As far as I can tell, not really, as the assembly doesn't support the notion of "don't care predication", so the compiler would have to turn it into merge-predication, or unpredicated. At which point, they may as well have made this behaviour explicit.

The design decision just seems to be really weird to me - unnecessarily complex for questionable gain.

they appear to have made the design decision to have predicates in as many places as possible

I'm not sure if it came through clearly, but I was thinking more in-line with how AVX-512 intrinsics are done, e.g. a _mm_add_X for unpredicated, and _mm_mask_add_X for predicated. Instead of doing _x/_z/_m, the _x variant would just become unpredicated, so you can still have masks in all the places where it makes sense.

@jan-wassenberg
Copy link

Yes, I would also have preferred the AVX-512 style of encoding here. The only downside is that one of the mask regs means "no mask", but we anyway rarely use more than a few masks.

As far as I can tell, not really, as the assembly doesn't support the notion of "don't care predication", so the compiler would have to turn it into merge-predication, or unpredicated. At which point, they may as well have made this behaviour explicit.

To clarify: I was thinking of the HW, not any intrinsics or compiler transform. Predicating each instruction makes it slightly more likely that users will pass through a not-all-true mask, vs. a hypothetical ISA that gives a "no mask" option. This would allow some clock-gating, but I agree that's a really small savings, mainly at the end of loops.

@zingaburga
Copy link
Author

The only downside is that one of the mask regs means "no mask"

Many instructions are unpredicated or have unpredicated forms. For those that require a predicate, PTRUE basically sets a mask appropriately, so you can get the same effect by dedicating it as a "no masking" predicate, but with more flexibility.

To clarify: I was thinking of the HW, not any intrinsics or compiler transform

I don't really have any objections to the choices made in assembly/machine-code regarding predication, my concern is with the weird decision to have _x in ACLE.

@xlazom00
Copy link

@zingaburga btw are there any phone SOC with SVE/SVE2 ?
I am testing Snapdragon Gen2 and NO LUCK :(

@zingaburga
Copy link
Author

I can't provide advice on that front. From memory, Qualcomm disables SVE on their CPUs.

@EwoutH
Copy link

EwoutH commented May 23, 2024

The Apple M4 is talked a lot about in the last week, apparently it's a full-fledged ARMv9.4 CPU with SME2 support, which would imply SME and SVE2 support. And apparently the SME2 support will replace Apple's own AMX.

See

image

@zingaburga
Copy link
Author

From what I've read, M4 does not support SVE(2), it only supports streaming SVE and SME.
https://scalable.uni-jena.de/opt/sme/micro.html

That SME is included but SVE2 isn't, seems to be a stronger indicator of Apple being uninterested in SVE. And Qualcomm's next chip won't support SVE either, which really doesn't bode too well for adoption, at least in the client/mobile space.

@EwoutH
Copy link

EwoutH commented May 23, 2024

@zingaburga so that would mean SME isn't a strict superset of SVE2, like ARM documents? Or does the silicon likely support it but did Apple disable certain operations?

@zingaburga
Copy link
Author

From memory, no, but I didn't really look into SME. I think the diagram is roughly correct, but not strictly so, as SME can be implemented without a bunch of SVE2 functionality.

@JayDDee
Copy link

JayDDee commented May 28, 2024

NIce essay.

Lack of CPUs with 256 bit vector registers is likely slowing adoption of SVE so there's no need to waste the silicon on it. Neon is faster anyway.

I've been looking at SVE2 but I can't say I'm impressed. It seems to be designed more for convenience than performance, primarilly to offload some of the work from the compiler to the CPU and make the code more portable. I'm not convinced about the portability goal considering the many different CPU manufacturers and feature combinations. Vector size isn't at the top of the list of portability issues IMO.

I've briefly looked at RISC-V and it seems to take a similar approach as SVE.

SVE also includes run time overhead to determine HW vector length. I'm focussed on performance and I code for specific vector lengths. I've looked at using SVE for 256 bit VL , if/when HW is available, but the code seems overly complicated. I'd prefer a NEON256 with some of the missing instructions and predication ported back from SVE. The documentation is pretty clear that SVE is not meant to replace NEON and until there are CPUs with larger vector registers there's no use for SVE.

There is a much simpler way to address different HW vector sizes but it might not be suitable for RISC CPUs because it would require microcode to implement the instructions based on HW vector size limitations. Code written for larger vectors would be split over 2 or 4 registers while still being executed by a single instruction. While it won't achieve the performance of larger registers it still achieves VLA portability and reduces instruction fetching and decoding. It would also allow things like 512 bit vectors on Intel E-cores and allowing full 512 bit support on P-cores of hybrid CPUs.

@zingaburga
Copy link
Author

Thanks.

SVE also includes run time overhead to determine HW vector length. I'm focussed on performance

I can't imagine this overhead to be of much significance. Arguably x86 also has run time overhead given that you need code to detect the available feature set.

If you always code to a specific vector length though (without some means to auto-generate), SVE's model may be challenging to adapt to.

it might not be suitable for RISC CPUs because it would require microcode to implement the instructions based on HW vector size limitations

It's generally wrong to think of ISAs in terms of RISC/CISC these days (maybe with the exception of the RISC-V base ISA, which generally follows 1980s' style RISC).
SVE most certainly requires some instructions (like gather/scatter) to be micro-coded (or perhaps micro-sequenced).

Code written for larger vectors would be split over 2 or 4 registers while still being executed by a single instruction

RISC-V Vector extension allows up to 8(!) registers to be operated on via a single instruction.

It would also allow things like 512 bit vectors on Intel E-cores and allowing full 512 bit support on P-cores of hybrid CPUs.

It's worth pointing out that Intel's E-cores already break 256-bit instructions into halves when executing on its 128-bit FPUs. I have a suspicion that breaking an instruction into 4 uOps might be problematic, hence its avoidance (AVX-512 also includes a lot more permutation instructions which may be challenging to divvy up).

@JayDDee
Copy link

JayDDee commented May 29, 2024

I'm still new to arm but I suspected it was not true RISC despite the name, thanks for clarifying.

I can't imagine this overhead to be of much significance. Arguably x86 also has run time overhead given that you need code to detect the available feature set.

If you always code to a specific vector length though (without some means to auto-generate), SVE's model may be challenging to adapt to.

A "portable" build would require run-time detection, while seperate builds are tuned at compile time specifically to avoid any unnecessary run time decision making. I don't think that can be avoided with SVE. My best guess so far would be to set the size once at program startup but I believe there is also overhead to test it every iteration.

https://developer.arm.com/documentation/102131/0100/Part-Three---When-it-is-sometimes-useful-to-keep-optimized-Neon-code

It's worth pointing out that Intel's E-cores already break 256-bit instructions into halves when executing on its 128-bit FPUs. I have a suspicion
that breaking an instruction into 4 uOps might be problematic, hence its avoidance (AVX-512 also includes a lot more permutation instructions
which may be challenging to divvy up).

x86 microcode would come into full play when breaking up those AVX512 permutes and turning them into register moves.
Maybe that partially explains the existing penalty crossing 128 bit lanes with AVX*.

@zingaburga
Copy link
Author

I don't think that can be avoided with SVE. My best guess so far would be to set the size once at program startup but I believe there is also overhead to test it every iteration.

I don't see why you can't just hard code the size if you're not doing a portable build?

@JayDDee
Copy link

JayDDee commented May 30, 2024

I don't think that can be avoided with SVE. My best guess so far would be to set the size once at program startup but I believe there is also overhead to test it every iteration.

I don't see why you can't just hard code the size if you're not doing a portable build?

That is probably what I will have to do but I'd rather not. Migrating from SSE to NEON was trivial compared to migrating from NEON to SVE seems to be.
SVE will require more source code changes, NEON only required a header file to convert SSE intrinsics to NEON intrinsics.
SVE has to deal with VLA and forced predication regardless of hard coding the vector length.
Much of my code shifts from 128 to 256 bit vectors depending on the element size of individual functions being either 32 or 64 bits. This is easy to do between SSE and AVX but not so simple between NEON and SVE because they use a different register set.
I would probably have to rewrite all the NEON code for SVE before I could support 256 bit SVE. That new code would then be incompatible with SSE/AVX code, and so on.
And I haven't yet considered all the issues you raised in your essay, like permutations.
Permutations became easier on x86_64 as vector registers grew from 128 to 256 to 512 bits but the the opposite seems to be case on ARM.
It's a huge rats nest and I'm no where near the bottom yet.
Fortunately it's all academic until consumer ARM CPUs start showing up with 256 bit vector registers.
Thanks for all your feedback but I think I'll drop the idea for now.

@jan-wassenberg
Copy link

@JayDDee you might be interested in our Highway library. After once porting your x86 code to portable intrinsics, the same (vector-length-agnostic) code also works on SVE.

@JayDDee
Copy link

JayDDee commented May 30, 2024

@JayDDee you might be interested in our Highway library. After once porting your x86 code to portable intrinsics, the same (vector-length-agnostic) code also works on SVE.

Thanks for the tip, I'll take a look.

@embg
Copy link

embg commented Dec 27, 2024

Unfortunately, as there’s no direct predicate to integer instruction, this will need to go through memory

Instead of moving through memory, isn't it faster to go predicate reg -> vector reg -> GPR? PMOV moves from predicate -> vector, and FMOV moves from vector -> GPR.

@clausecker
Copy link

@EmBh It's not clear if it'll be faster to do two moves; store-forwarding is pretty fast and you would have to benchmark.

@zingaburga
Copy link
Author

zingaburga commented Dec 28, 2024

Instead of moving through memory, isn't it faster to go predicate reg -> vector reg -> GPR?

Thanks for pointing that out.
Note that this article was written before SVE2.1 was published, so PMOV didn't exist back then.
I've added your point in, though I haven't really examined the SVE2.1/2.2 changes much, so the article may be out of date (though should mostly be applicable for SVE2.0).

I don't believe there's any SVE2.1 cores available yet, though I think ARM's next gen Cortex is rumoured to be ARMv9.4, which I think includes SVE2.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment