Skip to content

Instantly share code, notes, and snippets.

@domenic
Last active December 24, 2024 11:46
Show Gist options
  • Save domenic/a9343fa787ba54b4ba3a60882c49cc32 to your computer and use it in GitHub Desktop.
Save domenic/a9343fa787ba54b4ba3a60882c49cc32 to your computer and use it in GitHub Desktop.
Generic zero-copy ArrayBuffer

Generic zero-copy ArrayBuffer usage

Most APIs which accept binary data need to ensure that the data is not modified while they read from it. (Without loss of generality, let's only analyze ArrayBuffer instances for now.) Modifications can come about due to the API processing the data asynchronously, or due to the API processing the data on some other thread which runs in parallel to the main thread. (E.g., an OS API which reads from the provided ArrayBuffer and writes it to a file.)

On the web platform, APIs generally solve this by immediately making a copy of the incoming data. The code is essentially:

function someAPI(arrayBuffer) {
  arrayBuffer = arrayBuffer.slice(); // make a copy

  // Now we can use arrayBuffer, async or in another thread,
  // with a guarantee nobody will modify its contents.
}

But this is slower than it could be, and uses twice as much memory. Can we do a zero-copy version?

One solution is for such APIs to transfer the input:

function someAPI(arrayBuffer) {
  arrayBuffer = arrayBuffer.transfer(); // take ownership of the backing memory

  // Now we can use arrayBuffer, async or in another thread,
  // with a guarantee nobody will modify its contents.
}

But this can be frustrating for callers, who don't know which APIs will do this, and thus don't know whether passing in an ArrayBuffer to an API will give up their own ownership of it.

This gist explores a solution which has the following properties:

  • It requires the caller to do a one-time transfer of the ArrayBuffer to the callee, via explicit call-site opt-in.
  • Callees do need to do a small amount of work to take advantage of this, but the code to do that work is generic and could be generated automatically. (E.g. by Web IDL bindings, on the web.)

In this world, the default path, where you just call someAPI(arrayBuffer), still does a copy. This means the caller doesn't have to worry about whether they're allowed to continue using arrayBuffer or not. I think this is the right default given how the ecosystem has grown so far.

What it looks like in practice

function someAPI(arrayBuffer) {
  // This line could be code-generated generically for all ArrayBuffer-taking APIs.
  arrayBuffer = ArrayBufferTaker.takeOrCopy(arrayBuffer);

  // Nobody else can modify arrayBuffer. Do stuff with it, possibly asynchronously
  // or in native code that reads from it in other threads.
}

const arrayBuffer = new ArrayBuffer(1024);
someAPI(arrayBuffer); // copies

const arrayBuffer2 = new ArrayBuffer(1024);
someAPI(new ArrayBufferTaker(arrayBuffer2)); // transfers

The implementation of ArrayBufferTaker can be done today, and is in the attached file.

Open questions

  • How to make this work ergonomically for cases where someAPI takes a typed array or DataView?
  • Probably arrayBuffer2.take() or some better-named method would be more ergonomic than new ArrayBufferTaker(arrayBuffer2)
  • Probably in general we should come up with better names. This is an important paradigm and using the right names and analogies is key.
  • Can we let someAPI release the memory back to the caller? That would require language support.
  • How does this interact with SharedArrayBuffers, resizable ArrayBuffers, and growable SharedArrayBuffers?
    • Probably this is just not applicable to SharedArrayBuffer cases. Those are explicitly racey.
    • Maybe it just works for resizable ArrayBuffers?

Acknowledgments

Thanks to @jasnell for inspiring this line of thought via whatwg/fetch#1560. Thanks to the members of the "TC39 General" Matrix channel for a conversation that spawned this idea, especially @mhofman who provided the key insight: a two-step create-taker then take procedure, instead of attempting to do this in one step.

class ArrayBufferTaker {
#ab;
constructor(ab) {
// Using https://github.com/tc39/proposal-arraybuffer-transfer
this.#ab = ab.transfer();
// Or if you want something that works today:
// this.#ab = structuredClone(ab, { transfer: [ab] });
}
take() {
const ab = this.#ab;
if (!ab) {
throw new TypeError("Cannot take twice");
}
this.#ab = null;
return ab;
}
static takeOrCopy(abOrTaker) {
if (#ab in abOrTaker) {
return abOrTaker.take();
}
return abOrTaker.slice();
}
}
@jakearchibald
Copy link

jakearchibald commented Dec 7, 2022

How to make this work ergonomically for cases where someAPI takes a typed array or DataView?

Could ArrayBufferTaker just take Uint8array etc? They can already be transferred.

Can we let someAPI release the memory back to the caller? That would require language support.

Could something like this work?

function someAPI(arrayBuffer) {
  // This wrapper could be code-generated generically for all ArrayBuffer-taking APIs.
  const [borrowedBuffer, giveBack] = ArrayBufferTaker.borrowOrCopy(arrayBuffer);
  arrayBuffer = borrowedBuffer;
  
  const returnValue = (() => {
    // Nobody else can modify arrayBuffer. Do stuff with it, possibly asynchronously
    // or in native code that reads from it in other threads.
  })();
  
  giveBack();
  return returnValue;
}

let arrayBuffer2 = new ArrayBuffer(1024);
const taken = new ArrayBufferTaker(arrayBuffer2); // transfers
someAPI(taken); // no copy
arrayBuffer2 = taken.retrieve();

I figured that the retrieval doesn't need to be promise-based, as the completion of someAPI should be enough.

If someAPI throws, then the array shouldn't be retrievable, as it may be in an unreliable state.

class ArrayBufferTaker {
  #ab;
  #retrievable = false;
  #used = false;

  constructor(ab) {
    // Using https://github.com/tc39/proposal-arraybuffer-transfer
    this.#ab = ab.transfer();

    // Or if you want something that works today:
    // this.#ab = structuredClone(ab, { transfer: [ab] });
  }

  take() {
    const [ab] = this.borrow();
    this.#ab = null;
    return ab;
  }

  borrow() {
    if (!this.#used) {
      throw new TypeError("Cannot take twice");
    }
    this.#used = true;
    const ab = this.#ab;
    return [ab, () => (this.#retrievable = true)];
  }

  retrieve() {
    if (!this.#retrievable) {
      throw new TypeError("Not retrievable");
    }
    return this.#ab;
  }

  static takeOrCopy(abOrTaker) {
    if (#ab in abOrTaker) {
      return abOrTaker.take();
    }
    return abOrTaker.slice();
  }

  static borrowOrCopy(abOrTaker) {
    if (#ab in abOrTaker) {
      return abOrTaker.borrow();
    }
    return [abOrTaker.slice(), () => {}];
  }
}

@mhofman
Copy link

mhofman commented Dec 7, 2022

Could ArrayBufferTaker just take Uint8array etc? They can already be transferred.

Afaik, the structured clone steps capture the byteOffset, byteLength and TypedArray kind, and detach/transfer the underlying ArrayBuffer. I suppose for most consuming API what matters is mainly the byteOffset, byteLength and underlying bytes of the buffer, but it should be easy enough to similarly reconstruct a new view from the transferred buffer and captured properties of the view.

Could something like this work?

I think that general approach would work. I'm wondering if the intent of retrieval should be expressed ahead of time by the sender? In that case the receiver would be obligated to use borrow semantics. Also while I trust host APIs to no longer use the buffer after "giveBack()", the retrieve step should enforce this by transferring back into a new buffer.

I would still prefer built-in language support, as that might allow to temporarily "detach" the array buffer, instead of requiring a new instance.

@littledan
Copy link

[Nit: s/Callers/Callees/ in the second bullet point]

This is a really nice idea. If we wanted to go all-out (like in a v2), we could have a way to borrow/take a read-only view of an ArrayBuffer as well--this would allow multiple readers. Such an idea would need additional primitives for read-only ArrayBuffers, but these are under consideration at TC39--actually, the take/borrow paradigm could clean up some of the complexity of that draft proposal.

@mhofman
Copy link

mhofman commented Dec 7, 2022

Yes mixing in read-only, and having parameters for the taker creation would allow the sender being explicit about what kind of access it allows the receiver to have.

@kentonv
Copy link

kentonv commented Dec 9, 2022

I guess this approach requires the caller to opt in, which means that only apps which are hyper-optimized will actually use the API. The vast majority probably won't bother. It's better than doing nothing, but is there a solution that can "just work" without any code changes on the caller side?

I think a copy-on-write mechanism could achieve this.

let view = ab.getImmutableView();
await file.write(view);
view.release();

Between getImmutableView(), and view.release(), if anything else tries to modify the source buffer, it would trigger a copy.

Implementing this would require support from the JS engine, although a polyfill could fall back to always making a copy (which is the status quo today anyway).

I don't know much about JS engine internals. But, given that ArrayBuffer access already has to check for detachment today, I would guess that there wouldn't be any general performance penalty to check for COW at the same time -- I imagine it would be treated like a special case of detachment, where the buffer can be reconstructed to fulfill the request.

@mhofman
Copy link

mhofman commented Dec 9, 2022

@kentonv I agree and I suggested a copy-on-write optimization both during plenary (notes not yet published) and argued for it again in the TC39 Matrix channel. However implementers seem to have security concerns about copy-on-write, currently documented on the explainer of the transfer proposal.

We would definitely need an ab.detach() as to not rely on the GC behavior. The "view" would probably need to be a regular ArrayBuffer since consumers usually need their own TypeArray view on top. Immutable / read-only clones would be great addition, but not strictly necessary for this. With the existing .slice(), the engine could create a new ArrayBuffer instance that is backed by the same memory as the original, with a copy-on-write guard added to both instances.

@kentonv
Copy link

kentonv commented Dec 10, 2022

Hmm, I don't really understand the security risk argument. By the same argument, is detaching an ArrayBuffer not also a security risk, since the memory that the ArrayBuffer previously pointed to now no longer belongs to it?

@mhofman
Copy link

mhofman commented Feb 1, 2023

@kentonv, I raised that argument in plenary. I think the current security analysis relies on fixed pointers after allocation, and somehow detached-ness doesn't impact that, or has known impact proven to at worst result in incorrect JS execution (not in a compromise of the sandbox). I still believe that CoW would present the same risks as detached buffer in case of a bug in the implementation. It's possible v8/Chrome may be willing to re-evaluate the risk posed by transparent CoW if provided with an pull request and design doc, which I clearly don't have time or expertise to work on (FYI if anyone wants to take that on).

That said the taker mechanism here is still valuable for explicit borrow semantics. And if the caller does not provide a taker but only an ArrayBuffer, the host API implementation would fallback to a copy, which may or may not be CoW in the JS engine implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment