This is the tale of a long weekend spent uncovering a mysterious iOS 18 Neural Engine bug—a journey of problem-solving in a system where full visibility is elusive, especially in the locked-down world of Apple’s platforms. But the process I followed is a general approach you can use for any opaque system. It all began last week when I stumbled upon a strange behavior in my iOS app. The output generated from a CoreML model was completely broken—something I had never seen before. And after some digging, I realized this only happened when the model was running on the Neural Engine of iOS 18. The first step was triage. I implemented a quick workaround in the app: if the device is running iOS 18, switch from the Neural Engine to the GPU. This temporarily solved the issue, but I had no idea why it worked or whether other CoreML models in the app’s pipeline might also be affected. Without a deeper understanding of the root cause, I knew I couldn’t rest. Thus, the long weekend investigation began.
The first goal was to pin down the exact conditions under which the bug occurred. The issue was inconsistent—sometimes it happened, sometimes it didn’t. But on my iPhone 15 Pro, running iOS 18, it occurred often enough to provide some clues. I already knew switching to the GPU solved the problem, so the Neural Engine on iOS 18 was clearly involved. I tested different variations of the CoreML model. The bug didn’t happen when using CoreML models in formats 7 or 8, which support fp16 outputs instead of fp32. But with the CoreML 6 format—always outputting in fp32—the bug consistently reappeared. Interestingly, even CoreML 8 showed the bug if it used fp32 output. This allowed me to zero in on the problem: iOS 18 + Neural Engine + fp32 output was the deadly combination.
Next, I needed to determine which types of models could trigger the bug. Not all models using fp32 output on the Neural Engine of iOS 18 caused the issue. To dig deeper, I created a small, stripped-down application that mimicked the original app’s behavior, allowing me to isolate the problem more effectively. But here’s where I made a mistake: instead of analyzing the raw output from the model, I was only looking at the post-processed values in my smaller app. This oversight cost me hours. If I had stripped back the code earlier, I would have uncovered a key fact much sooner. But hindsight is 20/20, right? When I finally looked directly at the model’s output, I saw something bizarre. The output tensor, which should have had 200 float values, only had 12 non-zero values. The rest were all zeroes. And these 12 values? They corresponded to every 16th value in the expected output (12 * 16 < 200). This number, 16, was the key to unlocking the mystery.
I took a closer look at the model’s input tensor, which had a shape of (1, 3, 320, 320). After several convolutions, it was reduced to (1, 80, 10, 10), and finally to (200, 1). Notice that the dimension 10 is 1/32 of 320, and 32 is a multiple of 16. This pattern was no coincidence. Before making this breakthrough, I had been trying (unsuccessfully) to recreate the bug by mimicking the original model graph. By Sunday afternoon, frustration was setting in. But then I added a pooling layer to the recreated model, and—bingo! I had my reproducible test case. The minimal conditions for triggering the bug were now clear:
- Convolution
- Pooling
- Transpose
- Reshape
- Cast to fp32
Here’s a reproducible code snippet to demonstrate the issue: https://github.com/niw/iOS18NeuralEngineBugTestApp
While I suspect this issue might already be fixed in iOS 18.1, I haven’t been able to test it on the right SoC and iOS combination yet.
Update on 10/2/2024: based on some tests and reports, the problem is happening on both iOS 18 and iOS 18.1 beta on iPhone 15 Pro. Therefore, it is not fixed on iOS 18.1 at this moment.
Ultimately, this investigation was a valuable learning experience in approaching problems with limited visibility. Here are my takeaways:
- Identify the Environment: The first step is to identify the exact combination of conditions that cause the issue. Often, it’s triggered by a specific confluence of factors—like an OS or dependency update.
- Understand the Problem Clearly: This is where I faltered initially. A clear understanding of the problem provides insights into the root cause. For example, if I had realized earlier that every 16th value was in the output, I would have found the bug’s cause much faster. Such insights are golden—they provide shortcuts to the solution.
- Recreate the Environment: Setting up a quick and repeatable test environment is essential. Trial and error become your best tools, and you need to iterate fast.
- Create a Minimal Reproducible Case: Once you have the right environment, create a minimal reproducible code to demonstrate the issue. From here, the goal is either to implement a workaround or to collaborate with those who have full system visibility to get it fixed.
My long weekend is over now. It’s late Sunday night, and although I’m exhausted, I’m thrilled to have reached my goal. I hope this story not only helps you understand the bug but also makes your weekend a bit more exciting!