Skip to content

Instantly share code, notes, and snippets.

@roadkell
Last active September 12, 2023 09:35
Show Gist options
  • Save roadkell/9e98db6656e28fbbf1bf51082040f67f to your computer and use it in GitHub Desktop.
Save roadkell/9e98db6656e28fbbf1bf51082040f67f to your computer and use it in GitHub Desktop.
Fixing acpi_call kernel oops on Thinkpads

Fixing acpi_call kernel oops on Thinkpads

Intro

TLP, a power management utility for Thinkpads and other laptops, uses tpacpi-bat script for battery calibration and setting charge thresholds (for Thinkpads xx20 and later), which in turn uses acpi_call Linux kernel module that enables calls to ACPI methods through /proc/acpi/call. acpi_call can also be used for hybrid graphics switching and other power management tasks.

What happened

As explained here and here, a kernel upstream commit made seek support for procfs mandatory. Not providing it will cause a null pointer exception for kernels >=5.13.0, including Ubuntu 21.10. Consequently, pre-1.2.2 versions of acpi_call became incompatible, and calling into them leads to a null pointer dereference.

How does it show

If you run Ubuntu 21.10 and have acpi-call-dkms installed, you can check if this bug affects you with this command:

sudo dmesg | grep "BUG: kernel NULL pointer dereference" -A 10

If it does, you'll see something like this among the lines:

[45420.141212] BUG: kernel NULL pointer dereference, address: 0000000000000000
[45420.141217] #PF: supervisor instruction fetch in kernel mode
[45420.141220] #PF: error_code(0x0010) - not-present page
[45420.141221] PGD 0 P4D 0 
[45420.141224] Oops: 0010 [#4] SMP NOPTI
[45420.141226] CPU: 3 PID: 85578 Comm: tpacpi-bat Tainted: G      D    O      5.13.0-19-generic #1
...

How to fix

Thankfully, acpi_call has already been fixed in v1.2.2. Unfortunately, many repositories still ship an outdated version, not mentioning the need for backports. So, if you happen to have this combo of a Thinkpad xx20 or later, a Linux kernel >=5.13, an acpi_call <1.2.2, and TLP or some other software utilizing it, you'll have to manually download, compile and install a fresh version of acpi_call.

But first, if there was a Non-Volatile Variable Storage is About Full boot error, you'll need to clean up NVRAM, as described here. In our case, this error is caused by kernel dumps filling up the storage. Make sure you have the same error, by comparing your dmesg output with the one above, check TLP battery FAQ, and UEFI troubleshooting on ArchWiki.

Commands below are for Debian/dpkg-based distributions, including Ubuntu and its derivatives. If needed, replace with appropriate commands for your distro.

⚠️ Warning! Deleting wrong EFI variables may brick your system. Read ArchWiki first. Proceed with caution.

# Check if there are any dumps
sudo ls /sys/firmware/efi/efivars/dump-*

# If found, delete them
sudo rm /sys/firmware/efi/efivars/dump-*

Now on to installing the kernel module.

# Remove previously installed acpi-call-dkms package (if any)
sudo apt purge acpi-call-dkms

# Install git (if you don’t have it installed yet)
sudo apt install git

# Clone the repository at nix-community/acpi_call
git clone --branch v1.2.2 https://github.com/nix-community/acpi_call.git

# Navigate to the cloned repository
cd acpi_call

# Prepare dkms.conf file
make dkms.conf

# Copy the module source to the shared sources directory
sudo cp -R . /usr/src/acpi-call-1.2.2

# Add the module to the dkms tree for build
sudo dkms add -m acpi-call -v 1.2.2

# Build the module
sudo dkms build -m acpi-call -v 1.2.2

# Install the module
sudo dkms install -m acpi-call -v 1.2.2

# Reboot
sudo reboot

Finally, take a moment and notify the maintainers of the package for your distro about the bug and the updated version. For example, here is the bug report for Debian acpi-call-dkms package, and here is for Ubuntu.

EDIT: the proper way of installing the module is taken from here, kudos to @monosoul.

@slazien
Copy link

slazien commented Feb 3, 2022

Thanks for providing an overview of what happened and sharing the fix! I was also getting the same error as @szero and, after some digging into logs via journalctl I found the kernel oops coming from acpi-call-dkms.

https://unix.stackexchange.com/questions/689158/disabling-writing-dump-files-to-efivars

@sttzr
Copy link

sttzr commented Feb 5, 2022

Me and my brothers each own a Thinkpad x131e that were running Ubuntu 20.04 very smooth untill with the recent kernel update 5.13 each of us encountered the error Error: The Non-Volatile Variable Storage is About Full. Unlike newer Thinkpads this model does not provide a firmware option to clear the storage. At first it did boot but after a few restarts it wont any more, not even beeing able to enter BIOS setup. Two of us could not resolve the error before this to happen so our ThinkPads are now bricked. At least one of us could rescue his Laptop with your help! Thanks a lot!
I hope I will find someone who can somehow physically reset the chip on the motherboards that are now unbootable. Removing the CMOS-Battery did not clear the NVRAM. Any thoughts on this?

@gma
Copy link

gma commented Feb 18, 2022

@sttzr I've just had this problem on a T430S, and was also unable to boot for several attempts. Then, I found that pressing F12 allowed me to get to the boot menu, and choosing one of the boot entries booted okay. I then deleted the dump files, only to find that (on reboot) it was still complaining about being almost out of space. And this time, I couldn't get the boot menu to appear by pressing F12 at startup; instead I was always shown the error message and offered the chance of entering the BIOS with F1.

I'm not sure if I found a workaround that will allow regular access to the boot menu, or if it was just a one-off for me, but it may be a repeatable solution so I'll explain what I did next.

I went into the BIOS settings, then into the Startup screen. In there I've got a "Boot mode" option that was set to "Quick". I set it to "Diagnostics" and restarted. I was then prompted to press Enter to interrupt the normal boot process (which I did), and from the next screen I was able to choose to enter the boot menu. It worked, and I booted back into Linux. It was at this point that I found the dump files were back (or maybe they were never properly deleted, though there was nothing wrong with my rm command). So I deleted them again and installed v1.2.2 of the module (as documented in this gist) and now everything appears to be fine. No dump files after a reboot and (having put the "Boot mode" back to "Quick") it's booting up as normal.

Hope I haven't wasted your time, and that it gets you back in to your machine!

@ig0rnig0r
Copy link

Wow, what a wonderful How-To, thanks so much! This way I'll hang on to my x230 for many more years to come, cheers!

@DiagonalArg
Copy link

DiagonalArg commented Feb 27, 2022

If I am to remove the /sys/firmware/efivars/dump-* files, am I to also remove the associated /sys/firmware/efi/vars/dump-* directories? Each of those directories look like (choosing just one):

$ ls -l efivars/dump*
-rw-r--r-- 1 root root 644 Feb 27 00:12 efivars/dump-type0-10-1-1645912016-C-cfc8fc79-be2e-4ddc-97f0-9f98bfe298a0

$ ls -l vars/dump*
vars/dump-type0-10-1-1645912016-C-cfc8fc79-be2e-4ddc-97f0-9f98bfe298a0:
total 0
-r-------- 1 root root 4096 Feb 27 00:48 attributes
-r-------- 1 root root 4096 Feb 27 00:48 data
-r-------- 1 root root 4096 Feb 27 00:48 guid
-rw------- 1 root root 4096 Feb 27 00:48 raw_var
-r-------- 1 root root 4096 Feb 27 00:48 size

Edit:

I am seeing in the Arch UEFI documentation:

UEFI Runtime Variables Support (efivarfs filesystem - /sys/firmware/efi/efivars). This option is important as this is required to manipulate 
UEFI runtime variables using tools like /usr/bin/efibootmgr. The configuration option below has been added in kernel 3.10 and later.

`CONFIG_EFIVAR_FS=y`

UEFI Runtime Variables Support (old efivars sysfs interface - /sys/firmware/efi/vars). This option should be disabled to prevent any potential issues with both efivarfs and sysfs-efivars enabled.

`CONFIG_EFI_VARS=n`

Unfortunately, in Ubuntu 20.04, both of these are set =y

So what are the implications? Should we be deleting dump-* in both? (I have also posted this as a Superuser question.)

@LinuxOnTheDesktop
Copy link

LinuxOnTheDesktop commented Mar 8, 2022

Thank you, Roadkell. I note the following.

  1. On my affected ThinkPad X230, after applying the fix, I still had the problem. But after deleting the dump files - again - and rebooting, all seems good.

  2. The post by @DiagonalArg asks a question and says the question was posted on Superuser. On superuser, the response to the question was, in summary: you don't need to worry about those directories you were worrying about.

@frzb
Copy link

frzb commented Mar 23, 2022

Thank you so much.

Нет войне!

@DiagonalArg
Copy link

Thank you, Roadkell. I note the following.

  1. On my affected ThinkPad X230, after applying the fix, I still had the problem. But after deleting the dump files - again - and rebooting, all seems good.

I didn't apply the fix. I just removed tlp, but I nevertheless had the same problem. I had to delete the files, reboot, delete again, and reboot a second time, before the files (and directories) were finally gone. I think this has kill a couple of laptops, when tlp was kept (not patched), the files were thought to be deleted (but were not), and a reboot was done, fillling the NVRAM.

@Silcet
Copy link

Silcet commented Apr 20, 2022

This worked like a charm in my Thinkpad P52 with Ubuntu 20.04 and kernel 5.13. Thanks to you I can recalibrate my poor battery that was at 60% capacity.

Thank you so much! ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment