Drop the .service
file into /etc/systemd/system/
, and then activate the script via systemctl
:
# systemctl daemon-reload
# systemctl enable fix-intel_wifi_aer-avell_g1513_fire_v3.service
# systemctl start fix-intel_wifi_aer-avell_g1513_fire_v3.service
This will effectively disable the "corrected" severity logging for the device, and save you loads of (logging) disk space. :)
Sorry for the poor explanation, future self. I'm kinda tired right now. I don't even know if all of this is correct. :(
When AER becomes too active in logging errors, it's generally something to do with buggy hardware or drivers.
What most people recommend is to disable AER via a kernel parameter such as pci=noaer
. If you know that the affected device is fine,
and that the device's driver indeed has a bug that's still not fixed but won't affect proper usage, you can just disable AER for specific
severity levels by setting the flags directly into the device via setpci
, instead of disabling AER globally.
For more info on setpci
, please see its docs.
AER (Advanced Error Reporting) is a PCIe capability. Linux adds support for it through a kernel module that is started sometime
during systemd-modules-load.service
's execution. The AER driver initializes reporting for PCIe devices at startup, so it's
important that we only reset the flags AFTER systemd's module loading service.
According to the AER module's source code, the four severity levels (Corrected, Error, Fatal and Undefined) are always enabled when AER is enabled for a device:
// From `/usr/include/uapi/linux/pci_regs.h`
#define PCI_EXP_DEVCTL 8 /* Device Control */
#define PCI_EXP_DEVCTL_CERE 0x0001 /* Correctable Error Reporting En. */
#define PCI_EXP_DEVCTL_NFERE 0x0002 /* Non-Fatal Error Reporting Enable */
#define PCI_EXP_DEVCTL_FERE 0x0004 /* Fatal Error Reporting Enable */
#define PCI_EXP_DEVCTL_URRE 0x0008 /* Unsupported Request Reporting En. */
// From `source/drivers/pci/pcie/aer/aerdrv_core.c`
#define PCI_EXP_AER_FLAGS (PCI_EXP_DEVCTL_CERE | PCI_EXP_DEVCTL_NFERE | \
PCI_EXP_DEVCTL_FERE | PCI_EXP_DEVCTL_URRE)
int pci_enable_pcie_error_reporting(struct pci_dev *dev)
{
if (pcie_aer_get_firmware_first(dev))
return -EIO;
if (!dev->aer_cap)
return -EIO;
return pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS);
}
Inspecting the kernel's source code some more, one can find that PCI_EXP_DEVCTL
is an offset on the device's
dev->pcie_cap
PCIe capability flags, and that is itself yet another offset on the device's starting memory location.
If you follow the implementation of pcie_capability_set_word
and its dependencies (function calls), you end up in
pcie_capability_write_dword
:
// From `source/drivers/pci/access.c`
int pcie_capability_write_dword(struct pci_dev *dev, int pos, u32 val)
{
if (pos & 3)
return -EINVAL;
if (!pcie_capability_reg_implemented(dev, pos))
return 0;
return pci_write_config_dword(dev, pci_pcie_cap(dev) + pos, val);
}
// From `/usr/include/linux/pci.h`
static inline int pcie_capability_set_word(struct pci_dev *dev, int pos,
u16 set)
{
return pcie_capability_clear_and_set_word(dev, pos, 0, set);
}
static inline int pci_pcie_cap(struct pci_dev *dev)
{
return dev->pcie_cap;
}
Depending on the machine's setup, setpci
may list the register name CAP_EXP
as available through setpci --dumpregs
.
This register refers to the dev->pcie_cap
offset. To identify how AER is configured, one needs the device/vendor or
bus/slot/function combination for the affected device. AER's logged messages already have this information. Below is an
example, from where we can take two different identifiers for the device: 8086:a114
(device/vendor ID) and 0000:00:1c.4
(domain/bus/slot/function).
# dmesg | tail -n 4
[ 4455.385233] pcieport 0000:00:1c.4: AER: Corrected error received: id=00e4
[ 4455.385242] pcieport 0000:00:1c.4: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00e4(Receiver ID)
[ 4455.385250] pcieport 0000:00:1c.4: device [8086:a114] error status/mask=00000001/00002000
[ 4455.385254] pcieport 0000:00:1c.4: [ 0] Receiver Error (First)
To check which is the affected device, see lshw
or lspci
:
[flisboac@sonic ~]$ sudo lspci -v -s 00:1c.4
00:1c.4 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #5 (rev f1) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0, IRQ 124
Bus: primary=00, secondary=03, subordinate=03, sec-latency=0
I/O behind bridge: None
Memory behind bridge: df200000-df2fffff [size=1M]
Prefetchable memory behind bridge: None
Capabilities: [40] Express Root Port (Slot+), MSI 00
Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
Capabilities: [90] Subsystem: Device 1d05:1021
Capabilities: [a0] Power Management version 3
Capabilities: [100] Advanced Error Reporting
Capabilities: [140] Access Control Services
Capabilities: [220] #19
Kernel driver in use: pcieport
Kernel modules: shpchp
In this case, the error may refer to a device attached to a PCIe port. One can check which device is attached to said port with
lshw
:
# lshw -numeric
sonic
description: Notebook
product: 1513 (To be filled by O.E.M.)
vendor: Avell High Performance
version: To be filled by O.E.M.
serial: To be filled by O.E.M.
width: 4294967295 bits
capabilities: smbios-3.0 dmi-3.0 smp vsyscall32
configuration: boot=normal chassis=notebook family=To be filled by O.E.M. sku=To be filled by O.E.M. uuid=00020003-0004-0005-0006-000700080009
*-core
description: Motherboard
physical id: 0
version: 0.1
serial: To be filled by O.E.M.
slot: To be filled by O.E.M.
(... lshw is so verbose ...)
*-pci
description: Host bridge
product: Skylake Host Bridge/DRAM Registers [8086:1910]
vendor: Intel Corporation [8086]
physical id: 100
bus info: pci@0000:00:00.0
version: 07
width: 32 bits
clock: 33MHz
configuration: driver=skl_uncore
resources: irq:0
(... lshw is so verbose ...)
*-pci:2
description: PCI bridge
product: Sunrise Point-H PCI Express Root Port #5 [8086:A114]
vendor: Intel Corporation [8086]
physical id: 1c.4
bus info: pci@0000:00:1c.4
version: f1
width: 32 bits
clock: 33MHz
capabilities: pci pciexpress msi pm normal_decode bus_master cap_list
configuration: driver=pcieport
resources: irq:124 memory:df200000-df2fffff
*-network
description: Wireless interface
product: Wireless 7265 [8086:95A]
vendor: Intel Corporation [8086]
physical id: 0
bus info: pci@0000:03:00.0
logical name: wlp3s0
version: 48
serial: 64:80:99:f3:9d:d7
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress bus_master cap_list ethernet physical wireless
configuration: broadcast=yes driver=iwlwifi driverversion=4.10.13-1-ARCH firmware=17.459231.0 ip=192.168.1.26 latency=0 link=yes multicast=yes wireless=IEEE 802.11
resources: irq:137 memory:df200000-df201fff
Summarizing, CAP_EXP
is the base regitry, and we make some kind of pointer arithmetic with it. We offset CAP_EXP
by PCI_EXP_DEVCTL
, and write the proper flags to it as a single word. Just remember that PCI_EXP_*
is defined as decimals,
while setpci
only accepts hexadecimals (have them the hexadecimal prefix 0x
or not), so some base conversion may be needed
-- although that's not the case for PCI_EXP_DEVCTL
.
So, to read the current configuration:
[flisboac@sonic ~]$ sudo setpci -v -d 8086:a114 CAP_EXP+0x8.w
0000:00:1c.4 (cap 10 @40) @48 = 000f
000f
tells us that all AER severity flags are set. The Corrected severity is bit 0 in that word, so we just need to set the new
value to 000e
to disable only the Corrected severity reporting:
[flisboac@sonic ~]$ sudo setpci -v -d 8086:a114 CAP_EXP+0x8.w=0x0e
0000:00:1c.4 (cap 10 @40) @48 000e
And that's it!
Could you help me out with this AER filling up syslog on a server I maintain? Here's a snippet;
and this is the device throwing out the errors, or the device in this pci-e slot causes it (an LSI HBA storage card);
so then I did
assuming CAP_EXP is the correct register (probably not), I tried
Would that work?