Bug Hunt Could Cross Three or More Levels of Indirection

When running Proxmox VE, my Dell Inspiron 7577’s onboard Realtek Ethernet would quit at unexpected times. Network transmission halts, and a network watchdog timer fires which triggers a debug error message. One proposed workaround is to change to a different Realtek driver. But after learning about the tradeoffs involved, I decided against pursuing that path.

This watchdog timer error message has been reported by many users on Proxmox forums, and some kind of a fix is en route. I’m not confident it’ll help me, because it deactivated ASPM on Realtek devices but turning off ASPM across the board on my computer didn’t keep the machine online. I’m curious how that particular fix was developed, or the data that informed the fix. Thinking generally, pinning such a failure down requires jumping through three levels of indirection. My poorly-informed speculation is as follows:

The first and easiest step is the watchdog timer itself. A call stack is part of the error message, which might be enough to determine the code path that started the timer. But since it is a production binary, the call stack has incomplete symbols. Getting more information would require building a debug kernel in order to get full symbols.

With that information, it should be relatively straightforward to get to the second step: determining what network operation timed out. But then what? Given the random and intermittent nature, the failing network operation was probably just an ordinary transaction that had succeeded many times before and should have succeeded again. But for whatever reason, failed this time because the Realtek driver and/or hardware got in a bad state.

And that’s the difficult third step: how to look at an otherwise ordinary network transaction and deduce a cause for the bad Realtek state. It probably wasn’t the network transaction itself! Which meant at least one more indirect jump. The fix en route dealt with PCIe ASPM (PCI Express Active State Power Management) which probably wasn’t directly on the code path for a normal network data transmission. I’m really curious how that deduction was made and, if the incoming fix doesn’t address my issue, how I can use similar techniques to determine what put my hardware in a bad state.

From the outside, that process feels like a lot of black magic voodoo I don’t understand. For now I will sit tight with my reboot cron job workaround and wait for the updated kernel to arrive.

[UPDATE: A Proxmox VE update has arrived bringing kernel 6.2.16-18-pve to replace 6.2.16-15-pve I had been running. Despite my skepticism about ASPM, either that change or another in this update seems to be successful keeping the machine online!]


Featured image created by Microsoft Bing Image Creator powered by DALL-E 3 with prompt “Cartoon drawing of a black laptop computer showing a crying face on screen and holding a network cable

Realtek r8168 Driver Is Not r8169 Driver Predecessor

I have a Dell Inspiron 7577 whose onboard Realtek Ethernet hardware would randomly quit under Proxmox VE. [UPDATE: After installing Proxmox VE kernel update from 6.2.16-15-pve to 6.2.16-18-pve, this problem no longer occurs, allowing the machine to stay connected to the network.] After trying some kernel flags that didn’t help, I put in place an ugly hack to reboot the computer every time the network watchdog went off. This would at least keep the machine accessible from the network most of the time while I learn more about this problem.

In my initial research, I found some people who claimed switching to the r8168 driver kept their machines online. Judging by their names, I thought the r8168 driver was the immediate predecessor to the r8169 driver currently part of the system causing me headaches. But after reading a bit more, I’ve learned this was not the case. While both r8168 and r8169 refer to Linux drivers for Realtek Ethernet hardware, they exist in parallel reflecting two different development teams.

r8169 is an in-tree kernel driver that supports a few Ethernet adapters including R8168.

r8168 module built from source provided by Realtek.

— Excerpt from “r8168/r8169 – which one should I use?” on AskUbuntu.com:

This is a lot more complicated than “previous version”. As an in-tree kernel driver, r8169 will be updated in lock step with Linux updates largely independent of Realtek product cycle. As a vendor-provided module, r8168 will be updated to support Realtek hardware, but won’t necessarily stay in sync with Linux updates.

This explains why when someone has a new computer that doesn’t have networking under Linux, the suggestion is to try the r8168 driver: Realtek would add support for new hardware before Linux developers would get around to it. It also explains why people running r8168 driver run into problems later: they updated their Linux kernel and could no longer run their r8168 driver targeted to an earlier kernel.

Given this knowledge, I’m very skeptical running r8168 would help me. Some Proxmox users report that it’s the opposite of helpful, killing their network connection entirely. D’oh! Another interesting data point from that forum thread was the anecdotal observation that Proxmox clusters accelerate faults with the Realtek driver. This matches with my observation. Before I set up a Proxmox cluster, the network fault would occur roughly once or twice a day. After my cluster was up and running, it would occur many times a day with uptime as short as an hour and a half.

Even if switching to r8168 would help, it would only be a temporary solution. The next Linux update in this area would break the driver until Realtek catches up with an update. The best I can hope from r8168 is a data point informing an investigation of what triggers this fault condition, which seems like a lot of work for little gain. I decided against trying the r8168 driver. There are many other pieces in this puzzle.


Featured image created by Microsoft Bing Image Creator powered by DALL-E 3 with prompt “Cartoon drawing of a black laptop computer showing a crying face on screen and holding a network cable

Reported PCI Express Error was Unrelated

I have a Dell Inspiron 7577 laptop whose Ethernet hardware is unhappy with Proxmox VE 8, dropping off the network at unpredictable times. [UPDATE: Network connectivity stabilized after installing Proxmox VE kernel update from 6.2.16-15-pve to 6.2.16-18-pve. The PCI Express AER messages described in this post also stopped.] Trying to dig deeper, I found there was an error message dump indicating a watchdog timer went off while waiting to transmit data over the network. Searching online, I find bug reports that match the symptoms but that’s not necessarily the cause. A watchdog timer can be triggered by anything that gum up the works, so what resolves the network issue on one machine wouldn’t necessarily work on mine. I went back to dmesg to look for other clues.

Before the watchdog timer triggered, I found several lines of this message at irregular intervals:

[36805.253317] pcieport 0000:00:1c.4: AER: Corrected error received: 0000:3b:00.0

Sometimes only seconds apart, other times hours apart, and sometimes it never happens at all before the watchdog timer barks. This is some sort of error on the PCIe bus from device 0x3b:00.0, which is the Realtek Ethernet controller as per this lspci excerpt:

3b:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)

Even though the debug message said the error was corrected, maybe it triggered side effects causing my problem? Searching on this error message, I found several possibly relevant kernel flags. This Reddit thread has a good summary of them all.

  • pci=noaer disables PCI Express Advanced Error Reporting which sent this message. This is literally shooting the messenger. It’ll silence those messages but won’t do anything to address underlying problems.
  • pci=nomsi disables a PCI Express signaling mechanism that might cause these correctable errors, forcing all devices to fall back to a different mechanism. Some people reported losing peripherals (like USB) when they use this flag, I guess that hardware couldn’t fall back to something else? I tried it and while it didn’t cause any obvious problems (I still had USB) it also didn’t help keep my Ethernet alive either.
  • pci=nommconf disables PCI Express memory-mapped configuration. (I don’t know what those words mean, I just copied them out of kernel documentation.) The good news is adding this flag did eliminate those “Corrected error received” messages. The bad news it didn’t help keep my Ethernet alive, either.

Up until I tried pci=nommconf I had wondered if I’ve been doing kernel flags wrong. I was editing /etc/default/grub then running update-grub. After boot, I checked they showed up on cat /proc/cmdline but I didn’t really know if the kernel actually changed behavior. After pci=nommconf, my confidence was boosted by the lack of “Corrected error received” messages, though that might still be a false sense of confidence because “Corrected error received” messages don’t always happen. It’s an imperfect world, I work with what I have.

And sadly, there is something I need but don’t have today: ability to dig deeper into Linux kernel to find out what has frozen up, leading to the watchdog timer expiring. But I’m out of ideas for now and I still have a computer that drops off the network at irregular times. I don’t want to keep pulling the laptop off the shelf to log in locally and type “reboot” several times a day. I concede I must settle for a hideously ugly hack to do that for me.


Featured image created by Microsoft Bing Image Creator powered by DALL-E 3 with prompt “Cartoon drawing of a black laptop computer showing a crying face on screen and holding a network cable

Ethernet Failure Triggers Network Stack Timeout

I was curious about Proxmox VE capability to migrate virtual machines from one cluster node to another. I set up a small cluster to try it and found it to be as easy as advertised. After migrating my VM experiments to a desktop computer with Intel networking hardware, they have been running flawlessly. This allowed me to resume tinkering with a laptop computer that would drop off the network at unpredictable times. This unfortunate tendency makes it a very poor Proxmox VE server. [UPDATE: Network connectivity stabilized after installing Proxmox VE kernel update from 6.2.16-15-pve to 6.2.16-18-pve.]

Repeating Errors From r8169

After it dropped off the network, I have to log on to the computer locally. The screen is usually filled with error messages. I ran dmesg and saw the same messages there as well. Based on associated timestamp, this block of messages repeats every four minutes:

[68723.346727] r8169 0000:3b:00.0 enp59s0: rtl_chipcmd_cond == 1 (loop: 100, delay: 100).
[68723.348833] r8169 0000:3b:00.0 enp59s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[68723.350921] r8169 0000:3b:00.0 enp59s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[68723.352954] r8169 0000:3b:00.0 enp59s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[68723.355097] r8169 0000:3b:00.0 enp59s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[68723.357156] r8169 0000:3b:00.0 enp59s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[68723.359289] r8169 0000:3b:00.0 enp59s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[68723.389357] r8169 0000:3b:00.0 enp59s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
[68723.415890] r8169 0000:3b:00.0 enp59s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
[68723.442132] r8169 0000:3b:00.0 enp59s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).

Searching on that led me to Proxmox forums, and one of the workarounds was to set the kernel flag pcie_aspm=off. I tried that, but the computer still kept dropping off the network. Either I’m not doing this correctly (editing /etc/default/grub then running update-grub) or the change doesn’t help my situation. Perhaps it addressed a different problem with similar symptoms, leaving open the mystery of what’s going with my machine.

NETDEV WATCHDOG

Looking for more clues, I scrolled backwards in dmesg log and found this block of information just before the repeating series of r8169 errors:

[67717.227089] ------------[ cut here ]------------
[67717.227096] NETDEV WATCHDOG: enp59s0 (r8169): transmit queue 0 timed out
[67717.227126] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x23a/0x250
[67717.227133] Modules linked in: veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilt>[67717.227254]  iwlwifi ttm snd_timer pcspkr drm_display_helper intel_wmi_thunderbolt btintel dell_wmi_descriptor joydev processor_thermal_mbox>[67717.227374]  i2c_i801 xhci_pci i2c_hid_acpi crc32_pclmul i2c_smbus nvme_common i2c_hid realtek xhci_pci_renesas ahci libahci psmouse xhci_hc>[67717.227401] CPU: 1 PID: 0 Comm: swapper/1 Tainted: P           O       6.2.16-15-pve #1
[67717.227404] Hardware name: Dell Inc. Inspiron 7577/0P9G3M, BIOS 1.17.0 03/18/2022
[67717.227406] RIP: 0010:dev_watchdog+0x23a/0x250
[67717.227411] Code: 00 e9 2b ff ff ff 48 89 df c6 05 ac 5d 7d 01 01 e8 bb 08 f8 ff 44 89 f1 48 89 de 48 c7 c7 90 87 80 bc 48 89 c2 e8 56 91 30>[67717.227414] RSP: 0018:ffffae88c014ce38 EFLAGS: 00010246
[67717.227417] RAX: 0000000000000000 RBX: ffff99129280c000 RCX: 0000000000000000
[67717.227419] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[67717.227421] RBP: ffffae88c014ce68 R08: 0000000000000000 R09: 0000000000000000
[67717.227423] R10: 0000000000000000 R11: 0000000000000000 R12: ffff99129280c4c8
[67717.227425] R13: ffff99129280c41c R14: 0000000000000000 R15: 0000000000000000
[67717.227427] FS:  0000000000000000(0000) GS:ffff991600480000(0000) knlGS:0000000000000000
[67717.227429] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[67717.227432] CR2: 000000c0006e1010 CR3: 0000000165810003 CR4: 00000000003726e0
[67717.227434] Call Trace:
[67717.227436]  <IRQ>
[67717.227439]  ? show_regs+0x6d/0x80
[67717.227444]  ? __warn+0x89/0x160
[67717.227447]  ? dev_watchdog+0x23a/0x250
[67717.227451]  ? report_bug+0x17e/0x1b0
[67717.227455]  ? irq_work_queue+0x2f/0x70
[67717.227459]  ? handle_bug+0x46/0x90
[67717.227462]  ? exc_invalid_op+0x18/0x80
[67717.227465]  ? asm_exc_invalid_op+0x1b/0x20
[67717.227470]  ? dev_watchdog+0x23a/0x250
[67717.227474]  ? dev_watchdog+0x23a/0x250
[67717.227477]  ? __pfx_dev_watchdog+0x10/0x10
[67717.227481]  call_timer_fn+0x29/0x160
[67717.227485]  ? __pfx_dev_watchdog+0x10/0x10
[67717.227488]  __run_timers+0x259/0x310
[67717.227493]  run_timer_softirq+0x1d/0x40
[67717.227496]  __do_softirq+0xd6/0x346
[67717.227499]  ? hrtimer_interrupt+0x11f/0x250
[67717.227504]  __irq_exit_rcu+0xa2/0xd0
[67717.227507]  irq_exit_rcu+0xe/0x20
[67717.227510]  sysvec_apic_timer_interrupt+0x92/0xd0
[67717.227513]  </IRQ>
[67717.227515]  <TASK>
[67717.227517]  asm_sysvec_apic_timer_interrupt+0x1b/0x20
[67717.227520] RIP: 0010:cpuidle_enter_state+0xde/0x6f0
[67717.227524] Code: 12 57 44 e8 f4 64 4a ff 8b 53 04 49 89 c7 0f 1f 44 00 00 31 ff e8 22 6d 49 ff 80 7d d0 00 0f 85 eb 00 00 00 fb 0f 1f 44 00>[67717.227526] RSP: 0018:ffffae88c00ffe38 EFLAGS: 00000246
[67717.227529] RAX: 0000000000000000 RBX: ffffce88bfc80000 RCX: 0000000000000000
[67717.227531] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
[67717.227533] RBP: ffffae88c00ffe88 R08: 0000000000000000 R09: 0000000000000000
[67717.227534] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffbd2c3a40
[67717.227536] R13: 0000000000000008 R14: 0000000000000008 R15: 00003d96a543ec60
[67717.227540]  ? cpuidle_enter_state+0xce/0x6f0
[67717.227544]  cpuidle_enter+0x2e/0x50
[67717.227547]  do_idle+0x216/0x2a0
[67717.227551]  cpu_startup_entry+0x1d/0x20
[67717.227554]  start_secondary+0x122/0x160
[67717.227557]  secondary_startup_64_no_verify+0xe5/0xeb
[67717.227563]  </TASK>
[67717.227565] ---[ end trace 0000000000000000 ]---

A watchdog timer went off somewhere in the networking stack while waiting to transmit data. The data output starts with [ cut here ] but I have no idea where this information should be pasted into. I recognize the format of a call trace alongside a dump of CPU register data, but the actual call trace is incomplete. There are a lot of “?” in here because I am not running the debug kernel and symbols are missing.

Looking in the FAQ for Kernel.org, I followed a link to kernelnewbies.org and from there their page “So, you think you’ve found a Linux kernel bug?” I see the section on “Oops messages” and they look very similar to what I see here, except without the actual line with “Oops” in it. From there I was linked to the kernel bug tracking database. A search on watchdog timer expiration in r8169 got several dozen hits across many years, including 217814 which I found earlier via Proxmox forum search, thus coming full circle.

I see some differences between my call trace with that in 217814, but that’s possibly expected differences between my kernel (6.2.16-15-pve) and what generated 217814 (6.2.0-26-generic). In any case, the call stack appears to be for the watchdog timer itself and not whatever triggered it. Supposedly disabling ASPM would resolve 217814. Since it didn’t do anything for me, I conclude there’s something else clogging up the network stack. Teasing out that “something else” requires learning more about Linux kernel inner workings. I’m not enthusiastic about that prospect so I looked for other things to try.


Featured image created by Microsoft Bing Image Creator powered by DALL-E 3 with prompt “Cartoon drawing of a black laptop computer showing a crying face on screen and holding a network cable

Proxmox Cluster VM Migration

I had hoped to use an older Dell Inspiron 7577 as a light-duty virtualization server running Proxmox VE, but there’s a Realtek Ethernet problem causing it to lose connectivity after an unpredictable amount of time. A workaround mirroring the in-progress bug fix didn’t seem to do anything, so now I’m skeptical that the upcoming “fixed” kernel will address my issue. [UPDATE: I was wrong! After installing Proxmox VE kernel update from 6.2.16-15-pve to 6.2.16-18-pve, the network problem no longer occurs.] I found two other workarounds online: revert back to an earlier kernel, or revert back to an earlier driver. Neither feel like great options, so I’m going to leverage my “hardware-rich environment” a.k.a. I hoard computer hardware and might as well put them to work.

I brought another computer system online, the hardware was formerly the core of Luggable PC Mark II and mostly gathering dust ever since Mark II was disassembled. I bring it out for an experiment here and there, and now it will be my alternate Proxmox VE host. The first thing I checked was its networking hardware by typing “lspci” to see all PCI devices including the following two lines:

00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-V
06:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)

This motherboard has two onboard Ethernet ports, and apparently both have Intel hardware behind them. So if I run into problems, hopefully it’s at least not the same Realtek problem.

At idle, this system draws roughly 16 watts which is not bad for a desktop system but vastly more than the 2 watts drawn by a laptop. Running my virtual machines on this desktop will hopefully more reliable while I try to get to the bottom of my laptop’s network issue. I really like the idea of a server that draws only around 2 watts when idle so I want to make it work. This means I foresee two VM migrations: immediate move from the laptop to the desktop, and a future migration back to the laptop after its Ethernet is reliable.

I am confident I can perform this migration manually, since I just did it a few days ago to move these virtual machines from Ubuntu Desktop KVM to Proxmox VE. But why do it manually when there’s a software feature to do it automatically? I set these two machines up as nodes in a Proxmox cluster. Grouping them together in such a way gains several features, the one I want right now is virtual machine migration. Instead of messing around with manually setting up software and copying backup files, now I click a single “Migrate” button.

It took roughly 7 minutes to migrate the 32GB virtual disk from one Proxmox VE cluster node to another, and once back up and running, each virtual machine resumed as if nothing had happened. This is way easier and faster than my earlier manual migration procedure and I’m happy it worked seamlessly. With my virtual machines now seamlessly running on a different piece of hardware, I can dig deeper into the signs of a a problematic network driver.

A Quick Look at ASPM and Power Consumption

[UPDATE: After installing Proxmox VE kernel update from 6.2.16-15-pve to 6.2.16-18-pve, this problem no longer occurs, allowing the machine to stay connected to the network.]

I’ve configured an old 15″ laptop into a light-duty virtualization server running Proxmox VE, and I’m running into a reliability problem with the Ethernet controller on this Dell Inspiron 7577. My symptoms line up with a bug that others have filed, and a change to address the issue is working its way through the pipeline. I wouldn’t call it a fix, exactly, as the problem seems to be flawed power management in Realtek hardware and/or driver in combination with the latest Linux kernel. The upcoming change doesn’t fix Realtek power management, it merely disables their participation in PCIe ASPM (Active State Power Management).

Until that change arrives, one of the mitigation workarounds is to deactivate ASPM on the entire PCIe bus. There are a lot of components on that bus! Here’s the output from running “lspci” at the command line:

00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers (rev 05)
00:01.0 PCI bridge: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) (rev 05)
00:02.0 VGA compatible controller: Intel Corporation HD Graphics 630 (rev 04)
00:04.0 Signal processing controller: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem (rev 05)
00:14.0 USB controller: Intel Corporation 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller (rev 31)
00:14.2 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Thermal Subsystem (rev 31)
00:15.0 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Serial IO I2C Controller #0 (rev 31)
00:15.1 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Serial IO I2C Controller #1 (rev 31)
00:16.0 Communication controller: Intel Corporation 100 Series/C230 Series Chipset Family MEI Controller #1 (rev 31)
00:17.0 SATA controller: Intel Corporation HM170/QM170 Chipset SATA Controller [AHCI Mode] (rev 31)
00:1c.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #1 (rev f1)
00:1c.4 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #5 (rev f1)
00:1c.5 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #6 (rev f1)
00:1d.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #9 (rev f1)
00:1f.0 ISA bridge: Intel Corporation HM175 Chipset LPC/eSPI Controller (rev 31)
00:1f.2 Memory controller: Intel Corporation 100 Series/C230 Series Chipset Family Power Management Controller (rev 31)
00:1f.3 Audio device: Intel Corporation CM238 HD Audio Controller (rev 31)
00:1f.4 SMBus: Intel Corporation 100 Series/C230 Series Chipset Family SMBus (rev 31)
01:00.0 VGA compatible controller: NVIDIA Corporation GP106M [GeForce GTX 1060 Mobile] (rev a1)
01:00.1 Audio device: NVIDIA Corporation GP106 High Definition Audio Controller (rev a1)
3b:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
3c:00.0 Network controller: Intel Corporation Wireless 8265 / 8275 (rev 78)
3d:00.0 Non-Volatile memory controller: Intel Corporation Device f1aa (rev 03)

Deactivating APSM across the board will impact far more than the Realtek chip. I was curious what impact this would have on power consumption and decided to dig up my Kill-a-Watt meter for some before/after measurements.

Dell Latitude E6230 + Ubuntu Desktop

As a point of comparison, I had measured a few values of Dell Latitude E6230 I had just retired. These are the lowest values I could see within a ~15 second window. It would jump up by a watt or two for a few seconds before dropping.

  • 5W: idle.
  • 8W: hosting Home Assistant OS under KVM but not doing anything intensive.
  • 35W: 100% CPU utilization as HAOS compiled ESPHome firmware updates.

As a light-duty server, the most important value here is the 8W value, because that’s what it will be drawing most of the time.

Dell Inspiron 7577 + Proxmox VM

Since the Inspiron 7577 came with a beefy 180W AC power adapter (versus the 60W unit of the E6230) I was not optimistic about its power consumption. As a newer larger more power-hungry machine, I had expected idle power draw at least double that of the E6230. I was very pleasantly surprised. Running Proxmox VE but with all VMs shut down, the Kill-a-Watt indicated a rock solid two watts. Two!

As I started up my three virtual machines (Home Assistant OS, Plex, and InfluxDB), it jumped up to fifteen watts then gradually ramped back down to two watts as those VMs reached steady state. After that, it would occasionally jump up to four or five watts for a few seconds to service those mostly-idle VMs, then drop back down to two watts.

On the upside, it appears four generations of Intel CPU and laptop evolution has provided significant improvements in power efficiency. However, they were running different software so some of that difference might be credited to Ubuntu Desktop versus Proxmox.

On the downside, the Kill-a-Watt only measures down to whole watts with no fractional numbers. So a baseline of two watts isn’t very useful because it would take a 50% change in power consumption to show up in Kill-a-Watt numbers. I know running three VMs would take some power, but idling with and without VM both bottomed out at two watts. This puts me into measurement error territory. I need finer grained instrumentation to make meaningful measurements, but I’m not willing to pay money for just a curiosity experiment. I shrugged and kept going.

Dell Inspiron 7577 + Proxmox VM + pcie_aspm=off

Reading Ubuntu bug #2031537 I saw one of their investigative steps was to add pcie_aspm=off to the kernel command line. To follow in those footsteps, I first needed to learn what that meant. I could confirm it is documented as a valid kernel command line parameter. Then I had to find instructions on how to add such a thing, which involved editing /etc/default/grub then running update-grub. And finally, after the system rebooted, I could confirm the command line was processed by typing “cat /proc/cmdline“. I don’t know how to verify it actually took effect, though, except by observing system behavior changes.

The first data point is power consumption: now when hosting my three virtual machines, the Kill-a-Watt showed three watts most of the time. It still occasionally dips down to two watts for a second or two, but most of the time it hovers at three watts plus the occasional spike up to four or five watts. Given the coarse granularity, it’s inconclusive whether this reflects actual change or just random.

The second and more important data point is: did it improve Ethernet reliability? Sadly it did not. Before I made this change, I noted three failures from Realtek Ethernet. Each session lasting 36 hours or less. The first reboot after this change lost network after 50 hours. This might be within range of random error (meaning maybe pcie_aspm=off didn’t actually change anything) and definitely not long enough. After that reboot, the system fell off the network again after less than 3 hours. (2 hours 55 minutes!) That is a complete fail.

I’m sad pcie_aspm=off turned out to be a bust. So what’s next? First I need to move these virtual machines to another physical machine, which was a handy excuse to play with Proxmox clusters.

Realtek Network r8169 Woes with Linux Kernel 6

[UPDATE: After installing Proxmox VE kernel update from 6.2.16-15-pve to 6.2.16-18-pve, this problem no longer occurs, allowing the machine to stay connected to the network.]

After setting up a Home Assistant OS virtual machine in Proxmox VE alongside a few other virtual machines, I wondered how long it would be before I encounter my first problem with this setup. I got my answer roughly 36 hours after I installed Proxmox VE. I woke up in the morning with my ESP microcontrollers blinking their blue LEDs, signaling a problem. The Dell Inspiron 7577 laptop I’m using as a light-duty server has fallen off the network. What happened?

I pulled the machine off the shelf and opened the lid, which is dark because of my screen blanking configuration earlier. But tapping a key woke it up and I saw it filled with messages. Two messages were dominant. There would be several lines of this:

r8169 0000:03:00.0 enp3s0: rtl_chipcmd_cond == 1 (loop: 100, delay: 100).

Followed by several lines of a similar but slightly different message:

r8169 0000:03:00.0 enp3s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).

Since the machine is no longer on the network, I couldn’t access Proxmox VE’s web interface. About the only thing I could do is to log in at the keyboard and type “reboot”. A few minutes later, the system is back online.

While it was rebooting, I performed a search for rtl_ephyar_cond and found a hit on the Proxmox subreddit: System hanging intermittently after upgraded to 8. It pointed the finger at Realtek’s 8169 network driver, and to a Proxmox forum thread: System hanging after upgrade…NIC driver? It sounds like Realtek’s 8169 drivers have a bug exposed by Linux kernel 6. Proxmox bug #4807 was opened to track this issue, which led me down a chain of links to Ubuntu bug #2031537.

The code change intended to resolve this issue doesn’t fix anything on the Realtek side, but purportedly avoids the problem by disabling PCIe ASPM (Active State Power Management) for Realtek chip versions 42 and 43. I couldn’t confirm this is directly relevant to me. I typed lspci at the command line and here’s the line about my network controller:

3b:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)

This matches some of the reports on Proxmox bug 4807, but I don’t know how “rev 15” relates to “42 and 43” and I don’t know how to get further details to confirm or deny. I guess I have to wait for the bug fix to propagate through the pipeline to my machine. I’ll find out if it works then, and whether there’s another problem hiding behind this one.

So if the problem is exposed by the combination of new Linux kernel and new Realtek driver and only comes up at unpredictable times after the machine has been running a while, what workarounds can I do in the meantime? I’ve seen the following options discussed:

  1. Use Realtek driver r8168.
  2. Revert to previous Linux kernel 5.12.
  3. Disable PCIe ASPM on everything with pcie_aspm=off kernel parameter.
  4. Reboot the machine regularly.

I thought I’d try the easy thing first with regular reboots. I ran “crontab -e” and added a line to the end. “0 4 * * * reboot” This should reboot the system every day at four in the morning. It ran for 36 hours the first time around, so I thought a reboot every 24 hours would suffice. This turned out to be overly optimistic. I woke up the next morning and this computer was off the network again. Another reboot and I could log in to Home Assistant and saw it stopped receiving data from my ESPHome nodes just after 3AM. If the 4AM reboot happened, it didn’t restore the network. And it doesn’t matter anyway because the Realtek crapped out before then.

Oh well! It was worth a try. I will now try disabling ASPM, which is also an opportunity to learn its impact on electric power consumption.


Featured image created by Microsoft Bing Image Creator powered by DALL-E 3 with prompt “Cartoon drawing of a black laptop computer showing a crying face on screen and holding a network cable

Configuring Laptop for Proxmox VE

I’m migrating my light-duty server duties from my Dell Latitude E6230 to my Dell Inspiron 7577. When I started playing with KVM hypervisor on the E6230, I installed Ubuntu Desktop instead of server for two reasons: I didn’t know how to deal with the laptop screen, and I didn’t know how to work with KVM via the command line. But the experience allowed me to learn things I will incorporate into my 7577 configuration.

Dealing with the Screen

By default, Proxmox VE would leave a simple text prompt on screen, which is fine because most server hardware don’t even have screens attached. On a laptop, keeping the screen on wastes power and probably cause long-term damage as well. I found an answer on Proxmox forums:

  • Edit /etc/default/grub to add “consoleblank=30” (30 is timeout in seconds) to GRUB_CMDLINE_LINUX if an entry already existed. If not, add a single line GRUB_CMDLINE_LINUX="consoleblank=30"
  • Run update-grub to apply this configuration.
  • Reboot

Another default behavior: when closing the laptop lid, the laptop goes to sleep. I don’t want this behavior when I’m using it as mini-server. I was surprised to learn the technique I found for Ubuntu Desktop would also work for server edition as well: edit /etc/systemd/logind.conf and change HandleLidSwitch to ignore.

Making the two above changes turn off my laptop screen after the set number of seconds of inactivity, and leaves the computer running when the lid is closed.

Dealing with KVM

KVM is a big piece of software with lots of knobs. I was intimidated by the thought of learning all command line options and switches on my own. So, for my earlier experiment, I ran Virtual Machine Manager on Ubuntu Desktop edition to keep my settings straight. I’ve learned bits and pieces of interacting with KVM via its virsh command line tool, but I have yet to get comfortable enough with it to use command line as the default interface.

Fortunately, many others felt similarly and there are other ways to work with a KVM hypervisor. My personal data storage solution TrueNAS has moved from a FreeBSD-based system (now named TrueNAS CORE) to a Linux-based system (a parallel sibling product called TrueNAS SCALE). TrueNAS SCALE included virtual machine capability with KVM hypervisor which looked pretty good. After a quick evaluation session, I decided I preferred working with KVM using Proxmox VE, a whole operating system built on top of Debian/Ubuntu dedicated to the job. Hosting virtual machines with the KVM hypervisor and tools to monitor and manage those virtual machines. Instead of Virtual Machine Manager’s UI running on Ubuntu Desktop, both TrueNAS SCALE and Proxmox VE expose their UI as a browser-based interface accessible over the network.

I liked the idea of doing everything on a single server running TrueNAS SCALE, and may eventually move in that direction. But there is something to be said of keeping two isolated machines. I need my TrueNAS SCALE machine to be absolutely reliable, an appliance I can leave running its job of data storage. It can be argued it’s a good idea to use a different machine for more experimental things like ESPHome and Home Assistant Operating System. Besides, unlike normal people, I have plenty of PC hardware sitting around. Put some of them to work!

Dell Inspiron 7577 Laptop as Light Duty Server

I’m setting aside my old Dell Latitude E6230 laptop due to its multiple hardware failures. At the moment I am using it to play with virtualization server software. Virtualization hosts usually run on rack-mounted server hardware in a datacenter somewhere. But an old laptop works well for light-duty exploration at home by curious hobbyists: they sip power for small electric bill impact, they’re compact so we can stash them in a corner somewhere, and they come with a battery for surviving power failures.

I bought my Dell Inspiron 7577 15″ laptop five years ago, because at the time that was the only reasonable way to get my hands on a NVIDIA GPU. The market situation have improved since then, so I now have a better GPU on my gaming desktop. I’ve also learned I haven’t needed mobile gaming power enough to justify carrying a heavy laptop around, so I got a lighter laptop.

RAM turned out to be a big constraint on what I could explore on the E6230. Which had a meager 4GB RAM and I couldn’t justify spending money to buy old outdated DDR2 memory. Now I look forward to having 16GB of elbow room on the 7577.

While none of my virtualization experiments demanded much processing power, more is always better. This move will upgrade from a 3rd-gen Core i5 3320M processor to a 7th-gen Core i5 7300HQ. Getting four hardware cores instead of two hyperthreaded cores should be a good boost, in addition to all the other improvements made over four generations of Intel engineering.

For data storage, I’ve upgraded the 7577 from its factory M.2 NVMe SSD from a 256GB unit to a 1TB unit, and the 7577 chassis has an open 2.5″ SATA slot for even more storage if I need it. The E6230 had only a single 2.5″ SATA slot. Neither of these machines had an optical drive, but if they did, that can be converted to another 2.5″ SATA slot with adapters made for the purpose.

Both of these laptops have a wired gigabit Ethernet port, sadly a fast-disappearing luxury in laptops. It eliminates all the unreliable hassle of wireless networking, but an Ethernet jack is a huge and bulky component in an industry aiming for ever thinner and lighter designs. [UPDATE: The 7577’s Ethernet port would prove to be a source of headaches.]

And finally, the Inspiron 7577 has a hardware-level feature to improve battery longevity: I could configure its BIOS to stop battery charging at 80% full. This should be less stressful on the battery than being kept at 100% full all the time, which is what the E6230 would do and I could not configure it otherwise. I believe this deviation from laptop usage pattern contributed to battery demise and E6230 retirement, so I hope the 80% state of charge limit will keep the 7577 battery alive for longer.

When I started playing with KVM hypervisor on the E6230, I installed Ubuntu Desktop instead of server for two reasons: I didn’t know how to deal with the laptop screen, and I didn’t know how to work with KVM via the command line. Now this 7577 configuration will incorporate what I’ve learned since then.