Proxmox Cluster Node Removal

I’ve transferred the core of a computer into a 3D-printed case, reducing the volume it took up on my shelf. It’s been part of my Proxmox experimentation, getting a feel for the software by playing with different capabilities. One notable experiment was putting two machines together into a cluster, and seeing how easy and seamless it was to migrate virtual machines between them. It was really neat!

Thankfully, the Realtek network problems which forced my hand with VM migration has been resolved, and my Dell 7577 has run reliably for several months. Since it draws less power than a Mini-ITX desktop, I decided to migrated all my virtual machines back to the 7577. This will free my Mini-ITX system to be powered down for now and available for other experiments in the future. I found instructions for removing a Proxmox cluster node, but the command failed with the error message: “cluster not ready - no quorum? (500)

Major cluster operations requires quorum, defined as a majority of nodes ((number of nodes/2)+1) to be online and actively participating in cluster operations. Adding and removing cluster nodes qualify but apparently there are built-in exceptions for adding the first few nodes because by definition we have to start with a single node and build our way up. But there is no built-in exception for removal and thus I’m prevented from dropping node count back down to one.

Searching Proxmox forums, I found a workaround in thread Another “cluster not ready – no quorum? (500)” case. We can suppress quorum requirements with the command “pvecm expected 1“, then proceed with the operation that typically require quorum like removing a cluster node. Since quorum requirement exists to make sure we don’t fatally damage a Proxmox cluster, this is a very powerful hammer that needs to be wielded carefully. We have to know what we are doing, which may include requirements outside of the actual act of removing a node.

In my case, I am responsible for making sure that the removed node never gets on the network again in its current state. I unplugged the network cable from the back of the motherboard and used a Windows 10 installation USB drive to overwrite Proxmox with Windows 10. That should do it.

Bug Hunt Could Cross Three or More Levels of Indirection

When running Proxmox VE, my Dell Inspiron 7577’s onboard Realtek Ethernet would quit at unexpected times. Network transmission halts, and a network watchdog timer fires which triggers a debug error message. One proposed workaround is to change to a different Realtek driver. But after learning about the tradeoffs involved, I decided against pursuing that path.

This watchdog timer error message has been reported by many users on Proxmox forums, and some kind of a fix is en route. I’m not confident it’ll help me, because it deactivated ASPM on Realtek devices but turning off ASPM across the board on my computer didn’t keep the machine online. I’m curious how that particular fix was developed, or the data that informed the fix. Thinking generally, pinning such a failure down requires jumping through three levels of indirection. My poorly-informed speculation is as follows:

The first and easiest step is the watchdog timer itself. A call stack is part of the error message, which might be enough to determine the code path that started the timer. But since it is a production binary, the call stack has incomplete symbols. Getting more information would require building a debug kernel in order to get full symbols.

With that information, it should be relatively straightforward to get to the second step: determining what network operation timed out. But then what? Given the random and intermittent nature, the failing network operation was probably just an ordinary transaction that had succeeded many times before and should have succeeded again. But for whatever reason, failed this time because the Realtek driver and/or hardware got in a bad state.

And that’s the difficult third step: how to look at an otherwise ordinary network transaction and deduce a cause for the bad Realtek state. It probably wasn’t the network transaction itself! Which meant at least one more indirect jump. The fix en route dealt with PCIe ASPM (PCI Express Active State Power Management) which probably wasn’t directly on the code path for a normal network data transmission. I’m really curious how that deduction was made and, if the incoming fix doesn’t address my issue, how I can use similar techniques to determine what put my hardware in a bad state.

From the outside, that process feels like a lot of black magic voodoo I don’t understand. For now I will sit tight with my reboot cron job workaround and wait for the updated kernel to arrive.

[UPDATE: A Proxmox VE update has arrived bringing kernel 6.2.16-18-pve to replace 6.2.16-15-pve I had been running. Despite my skepticism about ASPM, either that change or another in this update seems to be successful keeping the machine online!]


Featured image created by Microsoft Bing Image Creator powered by DALL-E 3 with prompt “Cartoon drawing of a black laptop computer showing a crying face on screen and holding a network cable

Realtek r8168 Driver Is Not r8169 Driver Predecessor

I have a Dell Inspiron 7577 whose onboard Realtek Ethernet hardware would randomly quit under Proxmox VE. [UPDATE: After installing Proxmox VE kernel update from 6.2.16-15-pve to 6.2.16-18-pve, this problem no longer occurs, allowing the machine to stay connected to the network.] After trying some kernel flags that didn’t help, I put in place an ugly hack to reboot the computer every time the network watchdog went off. This would at least keep the machine accessible from the network most of the time while I learn more about this problem.

In my initial research, I found some people who claimed switching to the r8168 driver kept their machines online. Judging by their names, I thought the r8168 driver was the immediate predecessor to the r8169 driver currently part of the system causing me headaches. But after reading a bit more, I’ve learned this was not the case. While both r8168 and r8169 refer to Linux drivers for Realtek Ethernet hardware, they exist in parallel reflecting two different development teams.

r8169 is an in-tree kernel driver that supports a few Ethernet adapters including R8168.

r8168 module built from source provided by Realtek.

— Excerpt from “r8168/r8169 – which one should I use?” on AskUbuntu.com:

This is a lot more complicated than “previous version”. As an in-tree kernel driver, r8169 will be updated in lock step with Linux updates largely independent of Realtek product cycle. As a vendor-provided module, r8168 will be updated to support Realtek hardware, but won’t necessarily stay in sync with Linux updates.

This explains why when someone has a new computer that doesn’t have networking under Linux, the suggestion is to try the r8168 driver: Realtek would add support for new hardware before Linux developers would get around to it. It also explains why people running r8168 driver run into problems later: they updated their Linux kernel and could no longer run their r8168 driver targeted to an earlier kernel.

Given this knowledge, I’m very skeptical running r8168 would help me. Some Proxmox users report that it’s the opposite of helpful, killing their network connection entirely. D’oh! Another interesting data point from that forum thread was the anecdotal observation that Proxmox clusters accelerate faults with the Realtek driver. This matches with my observation. Before I set up a Proxmox cluster, the network fault would occur roughly once or twice a day. After my cluster was up and running, it would occur many times a day with uptime as short as an hour and a half.

Even if switching to r8168 would help, it would only be a temporary solution. The next Linux update in this area would break the driver until Realtek catches up with an update. The best I can hope from r8168 is a data point informing an investigation of what triggers this fault condition, which seems like a lot of work for little gain. I decided against trying the r8168 driver. There are many other pieces in this puzzle.


Featured image created by Microsoft Bing Image Creator powered by DALL-E 3 with prompt “Cartoon drawing of a black laptop computer showing a crying face on screen and holding a network cable

Reboot After Network Watchdog Timer Fires

My Dell Inspiron 7577 is not happy running Proxmox VE. For reason I don’t yet understand, its onboard Ethernet would quit at unpredictable times. [UPDATE: Network connectivity stabilized after installing Proxmox VE kernel update from 6.2.16-15-pve to 6.2.16-18-pve. The hack described in this post is no longer necessary.] Running dmesg to see error messages logged on the system, I searched online and found a few Linux kernel flags to try as potential workarounds. None of them have helped keep the system online. So now I’m falling back to an ugly hack: rebooting the system after it falls offline.

My first session stayed online for 36 hours, so my first attempt at this workaround was to reboot the system once a day in the middle of the night. That wasn’t good enough because it frequently failed much sooner than 24 hours. The worst case I’ve observed so far was about 90 minutes. Unless I wanted to reboot every half hour or something ridiculous, I need to react to system state and not a timer.

In the Proxmox forum thread I read, one of the members said they wrote a script to ping Google at regular intervals and reboot the system if that should fail. I started thinking about doing the same for myself but wanted to narrow down the variables. I don’t want to my machine to reboot if there’s been a network hiccup at a Google datacenter, or my ISP, or even when I’m rebooting my router. This is a local issue and I want to focus my scope locally.

So instead of running ping I decided to base my decision off of what I’ve found so far. I don’t know why the Ethernet networking stack fails, but when it does, I know a network watchdog timer fires and logs message into the system. Reading about this system, I learned it is called a journal and can be accessed and queried using the command line tool journalctl. Reading about its options, I wrote a small shell script I named /root/watch_watchdog.sh:

#!/usr/bin/bash
if /usr/bin/journalctl --boot --grep="NETDEV WATCHDOG"
then
  /usr/sbin/reboot
fi

Every executable (bash, journalctl, and reboot) are specified with full paths because I had problems with context of bash scripts executed as cron jobs. My workaround, which I decided was also good security practice, is to fully qualify each binary file.

The --boot parameter restricts the query to the current running system boot, ignoring messages from before the most recent reboot.

The --grep="NETDEV WATCHDOG" parameter looks for the network watchdog error message. I thought to restrict it to exactly the message I saw: "kernel: NETDEV WATCHDOG: enp59s0 (r8169): transmit queue 0 timed out" but using that whole string returned no entries. Maybe the symbols (the colon? the parentheses?) caused a problem. Backing off, I found just "NETDEV" is too broad because there are other networking messages in the log. Just "WATCHDOG" is also too broad given unrelated watchdogs on the system. Using "NETDEV WATCHDOG" is fine so far, but I may need to make it more specific later if that’s still too broad.

The most important part of this is the exit code for journalctl. It would be nonzero if messages are found from the query, and zero if no entries are found. This exit code is used by the "if" statement to decide whether to reboot the system.

Once the shell script file in place and made executable with chmod +x /root/watch_watchdog.sh, I could add it to the cron jobs table by running crontab -e. I started by running this script once an hour on the top of the hour.

0 * * * * /root/watch_watchdog.sh

But then I thought: what’s the downside to running it more frequently? I couldn’t think of anything, so I expanded to running once every five minutes. (I learned the pattern syntax from Crontab guru.) If I learn a reason not to run this so often, I will reduce the frequency.

*/5 * * * * /root/watch_watchdog.sh

This ensured network outages due to Realtek Ethernet issue are no longer than five minutes in length. This is a vast improvement over what I had until now, which is waiting until I noticed the 7577 had dropped off the network (which may take hours), pulling it off the shelf, log in locally, and type “reboot”. Now this script will do it within five minutes of watchdog timer message. It’s a really ugly hack, but it’s something I can do today. Fixing this issue properly requires a lot more knowledge about Realtek network drivers, and that knowledge seemed to be spread across multiple drivers.


Featured image created by Microsoft Bing Image Creator powered by DALL-E 3 with prompt “Cartoon drawing of a black laptop computer showing a crying face on screen and holding a network cable

Reported PCI Express Error was Unrelated

I have a Dell Inspiron 7577 laptop whose Ethernet hardware is unhappy with Proxmox VE 8, dropping off the network at unpredictable times. [UPDATE: Network connectivity stabilized after installing Proxmox VE kernel update from 6.2.16-15-pve to 6.2.16-18-pve. The PCI Express AER messages described in this post also stopped.] Trying to dig deeper, I found there was an error message dump indicating a watchdog timer went off while waiting to transmit data over the network. Searching online, I find bug reports that match the symptoms but that’s not necessarily the cause. A watchdog timer can be triggered by anything that gum up the works, so what resolves the network issue on one machine wouldn’t necessarily work on mine. I went back to dmesg to look for other clues.

Before the watchdog timer triggered, I found several lines of this message at irregular intervals:

[36805.253317] pcieport 0000:00:1c.4: AER: Corrected error received: 0000:3b:00.0

Sometimes only seconds apart, other times hours apart, and sometimes it never happens at all before the watchdog timer barks. This is some sort of error on the PCIe bus from device 0x3b:00.0, which is the Realtek Ethernet controller as per this lspci excerpt:

3b:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)

Even though the debug message said the error was corrected, maybe it triggered side effects causing my problem? Searching on this error message, I found several possibly relevant kernel flags. This Reddit thread has a good summary of them all.

  • pci=noaer disables PCI Express Advanced Error Reporting which sent this message. This is literally shooting the messenger. It’ll silence those messages but won’t do anything to address underlying problems.
  • pci=nomsi disables a PCI Express signaling mechanism that might cause these correctable errors, forcing all devices to fall back to a different mechanism. Some people reported losing peripherals (like USB) when they use this flag, I guess that hardware couldn’t fall back to something else? I tried it and while it didn’t cause any obvious problems (I still had USB) it also didn’t help keep my Ethernet alive either.
  • pci=nommconf disables PCI Express memory-mapped configuration. (I don’t know what those words mean, I just copied them out of kernel documentation.) The good news is adding this flag did eliminate those “Corrected error received” messages. The bad news it didn’t help keep my Ethernet alive, either.

Up until I tried pci=nommconf I had wondered if I’ve been doing kernel flags wrong. I was editing /etc/default/grub then running update-grub. After boot, I checked they showed up on cat /proc/cmdline but I didn’t really know if the kernel actually changed behavior. After pci=nommconf, my confidence was boosted by the lack of “Corrected error received” messages, though that might still be a false sense of confidence because “Corrected error received” messages don’t always happen. It’s an imperfect world, I work with what I have.

And sadly, there is something I need but don’t have today: ability to dig deeper into Linux kernel to find out what has frozen up, leading to the watchdog timer expiring. But I’m out of ideas for now and I still have a computer that drops off the network at irregular times. I don’t want to keep pulling the laptop off the shelf to log in locally and type “reboot” several times a day. I concede I must settle for a hideously ugly hack to do that for me.


Featured image created by Microsoft Bing Image Creator powered by DALL-E 3 with prompt “Cartoon drawing of a black laptop computer showing a crying face on screen and holding a network cable

Ethernet Failure Triggers Network Stack Timeout

I was curious about Proxmox VE capability to migrate virtual machines from one cluster node to another. I set up a small cluster to try it and found it to be as easy as advertised. After migrating my VM experiments to a desktop computer with Intel networking hardware, they have been running flawlessly. This allowed me to resume tinkering with a laptop computer that would drop off the network at unpredictable times. This unfortunate tendency makes it a very poor Proxmox VE server. [UPDATE: Network connectivity stabilized after installing Proxmox VE kernel update from 6.2.16-15-pve to 6.2.16-18-pve.]

Repeating Errors From r8169

After it dropped off the network, I have to log on to the computer locally. The screen is usually filled with error messages. I ran dmesg and saw the same messages there as well. Based on associated timestamp, this block of messages repeats every four minutes:

[68723.346727] r8169 0000:3b:00.0 enp59s0: rtl_chipcmd_cond == 1 (loop: 100, delay: 100).
[68723.348833] r8169 0000:3b:00.0 enp59s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[68723.350921] r8169 0000:3b:00.0 enp59s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[68723.352954] r8169 0000:3b:00.0 enp59s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[68723.355097] r8169 0000:3b:00.0 enp59s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[68723.357156] r8169 0000:3b:00.0 enp59s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[68723.359289] r8169 0000:3b:00.0 enp59s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[68723.389357] r8169 0000:3b:00.0 enp59s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
[68723.415890] r8169 0000:3b:00.0 enp59s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
[68723.442132] r8169 0000:3b:00.0 enp59s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).

Searching on that led me to Proxmox forums, and one of the workarounds was to set the kernel flag pcie_aspm=off. I tried that, but the computer still kept dropping off the network. Either I’m not doing this correctly (editing /etc/default/grub then running update-grub) or the change doesn’t help my situation. Perhaps it addressed a different problem with similar symptoms, leaving open the mystery of what’s going with my machine.

NETDEV WATCHDOG

Looking for more clues, I scrolled backwards in dmesg log and found this block of information just before the repeating series of r8169 errors:

[67717.227089] ------------[ cut here ]------------
[67717.227096] NETDEV WATCHDOG: enp59s0 (r8169): transmit queue 0 timed out
[67717.227126] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x23a/0x250
[67717.227133] Modules linked in: veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilt>[67717.227254]  iwlwifi ttm snd_timer pcspkr drm_display_helper intel_wmi_thunderbolt btintel dell_wmi_descriptor joydev processor_thermal_mbox>[67717.227374]  i2c_i801 xhci_pci i2c_hid_acpi crc32_pclmul i2c_smbus nvme_common i2c_hid realtek xhci_pci_renesas ahci libahci psmouse xhci_hc>[67717.227401] CPU: 1 PID: 0 Comm: swapper/1 Tainted: P           O       6.2.16-15-pve #1
[67717.227404] Hardware name: Dell Inc. Inspiron 7577/0P9G3M, BIOS 1.17.0 03/18/2022
[67717.227406] RIP: 0010:dev_watchdog+0x23a/0x250
[67717.227411] Code: 00 e9 2b ff ff ff 48 89 df c6 05 ac 5d 7d 01 01 e8 bb 08 f8 ff 44 89 f1 48 89 de 48 c7 c7 90 87 80 bc 48 89 c2 e8 56 91 30>[67717.227414] RSP: 0018:ffffae88c014ce38 EFLAGS: 00010246
[67717.227417] RAX: 0000000000000000 RBX: ffff99129280c000 RCX: 0000000000000000
[67717.227419] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[67717.227421] RBP: ffffae88c014ce68 R08: 0000000000000000 R09: 0000000000000000
[67717.227423] R10: 0000000000000000 R11: 0000000000000000 R12: ffff99129280c4c8
[67717.227425] R13: ffff99129280c41c R14: 0000000000000000 R15: 0000000000000000
[67717.227427] FS:  0000000000000000(0000) GS:ffff991600480000(0000) knlGS:0000000000000000
[67717.227429] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[67717.227432] CR2: 000000c0006e1010 CR3: 0000000165810003 CR4: 00000000003726e0
[67717.227434] Call Trace:
[67717.227436]  <IRQ>
[67717.227439]  ? show_regs+0x6d/0x80
[67717.227444]  ? __warn+0x89/0x160
[67717.227447]  ? dev_watchdog+0x23a/0x250
[67717.227451]  ? report_bug+0x17e/0x1b0
[67717.227455]  ? irq_work_queue+0x2f/0x70
[67717.227459]  ? handle_bug+0x46/0x90
[67717.227462]  ? exc_invalid_op+0x18/0x80
[67717.227465]  ? asm_exc_invalid_op+0x1b/0x20
[67717.227470]  ? dev_watchdog+0x23a/0x250
[67717.227474]  ? dev_watchdog+0x23a/0x250
[67717.227477]  ? __pfx_dev_watchdog+0x10/0x10
[67717.227481]  call_timer_fn+0x29/0x160
[67717.227485]  ? __pfx_dev_watchdog+0x10/0x10
[67717.227488]  __run_timers+0x259/0x310
[67717.227493]  run_timer_softirq+0x1d/0x40
[67717.227496]  __do_softirq+0xd6/0x346
[67717.227499]  ? hrtimer_interrupt+0x11f/0x250
[67717.227504]  __irq_exit_rcu+0xa2/0xd0
[67717.227507]  irq_exit_rcu+0xe/0x20
[67717.227510]  sysvec_apic_timer_interrupt+0x92/0xd0
[67717.227513]  </IRQ>
[67717.227515]  <TASK>
[67717.227517]  asm_sysvec_apic_timer_interrupt+0x1b/0x20
[67717.227520] RIP: 0010:cpuidle_enter_state+0xde/0x6f0
[67717.227524] Code: 12 57 44 e8 f4 64 4a ff 8b 53 04 49 89 c7 0f 1f 44 00 00 31 ff e8 22 6d 49 ff 80 7d d0 00 0f 85 eb 00 00 00 fb 0f 1f 44 00>[67717.227526] RSP: 0018:ffffae88c00ffe38 EFLAGS: 00000246
[67717.227529] RAX: 0000000000000000 RBX: ffffce88bfc80000 RCX: 0000000000000000
[67717.227531] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
[67717.227533] RBP: ffffae88c00ffe88 R08: 0000000000000000 R09: 0000000000000000
[67717.227534] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffbd2c3a40
[67717.227536] R13: 0000000000000008 R14: 0000000000000008 R15: 00003d96a543ec60
[67717.227540]  ? cpuidle_enter_state+0xce/0x6f0
[67717.227544]  cpuidle_enter+0x2e/0x50
[67717.227547]  do_idle+0x216/0x2a0
[67717.227551]  cpu_startup_entry+0x1d/0x20
[67717.227554]  start_secondary+0x122/0x160
[67717.227557]  secondary_startup_64_no_verify+0xe5/0xeb
[67717.227563]  </TASK>
[67717.227565] ---[ end trace 0000000000000000 ]---

A watchdog timer went off somewhere in the networking stack while waiting to transmit data. The data output starts with [ cut here ] but I have no idea where this information should be pasted into. I recognize the format of a call trace alongside a dump of CPU register data, but the actual call trace is incomplete. There are a lot of “?” in here because I am not running the debug kernel and symbols are missing.

Looking in the FAQ for Kernel.org, I followed a link to kernelnewbies.org and from there their page “So, you think you’ve found a Linux kernel bug?” I see the section on “Oops messages” and they look very similar to what I see here, except without the actual line with “Oops” in it. From there I was linked to the kernel bug tracking database. A search on watchdog timer expiration in r8169 got several dozen hits across many years, including 217814 which I found earlier via Proxmox forum search, thus coming full circle.

I see some differences between my call trace with that in 217814, but that’s possibly expected differences between my kernel (6.2.16-15-pve) and what generated 217814 (6.2.0-26-generic). In any case, the call stack appears to be for the watchdog timer itself and not whatever triggered it. Supposedly disabling ASPM would resolve 217814. Since it didn’t do anything for me, I conclude there’s something else clogging up the network stack. Teasing out that “something else” requires learning more about Linux kernel inner workings. I’m not enthusiastic about that prospect so I looked for other things to try.


Featured image created by Microsoft Bing Image Creator powered by DALL-E 3 with prompt “Cartoon drawing of a black laptop computer showing a crying face on screen and holding a network cable

Proxmox Cluster VM Migration

I had hoped to use an older Dell Inspiron 7577 as a light-duty virtualization server running Proxmox VE, but there’s a Realtek Ethernet problem causing it to lose connectivity after an unpredictable amount of time. A workaround mirroring the in-progress bug fix didn’t seem to do anything, so now I’m skeptical that the upcoming “fixed” kernel will address my issue. [UPDATE: I was wrong! After installing Proxmox VE kernel update from 6.2.16-15-pve to 6.2.16-18-pve, the network problem no longer occurs.] I found two other workarounds online: revert back to an earlier kernel, or revert back to an earlier driver. Neither feel like great options, so I’m going to leverage my “hardware-rich environment” a.k.a. I hoard computer hardware and might as well put them to work.

I brought another computer system online, the hardware was formerly the core of Luggable PC Mark II and mostly gathering dust ever since Mark II was disassembled. I bring it out for an experiment here and there, and now it will be my alternate Proxmox VE host. The first thing I checked was its networking hardware by typing “lspci” to see all PCI devices including the following two lines:

00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-V
06:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)

This motherboard has two onboard Ethernet ports, and apparently both have Intel hardware behind them. So if I run into problems, hopefully it’s at least not the same Realtek problem.

At idle, this system draws roughly 16 watts which is not bad for a desktop system but vastly more than the 2 watts drawn by a laptop. Running my virtual machines on this desktop will hopefully more reliable while I try to get to the bottom of my laptop’s network issue. I really like the idea of a server that draws only around 2 watts when idle so I want to make it work. This means I foresee two VM migrations: immediate move from the laptop to the desktop, and a future migration back to the laptop after its Ethernet is reliable.

I am confident I can perform this migration manually, since I just did it a few days ago to move these virtual machines from Ubuntu Desktop KVM to Proxmox VE. But why do it manually when there’s a software feature to do it automatically? I set these two machines up as nodes in a Proxmox cluster. Grouping them together in such a way gains several features, the one I want right now is virtual machine migration. Instead of messing around with manually setting up software and copying backup files, now I click a single “Migrate” button.

It took roughly 7 minutes to migrate the 32GB virtual disk from one Proxmox VE cluster node to another, and once back up and running, each virtual machine resumed as if nothing had happened. This is way easier and faster than my earlier manual migration procedure and I’m happy it worked seamlessly. With my virtual machines now seamlessly running on a different piece of hardware, I can dig deeper into the signs of a a problematic network driver.

A Quick Look at ASPM and Power Consumption

[UPDATE: After installing Proxmox VE kernel update from 6.2.16-15-pve to 6.2.16-18-pve, this problem no longer occurs, allowing the machine to stay connected to the network.]

I’ve configured an old 15″ laptop into a light-duty virtualization server running Proxmox VE, and I’m running into a reliability problem with the Ethernet controller on this Dell Inspiron 7577. My symptoms line up with a bug that others have filed, and a change to address the issue is working its way through the pipeline. I wouldn’t call it a fix, exactly, as the problem seems to be flawed power management in Realtek hardware and/or driver in combination with the latest Linux kernel. The upcoming change doesn’t fix Realtek power management, it merely disables their participation in PCIe ASPM (Active State Power Management).

Until that change arrives, one of the mitigation workarounds is to deactivate ASPM on the entire PCIe bus. There are a lot of components on that bus! Here’s the output from running “lspci” at the command line:

00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers (rev 05)
00:01.0 PCI bridge: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) (rev 05)
00:02.0 VGA compatible controller: Intel Corporation HD Graphics 630 (rev 04)
00:04.0 Signal processing controller: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem (rev 05)
00:14.0 USB controller: Intel Corporation 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller (rev 31)
00:14.2 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Thermal Subsystem (rev 31)
00:15.0 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Serial IO I2C Controller #0 (rev 31)
00:15.1 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Serial IO I2C Controller #1 (rev 31)
00:16.0 Communication controller: Intel Corporation 100 Series/C230 Series Chipset Family MEI Controller #1 (rev 31)
00:17.0 SATA controller: Intel Corporation HM170/QM170 Chipset SATA Controller [AHCI Mode] (rev 31)
00:1c.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #1 (rev f1)
00:1c.4 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #5 (rev f1)
00:1c.5 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #6 (rev f1)
00:1d.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #9 (rev f1)
00:1f.0 ISA bridge: Intel Corporation HM175 Chipset LPC/eSPI Controller (rev 31)
00:1f.2 Memory controller: Intel Corporation 100 Series/C230 Series Chipset Family Power Management Controller (rev 31)
00:1f.3 Audio device: Intel Corporation CM238 HD Audio Controller (rev 31)
00:1f.4 SMBus: Intel Corporation 100 Series/C230 Series Chipset Family SMBus (rev 31)
01:00.0 VGA compatible controller: NVIDIA Corporation GP106M [GeForce GTX 1060 Mobile] (rev a1)
01:00.1 Audio device: NVIDIA Corporation GP106 High Definition Audio Controller (rev a1)
3b:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
3c:00.0 Network controller: Intel Corporation Wireless 8265 / 8275 (rev 78)
3d:00.0 Non-Volatile memory controller: Intel Corporation Device f1aa (rev 03)

Deactivating APSM across the board will impact far more than the Realtek chip. I was curious what impact this would have on power consumption and decided to dig up my Kill-a-Watt meter for some before/after measurements.

Dell Latitude E6230 + Ubuntu Desktop

As a point of comparison, I had measured a few values of Dell Latitude E6230 I had just retired. These are the lowest values I could see within a ~15 second window. It would jump up by a watt or two for a few seconds before dropping.

  • 5W: idle.
  • 8W: hosting Home Assistant OS under KVM but not doing anything intensive.
  • 35W: 100% CPU utilization as HAOS compiled ESPHome firmware updates.

As a light-duty server, the most important value here is the 8W value, because that’s what it will be drawing most of the time.

Dell Inspiron 7577 + Proxmox VM

Since the Inspiron 7577 came with a beefy 180W AC power adapter (versus the 60W unit of the E6230) I was not optimistic about its power consumption. As a newer larger more power-hungry machine, I had expected idle power draw at least double that of the E6230. I was very pleasantly surprised. Running Proxmox VE but with all VMs shut down, the Kill-a-Watt indicated a rock solid two watts. Two!

As I started up my three virtual machines (Home Assistant OS, Plex, and InfluxDB), it jumped up to fifteen watts then gradually ramped back down to two watts as those VMs reached steady state. After that, it would occasionally jump up to four or five watts for a few seconds to service those mostly-idle VMs, then drop back down to two watts.

On the upside, it appears four generations of Intel CPU and laptop evolution has provided significant improvements in power efficiency. However, they were running different software so some of that difference might be credited to Ubuntu Desktop versus Proxmox.

On the downside, the Kill-a-Watt only measures down to whole watts with no fractional numbers. So a baseline of two watts isn’t very useful because it would take a 50% change in power consumption to show up in Kill-a-Watt numbers. I know running three VMs would take some power, but idling with and without VM both bottomed out at two watts. This puts me into measurement error territory. I need finer grained instrumentation to make meaningful measurements, but I’m not willing to pay money for just a curiosity experiment. I shrugged and kept going.

Dell Inspiron 7577 + Proxmox VM + pcie_aspm=off

Reading Ubuntu bug #2031537 I saw one of their investigative steps was to add pcie_aspm=off to the kernel command line. To follow in those footsteps, I first needed to learn what that meant. I could confirm it is documented as a valid kernel command line parameter. Then I had to find instructions on how to add such a thing, which involved editing /etc/default/grub then running update-grub. And finally, after the system rebooted, I could confirm the command line was processed by typing “cat /proc/cmdline“. I don’t know how to verify it actually took effect, though, except by observing system behavior changes.

The first data point is power consumption: now when hosting my three virtual machines, the Kill-a-Watt showed three watts most of the time. It still occasionally dips down to two watts for a second or two, but most of the time it hovers at three watts plus the occasional spike up to four or five watts. Given the coarse granularity, it’s inconclusive whether this reflects actual change or just random.

The second and more important data point is: did it improve Ethernet reliability? Sadly it did not. Before I made this change, I noted three failures from Realtek Ethernet. Each session lasting 36 hours or less. The first reboot after this change lost network after 50 hours. This might be within range of random error (meaning maybe pcie_aspm=off didn’t actually change anything) and definitely not long enough. After that reboot, the system fell off the network again after less than 3 hours. (2 hours 55 minutes!) That is a complete fail.

I’m sad pcie_aspm=off turned out to be a bust. So what’s next? First I need to move these virtual machines to another physical machine, which was a handy excuse to play with Proxmox clusters.

Realtek Network r8169 Woes with Linux Kernel 6

[UPDATE: After installing Proxmox VE kernel update from 6.2.16-15-pve to 6.2.16-18-pve, this problem no longer occurs, allowing the machine to stay connected to the network.]

After setting up a Home Assistant OS virtual machine in Proxmox VE alongside a few other virtual machines, I wondered how long it would be before I encounter my first problem with this setup. I got my answer roughly 36 hours after I installed Proxmox VE. I woke up in the morning with my ESP microcontrollers blinking their blue LEDs, signaling a problem. The Dell Inspiron 7577 laptop I’m using as a light-duty server has fallen off the network. What happened?

I pulled the machine off the shelf and opened the lid, which is dark because of my screen blanking configuration earlier. But tapping a key woke it up and I saw it filled with messages. Two messages were dominant. There would be several lines of this:

r8169 0000:03:00.0 enp3s0: rtl_chipcmd_cond == 1 (loop: 100, delay: 100).

Followed by several lines of a similar but slightly different message:

r8169 0000:03:00.0 enp3s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).

Since the machine is no longer on the network, I couldn’t access Proxmox VE’s web interface. About the only thing I could do is to log in at the keyboard and type “reboot”. A few minutes later, the system is back online.

While it was rebooting, I performed a search for rtl_ephyar_cond and found a hit on the Proxmox subreddit: System hanging intermittently after upgraded to 8. It pointed the finger at Realtek’s 8169 network driver, and to a Proxmox forum thread: System hanging after upgrade…NIC driver? It sounds like Realtek’s 8169 drivers have a bug exposed by Linux kernel 6. Proxmox bug #4807 was opened to track this issue, which led me down a chain of links to Ubuntu bug #2031537.

The code change intended to resolve this issue doesn’t fix anything on the Realtek side, but purportedly avoids the problem by disabling PCIe ASPM (Active State Power Management) for Realtek chip versions 42 and 43. I couldn’t confirm this is directly relevant to me. I typed lspci at the command line and here’s the line about my network controller:

3b:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)

This matches some of the reports on Proxmox bug 4807, but I don’t know how “rev 15” relates to “42 and 43” and I don’t know how to get further details to confirm or deny. I guess I have to wait for the bug fix to propagate through the pipeline to my machine. I’ll find out if it works then, and whether there’s another problem hiding behind this one.

So if the problem is exposed by the combination of new Linux kernel and new Realtek driver and only comes up at unpredictable times after the machine has been running a while, what workarounds can I do in the meantime? I’ve seen the following options discussed:

  1. Use Realtek driver r8168.
  2. Revert to previous Linux kernel 5.12.
  3. Disable PCIe ASPM on everything with pcie_aspm=off kernel parameter.
  4. Reboot the machine regularly.

I thought I’d try the easy thing first with regular reboots. I ran “crontab -e” and added a line to the end. “0 4 * * * reboot” This should reboot the system every day at four in the morning. It ran for 36 hours the first time around, so I thought a reboot every 24 hours would suffice. This turned out to be overly optimistic. I woke up the next morning and this computer was off the network again. Another reboot and I could log in to Home Assistant and saw it stopped receiving data from my ESPHome nodes just after 3AM. If the 4AM reboot happened, it didn’t restore the network. And it doesn’t matter anyway because the Realtek crapped out before then.

Oh well! It was worth a try. I will now try disabling ASPM, which is also an opportunity to learn its impact on electric power consumption.


Featured image created by Microsoft Bing Image Creator powered by DALL-E 3 with prompt “Cartoon drawing of a black laptop computer showing a crying face on screen and holding a network cable

Running Home Assistant OS Under Proxmox VE

I’ve dusted off my Dell Inspiron 7577 laptop and set it up as a light-duty virtualization server running Proxmox Virtual Environment. My InfluxDB project and my Plex server both run on top of Ubuntu Server, and Proxmox has a very streamlined process to set up virtual machines from installation media ISO file. I got those two up and running easily.

Setting up Home Assistant OS under Proxmox took more work. Unlike Virtual Machine Manager, Proxmox doesn’t have a great way to import an existing KVM virtual machine image, which is how Home Assistant OS was distributed. I tried three sets of instructions without success:

  • Proxmox documentation describes how to import an OVF file. HAOS is available as an OVA file, which is a tar archive of an OVF plus its associated files. I unpacked that file to confirm it did include an OVF file and tried using that, but the disk image reference was considered invalid by the import tool and ignored.
  • GetLabsDone: I got far enough to get a virtual machine, but it never booted. I got some sort of infinite loop, consuming 100% of one CPU while showing a blank screen.
  • OSTechNix: Slightly different procedure but the same results: blank screen and 100% of one CPU.

Then I found a forum thread on Home Assistant forums, where I learned GitHub user @tteck has put together a script to automate the entire process. I downloaded the script to see what it is doing. I understood it enough to see it closely resembled the instructions on GetLabsDone and OSTechNix, but not enough to understand all the differences. I felt I at least understood it enough to be satisfied it’s not doing anything malicious, so I ran the script on my Proxmox VE instance and it worked well to get Home Assistant OS up and running. Looking at the resulting machine properties in Proxmox UI, I see a few differences. The system BIOS is “OVMF” instead of default “SeaBIOS” and there’s an additional 4MB “EFI disk”. I could try to recreate a Home Assistant VM using these parameters, but since HAOS is already up and running so I’m not particularly motivated to perform that experiment.

A side note on auditing @tteck‘s script haos-vm.sh: commands are on a single line no matter their length, so I wanted a way to line-wrap text files at the command-line and learned about the fold command. Instead of dumping out the script with “more haos-vm.sh” I can line wrap it at spaces with “fold -s haos-vm.sh | more“.

After Home Assistant OS fired up and I could access its interface in a web browser, the very first screen has an option for me to upload a backup file from my previous HAOS installation. I uploaded the file and a few minutes later the new HAOS virtual machine running under Proxmox VE took over all functions with only a few notes:

  • The “upload…” screen spinner kept spinning even after the system was up and running. I saw the CPU and memory usage dropped in Proxmox UI and thought things were done. I opened up a new browser tab to http:/homeassistant.local:8123/ and saw Home Assistant was indeed up and running, but the “Uploading…” spinner never stopped. I shrugged, closed that first spinner tab, and moved on.
  • The nightly backup automation carried over, but I had to manually re-add the network drive used for backups and point the automation back at the just-re-added storage location target.
  • All my ESPHome YAML files carried over intact, but I had to manually re-add ESPHome integration. Then all the YAML files were visible and associated with their respective still-running devices around the house, who seamlessly started reporting data to the new HAOS virtual machine.

I have done several Home Assistant migrations by now, and it’s been nearly seamless every time with only minor adjustments needed. I really appreciate how well Home Assistant handles this infrequently-used but important capability to backup and restore.

After I got Home Assistant up and running under Proxmox VE on the new machine, I wondered how long it’ll be before I run into my first technical problem with this setup. The answer: about 36 hours.

Configuring Laptop for Proxmox VE

I’m migrating my light-duty server duties from my Dell Latitude E6230 to my Dell Inspiron 7577. When I started playing with KVM hypervisor on the E6230, I installed Ubuntu Desktop instead of server for two reasons: I didn’t know how to deal with the laptop screen, and I didn’t know how to work with KVM via the command line. But the experience allowed me to learn things I will incorporate into my 7577 configuration.

Dealing with the Screen

By default, Proxmox VE would leave a simple text prompt on screen, which is fine because most server hardware don’t even have screens attached. On a laptop, keeping the screen on wastes power and probably cause long-term damage as well. I found an answer on Proxmox forums:

  • Edit /etc/default/grub to add “consoleblank=30” (30 is timeout in seconds) to GRUB_CMDLINE_LINUX if an entry already existed. If not, add a single line GRUB_CMDLINE_LINUX="consoleblank=30"
  • Run update-grub to apply this configuration.
  • Reboot

Another default behavior: when closing the laptop lid, the laptop goes to sleep. I don’t want this behavior when I’m using it as mini-server. I was surprised to learn the technique I found for Ubuntu Desktop would also work for server edition as well: edit /etc/systemd/logind.conf and change HandleLidSwitch to ignore.

Making the two above changes turn off my laptop screen after the set number of seconds of inactivity, and leaves the computer running when the lid is closed.

Dealing with KVM

KVM is a big piece of software with lots of knobs. I was intimidated by the thought of learning all command line options and switches on my own. So, for my earlier experiment, I ran Virtual Machine Manager on Ubuntu Desktop edition to keep my settings straight. I’ve learned bits and pieces of interacting with KVM via its virsh command line tool, but I have yet to get comfortable enough with it to use command line as the default interface.

Fortunately, many others felt similarly and there are other ways to work with a KVM hypervisor. My personal data storage solution TrueNAS has moved from a FreeBSD-based system (now named TrueNAS CORE) to a Linux-based system (a parallel sibling product called TrueNAS SCALE). TrueNAS SCALE included virtual machine capability with KVM hypervisor which looked pretty good. After a quick evaluation session, I decided I preferred working with KVM using Proxmox VE, a whole operating system built on top of Debian/Ubuntu dedicated to the job. Hosting virtual machines with the KVM hypervisor and tools to monitor and manage those virtual machines. Instead of Virtual Machine Manager’s UI running on Ubuntu Desktop, both TrueNAS SCALE and Proxmox VE expose their UI as a browser-based interface accessible over the network.

I liked the idea of doing everything on a single server running TrueNAS SCALE, and may eventually move in that direction. But there is something to be said of keeping two isolated machines. I need my TrueNAS SCALE machine to be absolutely reliable, an appliance I can leave running its job of data storage. It can be argued it’s a good idea to use a different machine for more experimental things like ESPHome and Home Assistant Operating System. Besides, unlike normal people, I have plenty of PC hardware sitting around. Put some of them to work!

My First Proxmox VM

After using an old laptop to dabble with running virtual machines under KVM hypervisor, I’ve decided to dedicate a computer to virtual machine hosting. The heart of the machine are the CPU, memory, and a M.2 SSD all mounted to a small Mini-ITX mainboard. On these pages, they were formerly the core of Luggable PC Mark II, which was decommissioned and disassembled two years ago. Now it will run Proxmox VE (Virtual Environment) which offers both virtual machine and container hosting managed with a browser-based interface. Built on top of a Debian Linux distribution, Proxmox uses KVM as its virtual machine hypervisor which I’ve used successfully before.

Enterprise Subscription

I downloaded Proxmox VE 7.4, latest as of this writing, and its installation was uneventful. Its setup procedure no more complex than an Ubuntu installation. Once up and running, the first dialog box to greet me was a “You haven’t paid for an Enterprise subscription!” reminder. This warning dialog box would repeat every time I log on to the administration dashboard. Strictly speaking, a subscription is optional for running core features of Proxmox. A fraction of the Enterprise repository features are available by switching to the no-subscription repository, which has a far less comprehensive service level agreement. If I end up loving Proxmox and using it on an ongoing basis, I may choose to subscribe in the future. In the meantime, I have to dismiss this dialog every time I log on, even when I’m on the no-subscription repository. I understand the subscription is what pays the bills for this project so I don’t begrudge them for promoting it. A single dialog box per logon isn’t overly pushy by my standards.

ISO Images for Operating System Installation

My first impression of the administration interface is that it is very tightly packed with information. A big blue “Create VM” button in the upper right corner made it obvious how to create a new virtual machine, but it took some time before I figured out how to install an operating system on it. During VM creation there’s a dialog box for installation media, but I couldn’t upload Ubuntu Server 22.04 ISO on that screen. It took some poking around before I found I need to click on the Proxmox node representing my computer, click on its local storage, and at that point I could upload an ISO. Or, conveniently, I could download from an URL if I didn’t have an ISO to upload. I could even enter the SHA256 checksum to verify integrity of the download! That’s pretty slick. (Once I’ve found it.)

Helpful Help

After an installation ISO was on my Proxmox host, everything else went smoothly. This was helped tremendously by the fact every Proxmox user interface has a link to its corresponding section in online HTML documentation. I’ve learned to like this approach, because it lets me see that information in context of other related information in the same section. In contrast, clicking help in TrueNAS would give me just a short description. If that’s not enough, I’ve got to hit the web and search on my own.

USB Passthrough: Success

Once my virtual machine was up and running, I tested my must-have feature: USB passthrough. While the virtual machine is up and running, I can go into Proxmox interface and add a USB passthrough device. It immediately showed up in the virtual machine as if I had just hot-plugged the USB hardware into a port. Excellent! This brings it to parity with my existing Home Assistant VM setup using Ubuntu + Virtual Machine Manager and ahead of TrueNAS SCALE 22.02 (“Angelfish”) which lacked USB passthrough.

When I looked at TrueNAS SCALE earlier with an eye to running my Home Assistant VM, I found the TrueNAS bug database entry tracking the USB passthrough feature request. Revisiting that item, I saw USB passthrough has since been added to TrueNAS SCALE 22.12 (“Bluefin”). Well, now. That means it’s time for another look at TrueNAS SCALE.

Hello Proxmox Virtual Environment

Last time I played with virtualization, my motivation was to run Home Assistant Operating System (HAOS) within a hypervisor that can reliably reboot my virtual machines. I was successful running HAOS under KVM (kernel-based virtual machine) on an old laptop. A bonus feature of KVM was USB passthrough, allowing a virtual machine to access USB hardware. This allowed ESPHome to perform initial firmware flash. (After that initial flash, ESPHome can update wirelessly, but that first flash must use an USB cable.) Once I had a taste of USB passthrough, it has been promoted from a “bonus” to a “must-have” feature.

I wasn’t up for learning the full suite of command-line tools for managing KVM so I installed Virtual Machine Manager for a friendlier graphical user interface. Once everything was setup for HAOS, it was easy for me to add virtual machines for experiments. Some quick and fleeting, others lasting weeks or months. And when I’m done with the experiment, I could delete those virtual machines just as easily. I could install software within a VM without risk of interference from earlier experiments, because they were isolated in entirely different VMs. I now understand the appeal of having a fleet of disposable virtual machines!

With growing VM use, it was inevitable I’d start running into limitations of an old laptop. I had expected the processor to be the first barrier, as it was a meager Core i5-3320M with two hyperthreaded cores. But I hadn’t been running processor-intensive experiments so that CPU was actually fine. A standard 2.5″ laptop hard drive slot made for easy upgrades in SSD capacity. The biggest barrier turned out to be RAM: there was only 4GB of it, and it doesn’t make much economic sense to buy DDR3 SODIMM to upgrade this old laptop. Not when I already have more capable machines on hand I could allocate to the task.

This laptop screen has only 1388×768 resolution, which was a minor impediment. In its use as KVM host, I only ever have to look at that screen when I bring up Virtual Machine Manager to perform VM housekeeping. (Those I have yet to learn to do remotely with virsh commands over ssh.) For such usage, the screen is serviceable but also cramped. I frequently wished I could manage KVM remotely from my desktop with large monitor.

Now that I’m contemplating setting up a dedicated computer, I decided to try something more task-focused than Ubuntu Desktop + Virtual Machine Manager combination I have been using. My desire to dedicate a computer to host a small number of virtual machines under KVM hypervisor, managed over local network, led me to Proxmox Virtual Environment. I learned about Proxmox VE when an acquaintance posted about setting it up on their machine a few weeks ago. As I read through Promox website I thought “That would be interesting to investigate later.”

It is time.