Bug Hunt Could Cross Three or More Levels of Indirection

When running Proxmox VE, my Dell Inspiron 7577’s onboard Realtek Ethernet would quit at unexpected times. Network transmission halts, and a network watchdog timer fires which triggers a debug error message. One proposed workaround is to change to a different Realtek driver. But after learning about the tradeoffs involved, I decided against pursuing that path.

This watchdog timer error message has been reported by many users on Proxmox forums, and some kind of a fix is en route. I’m not confident it’ll help me, because it deactivated ASPM on Realtek devices but turning off ASPM across the board on my computer didn’t keep the machine online. I’m curious how that particular fix was developed, or the data that informed the fix. Thinking generally, pinning such a failure down requires jumping through three levels of indirection. My poorly-informed speculation is as follows:

The first and easiest step is the watchdog timer itself. A call stack is part of the error message, which might be enough to determine the code path that started the timer. But since it is a production binary, the call stack has incomplete symbols. Getting more information would require building a debug kernel in order to get full symbols.

With that information, it should be relatively straightforward to get to the second step: determining what network operation timed out. But then what? Given the random and intermittent nature, the failing network operation was probably just an ordinary transaction that had succeeded many times before and should have succeeded again. But for whatever reason, failed this time because the Realtek driver and/or hardware got in a bad state.

And that’s the difficult third step: how to look at an otherwise ordinary network transaction and deduce a cause for the bad Realtek state. It probably wasn’t the network transaction itself! Which meant at least one more indirect jump. The fix en route dealt with PCIe ASPM (PCI Express Active State Power Management) which probably wasn’t directly on the code path for a normal network data transmission. I’m really curious how that deduction was made and, if the incoming fix doesn’t address my issue, how I can use similar techniques to determine what put my hardware in a bad state.

From the outside, that process feels like a lot of black magic voodoo I don’t understand. For now I will sit tight with my reboot cron job workaround and wait for the updated kernel to arrive.

[UPDATE: A Proxmox VE update has arrived bringing kernel 6.2.16-18-pve to replace 6.2.16-15-pve I had been running. Despite my skepticism about ASPM, either that change or another in this update seems to be successful keeping the machine online!]


Featured image created by Microsoft Bing Image Creator powered by DALL-E 3 with prompt “Cartoon drawing of a black laptop computer showing a crying face on screen and holding a network cable

Realtek r8168 Driver Is Not r8169 Driver Predecessor

I have a Dell Inspiron 7577 whose onboard Realtek Ethernet hardware would randomly quit under Proxmox VE. [UPDATE: After installing Proxmox VE kernel update from 6.2.16-15-pve to 6.2.16-18-pve, this problem no longer occurs, allowing the machine to stay connected to the network.] After trying some kernel flags that didn’t help, I put in place an ugly hack to reboot the computer every time the network watchdog went off. This would at least keep the machine accessible from the network most of the time while I learn more about this problem.

In my initial research, I found some people who claimed switching to the r8168 driver kept their machines online. Judging by their names, I thought the r8168 driver was the immediate predecessor to the r8169 driver currently part of the system causing me headaches. But after reading a bit more, I’ve learned this was not the case. While both r8168 and r8169 refer to Linux drivers for Realtek Ethernet hardware, they exist in parallel reflecting two different development teams.

r8169 is an in-tree kernel driver that supports a few Ethernet adapters including R8168.

r8168 module built from source provided by Realtek.

— Excerpt from “r8168/r8169 – which one should I use?” on AskUbuntu.com:

This is a lot more complicated than “previous version”. As an in-tree kernel driver, r8169 will be updated in lock step with Linux updates largely independent of Realtek product cycle. As a vendor-provided module, r8168 will be updated to support Realtek hardware, but won’t necessarily stay in sync with Linux updates.

This explains why when someone has a new computer that doesn’t have networking under Linux, the suggestion is to try the r8168 driver: Realtek would add support for new hardware before Linux developers would get around to it. It also explains why people running r8168 driver run into problems later: they updated their Linux kernel and could no longer run their r8168 driver targeted to an earlier kernel.

Given this knowledge, I’m very skeptical running r8168 would help me. Some Proxmox users report that it’s the opposite of helpful, killing their network connection entirely. D’oh! Another interesting data point from that forum thread was the anecdotal observation that Proxmox clusters accelerate faults with the Realtek driver. This matches with my observation. Before I set up a Proxmox cluster, the network fault would occur roughly once or twice a day. After my cluster was up and running, it would occur many times a day with uptime as short as an hour and a half.

Even if switching to r8168 would help, it would only be a temporary solution. The next Linux update in this area would break the driver until Realtek catches up with an update. The best I can hope from r8168 is a data point informing an investigation of what triggers this fault condition, which seems like a lot of work for little gain. I decided against trying the r8168 driver. There are many other pieces in this puzzle.


Featured image created by Microsoft Bing Image Creator powered by DALL-E 3 with prompt “Cartoon drawing of a black laptop computer showing a crying face on screen and holding a network cable

Reported PCI Express Error was Unrelated

I have a Dell Inspiron 7577 laptop whose Ethernet hardware is unhappy with Proxmox VE 8, dropping off the network at unpredictable times. [UPDATE: Network connectivity stabilized after installing Proxmox VE kernel update from 6.2.16-15-pve to 6.2.16-18-pve. The PCI Express AER messages described in this post also stopped.] Trying to dig deeper, I found there was an error message dump indicating a watchdog timer went off while waiting to transmit data over the network. Searching online, I find bug reports that match the symptoms but that’s not necessarily the cause. A watchdog timer can be triggered by anything that gum up the works, so what resolves the network issue on one machine wouldn’t necessarily work on mine. I went back to dmesg to look for other clues.

Before the watchdog timer triggered, I found several lines of this message at irregular intervals:

[36805.253317] pcieport 0000:00:1c.4: AER: Corrected error received: 0000:3b:00.0

Sometimes only seconds apart, other times hours apart, and sometimes it never happens at all before the watchdog timer barks. This is some sort of error on the PCIe bus from device 0x3b:00.0, which is the Realtek Ethernet controller as per this lspci excerpt:

3b:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)

Even though the debug message said the error was corrected, maybe it triggered side effects causing my problem? Searching on this error message, I found several possibly relevant kernel flags. This Reddit thread has a good summary of them all.

  • pci=noaer disables PCI Express Advanced Error Reporting which sent this message. This is literally shooting the messenger. It’ll silence those messages but won’t do anything to address underlying problems.
  • pci=nomsi disables a PCI Express signaling mechanism that might cause these correctable errors, forcing all devices to fall back to a different mechanism. Some people reported losing peripherals (like USB) when they use this flag, I guess that hardware couldn’t fall back to something else? I tried it and while it didn’t cause any obvious problems (I still had USB) it also didn’t help keep my Ethernet alive either.
  • pci=nommconf disables PCI Express memory-mapped configuration. (I don’t know what those words mean, I just copied them out of kernel documentation.) The good news is adding this flag did eliminate those “Corrected error received” messages. The bad news it didn’t help keep my Ethernet alive, either.

Up until I tried pci=nommconf I had wondered if I’ve been doing kernel flags wrong. I was editing /etc/default/grub then running update-grub. After boot, I checked they showed up on cat /proc/cmdline but I didn’t really know if the kernel actually changed behavior. After pci=nommconf, my confidence was boosted by the lack of “Corrected error received” messages, though that might still be a false sense of confidence because “Corrected error received” messages don’t always happen. It’s an imperfect world, I work with what I have.

And sadly, there is something I need but don’t have today: ability to dig deeper into Linux kernel to find out what has frozen up, leading to the watchdog timer expiring. But I’m out of ideas for now and I still have a computer that drops off the network at irregular times. I don’t want to keep pulling the laptop off the shelf to log in locally and type “reboot” several times a day. I concede I must settle for a hideously ugly hack to do that for me.


Featured image created by Microsoft Bing Image Creator powered by DALL-E 3 with prompt “Cartoon drawing of a black laptop computer showing a crying face on screen and holding a network cable

Ethernet Failure Triggers Network Stack Timeout

I was curious about Proxmox VE capability to migrate virtual machines from one cluster node to another. I set up a small cluster to try it and found it to be as easy as advertised. After migrating my VM experiments to a desktop computer with Intel networking hardware, they have been running flawlessly. This allowed me to resume tinkering with a laptop computer that would drop off the network at unpredictable times. This unfortunate tendency makes it a very poor Proxmox VE server. [UPDATE: Network connectivity stabilized after installing Proxmox VE kernel update from 6.2.16-15-pve to 6.2.16-18-pve.]

Repeating Errors From r8169

After it dropped off the network, I have to log on to the computer locally. The screen is usually filled with error messages. I ran dmesg and saw the same messages there as well. Based on associated timestamp, this block of messages repeats every four minutes:

[68723.346727] r8169 0000:3b:00.0 enp59s0: rtl_chipcmd_cond == 1 (loop: 100, delay: 100).
[68723.348833] r8169 0000:3b:00.0 enp59s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[68723.350921] r8169 0000:3b:00.0 enp59s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[68723.352954] r8169 0000:3b:00.0 enp59s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[68723.355097] r8169 0000:3b:00.0 enp59s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[68723.357156] r8169 0000:3b:00.0 enp59s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[68723.359289] r8169 0000:3b:00.0 enp59s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[68723.389357] r8169 0000:3b:00.0 enp59s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
[68723.415890] r8169 0000:3b:00.0 enp59s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
[68723.442132] r8169 0000:3b:00.0 enp59s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).

Searching on that led me to Proxmox forums, and one of the workarounds was to set the kernel flag pcie_aspm=off. I tried that, but the computer still kept dropping off the network. Either I’m not doing this correctly (editing /etc/default/grub then running update-grub) or the change doesn’t help my situation. Perhaps it addressed a different problem with similar symptoms, leaving open the mystery of what’s going with my machine.

NETDEV WATCHDOG

Looking for more clues, I scrolled backwards in dmesg log and found this block of information just before the repeating series of r8169 errors:

[67717.227089] ------------[ cut here ]------------
[67717.227096] NETDEV WATCHDOG: enp59s0 (r8169): transmit queue 0 timed out
[67717.227126] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x23a/0x250
[67717.227133] Modules linked in: veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilt>[67717.227254]  iwlwifi ttm snd_timer pcspkr drm_display_helper intel_wmi_thunderbolt btintel dell_wmi_descriptor joydev processor_thermal_mbox>[67717.227374]  i2c_i801 xhci_pci i2c_hid_acpi crc32_pclmul i2c_smbus nvme_common i2c_hid realtek xhci_pci_renesas ahci libahci psmouse xhci_hc>[67717.227401] CPU: 1 PID: 0 Comm: swapper/1 Tainted: P           O       6.2.16-15-pve #1
[67717.227404] Hardware name: Dell Inc. Inspiron 7577/0P9G3M, BIOS 1.17.0 03/18/2022
[67717.227406] RIP: 0010:dev_watchdog+0x23a/0x250
[67717.227411] Code: 00 e9 2b ff ff ff 48 89 df c6 05 ac 5d 7d 01 01 e8 bb 08 f8 ff 44 89 f1 48 89 de 48 c7 c7 90 87 80 bc 48 89 c2 e8 56 91 30>[67717.227414] RSP: 0018:ffffae88c014ce38 EFLAGS: 00010246
[67717.227417] RAX: 0000000000000000 RBX: ffff99129280c000 RCX: 0000000000000000
[67717.227419] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[67717.227421] RBP: ffffae88c014ce68 R08: 0000000000000000 R09: 0000000000000000
[67717.227423] R10: 0000000000000000 R11: 0000000000000000 R12: ffff99129280c4c8
[67717.227425] R13: ffff99129280c41c R14: 0000000000000000 R15: 0000000000000000
[67717.227427] FS:  0000000000000000(0000) GS:ffff991600480000(0000) knlGS:0000000000000000
[67717.227429] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[67717.227432] CR2: 000000c0006e1010 CR3: 0000000165810003 CR4: 00000000003726e0
[67717.227434] Call Trace:
[67717.227436]  <IRQ>
[67717.227439]  ? show_regs+0x6d/0x80
[67717.227444]  ? __warn+0x89/0x160
[67717.227447]  ? dev_watchdog+0x23a/0x250
[67717.227451]  ? report_bug+0x17e/0x1b0
[67717.227455]  ? irq_work_queue+0x2f/0x70
[67717.227459]  ? handle_bug+0x46/0x90
[67717.227462]  ? exc_invalid_op+0x18/0x80
[67717.227465]  ? asm_exc_invalid_op+0x1b/0x20
[67717.227470]  ? dev_watchdog+0x23a/0x250
[67717.227474]  ? dev_watchdog+0x23a/0x250
[67717.227477]  ? __pfx_dev_watchdog+0x10/0x10
[67717.227481]  call_timer_fn+0x29/0x160
[67717.227485]  ? __pfx_dev_watchdog+0x10/0x10
[67717.227488]  __run_timers+0x259/0x310
[67717.227493]  run_timer_softirq+0x1d/0x40
[67717.227496]  __do_softirq+0xd6/0x346
[67717.227499]  ? hrtimer_interrupt+0x11f/0x250
[67717.227504]  __irq_exit_rcu+0xa2/0xd0
[67717.227507]  irq_exit_rcu+0xe/0x20
[67717.227510]  sysvec_apic_timer_interrupt+0x92/0xd0
[67717.227513]  </IRQ>
[67717.227515]  <TASK>
[67717.227517]  asm_sysvec_apic_timer_interrupt+0x1b/0x20
[67717.227520] RIP: 0010:cpuidle_enter_state+0xde/0x6f0
[67717.227524] Code: 12 57 44 e8 f4 64 4a ff 8b 53 04 49 89 c7 0f 1f 44 00 00 31 ff e8 22 6d 49 ff 80 7d d0 00 0f 85 eb 00 00 00 fb 0f 1f 44 00>[67717.227526] RSP: 0018:ffffae88c00ffe38 EFLAGS: 00000246
[67717.227529] RAX: 0000000000000000 RBX: ffffce88bfc80000 RCX: 0000000000000000
[67717.227531] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
[67717.227533] RBP: ffffae88c00ffe88 R08: 0000000000000000 R09: 0000000000000000
[67717.227534] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffbd2c3a40
[67717.227536] R13: 0000000000000008 R14: 0000000000000008 R15: 00003d96a543ec60
[67717.227540]  ? cpuidle_enter_state+0xce/0x6f0
[67717.227544]  cpuidle_enter+0x2e/0x50
[67717.227547]  do_idle+0x216/0x2a0
[67717.227551]  cpu_startup_entry+0x1d/0x20
[67717.227554]  start_secondary+0x122/0x160
[67717.227557]  secondary_startup_64_no_verify+0xe5/0xeb
[67717.227563]  </TASK>
[67717.227565] ---[ end trace 0000000000000000 ]---

A watchdog timer went off somewhere in the networking stack while waiting to transmit data. The data output starts with [ cut here ] but I have no idea where this information should be pasted into. I recognize the format of a call trace alongside a dump of CPU register data, but the actual call trace is incomplete. There are a lot of “?” in here because I am not running the debug kernel and symbols are missing.

Looking in the FAQ for Kernel.org, I followed a link to kernelnewbies.org and from there their page “So, you think you’ve found a Linux kernel bug?” I see the section on “Oops messages” and they look very similar to what I see here, except without the actual line with “Oops” in it. From there I was linked to the kernel bug tracking database. A search on watchdog timer expiration in r8169 got several dozen hits across many years, including 217814 which I found earlier via Proxmox forum search, thus coming full circle.

I see some differences between my call trace with that in 217814, but that’s possibly expected differences between my kernel (6.2.16-15-pve) and what generated 217814 (6.2.0-26-generic). In any case, the call stack appears to be for the watchdog timer itself and not whatever triggered it. Supposedly disabling ASPM would resolve 217814. Since it didn’t do anything for me, I conclude there’s something else clogging up the network stack. Teasing out that “something else” requires learning more about Linux kernel inner workings. I’m not enthusiastic about that prospect so I looked for other things to try.


Featured image created by Microsoft Bing Image Creator powered by DALL-E 3 with prompt “Cartoon drawing of a black laptop computer showing a crying face on screen and holding a network cable

Proxmox Cluster VM Migration

I had hoped to use an older Dell Inspiron 7577 as a light-duty virtualization server running Proxmox VE, but there’s a Realtek Ethernet problem causing it to lose connectivity after an unpredictable amount of time. A workaround mirroring the in-progress bug fix didn’t seem to do anything, so now I’m skeptical that the upcoming “fixed” kernel will address my issue. [UPDATE: I was wrong! After installing Proxmox VE kernel update from 6.2.16-15-pve to 6.2.16-18-pve, the network problem no longer occurs.] I found two other workarounds online: revert back to an earlier kernel, or revert back to an earlier driver. Neither feel like great options, so I’m going to leverage my “hardware-rich environment” a.k.a. I hoard computer hardware and might as well put them to work.

I brought another computer system online, the hardware was formerly the core of Luggable PC Mark II and mostly gathering dust ever since Mark II was disassembled. I bring it out for an experiment here and there, and now it will be my alternate Proxmox VE host. The first thing I checked was its networking hardware by typing “lspci” to see all PCI devices including the following two lines:

00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-V
06:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)

This motherboard has two onboard Ethernet ports, and apparently both have Intel hardware behind them. So if I run into problems, hopefully it’s at least not the same Realtek problem.

At idle, this system draws roughly 16 watts which is not bad for a desktop system but vastly more than the 2 watts drawn by a laptop. Running my virtual machines on this desktop will hopefully more reliable while I try to get to the bottom of my laptop’s network issue. I really like the idea of a server that draws only around 2 watts when idle so I want to make it work. This means I foresee two VM migrations: immediate move from the laptop to the desktop, and a future migration back to the laptop after its Ethernet is reliable.

I am confident I can perform this migration manually, since I just did it a few days ago to move these virtual machines from Ubuntu Desktop KVM to Proxmox VE. But why do it manually when there’s a software feature to do it automatically? I set these two machines up as nodes in a Proxmox cluster. Grouping them together in such a way gains several features, the one I want right now is virtual machine migration. Instead of messing around with manually setting up software and copying backup files, now I click a single “Migrate” button.

It took roughly 7 minutes to migrate the 32GB virtual disk from one Proxmox VE cluster node to another, and once back up and running, each virtual machine resumed as if nothing had happened. This is way easier and faster than my earlier manual migration procedure and I’m happy it worked seamlessly. With my virtual machines now seamlessly running on a different piece of hardware, I can dig deeper into the signs of a a problematic network driver.

A Quick Look at ASPM and Power Consumption

[UPDATE: After installing Proxmox VE kernel update from 6.2.16-15-pve to 6.2.16-18-pve, this problem no longer occurs, allowing the machine to stay connected to the network.]

I’ve configured an old 15″ laptop into a light-duty virtualization server running Proxmox VE, and I’m running into a reliability problem with the Ethernet controller on this Dell Inspiron 7577. My symptoms line up with a bug that others have filed, and a change to address the issue is working its way through the pipeline. I wouldn’t call it a fix, exactly, as the problem seems to be flawed power management in Realtek hardware and/or driver in combination with the latest Linux kernel. The upcoming change doesn’t fix Realtek power management, it merely disables their participation in PCIe ASPM (Active State Power Management).

Until that change arrives, one of the mitigation workarounds is to deactivate ASPM on the entire PCIe bus. There are a lot of components on that bus! Here’s the output from running “lspci” at the command line:

00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers (rev 05)
00:01.0 PCI bridge: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) (rev 05)
00:02.0 VGA compatible controller: Intel Corporation HD Graphics 630 (rev 04)
00:04.0 Signal processing controller: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem (rev 05)
00:14.0 USB controller: Intel Corporation 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller (rev 31)
00:14.2 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Thermal Subsystem (rev 31)
00:15.0 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Serial IO I2C Controller #0 (rev 31)
00:15.1 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Serial IO I2C Controller #1 (rev 31)
00:16.0 Communication controller: Intel Corporation 100 Series/C230 Series Chipset Family MEI Controller #1 (rev 31)
00:17.0 SATA controller: Intel Corporation HM170/QM170 Chipset SATA Controller [AHCI Mode] (rev 31)
00:1c.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #1 (rev f1)
00:1c.4 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #5 (rev f1)
00:1c.5 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #6 (rev f1)
00:1d.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #9 (rev f1)
00:1f.0 ISA bridge: Intel Corporation HM175 Chipset LPC/eSPI Controller (rev 31)
00:1f.2 Memory controller: Intel Corporation 100 Series/C230 Series Chipset Family Power Management Controller (rev 31)
00:1f.3 Audio device: Intel Corporation CM238 HD Audio Controller (rev 31)
00:1f.4 SMBus: Intel Corporation 100 Series/C230 Series Chipset Family SMBus (rev 31)
01:00.0 VGA compatible controller: NVIDIA Corporation GP106M [GeForce GTX 1060 Mobile] (rev a1)
01:00.1 Audio device: NVIDIA Corporation GP106 High Definition Audio Controller (rev a1)
3b:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
3c:00.0 Network controller: Intel Corporation Wireless 8265 / 8275 (rev 78)
3d:00.0 Non-Volatile memory controller: Intel Corporation Device f1aa (rev 03)

Deactivating APSM across the board will impact far more than the Realtek chip. I was curious what impact this would have on power consumption and decided to dig up my Kill-a-Watt meter for some before/after measurements.

Dell Latitude E6230 + Ubuntu Desktop

As a point of comparison, I had measured a few values of Dell Latitude E6230 I had just retired. These are the lowest values I could see within a ~15 second window. It would jump up by a watt or two for a few seconds before dropping.

  • 5W: idle.
  • 8W: hosting Home Assistant OS under KVM but not doing anything intensive.
  • 35W: 100% CPU utilization as HAOS compiled ESPHome firmware updates.

As a light-duty server, the most important value here is the 8W value, because that’s what it will be drawing most of the time.

Dell Inspiron 7577 + Proxmox VM

Since the Inspiron 7577 came with a beefy 180W AC power adapter (versus the 60W unit of the E6230) I was not optimistic about its power consumption. As a newer larger more power-hungry machine, I had expected idle power draw at least double that of the E6230. I was very pleasantly surprised. Running Proxmox VE but with all VMs shut down, the Kill-a-Watt indicated a rock solid two watts. Two!

As I started up my three virtual machines (Home Assistant OS, Plex, and InfluxDB), it jumped up to fifteen watts then gradually ramped back down to two watts as those VMs reached steady state. After that, it would occasionally jump up to four or five watts for a few seconds to service those mostly-idle VMs, then drop back down to two watts.

On the upside, it appears four generations of Intel CPU and laptop evolution has provided significant improvements in power efficiency. However, they were running different software so some of that difference might be credited to Ubuntu Desktop versus Proxmox.

On the downside, the Kill-a-Watt only measures down to whole watts with no fractional numbers. So a baseline of two watts isn’t very useful because it would take a 50% change in power consumption to show up in Kill-a-Watt numbers. I know running three VMs would take some power, but idling with and without VM both bottomed out at two watts. This puts me into measurement error territory. I need finer grained instrumentation to make meaningful measurements, but I’m not willing to pay money for just a curiosity experiment. I shrugged and kept going.

Dell Inspiron 7577 + Proxmox VM + pcie_aspm=off

Reading Ubuntu bug #2031537 I saw one of their investigative steps was to add pcie_aspm=off to the kernel command line. To follow in those footsteps, I first needed to learn what that meant. I could confirm it is documented as a valid kernel command line parameter. Then I had to find instructions on how to add such a thing, which involved editing /etc/default/grub then running update-grub. And finally, after the system rebooted, I could confirm the command line was processed by typing “cat /proc/cmdline“. I don’t know how to verify it actually took effect, though, except by observing system behavior changes.

The first data point is power consumption: now when hosting my three virtual machines, the Kill-a-Watt showed three watts most of the time. It still occasionally dips down to two watts for a second or two, but most of the time it hovers at three watts plus the occasional spike up to four or five watts. Given the coarse granularity, it’s inconclusive whether this reflects actual change or just random.

The second and more important data point is: did it improve Ethernet reliability? Sadly it did not. Before I made this change, I noted three failures from Realtek Ethernet. Each session lasting 36 hours or less. The first reboot after this change lost network after 50 hours. This might be within range of random error (meaning maybe pcie_aspm=off didn’t actually change anything) and definitely not long enough. After that reboot, the system fell off the network again after less than 3 hours. (2 hours 55 minutes!) That is a complete fail.

I’m sad pcie_aspm=off turned out to be a bust. So what’s next? First I need to move these virtual machines to another physical machine, which was a handy excuse to play with Proxmox clusters.

Realtek Network r8169 Woes with Linux Kernel 6

[UPDATE: After installing Proxmox VE kernel update from 6.2.16-15-pve to 6.2.16-18-pve, this problem no longer occurs, allowing the machine to stay connected to the network.]

After setting up a Home Assistant OS virtual machine in Proxmox VE alongside a few other virtual machines, I wondered how long it would be before I encounter my first problem with this setup. I got my answer roughly 36 hours after I installed Proxmox VE. I woke up in the morning with my ESP microcontrollers blinking their blue LEDs, signaling a problem. The Dell Inspiron 7577 laptop I’m using as a light-duty server has fallen off the network. What happened?

I pulled the machine off the shelf and opened the lid, which is dark because of my screen blanking configuration earlier. But tapping a key woke it up and I saw it filled with messages. Two messages were dominant. There would be several lines of this:

r8169 0000:03:00.0 enp3s0: rtl_chipcmd_cond == 1 (loop: 100, delay: 100).

Followed by several lines of a similar but slightly different message:

r8169 0000:03:00.0 enp3s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).

Since the machine is no longer on the network, I couldn’t access Proxmox VE’s web interface. About the only thing I could do is to log in at the keyboard and type “reboot”. A few minutes later, the system is back online.

While it was rebooting, I performed a search for rtl_ephyar_cond and found a hit on the Proxmox subreddit: System hanging intermittently after upgraded to 8. It pointed the finger at Realtek’s 8169 network driver, and to a Proxmox forum thread: System hanging after upgrade…NIC driver? It sounds like Realtek’s 8169 drivers have a bug exposed by Linux kernel 6. Proxmox bug #4807 was opened to track this issue, which led me down a chain of links to Ubuntu bug #2031537.

The code change intended to resolve this issue doesn’t fix anything on the Realtek side, but purportedly avoids the problem by disabling PCIe ASPM (Active State Power Management) for Realtek chip versions 42 and 43. I couldn’t confirm this is directly relevant to me. I typed lspci at the command line and here’s the line about my network controller:

3b:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)

This matches some of the reports on Proxmox bug 4807, but I don’t know how “rev 15” relates to “42 and 43” and I don’t know how to get further details to confirm or deny. I guess I have to wait for the bug fix to propagate through the pipeline to my machine. I’ll find out if it works then, and whether there’s another problem hiding behind this one.

So if the problem is exposed by the combination of new Linux kernel and new Realtek driver and only comes up at unpredictable times after the machine has been running a while, what workarounds can I do in the meantime? I’ve seen the following options discussed:

  1. Use Realtek driver r8168.
  2. Revert to previous Linux kernel 5.12.
  3. Disable PCIe ASPM on everything with pcie_aspm=off kernel parameter.
  4. Reboot the machine regularly.

I thought I’d try the easy thing first with regular reboots. I ran “crontab -e” and added a line to the end. “0 4 * * * reboot” This should reboot the system every day at four in the morning. It ran for 36 hours the first time around, so I thought a reboot every 24 hours would suffice. This turned out to be overly optimistic. I woke up the next morning and this computer was off the network again. Another reboot and I could log in to Home Assistant and saw it stopped receiving data from my ESPHome nodes just after 3AM. If the 4AM reboot happened, it didn’t restore the network. And it doesn’t matter anyway because the Realtek crapped out before then.

Oh well! It was worth a try. I will now try disabling ASPM, which is also an opportunity to learn its impact on electric power consumption.


Featured image created by Microsoft Bing Image Creator powered by DALL-E 3 with prompt “Cartoon drawing of a black laptop computer showing a crying face on screen and holding a network cable

Configuring Laptop for Proxmox VE

I’m migrating my light-duty server duties from my Dell Latitude E6230 to my Dell Inspiron 7577. When I started playing with KVM hypervisor on the E6230, I installed Ubuntu Desktop instead of server for two reasons: I didn’t know how to deal with the laptop screen, and I didn’t know how to work with KVM via the command line. But the experience allowed me to learn things I will incorporate into my 7577 configuration.

Dealing with the Screen

By default, Proxmox VE would leave a simple text prompt on screen, which is fine because most server hardware don’t even have screens attached. On a laptop, keeping the screen on wastes power and probably cause long-term damage as well. I found an answer on Proxmox forums:

  • Edit /etc/default/grub to add “consoleblank=30” (30 is timeout in seconds) to GRUB_CMDLINE_LINUX if an entry already existed. If not, add a single line GRUB_CMDLINE_LINUX="consoleblank=30"
  • Run update-grub to apply this configuration.
  • Reboot

Another default behavior: when closing the laptop lid, the laptop goes to sleep. I don’t want this behavior when I’m using it as mini-server. I was surprised to learn the technique I found for Ubuntu Desktop would also work for server edition as well: edit /etc/systemd/logind.conf and change HandleLidSwitch to ignore.

Making the two above changes turn off my laptop screen after the set number of seconds of inactivity, and leaves the computer running when the lid is closed.

Dealing with KVM

KVM is a big piece of software with lots of knobs. I was intimidated by the thought of learning all command line options and switches on my own. So, for my earlier experiment, I ran Virtual Machine Manager on Ubuntu Desktop edition to keep my settings straight. I’ve learned bits and pieces of interacting with KVM via its virsh command line tool, but I have yet to get comfortable enough with it to use command line as the default interface.

Fortunately, many others felt similarly and there are other ways to work with a KVM hypervisor. My personal data storage solution TrueNAS has moved from a FreeBSD-based system (now named TrueNAS CORE) to a Linux-based system (a parallel sibling product called TrueNAS SCALE). TrueNAS SCALE included virtual machine capability with KVM hypervisor which looked pretty good. After a quick evaluation session, I decided I preferred working with KVM using Proxmox VE, a whole operating system built on top of Debian/Ubuntu dedicated to the job. Hosting virtual machines with the KVM hypervisor and tools to monitor and manage those virtual machines. Instead of Virtual Machine Manager’s UI running on Ubuntu Desktop, both TrueNAS SCALE and Proxmox VE expose their UI as a browser-based interface accessible over the network.

I liked the idea of doing everything on a single server running TrueNAS SCALE, and may eventually move in that direction. But there is something to be said of keeping two isolated machines. I need my TrueNAS SCALE machine to be absolutely reliable, an appliance I can leave running its job of data storage. It can be argued it’s a good idea to use a different machine for more experimental things like ESPHome and Home Assistant Operating System. Besides, unlike normal people, I have plenty of PC hardware sitting around. Put some of them to work!

Dell Inspiron 7577 Laptop as Light Duty Server

I’m setting aside my old Dell Latitude E6230 laptop due to its multiple hardware failures. At the moment I am using it to play with virtualization server software. Virtualization hosts usually run on rack-mounted server hardware in a datacenter somewhere. But an old laptop works well for light-duty exploration at home by curious hobbyists: they sip power for small electric bill impact, they’re compact so we can stash them in a corner somewhere, and they come with a battery for surviving power failures.

I bought my Dell Inspiron 7577 15″ laptop five years ago, because at the time that was the only reasonable way to get my hands on a NVIDIA GPU. The market situation have improved since then, so I now have a better GPU on my gaming desktop. I’ve also learned I haven’t needed mobile gaming power enough to justify carrying a heavy laptop around, so I got a lighter laptop.

RAM turned out to be a big constraint on what I could explore on the E6230. Which had a meager 4GB RAM and I couldn’t justify spending money to buy old outdated DDR2 memory. Now I look forward to having 16GB of elbow room on the 7577.

While none of my virtualization experiments demanded much processing power, more is always better. This move will upgrade from a 3rd-gen Core i5 3320M processor to a 7th-gen Core i5 7300HQ. Getting four hardware cores instead of two hyperthreaded cores should be a good boost, in addition to all the other improvements made over four generations of Intel engineering.

For data storage, I’ve upgraded the 7577 from its factory M.2 NVMe SSD from a 256GB unit to a 1TB unit, and the 7577 chassis has an open 2.5″ SATA slot for even more storage if I need it. The E6230 had only a single 2.5″ SATA slot. Neither of these machines had an optical drive, but if they did, that can be converted to another 2.5″ SATA slot with adapters made for the purpose.

Both of these laptops have a wired gigabit Ethernet port, sadly a fast-disappearing luxury in laptops. It eliminates all the unreliable hassle of wireless networking, but an Ethernet jack is a huge and bulky component in an industry aiming for ever thinner and lighter designs. [UPDATE: The 7577’s Ethernet port would prove to be a source of headaches.]

And finally, the Inspiron 7577 has a hardware-level feature to improve battery longevity: I could configure its BIOS to stop battery charging at 80% full. This should be less stressful on the battery than being kept at 100% full all the time, which is what the E6230 would do and I could not configure it otherwise. I believe this deviation from laptop usage pattern contributed to battery demise and E6230 retirement, so I hope the 80% state of charge limit will keep the 7577 battery alive for longer.

When I started playing with KVM hypervisor on the E6230, I installed Ubuntu Desktop instead of server for two reasons: I didn’t know how to deal with the laptop screen, and I didn’t know how to work with KVM via the command line. Now this 7577 configuration will incorporate what I’ve learned since then.

Dell Latitude E6230 Getting Benched

I’ve got one set of dead batteries upgraded and tested and now attention turns to a different set of expired batteries. I bought this refurbished Dell Latitude E6230 several years ago intending to take apart and use as a robot brain. I changed my mind when it turned out to be a pretty nifty little laptop to take on the go, much smaller and lighter than my Dell Inspiron 7577. With lower specs than the 7577, it also had longer battery run time and its performance didn’t throttle as much while on battery. It has helped me field-program many microcontrollers and performed other mobile computing duties admirably.

I retired it from laptop duty when I got an Apple Silicon MacBook Air, but I brought it back out to serve as my introduction to running virtual machines under KVM hypervisor. Retired laptops work well as low-power machines for exploratory server duty. Running things like Home Assistant haven’t required much in the way of raw processing power, it was more important for a machine to run reliably around the clock while stashed unobtrusively in a corner somewhere. Laptops are built to be compact, energy-efficient, and already have a built-in battery backup. Though the battery usage pattern will be different from normal laptop use, which caused problems long term.

Before that happened though, this Latitude E6230 developed a problem starting up when warm. If I select “restart” it’ll reboot just fine, but if I select “shut down” and press the power button immediately to turn it back on, it’ll give me an error light pattern instead of starting up: The power LED is off, the hard drive LED is on, and the battery LED blinks. Given the blinking battery LED I thought it indicated a problem with the battery, but if I pull out the battery to run strictly on AC, I still see the same lights. The workaround is to leave the machine alone for 20-30 minutes to cool down, after which it is happy to start up either with or without battery.

But if the blinking battery LED doesn’t mean a problem with the battery, what did it mean? I looked for the Dell troubleshooting procedure that would explain this particular pattern. I didn’t get very far and, once I found the workaround, I didn’t invest any more time looking. Acting as a mini-server meant it was running most of the time and rarely powered off. And if it does power off for any reason, this mini-server isn’t running anything critical so waiting 20 minutes isn’t a huge deal. I decided to just live with this annoyance for a long time, until the second problem cropped up recently:

Now when the machine is running, the battery LED blinks yellow. This time it does indicate a problem with the battery. The BIOS screen says “Battery needs to be replaced”. The Ubuntu desktop gives me a red battery icon with an exclamation mark. And if I unplug the machine, there’s zero battery runtime: the machine powers off immediately. (Which has to be followed by that 20 minute wait for it to cool down before I can start it up again.)

I knew keeping lithium-ion batteries at 100% full charge is bad for their longevity, so this was somewhat expected. I would have preferred the ability to limit state of charge at 80% or so. Newer Dell laptops like my 7577 have such an option in BIOS but this older E6230 did not. Given its weird warm startup issue and dead battery, low-power mini-server duty will now migrate to my Inspiron 7577.

SATA Optical to 2.5″ Drive Adapter

I dusted off an old Dell Optiplex 960 for use as my TrueNAS replication backup target. The compact chassis had a place for my backup storage 8TB 3.5″ HDD extracted from a failed USB enclosure, which is good. But I also need a separate drive for Ubuntu operating system, and that’s where I ran into problems. There was an empty 3.5″ bay and a SATA data socket available on the motherboard, but the metal mounting bracket was missing, and power supply had no more SATA power plugs.

As an alternative plan, I thought I would repurpose the optical drive’s location. Not just its SATA data and power plugs, but I could also repurpose physical mounting bracket with an optical drive shaped caddy for a 2.5″ SATA drive. (*) It wasn’t a perfect fit but that was my own fault for ordering the wrong size.

Examining the caddy after I opened its package, I saw this oddly bent piece of sheet metal. Comparing against the DVD drive, I don’t think it’s supposed to bend like that. I can’t tell if it was damaged at the factory or during shipping, either way metal was thin and easy to bend back into place.

Also comparing against the DVD drive, I realized I bought the wrong size. It didn’t occur to me to check to see if there were multiple different sizes for laptop DVD drives. I bought a 9.5mm thick caddy (*) when I should have bought something thicker possibly this 12.7mm thick unit.(*) Oh well, I have this one in hand now and I’m going to try to make it work.

To install this caddy in an Optiplex 960 chassis, I need to reuse the sheet metal tray currently attached to the DVD drive.

One side fit without problems, but the other side didn’t fit due to mismatched height. This is my own fault.

There’s a mismatch in width as well, I’m not sure this was my fault. I understand the different form factors to be the same width so this part should have lined up. Oh well, at least it is easier to deal with a ~1mm too narrow adapter because one ~1mm too wide wouldn’t fit at all.

There were slots to take the DVD drive’s faceplate. This is for aesthetics so we don’t leave a gaping hole, the eject button wouldn’t work as it is no longer a DVD drive. Unfortunately, faceplate mounting slots didn’t match up, either. This might also be a function of the wrong height, but I’m skeptical. I ended up using the generic faceplate that came with the caddy.

Forcing everything to fit results in a caddy mounted crookedly.

Which resulted in a crooked facade.

Aesthetically speaking this is unfortunate, I should have bought a taller caddy (*) but functionally this unit works fine. The SSD is securely mounted in the caddy, which is now securely mounted to the chassis. And even more importantly, SATA power and data communication worked just fine, allowing me to install Ubuntu Server 22.04 LTS on an old small SSD inside the caddy. And about that old SSD… freeing it up for use turned out to be its own adventure.


(*) Disclosure: As an Amazon Associate I earn from qualifying purchases.

Dusting Off Dell Optiplex 960 SFF PC

After two years of use, my USB3 external 8TB backup drive stopped responding as an external disk. I took apart its enclosure and extracted a standard 3.5″ hard disk drive which seems OK in perfunctory testing. In order to continue using it for TrueNAS replication backup, I’ll need another enclosure. I briefly contemplated getting an USB3 SATA enclosure that takes 3.5″ drives (*) but I decided to use an entire computer as its enclosure: I have an old Dell Optiplex 960 SFF (small form factor) PC collecting dust and it would be more useful as my TrueNAS replication backup machine.

Dell’s Optiplex line is aimed at corporate customers, which meant it incorporated many design priorities that weren’t worth my money to buy new. But those designs also tend to live well past their first lives, and I have bought refurbished corporate Dells before. I’ve found them to be sturdy well-engineered machines that, on the secondhand market, is worth a small premium over generic refurbished PCs.

There’s nothing garish with exterior appearance of an Optiplex, just the computer equivalent of professional office attire. This particular machine is designed to be a little space-efficient box. Office space costs money and some companies decide compactness is worth paying for. Building such a compact box required using parts with nonstandard form factors. For a hobbyist like me, not being able to replace components with generic standard parts is a downside. For the corporate IT department with a Dell service contract, the ease of diagnosis and servicing is well worth the tradeoff.

This box is just as happy sitting horizontally as vertically, with rubber feet to handle either orientation.

Before it collected dust on my shelf, this computer collected dust on another maker’s shelf. I asked for it sometime around the time I started playing with LinuxCNC. I saw this computer had a built-in parallel port, so I would not need an expansion card. (Or I can add a card for even more control pins.) The previous owner said “Sure, I’m not doing anything with it, take it if you will do cool things with it.” Unfortunately, my LinuxCNC investigation came to a halt due to pandemic lockdown and I lost access to that space. TrueNAS replication target may not be as cool as my original intention for this box, but at least it’s better than collection dust.

Even though the chassis is small, it has a lot of nice design features. The row of “1 2 3 4” across the front are diagnostics LEDs. They light up in various combinations during initial boot-up so, if the computer fails to boot corporate IT tech support can start diagnosing failure before even opening up the box.

Which is great, because opening up the box might be hindered by a big beefy lock keeping the side release lever from sliding.

And if we get past the lock and open the lid, we trip the chassis intrusion detection switch. I’ve seen provision for chassis intrusion detection in my hobbyist-grade motherboards, but I never bothered to add an actual intrusion switch to any of my machines. Or a lock, for that matter.

Once opened I find everything is designed to be worked on without requiring specific tools. This chassis accommodates two half-height expansion cards: One PCI and one PCI-Express. On my PCs, expansion endplates are held by small Philips-head screws. On this PC, endplates are retained by this mechanism.

A push on the blue button releases a clamp for access to these endplates.

Adjacent to those expansion slots is a black plastic cage for 3.5″ Hard drive.

Two blue metal clips release the cage to flip open, allowing access to the hard drive. This drive was intended to be the only storage device hosting operating system plus all data. I plan to install my extracted 8TB backup storage drive in this space, which needs to be a separate drive from the operating system drive, so I need to find another space for a system drive.

Most of the motherboard is visible after I flipped the HDD cage out of the way. I see three SATA sockets. One for the storage HDD, one for the DVD drive, and an empty one I can use for my system drive. Next to those slots is a stick of DDR2 RAM. (I’m quite certain Corsair-branded RAM is not original Dell equipment.) Before I do anything else with this computer, I will need to replace the CR2032 coin cell timekeeping battery.

A push on the blue-stickered sheet metal button released the DVD drive. Judging by scratches, this DVD drive has been removed and reinstalled many times.

Putting the DVD drive aside, I can see a spare 3.5″ drive bay underneath. This was expected because we could see a 3.5″ blank plate in the front of this machine, possibly originally designed for a floppy disk drive. The good news is that this bay is empty and available, the bad news is that a critical piece of hardware is missing: This chassis is designed to have a sheet metal tray for installing a 3.5″ drive, which is not here.

I can probably hack around the missing bracket with something 3D-printed or even just double-sided tape. But even if I could mount a small SSD in here, there are no spare SATA power connector available for it. This is a problem. I contemplated repurposing the DVD drive’s power and data cables for a SSD and found adapters cables for this purpose. (*) But under related items, I found a product I didn’t even know existed: an optical-to-hard drive adapter (*) that doesn’t just handle the power and data connectors, it is also a mechanical fit into the optical drive’s space!


(*) Disclosure: As an Amazon Associate I earn from qualifying purchases.

Disable Sleep on a Laptop Acting as Server

I’ve played with different ways to install and run Home Assistant. At the moment my home instance is running as a virtual machine inside KVM hypervisor. The physical machine is a refurbished Dell Latitude E6230 running Ubuntu Desktop 22.04. Even though it will be running as a server, I installed the desktop edition for access to tools like Virtual Machine Manager. But there’s a downside to installing the desktop edition for server use: I did not want battery-saving features like suspend and sleep.

When I chose to use an old laptop like a server, I had thought its built-in battery would be useful in case of power failure. But I hadn’t tested that hypothesis until now. Roughly twenty minutes after I unplugged the laptop, it went to sleep. D’oh! The machine still reported 95% of battery capacity, but I couldn’t use that capacity as backup power.

The Ubuntu “Settings” user interface was disappointingly useless for this purpose, with no obvious ability to disable sleep when on battery power. Generally speaking, the revamped “Settings” of Ubuntu 22 has been cleaned up and now has fewer settings cluttering up all those menus. I could see this as a well-meaning effort to make Ubuntu less intimidating to beginners, but right now it’s annoying because I can’t do what I want. To the web search engines!

Looking for command-line tools to change Ubuntu power saving settings brought me to many pages with outdated information that no longer applied to Ubuntu 22. My path to success started with this forum thread on Linux.org. It pointed to this page on linux-tips.us. It has a lot of ads, but it also had applicable information: systemd targets. The page listed four potentially applicable targets:

  • suspend.target
  • sleep.target
  • hibernate.target
  • hybrid-sleep.target

Using “systemctl status” I could check which of those were triggered when my laptop went to sleep.

$ systemctl status suspend.target
○ suspend.target - Suspend
     Loaded: loaded (/lib/systemd/system/suspend.target; static)
     Active: inactive (dead)
       Docs: man:systemd.special(7)

Jul 21 22:58:32 dellhost systemd[1]: Reached target Suspend.
Jul 21 22:58:32 dellhost systemd[1]: Stopped target Suspend.
$ systemctl status sleep.target
○ sleep.target
     Loaded: masked (Reason: Unit sleep.target is masked.)
     Active: inactive (dead) since Thu 2022-07-21 22:58:32 PDT; 11h ago

Jul 21 22:54:41 dellhost systemd[1]: Reached target Sleep.
Jul 21 22:58:32 dellhost systemd[1]: Stopped target Sleep.
$ systemctl status hibernate.target
○ hibernate.target - System Hibernation
     Loaded: loaded (/lib/systemd/system/hibernate.target; static)
     Active: inactive (dead)
       Docs: man:systemd.special(7)
$ systemctl status hybrid-sleep.target
○ hybrid-sleep.target - Hybrid Suspend+Hibernate
     Loaded: loaded (/lib/systemd/system/hybrid-sleep.target; static)
     Active: inactive (dead)
       Docs: man:systemd.special(7)

Looks like my laptop reached the “Sleep” then “Suspend” targets, so I’ll disable those two.

$ sudo systemctl mask sleep.target
Created symlink /etc/systemd/system/sleep.target → /dev/null.
$ sudo systemctl mask suspend.target
Created symlink /etc/systemd/system/suspend.target → /dev/null.

After they were masked, the laptop was willing to use most of its battery capacity instead of just a tiny sliver. This should be good for several hours, but what happens after that? When the battery is almost empty, I want the computer to go into hibernation instead of dying unpredictably and possibly in a bad state. This is why I left hibernation.target alone, but I wanted to do more for battery health. I didn’t want to drain the battery all the way to near-empty, and this thread on AskUbuntu led me to /etc/UPower/UPower.conf which dictates what battery levels will trigger hibernation. I raised the levels so the battery shouldn’t be drained much past 15%.

# Defaults:
# PercentageLow=20
# PercentageCritical=5
# PercentageAction=2
PercentageLow=25
PercentageCritical=20
PercentageAction=15

The UPower service needs to be restarted to pick up those changes.

$ sudo systemctl restart upower.service

Alas, that did not have the effect I hoped it would. Leaving the cord unplugged, the battery dropped straight past 15% and did not go into hibernation. The percentage dropped faster and faster as it went lower, too. Indication that the battery is not in great shape, or at least mismatched with what its management system thought it should be doing.

$ upower -i /org/freedesktop/UPower/devices/battery_BAT0
  native-path:          BAT0
  vendor:               DP-SDI56
  model:                DELL YJNKK18
  serial:               1
  power supply:         yes
  updated:              Fri 22 Jul 2022 03:31:00 PM PDT (9 seconds ago)
  has history:          yes
  has statistics:       yes
  battery
    present:             yes
    rechargeable:        yes
    state:               discharging
    warning-level:       action
    energy:              3.2079 Wh
    energy-empty:        0 Wh
    energy-full:         59.607 Wh
    energy-full-design:  57.72 Wh
    energy-rate:         10.1565 W
    voltage:             9.826 V
    charge-cycles:       N/A
    time to empty:       19.0 minutes
    percentage:          5%
    capacity:            100%
    technology:          lithium-ion
    icon-name:          'battery-caution-symbolic'

I kept it unplugged until it dropped to 2%, at which point the default PercentageAction behavior of PowerOff should have occurred. It did not, so I gave up on this round of testing and plugged the laptop back into its power cord. I’ll have to come back later to figure out why this didn’t work but, hey, at least this old thing was able to run 5 hours and 15 minutes on battery.

And finally: this laptop will be left plugged in most of the time, so it would be nice to limit charging to no more than 80% of capacity to reduce battery wear. I’m OK with 20% reduction in battery runtime. I’m mostly concerned about brief blinks of power of a few minutes. A power failure of 4 hours instead of 5 makes little difference. I have seen “battery charge limit” as an option in the BIOS settings of my newer Dell laptops, but not this old laptop. And unfortunately, it does not appear possible to accomplish this strictly in Ubuntu software without hardware support. That thread did describe an intriguing option, however: dig into the cable to pull out Dell power supply communication wire and hook it up to a switch. When that wire is connected, everything should work as it does today. But when disconnected, some Dell laptops will run on AC power but not charge its battery. I could rig up some sort of external hardware to keep battery level around 75-80%. That would also be a project for another day.

Home Assistant OS in KVM Hypervisor

I encountered some problems running Home Assistant Operating System (HAOS) as a virtual machine on a TrueNAS CORE server, which is based on FreeBSD and its bhyve hypervisor. I wanted to solve these problems and, given my good experience with Home Assistant, I was willing to give it dedicated hardware. A lot of people use a Raspberry Pi, but in these times of hardware scarcity a Raspberry Pi is rarer and more valuable than an old laptop. I pulled out a refurbished Dell Latitude E6230 I had originally intended to use as robot brain. Now it shall be my Home Assistant server, which is a robot brain of sorts. This laptop’s Core i5-3320M CPU launched ten years ago, but as a x86_64 capable CPU designed for power-saving laptop usage, it should suit Home Assistant well.

Using Ubuntu KVM Because Direct Installation Failed Boot

I was willing to run HAOS directly on the machine, but the UEFI boot process failed for reasons I can’t decipher. I couldn’t even copy down an error message due to scrambled text on screen. HAOS 8.0 moved to a new boot procedure as per its release announcement, and the comments thread on that page had lots of users reporting boot problems. [UPDATE: A few days later, HAOS 8.1 was released with several boot fixes.] Undeterred, I tried a different tack: install Ubuntu Desktop 22.04 LTS and run HAOS as a virtual machine under KVM Hypervisor. This is the hypervisor used by the Linux-based TrueNAS SCALE, to which I might migrate in the future. Whether it works with HAOS would be an important data point in that decision.

Even though I expect this computer to run as an unattended server most of the time, I installed Ubuntu Desktop instead of Ubuntu Server for two reasons:

  1. Ubuntu Server has no knowledge of laptop components, so I’d be stuck with default hardware behavior that are problematic. First is that the screen will always stay on, which wastes power. Second is that closing the lid will put the machine to sleep, which defeats the point of a server. With Ubuntu Desktop I’ve found how to solve both problems: edit /etc/systemd/logind.conf and change lid switch behavior to lock, which turns off the screen but leaves the computer running. I don’t know how to do this with Ubuntu Server or Home Assistant OS direct installation.
  2. KVM Hypervisor is a huge piece of software with many settings. Given enough time I’m sure I could learn all of the command line tools I need to get things up and running, but I have a faster option with Ubuntu Desktop: Use Virtual Machine Manager to help me make sense of KVM.

KVM Network Bridge

Home Assistant instructions for installing HAOS as a KVM virtual machine was fairly straightforward except for lack of details on how to set up a network bridge. This is required so HAOS is a peer on my home network, capable of communicating with ESPHome devices. (Equivalent to the network_mode: host option when running Home Assistant Docker container.) HAOS instruction page merely says “Select your bridge” so I had to search elsewhere for details.

A promising search hit was How to use bridged networking with libvirt and KVM on linuxconfig.org. It gave a lot of good background information, but I didn’t care for the actual procedure due to this excerpt: “Notice that you can’t use your main ethernet interface […] we will use an additional interface […] provided by an ethernet to usb adapter attached to my machine.” I don’t want to add another Ethernet adapter to my machine. I know network bridging is possible on the existing adapter, because Docker does it with network_mode:host.

My next stop was Configuring Guest Networking page of KVM documentation. It offered several options corresponding to different scenarios, helping me confirm I wanted “Public Bridge”. This page had a few Linux distribution-specific scripts, including one for Debian. Unfortunately, it wanted me to edit a file /etc/network/interfaces which doesn’t exist on Ubuntu 22.04. Fortunately, that page gave me enough relevant keywords for me to find Network Configuration page of Ubuntu documentation which has a section “Bridging” pointing me to /etc/netplan. I had to change their example to match Ethernet hardware names on my computer, but once done I had a public network bridge upon my existing network adapter.

USB Device Redirection

Even though I’m still running HAOS under a virtual machine hypervisor, ESPHome could access USB hardware thanks to KVM device redirection.

First I plug in my ESP32 development board. Then, I open the Home Assistant virtual machine instance and select “Redirect USB device” under “Virtual Machine” menu.

That will bring up a list of plugged-in USB devices, where I could select the USB to UART bridge device on my ESP32 development board. Once selected, the ESPHome add-on running within this instance of HAOS could see the ESP32 board and flash its firmware via USB. This process is not as direct as it would have been for HAOS running directly on the computer, but it’s far better than what I had to do before.

At the moment, surfacing KVM capability for USB device redirection is not available on TrueNAS SCALE but it is a requested feature. Now that I see the feature working, it has become a must-have for me. Until this is done, I probably won’t bother migrating my TrueNAS server from CORE (FreeBSD/bhyve) to SCALE (Linux/KVM) because I want this feature when I consolidate HAOS back onto my TrueNAS hardware. (And probably send this Dell Latitude E6230 back into the storage closet.)

Start on Boot

And finally, I had to tell KVM to launch Home Assistant automatically upon boot. By checking “Start virtual machine on host boot up” under “Boot Options” setting.

In time I expect that I’ll learn the KVM command lines to accomplish what I’m doing today with Virtual Machine Manager, but today I’m glad VMM helps me get everything up and running quickly.

[UPDATE: virsh autostart is the command line tool to launch a virtual machine upon system startup. Haven’t yet figured out command line procedure for USB redirection.]

Dell XPS M1330 LED Backlight

My detour into laundry machine repair pushed back my LED backlight adventures for a bit, but I’m back on the topic now armed with my new dedicated backlight tester. The next backlight I shall attempt to salvage came from a Dell XPS M1330. This particular Dell product line offered an optional NVIDIA GPU packed into its lightweight chassis. Some engineering tradeoffs had to be made and history has deemed those tradeoffs to be poor as these laptops had a short life expectancy. In the absence of an official story from Dell, the internet consensus is that heat management was insufficient and these laptops cooked themselves after a few years. I was given one such failed unit which I tore down some years ago. I kept its screen and the laptop’s metal lid in case I wanted a rigid metal framework to go with the screen.

The display module itself was a Toshiba LTD133EWDD which had a native resolution of 1280×800 pixels. Not terribly interesting in today’s 1080p world. Certainly not enough motivation for me to buy an adapter to turn it into an external monitor, and hence a good candidate for backlight extraction.

Unlike the previous LCD modules I’ve taken apart, this one doesn’t cover its integrated control board in opaque black tape. Clear plastic is used instead, and I could immediately pick out the characteristic connection to the rest of the display. At the bottom are two of those high density data connections for the LCD pixel array, and towards the right is an 8-conductor connector for the LED backlight. The IC in closest proximity is my candidate for LED backlight controller.

Despite being clear plastic, it was still a little difficult to read the fine print on that chip. But after the plastic was removed I could clearly read “TOKO 61224 A33X” which failed to return any relevant results in a web search. [UPDATE: Randy has better search Kung Fu than I do, and found a datasheet.] Absent documentation I’m not optimistic I could drive the chip as I could a Texas Instruments TPS 61187. So I’ll probably end up trying to power the LEDs in the backlight directly.

ESA ISS Tracker on Dell Latitude X1

My failed effort at an ISS Tracker web kiosk reminded me of my previous failure trying to get Ubuntu Core web kiosk up and running on old hardware. That computer, a Dell Latitude X1, was also very sluggish running modern Ubuntu Mate interactively when I had tried it. I was curious how it would compare with the HP Mini.

The HP Mini has the advantage of age: it is roughly ten years old, whereas the X1 is around fifteen years old. When it comes to computers, an age difference of five years is a huge gulf spanning multiple hardware generations. However, the X1 launched as a top of the line premium machine for people who were willing to pay for a thin and light machine. Hence it was designed under very different criteria than the HP Mini despite similarity in form factor.

As one example: the HP Mini housed a commodity 2.5″ laptop hard drive, but the Dell Latitude X1 used a much smaller form factor hard drive that I have not seen before or since. Given its smaller market and lower volume, I think it is fair to assume the smaller hard drive comes at a significant price premium in exchange for reduction of a few cubic centimeters in volume and grams of weight.

Installing Ubuntu Mate 18.04 on the X1, I confirmed it is still quite sluggish by modern standards. However, this is a comparison test and the Dell X1 surprised me by feeling more responsive than the five years younger HP Mini. Given that they both use spinning platter hard drives and had 1GB of RAM, I thought the difference is probably due to their CPU. The Latitude X1 had an ULV (ultra low voltage) Pentium M 744 processor, which was a premium product showcasing the most processing power Intel can deliver while sipping gently on battery power. In comparison the HP Mini had an Atom processor, an entry-level product optimized for cost. Looking at their spec sheet comparison shows how closely an entry level CPU matches up to a premium CPU from five years earlier, but the Atom had only one quarter of the CPU cache and I think that was a decisive difference.

Despite its constrained cache, the Atom had two cores and thermal design power (TDP) of just 2.5W. In contrast the Pentium M 733 ULV had only a single core and TDP of 5W. Twice the cores, half the electrical power, the younger CPU far more power efficient. And it’s not just the CPU, either, it’s the whole machine. Whereas the HP Mini 110 only needed 7.5W to display ESA ISS Tracker, the Latitude X1 reports drawing more than double that. A little over 17W, according to upower. An aged battery, which has degraded to 43% of its original capacity, could only support that for about 40 minutes.

Device: /org/freedesktop/UPower/devices/battery_BAT0
native-path: BAT0
vendor: Sanyo
model: DELL T61376
serial: 161
power supply: yes
updated: Thu 23 Apr 2020 06:19:06 PM PDT (69 seconds ago)
has history: yes
has statistics: yes
battery
present: yes
rechargeable: yes
state: discharging
warning-level: none
energy: 11.4663 Wh
energy-empty: 0 Wh
energy-full: 11.4663 Wh
energy-full-design: 26.64 Wh
energy-rate: 17.2605 W
voltage: 12.474 V
time to empty: 39.9 minutes
percentage: 100%
capacity: 43.0417%
technology: lithium-ion
icon-name: 'battery-full-symbolic'
History (rate):
1587691145 17.261 discharging

Putting a computer to work showing the ESA tracker is only using its display. It doesn’t involve the keyboard. Such information consumption tasks are performed just as well by touchscreen devices, and I have a few to try. Starting with an Amazon Kindle Fire HD 7.

Dell Latitude E6230: Working Too Well To Be Dismembered, NUCC to the Rescue

The previous few blog posts about my refurbished Dell Latitude E6230 was written several months ago and had sat waiting for a long-term verdict. After several months of use I’m now comfortable proclaiming it to be a very nice little laptop. Small, lightweight, good battery life, and decently high performance when I need it. (At the cost of battery life when doing so, naturally.)

The heart of this machine is a third generation Intel Core i5, which covers the majority of computing needs I’ve had while away from my desk. From the basics like 64-bit software capability to its ability to speed itself up to tackle bigger workloads. When working away from a wall plug and running on battery, the E6230 slows only minimally. Unlike my much newer Inspiron 7577 which slows drastically while on battery to the extent that it occasionally felt slower than the E6230. I can run my 7577 for perhaps two to four hours on battery, never far from a reminder of its limited on-battery performance. Whereas I can run the E6230 for around four to six hours on battery, without feeling constrained by reduced performance.

The E6230 has several other features I felt would be good for a robot brain. Top of the list is an Ethernet port for reliable communication in crowded RF environments. Several “SuperSpeed” USB 3 ports are useful for interfacing with hardware. And when I want more screen real estate than the built-in screen can offer, I have my choice of VGA or HDMI video output.

That built-in screen, with its minimal 1366×768 resolution, is about the only thing standing between this machine and greatness. Originally I did not care, because I had planned to tear the case apart and embed just the motherboard in a robot. But this laptop is working too darned well to be subjected to that fate! For the near future I plan to continue using the E6230 as a small laptop for computing on-the-go, and kept my eyes open for other old laptops as robot brain candidates.

An opportunity arose at Sparklecon 2020, when I mentioned this project idea to NUCC. They had a cabinet of laptops retired for one reason or another. I was asked: “What do you need?” and I said the ideal candidate would be a laptop with a broken screen and/or damaged keyboard, and have at least a Core i3 processor.

We didn’t find my ideal candidate, but I did get to bring home three machines for investigation. Each representing a single criteria: one with a busted screen, one with a busted keyboard, and one with a Core i3 processor.

Close enough! And now it’s time for me to get to work on a research project: determine what condition these machines are in, and how they can be best put to use.

Dell Latitude E6230: Blank ExpressCard Placeholder Is Also A Ruler

I found a fun little design while looking over the refurbished laptop I had bought. It was a Dell Latitude E6230, which had an ExpressCard slot. I’ve never used a laptop in a way that required add-on hardware. No PCMCIA, no ExpressCard, etc. Few of my laptops even had provisions for an expansion slot. But I remembered one of them — an old Dell XPS M1330 — included a little bit of creativity. Rather than the typical blank piece of plastic placeholder, the expansion slot held an infrared remote control with simple media buttons like “Play”, “Pause”, etc. This lets people use the little laptop as a media player where they can sit back away from the keyboard and still be able to control playback.

This laptop is from Dell’s business-oriented Latitude line, so it would not be keeping with product position to have such entertainment-oriented accessories. But I was curious if it had more than just a blank piece of plastic placeholder. So even though I had no ExpressCard to install, I popped out the blank to take a look. I was happy to see that someone put some thought into the design: the blank plate is a small ruler with both inch and millimeter measurements.

This feature cost them very little to implement, and it would never be the make-or-break deciding factor when choosing the laptop, but it was a fun touch.

Dell Latitude E6230: Soft Touch Plastic Did Not Age Well

When I looked over the exterior of my refurbished Dell Latitude E6230 laptop, I noticed  some common touch parts of the wrist rest and touch pad had been covered with stickers. They were very well done on my example. It took me a while to realized they were even there. In use, they were not bothersome.

Initially I thought they were there to cover up signs of wear and tear on this refurbished machine, but I’ve realized there’s an additional and possibly more important reason for the sticker: The plastic material for the wrist rest has degraded.

Usually when plastic degrades it hardens or discolors, but for certain types of plastic, the breakdown results in a sticky surface that is unpleasant to touch. I usually see this in the flexible plastic shroud for old cables and not in rigid installations like a keyboard wrist rest. I assume these machines were originally built with some type of soft touch plastic which degraded in this very unpleasant manner.

I wonder what the production story behind this laptop is. I can think of a few possibilities right away and I’m sure there are more:

  1. Dell did not perform long term testing on this material and didn’t know it would degrade this way.
  2. Dell performed testing, but the methodology for accelerated aging didn’t trigger this behavior, so it didn’t show up in the tests.
  3. Dell was aware of this behavior, believed it would not occur until well after warranty period, and thus not their problem.

The expensive way to solve this problem would be to re-cast the plastic wrist rest in a different material and replace the part. Covering just the important surfaces with stickers is an ingeniously inexpensive workaround. Once the stickers were installed, I wouldn’t have to touch the unpleasant surfaces in normal use. However, there are still some sections exposed around the keyboard, and the sticky material is now a dust magnet.

It is a flaw in this little capable machine, but one I can tolerate thanks to the stickers. It made the laptop cheap to buy refurbished, and I’ll be less reluctant to take the computer apart and embed it in a robot, which is one of the long term plans for this machine.

Dell Latitude E6230: Hardware Internals

I picked up a Core i5-powered Dell Latitude E6230. It was a refurbished item at Fry’s Electronics, on sale for $149, and that was too tempting of a bargain to pass up. There were two major downsides to the machine: a low resolution 1366×768 display that I couldn’t do anything about, and a spinning magnetic platter hard drive that I intend to upgrade.

As is typical of Dell, a service manual is available online and I consulted it before purchasing to verify this chassis use standard laptop form factor SATA drive for storage. (Unlike the last compact Dell I bought.) Once I got it home, it was easy to work on this machine designed to be easily serviceable as is most Latitudes. A single screw releases the back cover, and the HDD was held down by two more screws. With only three screws and two plastic modules to deal with, this SSD upgrade needed less than five minutes to complete.

But since I had it open anyway, I spent some more time looking around inside to see signs of this laptop’s prior life.

Dell Latitude E6230 interior debris

There were a few curious pieces of debris inside. A piece of tape that presumably held down a segment of wire has come loose, and the adhesive is not sticky. This is consistent with aged tape. There was also a loose piece of clear plastic next to the tape. I removed both.

The CPU fan had an fine layer of insignificant dust clinging to its surface. I would have expected an old laptop to have picked up more dirt than this. Either the buildup has been cleaned up (and the cleaner ignored the tape and clear plastic) or more likely this laptop spent most of its time in an office HVAC environment with well maintained dust filtration.

The HDD that I removed was advertised to have a copy of Windows 10. But where is the license? Computers of this vintage may have their Windows license embedded in hardware. Though this is less likely for business line machines, as some businesses have their own site license for Windows. I installed Windows 10 on the SSD and checked its licensing state: not activated. The Windows 10 license is on that HDD and not in hardware. That’s fine, I intended to run Ubuntu on this one anyway, so I installed Ubuntu 18.04 over the non-activated Windows 10.

Once Ubuntu 18.04 was up and running, this machine proved quite capable. All features appear to be usable under Ubuntu and it is easily faster than my Inspiron 11 3180 across the board. It is a bit heavier, but much of that is the extended battery and might be worth the tradeoff.

Overall, a very good deal for $149 and my new ROS robot brain candidate.