Reboot After Network Watchdog Timer Fires

My Dell Inspiron 7577 is not happy running Proxmox VE. For reason I don’t yet understand, its onboard Ethernet would quit at unpredictable times. [UPDATE: Network connectivity stabilized after installing Proxmox VE kernel update from 6.2.16-15-pve to 6.2.16-18-pve. The hack described in this post is no longer necessary.] Running dmesg to see error messages logged on the system, I searched online and found a few Linux kernel flags to try as potential workarounds. None of them have helped keep the system online. So now I’m falling back to an ugly hack: rebooting the system after it falls offline.

My first session stayed online for 36 hours, so my first attempt at this workaround was to reboot the system once a day in the middle of the night. That wasn’t good enough because it frequently failed much sooner than 24 hours. The worst case I’ve observed so far was about 90 minutes. Unless I wanted to reboot every half hour or something ridiculous, I need to react to system state and not a timer.

In the Proxmox forum thread I read, one of the members said they wrote a script to ping Google at regular intervals and reboot the system if that should fail. I started thinking about doing the same for myself but wanted to narrow down the variables. I don’t want to my machine to reboot if there’s been a network hiccup at a Google datacenter, or my ISP, or even when I’m rebooting my router. This is a local issue and I want to focus my scope locally.

So instead of running ping I decided to base my decision off of what I’ve found so far. I don’t know why the Ethernet networking stack fails, but when it does, I know a network watchdog timer fires and logs message into the system. Reading about this system, I learned it is called a journal and can be accessed and queried using the command line tool journalctl. Reading about its options, I wrote a small shell script I named /root/watch_watchdog.sh:

#!/usr/bin/bash
if /usr/bin/journalctl --boot --grep="NETDEV WATCHDOG"
then
  /usr/sbin/reboot
fi

Every executable (bash, journalctl, and reboot) are specified with full paths because I had problems with context of bash scripts executed as cron jobs. My workaround, which I decided was also good security practice, is to fully qualify each binary file.

The --boot parameter restricts the query to the current running system boot, ignoring messages from before the most recent reboot.

The --grep="NETDEV WATCHDOG" parameter looks for the network watchdog error message. I thought to restrict it to exactly the message I saw: "kernel: NETDEV WATCHDOG: enp59s0 (r8169): transmit queue 0 timed out" but using that whole string returned no entries. Maybe the symbols (the colon? the parentheses?) caused a problem. Backing off, I found just "NETDEV" is too broad because there are other networking messages in the log. Just "WATCHDOG" is also too broad given unrelated watchdogs on the system. Using "NETDEV WATCHDOG" is fine so far, but I may need to make it more specific later if that’s still too broad.

The most important part of this is the exit code for journalctl. It would be nonzero if messages are found from the query, and zero if no entries are found. This exit code is used by the "if" statement to decide whether to reboot the system.

Once the shell script file in place and made executable with chmod +x /root/watch_watchdog.sh, I could add it to the cron jobs table by running crontab -e. I started by running this script once an hour on the top of the hour.

0 * * * * /root/watch_watchdog.sh

But then I thought: what’s the downside to running it more frequently? I couldn’t think of anything, so I expanded to running once every five minutes. (I learned the pattern syntax from Crontab guru.) If I learn a reason not to run this so often, I will reduce the frequency.

*/5 * * * * /root/watch_watchdog.sh

This ensured network outages due to Realtek Ethernet issue are no longer than five minutes in length. This is a vast improvement over what I had until now, which is waiting until I noticed the 7577 had dropped off the network (which may take hours), pulling it off the shelf, log in locally, and type “reboot”. Now this script will do it within five minutes of watchdog timer message. It’s a really ugly hack, but it’s something I can do today. Fixing this issue properly requires a lot more knowledge about Realtek network drivers, and that knowledge seemed to be spread across multiple drivers.


Featured image created by Microsoft Bing Image Creator powered by DALL-E 3 with prompt “Cartoon drawing of a black laptop computer showing a crying face on screen and holding a network cable

Configuring Laptop for Proxmox VE

I’m migrating my light-duty server duties from my Dell Latitude E6230 to my Dell Inspiron 7577. When I started playing with KVM hypervisor on the E6230, I installed Ubuntu Desktop instead of server for two reasons: I didn’t know how to deal with the laptop screen, and I didn’t know how to work with KVM via the command line. But the experience allowed me to learn things I will incorporate into my 7577 configuration.

Dealing with the Screen

By default, Proxmox VE would leave a simple text prompt on screen, which is fine because most server hardware don’t even have screens attached. On a laptop, keeping the screen on wastes power and probably cause long-term damage as well. I found an answer on Proxmox forums:

  • Edit /etc/default/grub to add “consoleblank=30” (30 is timeout in seconds) to GRUB_CMDLINE_LINUX if an entry already existed. If not, add a single line GRUB_CMDLINE_LINUX="consoleblank=30"
  • Run update-grub to apply this configuration.
  • Reboot

Another default behavior: when closing the laptop lid, the laptop goes to sleep. I don’t want this behavior when I’m using it as mini-server. I was surprised to learn the technique I found for Ubuntu Desktop would also work for server edition as well: edit /etc/systemd/logind.conf and change HandleLidSwitch to ignore.

Making the two above changes turn off my laptop screen after the set number of seconds of inactivity, and leaves the computer running when the lid is closed.

Dealing with KVM

KVM is a big piece of software with lots of knobs. I was intimidated by the thought of learning all command line options and switches on my own. So, for my earlier experiment, I ran Virtual Machine Manager on Ubuntu Desktop edition to keep my settings straight. I’ve learned bits and pieces of interacting with KVM via its virsh command line tool, but I have yet to get comfortable enough with it to use command line as the default interface.

Fortunately, many others felt similarly and there are other ways to work with a KVM hypervisor. My personal data storage solution TrueNAS has moved from a FreeBSD-based system (now named TrueNAS CORE) to a Linux-based system (a parallel sibling product called TrueNAS SCALE). TrueNAS SCALE included virtual machine capability with KVM hypervisor which looked pretty good. After a quick evaluation session, I decided I preferred working with KVM using Proxmox VE, a whole operating system built on top of Debian/Ubuntu dedicated to the job. Hosting virtual machines with the KVM hypervisor and tools to monitor and manage those virtual machines. Instead of Virtual Machine Manager’s UI running on Ubuntu Desktop, both TrueNAS SCALE and Proxmox VE expose their UI as a browser-based interface accessible over the network.

I liked the idea of doing everything on a single server running TrueNAS SCALE, and may eventually move in that direction. But there is something to be said of keeping two isolated machines. I need my TrueNAS SCALE machine to be absolutely reliable, an appliance I can leave running its job of data storage. It can be argued it’s a good idea to use a different machine for more experimental things like ESPHome and Home Assistant Operating System. Besides, unlike normal people, I have plenty of PC hardware sitting around. Put some of them to work!

Notes on Automating Ubuntu Updates

I grew up when computers were major purchases with four digits in the dollar sign. As technology advanced, perfectly capable laptops can be found for three digits. That was a major psychological barrier in my mind, and now I have another adjustment to make: today we can get a full-fledged PC (new/used) for well under a hundred bucks. Affordable enough that we can set up these general-purpose machines for a single specialized role and left alone.

I’ve had a few Raspberry Pi around the house running specialized tasks like OctoPi and TrueNAS replication target, and I’ve always known that I’ve been slacking off on keeping those systems updated. Security researchers and malicious actors are in a never-ending game to one-up each other and it’s important to keep up with security updates. The good news is that Ubuntu distributions come with an automated update mechanism called unattended-upgrades, so many security patches are automatically applied. However, its default settings only cover critical security updates, and sometimes they need a system reboot before taking effect. This is because Ubuntu chose default behavior to ensure they are least disruptive to actively used computers.

But what about task-specific machines that sees infrequent user logins? We can configure unattended-upgrades to be more aggressive. I went searching for more information and found a lot of coverage on this topic. I chose to start with this very old and frequently viewed AskUbuntu thread “How do I enable automatic updates?” The top two answer links lead to “AutomaticSecurityUpdates” page on help.ubuntu.com, and to “Automatic updates” on Ubuntu Server package management documentation. Browsing beyond official Ubuntu resources, I found “How to Install & Configure Unattended-Upgrades on Ubuntu 20.04” on LinuxCapable.com to be a pretty good overview.

For my specific situation, the highlights are:

  • Configuration file is at /etc/apt/apt.conf.d/50unattended-upgrades
  • Look at the Allowed-Origins entry up top. The line that ends with “-security” is active (as expected) and the line that ends with “-updates” is not. Uncomment that line to automatically pick up all updates, not just critical security fixes.
  • In order to pick up fixes that require a reboot, let unattended-upgrades reboot the machine as needed via “Unattended-Upgrade::Automatic-Reboot” to “true“.
  • (Optional) For computers that sleep most of the day, we may need to add an entry in root cron job table (sudo crontab -e) to run /usr/bin/unattended-upgrade at a specified time within the machine’s waking time window.
  • (Optional) There are several lines about automatically cleaning up unused packages and dependencies. Setting them to “true” will reduce chances of filling our disk.
  • Log files are written to directory /var/log/unattended-upgrades

Linux Shell Control of Sleep and Wake

I’ve extracted the 3.5″ SATA HDD from a Seagate Backup+ Hub external USB hard drive and installed it internally into a desktop PC tower case. I configured the PC as a TrueNAS replication target so it will keep a backup copy of data stored on my TrueNAS array. I couldn’t figure how to make it “take over” or “continue” the existing replication set on this disk created from Raspberry Pi, so I created an entirely new ZFS dataset instead. It’s a backup anyway and I have plenty of space.

But replication only happens once a day for a few minutes, and I didn’t want to keep the PC running around the clock. I automated my Raspberry Pi’s power supply via Home Assistant. That complexity was unnecessary for a modern PC as they include low-power sleep mode capability missing from (default) Raspberry Pi. I just need to figure out how to access that capability from the command line, and I found an answer with rtcwake and crontab.

rtcwake

There are many power-saving sleep modes available in the PC ecosystem, not all of which runs seamlessly under Linux as they each require some level of hardware and/or software driver support. Running rtcwake --list-modes is supposed to show what’s applicable to a piece of hardware. However, I found that even though “disk” (hibernate to disk) is listed, my attempt to use it merely caused the system to become unresponsive without going to sleep. (I had to reset the system.) I then tried “mem” (suspend system, keep power only to memory) and that seemed to work as expected. Your mileage will vary depending on hardware. I can tell my computer to sleep until 11:55PM with:

sudo rtcwake --mode mem --date 23:55

hwclock

The command above allowed me to put the computer to sleep, and schedule wake for five minutes before midnight. On my machine, it displayed the target time and went to sleep. But the listed target time was not 23:55! I thought I did something wrong, but after a bit of poking around I realized I didn’t. I wanted 23:55 my local time, and Ubuntu had set up my PC’s internal hardware clock to UTC time zone. The listed target time is in relative to UTC time of hardware clock. To set our current local time zone we run timedatectl. To see current hardware clock we can run this command:

sudo hwclock --show --verbose

I wasn’t surprised that putting the computer to sleep required “sudo” privileges, but I was surprised to see that hwclock needed that privilege as well. Why is reading the hardware clock important to protect? I don’t know. Sure, I can understand setting the clock may require privileges, but reading? timedatectl didn’t require sudo privileges to read. So hwclock‘s requirement was a surprise.

ssh

Another consequence of running rtcwake from a ssh session is that a sleep computer would leave my ssh prompt hanging. It will eventually time out with “broken pipe” but if I want to hurry that along, there’s a special key sequence to terminate a ssh session immediately: Press the <Enter> key, then type ~. (tilde symbol followed by period.)

crontab

But I didn’t really want to run the command manually, anyway. I want to automate that part as well. In order to schedule a job to execute that command at a specific time and interval, I added this command to the cron jobs table. Since I need root privileges to run rtcwake, I had to add this line to root user’s table with “sudo crontab -e“:

10 0 * * * rtcwake --mode mem --date 23:55

The first number is minutes, the next number hours. “10 0” means to run this command ten minutes after midnight, which should be long enough for TrueNAS replication to complete. The three asterisks mean every day of the month, every month, every day of the week. So “10 0 * * *” translates to “ten minutes after midnight every day” putting this PC to sleep until five minutes before midnight. I chose five minutes as it should be more than long enough for the machine to become visible on the network for TrueNAS replication. When this all works as intended (there have been hiccups I haven’t diagnosed yet) this PC, which usually sits unused, would wake up for only fifteen minutes a day instead of wasting power around the clock.

Notes from ZFS Adventures for TrueNAS Replication

My collection of old small SSDs played a game of musical chairs to free up a drive for my TrueNAS replication machine, the process of which was an opportunity for hands-on time with some Linux disk administration tools. Now that I have my system drive up and running on Ubuntu Server 22.04 LTS, it’s time to wade into the land of ZFS again. It’s been long enough that I had to refer to documentation to rediscover what I need to do, so I’m taking down these notes for if when I need to do it again.

Installation

ZFS tools are not installed by default on Ubuntu 22.04. There seems to be two separate packages for ZFS. I don’t understand the tradeoffs between those two options, I chose to sudo apt install zfsutils-linux because that’s what Ubuntu’s ZFS tutorial used.

Creation

Since my drive was already setup to be a replication storage drive, I didn’t have to create a new ZFS pool from scratch. If I did, though, here are the steps (excerpts from the Ubuntu tutorial linked above):

  • Either “fdisk -l” or “lsblk” to list all the storage devices attached to the machine.
  • Find the target device name (example: /dev/sdb) and choose a pool name (example: myzfs)
  • “zpool create myzfs /dev/sdb” would create a new storage pool with a single device. Many ZFS advantages require multiple disks, but for TrueNAS replication I just output to a single drive.

Once a pool exists, we need to create our first dataset on that pool.

  • “zfs create myzfs/myset” to create a dataset “myset” on pool “myzfs”
  • Optional: “zfs set compress=lz4 myzfs/mydataset” to enable LZ4 compression on specified dataset.

Maintenance

  • “zpool scrub myzfs” to check integrity of data on disk. With a single drive it wouldn’t be possible to automatically repair any errors, but at least we would know that problems exist.
  • “zpool export myzfs” is the closest thing I found to “ejecting” a ZFS pool. Ideally, we do this before we move a pool to another machine.
  • “zpool import myzfs” brings an existing ZFS pool onto the system. Ideally this pool had been “export”-ed from the previous machine, but as I found out when my USB enclosure died, this was not strictly required. I was able to import it into my new replication machine. (I don’t know what risks I took when I failed to export.)
  • “zfs list -t snapshot” to show all ZFS snapshots on record.

TrueNAS Replication

The big unknown for me is figuring out permissions for a non-root replication user. So far, I’ve only had luck doing this on root account of the replication target, which is bad for many reasons. But every time I tried to use a non-root account, replication fails with error umount: only root can use "--types" option

  • On TrueNAS: System/SSH Keypairs. “Add” to generate a new pair of private/public key. Copy the public key.
  • On replication target: add that public key to /root/.ssh/authorized_keys
  • On TrueNAS: System/SSH Connections. “Add” to create a new connection. Enter a name and IP address, and select the keypair generated earlier. Click “Discover Remote Host Key” which is our first test to see if SSH is setup correctly.
  • On TrueNAS: Tasks/Replication Tasks. “Add” to create a replication job using the newly created SSH connection to push replication data to the zfs dataset we just created.

Monitor Disk Activity

The problem with an automated task going directly to root is that I couldn’t tell what (if anything) was happening. There are several Linux tools to monitor disk activity. I first tried “iotop” but unhappy with the fact it required admin privileges and that is not considered a bug. (“Please stop opening bugs on this.”) Looking for an alternative, I found this list and decided dstat was the best fit for my needs. It is not installed on Ubuntu Server by default, but I could run sudo apt install pcp to install, followed by dstat -cd --disk-util --disk-tps to see activity level of all disks.

Notes on Linux Disk Tools

I am setting up an old PC as a TrueNAS replication target to back up data on my drive array. Fitting a modern SSD into the box was only part of the challenge, I need an SSD to put in it. This is a problem easily solved with money because I don’t need a big system drive for this task, and we live in an era of 256GB SSDs on sale for under $20.(*) But where’s the fun in that? I already have some old and small SSDs, I just need to do a bit of musical chairs to free one up.

These small drives are running various machines in my hoard of old PC hardware. 64-bit capable machines run Ubuntu LTS and 32-bit only hardware running Raspberry Pi Desktop. Historically they were quite… disposable, in the sense that I usually wipe the system and start fresh whenever I want to repurpose them. This time is different: one of these is currently a print server, turning my old Canon imageCLASS D550 laser printer into a network-connected printer. Getting Canon’s Linux driver up and running on this old printer was a miserable experience. Canon has since updated imageCLASS D550 Linux driver so things might be better now, but I didn’t want to risk repeating that experience. Instead of wiping a disk and starting fresh, I took this as an opportunity to learn and practice Linux disk administration tools.

Clonezilla

My first attempt tried using Clonezilla Live to move my print server from one drive to another. This failed with errors that scrolled by too fast for me to read. I rediscovered the “Scroll Lock” key on my keyboard to pause text scrolling so I could read the errors: partition table information was expected by one stage of the tool but was missing from a file created by an earlier stage of the tool. I have no idea how to resolve that. Time to try something else.

dd

I decided it was long overdue for me to learn and practice using the Linux disk tool dd. My primary reference is Arch Linux Wiki page for dd. It’s a powerful tool with many options, but I didn’t need anything fancy for my introduction. I just wanted to directly copy from one drive to another (larger) drive. To list all of my install storage drives, I knew about fdisk -l but this time I also learned of lsblk which doesn’t require entering the root password before listing all block storage device names and their capacities. Once I figured out the name of the source (/dev/sdc) and the destination (/dev/sde) I could perform a direct copy:

sudo dd if=/dev/sdc of=/dev/sde bs=512K status=progress

The “bs” parameter is “block size” and apparently the ideal value varies depending on hardware capabilities. But it defaults to 512 bytes for historical reasons and that’s apparently far too small for modern hardware. I bumped it up several orders of magnitude to 512 kilobytes without really understanding the tradeoffs involved. “status=progress” prints the occasional status report so I know the process is ongoing, as it can take some time to complete.

gparted

After the successful copy, I wanted to extend the partition so my print server can take advantage of new space. Resizing the partition with Ubuntu’s “disks” app failed with an error message “Unable to satisfy all constraints on the partition.” Fortunately, gparted had no such complaints, and my print server was back up and running with more elbow room.

Back to dd

Before I erase the smaller drive, though, I thought I would try making a disk image backup of it. If Canon driver installation were painless, I would not have bothered. In case of SSD failure, I would replace the drive and reinstall Ubuntu and set up a new print server. But Canon driver installation was painful, and I wanted an image to restore if needed. I went about looking for how to create a disk image and in the Linux world of “everything is a file” I was not too surprised to find it’s a matter of using a file name (~/canonserver.img) instead of device name (/dev/sde) for dd output.

sudo dd if=/dev/sdc of=~/canonserver.img bs=512K status=progress

gzip and xz

But that raw disk image file is rather large, exactly the size of the source drive. (80GB in my case) To compress this data, Arch Linux Wiki page on dd had examples of how to pipe dd output into gzip for compression. Following those direction worked fine, but I noticed Ubuntu’s “disks” app recognized img.xz natively as a compressed disk image file format and not img.gzip. Looking into that xz suffix, I learned xz was a different compression tool analogous to gzip, and I could generate my own img.xz image by piping dd output into xz, which in turn emits its output into a file, with the following command:

sudo dd if=/dev/sdc bs=512K status=progress | xz --compress -9 --block-size=100MiB -T4 > ~/canonserver.img.xz

I used xz parameters “-9” for maximum compression. “-T4” means spinning up four threads to work in parallel, as I was running this on a quad-core processor. “–block-size=100MiB” is how big of a chunk of data each thread receives to work.

A spinning-platter HDD was used as a test output and verified a restoration of this compressed image worked. Now I need to move this file to my TrueNAS array for backup, kind of bringing the project full circle. At 20GB, it is far smaller than the raw 80GB file but still nontrivial to move.

gio

I tried to mount my TrueNAS SMB shares as CIFS but kept running into errors. It would mount and I could read files, I just couldn’t write any. After several failures I started looking for an alternative and found gio.

gio mount --anonymous "smb://servername/sharename"
gio copy --progress ~/canonserver.img.xz "smb://servername/sharename/canonserver.img.xz"

OK, that worked, but what did I just use? This name “gio” is far too generic. My first search hit was a “Cross-Platform GUI for Go” which is definitely wrong. My second hit “Gnome Input/Output” might be correct or at least related. As a beginner this is all very fuzzy, perhaps it’ll get better with practice. For today I have an operating system disk up and running so I can work on my ZFS data storage drive.

Acer Aspire Switch is Linux Unfriendly

Now that the hardware of an Acer SW5-012 (Aspire Switch 10) is back up and running, the focus turns to software. Windows 8 is installed but locked with passwords I don’t have. I didn’t care much for Windows 8 anyway, and whatever data exists is not mine to recover. So – a clean wipe is in order.

As with the Latitude X1, my first thought was to turn this little old machine into an almost-Chromebook with Neverware CloudReady. And just like with the Latitude X1, the attempt was foiled. The Latitude X1 was too old and did not support some processor features required by CloudReady. The Acer problem is just the opposite – the hardware is too new and deliberately blocks the installation.

The blocking mechanism is Secure Boot, which according to its own web site is a “security standard developed by members of the PC industry to help make sure that a device boots using only software that is trusted by the Original Equipment Manufacturer.” I would describe it with different terms. Either way, trying to install CloudReady – or a Linux distribution – results in the error screen “Secure Boot Error”.

Intentional or not, this puts the Acer in a bad state. It gets stuck neither fully on nor off, the screen dark but burning battery power and making itself warm. I had to disassemble the computer again to pull the battery from the main circuit board in order to reboot the machine.

In theory Secure Boot can be disabled, but various efforts by other people on the internet indicated this isn’t straightforward. I certainly had no better luck when I tried it: I can see the menu option, and I could change it from black on white (disabled) to white on gray (enabled) by creating an admin password, but I couldn’t figure out how to actually change the Secure Boot mode out of “Standard”.

Acer Secure Boot Menu

And it might not even be worth the effort, as forum traffic indicates there is very poor Linux driver support for this class of hardware. Probably related to the secure boot barrier but either way I’m giving up. I’ll stay with Windows on this machine.

Windows Subsystem Returns for Linux

One of the newest features in Windows 10 is the “Windows Subsystem for Linux” (WSL) allowing a limited set of Linux binaries to run on the latest 64-bit edition of Windows 10. It may be a sign of open-source friendliness by the new Microsoft CEO but for trivia’s sake: it is not a new concept.

The lineage for Windows 10 traces all the way back to Windows NT, built-in the early 1990s as a heavier-duty operating system (or according to some, “a real operating system”) to move upscale relative to the existing DOS-based Windows (“not a real operating system”). As consumer-level hardware grew more capable, the old DOS core was phased out and the NT kernel took over. Windows 2000 was the modest start, followed by the successful Windows XP.

But back when Windows NT launched, it was intended to capture the business, enterprise, and government markets with higher margins than the consumer market. At the time, one requirement to compete for government contracts was support for POSIX, a IEEE-defined subset of Unix. The software architects for Windows NT built a modular design that supported multiple subsystems. In addition to the home-grown Microsoft Win32 and the POSIX subsystem to meet government requirement, there is also a subsystem for IBM OS/2 to compete in enterprises that had invested in OS/2.

History showed those subsystem were barely, if anything, more than lip service. They were not used much and gradually faded away in later evolution of the NT lineage.

But now, the concept returns.

Microsoft has a healthy and profitable market in desktop software development with Windows, but is only a marginal player in the web + cloud world. The people writing code there are more likely to be using a Linux workstation or a Macintosh with its FreeBSD-based MacOS. In an attempt to make Windows more relevant to this world, they need to provide access to the already entrenched tools.

So just like before, Microsoft is building a Linux subsystem for business competitive reasons. But unlike the POSIX subsystem, they couldn’t get away with just lip service to satisfy a checklist. It will actually need to be useful to gain meaningful traction.

The method of installation is a little odd – the supported Linux distributions are listed on the Microsoft Windows app store. But once we get past this square peg jammed in a round hole, it works well enough.

WSL is not a virtual machine or even a container. The Linux executables were not recompiled for Windows, they’re the exact same binaries. And they’re not isolated – they run side by side with the rest of Windows and has access to the same file system.

Personally, I’m using WSL so I can use the same git source control commands that I’ve learned while working in Ubuntu. I know Github has a Windows GUI and associated command-line toolkit, but I expect running the Ubuntu git via WSL would work better with git outside of Github. (Bitbucket, Heroku, etc.)

This is a good start. I hope WSL has a real life ahead to help Windows play well with others, and not fade away like its predecessors.