Initial Budibase Documentation Lessons

Going from Budibase “Quickstart” to a basic understanding of application structure was a bigger step than I had expected, but I think I’m there now. Here are some notable lessons from my first (and second) passes through Budibase documentation with the caveatI wouldn’t know for sure I got this right until I get some hands-on app-building experience.

Blocks Are Not Components

When reading Budibase documentation on user interface elements, I came across terms like “Form”, “Form component”, and “Form block”. I thought they all referred to the same thing, that words “block” and “component” were synonyms. I was wrong, which caused a lot of confusion. A component in Budibase parlance is a single interface unit (example: textbox) and a block is a predefined composition built from multiple components to address common scenarios. (example: form block contains many textboxes.)

The ability to “Eject block” was mentioned in “Quickstart” but I didn’t understand what it meant at a time. I had understood it as a way to break up into smaller pieces so I can access its internals for customization, which should be great for learning how it was built under the hood. But most of the time when I wanted to look inside something I couldn’t find that “Eject” button. Eventually I figured out I had been trying to eject components and that’s why I got nowhere.

Data Provider, Repeater, and Binding

I’ve learned that things are a little more complex than ‘database stores data, components display data, binding connect them together.’ A few additional pieces are required. First is the fact we don’t want everything in the database all at once. For tasks like filtering or pagination, there are “data provider” components that fetch a desired subset from the database. The subset is still a collection of data, so a “repeater” component is deployed as an enumerator. Child components of a repeater will receive that data one row at a time, and that’s when UI components can use a binding to pick up the exact information they want. A hypothetical example:

  • Database: “Employees” table
  • Data provider: “Name” of first 50 employees sorted by name.
  • Repeater: for each of those employees…
  • Paragraph: display text of “name” binding.

When we use a Table block (like the auto-generated screens in the tutorial) these steps are handled automatically. But if we want to built tailored UI, implementing these details become our job.

App State

Sometimes an app will need to convey data between parts outside of the standard data flow hierarchy. (Provider, binding, etc.) I was curious how it was done and hit “eject” on an existing block to see its implementation details. The answer is app state, an application-wide key/value store. Or in other words: global variables!


With these lessons in mind, I am ready to start building my Budibase app to work with my existing data on personal transactions. And wow, it’s a mess.

Budibase Documentation: Mind the Gap

I want to learn Budibase and I hope it will be a tool to solve my “Excel is not a database” situation with my personal finance tracking. Keeping that project in mind to anchor my study, I returned to Budibase documentation ready to learn more. They had an excellent “Overview” page, followed by an informative “Quickstart” project. Given those two precedents, I had expected “Quickstart” to be followed by a smooth ramp-up for beginners. Turns out I was overly optimistic. I found a significant gap between “Quickstart” and the knowledge needed to build my own Budibase app.

Going down the documentation index sidebar, “Quickstart” is followed by “Guides and resources”. This is a collection of solutions to common problems, sometimes called a recipe cookbook. I’m sure it’s a great resource, but for a beginner like myself coming right out of “Quickstart”, a cookbook is not what we need. Beginners lack experience to understand the solutions being offered. Even worse, we don’t even necessarily understand the problems being solved. I suffered a similar problem when I encountered the Node-RED cookbook.

After the “Getting Started” section, Budibase documentation had sections titled “Data”, “Design”, and “Bindings” but I saw nothing telling a beginner how they tied together. I was undaunted — I’ve been in similar situations before — so I dove in to read those sections and learn what I can. There were a lot of cross-referencing in those sections (each had references to the other two) meaning it wouldn’t help to read the three sections in any particular order. I had to accept I wouldn’t understand everything the first pass. Then, after reading all three sections once, return to read them all a second time. Only then did I feel I have a basic grasp of how a Budibase app worked.

I think there’s room for improvement here. If my experience is representative, I could see potential Budibase users getting discouraged trying to jump over this post-“Quickstart” gap. In hindsight, I think a few sentences at the end of the “Quickstart” guide would have gone a long way to help. Here’s my first draft:


Budibase Data Flow

This tutorial covered the core of every Budibase app, but things will be done differently beyond a tutorial.

  • Obviously real Budibase apps will not use our sample data. Visit Introduction to data to see how Budibase can work with your data.
  • This tutorial used autogenerated screens to get started quickly with a generic table interface. A good application will tailor an interface to best display and interact with its underlying data. Data in design goes into more details on how to do so.
  • Our autogenerated tables also handled linking data between database and user interface elements. These links are called bindings and Introduction to bindings covers how to define them.

I might submit this, or an evolution of it, as a proposed contribution after I get some Budibase experience to be confident this is actually correct. There’s a lot to learn.

Quick Survey of Budibase Foundation

I’m playing with Budibase on a self-hosted instance courtesy of Docker Compose. I didn’t edit the docker-compose.yaml file provided by Budibase but I did open it up to look around to see what Budibase was built atop of. Later I found answers within Budibase documentation which included a page on its system architecture and other scattered information on how it was built. Most of these building blocks are things I might use in my own future projects, with one major exception.

CouchDB

I knew CouchDB is one of the “NoSQL” databases out there, but not much more than that. My own NoSQL database adventure was with MongoDB and I don’t yet know enough to tell when I might use one versus another. Budibase uses CouchDB for its own internal operational data like user accounts. It is also available as the default store for client app data if I don’t want to connect to an external database. My to-do list includes better understanding the tradeoffs between those two paths.

Redis

Not everything involved in running an application needs to be written to disk. Sometimes an in-memory data storage system like Redis is the right tool for the job, because it is far faster than any disk. I’ve known Redis existed but never had a project that needed its capabilities. Budibase says it uses Redis as a high performance cache store.

MinIO

One thing that caught my eye about Budibase is file-handling ability. Meaning data entries aren’t limited to classic database types like dates or numbers or text, I can also upload file attachments. I had guessed it was implemented using something like PostgreSQL binary data format, but I was wrong. Attachments are actually stored in an instance of MinIO which I knew nothing about until now. A quick skim of documentation looks like MinIO is a way to get much of the functionality of AWS S3 but on my own data storage instead of AWS datacenter hardware. Like CouchDB, MinIO also used to store files for Budibase internal operation in addition to storing client app data.

Svelte

I was curious how Budibase built the HTML UI I had been looking at, and part of the answer is Svelte. This was on my “maybe look into later” list due to its connection with Vite, which I briefly looked at earlier. It’s all layers upon layers, and there appears to be yet another layer on top: SvelteKit. I didn’t see any mention of SvelteKit, though, so I think Budibase is using things at Svelte level.

Adobe Spectrum

Another component of Budibase HTML UI is Adobe’s Spectrum design system. I don’t recall ever coming across Spectrum before, but in hindsight it made sense Adobe would create its own interface design. Several years ago I became infatuated with Google’s Material Design system, and it was a big motivation for me to learn Google’s Angular platform. Since then Google corporate behavior turned me from a fan into a skeptic, and I’m not inclined to further pursue either Angular or Material Design. Adobe Spectrum Design would be an alternative… if it weren’t for the fact I’m even less fond of Adobe business practices. The fact Budibase uses Adobe Spectrum is not a deal-breaker against Budibase, but I’m never going to use Spectrum for my own web development projects.


Speaking of my own projects, I need to pick something interesting I can use to learn Budibase as I try to turn idea into reality. Fortunately I have no shortage of project ideas and have a long standing “Excel is not a database” problem I’ve wanted to solve: personal finance tracking.

Self-Hosting Budibase (Docker Compose)

I went through Budibase quick start guide and it made a great first impression. I want to learn it and add it to my toolbox for times when I need to build a data web app in a hurry. The quick start used Budibase cloud hosting for an easy low-friction way to kick the tires. Good enough for me to continue, but before I do it was important that I continue with an instance of Budibase I run for my own use. Wary of the example set by AWS Honeycode, which Amazon had shut down and took all user data offline with it, it was critically important that I can run my own local instance on data I control. This is a pretty common requirement so Budibase offers many alternatives to their cloud-hosted solution. I clicked on the link for deploying via Docker Compose and it turned out to be really easy.

I didn’t expect it to be that easy. I was no stranger to install procedures that went awry and causing collateral damage elsewhere. I spun up a Proxmox virtual machine running Ubuntu server 24.04 LTS for the exclusive purpose of experimenting with Budibase. If anything goes wrong in a VM, damage should be contained and I can easily erase everything and start over. I followed Budibase instructions to install Docker Compose then downloaded their docker-compose.yaml file for deploying Budibase and an .env file for configuration variables. I didn’t edit their docker-compose.yaml file, just the .env file to set my own passwords. I then typed “docker compose up“, waited for all the startup and deployment procedures to run, then pointed my browser at the address for my Budibase virtual machine. I was greeted with the initial setup screen to create an account on my newly-created Budibase instance. No mysterious failures, no inscrutable error messages, excellent.

I repeated the quick start tutorial on my own instance, and it worked exactly as it did on the cloud hosted instance. For the most part this is good, but not always. My first rude surprise was finding that Budibase app backup was locked off. Backup was clearly indicated as a Premium subscription tier feature, but I had thought that applied only to the cloud-hosted service. I was wrong: subscription features are locked away whether I’m running on their cloud hosting platform or on my own. If Budibase is truly and fully open source I suppose I can clone their GitHub repository and build my own version without these locks. Even if I don’t want to do that, the fact I’m running with Docker Compose means I can back up my data via Docker volumes. So, yes, there are solutions, but this is proof that levers still exist for Budibase to restrict self-hosted instances.

This is a demerit against Budibase, but not a deal-breaker. After all, I’m putting up with much the same risk in other self-hosted services like Plex Media Server, though in that case I have an alternative Jellyfin lined up. For now it’ll just be a note here on my project notebook and I will continue exploring Budibase.

Budibase Quickstart Is A Great Tour

I want to investigate tools for quickly building data applications and thought Budibase was the most promising candidate. Since the main pitch is ease of getting something useful up and running, they have put together a quick start guide to serve as a lightning tour of product features. Their effort to make a great first impression certainly worked on me.

Minimizing barriers to entry, we are pointed to their cloud-hosted service where we register to create our quick start app without worrying about setup or deployment. Then a sample database is only one click away. I examined the sample database and found a decently structured example with multiple tables and relationships between them. I had expected to see an easy single table database and was impressed they gave us something more representative of real world applications.

Once sample database tables were in place, the tour takes us to the interface design screen. One of the default templates is an editable table that would be comfortably familiar for any spreadsheet user. Simple data navigation like sorting by a column or searching in a column works much the same way as well, except this editable table also has a basic ability to interact with relational database tables.

If the goal is to take a small first step away from abusing Excel as a database, Budibase lives up to its promise. It really did take only a few minutes to set up a web app that presents an Excel-like interface but with underlying relational database capabilities managed with web app access control and data validation capabilities. It’s a credible “Minimal Viable Product” and if that’s all somebody needs, they can stop right there and have a perfectly usable tool.

But of course that is only the beginning. The real power comes from building interfaces better tailored to the task at hand instead of a spreadsheet-like table. For data entry, Budibase offers the usual set of form elements (text entry, number entry) plus a few intriguing surprises (file attachment upload?) Playing around with these tools I got the distinct feeling this section of Budibase is aiming to surpass Google Forms for ease of use. [UPDATE: Yep, there’s a page in Budibase documentation dedicated to pitching itself against Google Forms.] I’m not enough of a Google Forms user to judge if Budibase has succeeded, but I see no reason why I need to go back to Google Forms.

This quick start guide ends with links to other sections in Budibase documentation for further exploration. I like what I see so far, and I want to continue exploring, but on my own self-hosted instance instead of their cloud-hosted one.

Window Shopping Budibase

Occasionally I’ve had project ideas that require a database to keep track of everything, and then I would shelve the idea because I didn’t want to spend the time. Or even worse, try to do it with a spreadsheet and suffer all the ills of a square peg in a round hole. I want an easier way to use databases. The good news is that I’m not the only one with this problem and many solutions exist under the marketing umbrella of “low-code” or “no-code” web app platforms. Skimming available information, I think Budibase is worth a closer look.

Budibase is one of the “low-code” platforms and I understand this to mean I don’t have to write any code if my project stays within the lane of a supported usage pattern. But if I must go outside those boundaries, I have the option of inserting my own code to customize Budibase instead of being left on my own to re-implement everything from scratch. Sounds good.

On the business side, Budibase is a free open source project with source code hosted on GitHub. Compiled binaries for self-managed deployments are available a few different ways, including Docker containers which is how I’d test things out. Users are allowed to run their own apps on cloud infrastructure of their choice. However, the licensing agreement forbids reselling Budibase itself as a cloud-hosted service. That privilege is reserved exclusively to Budibase themselves and serves as an income stream alongside consultancy contracts to build apps for businesses. I can support this arrangement as a tradeoff between proprietary commercial software and totally free open-source projects without a reliable income stream. This way Budibase is less likely to shut down and, even if it does, I should be able to limp along on a self-hosted setup until I can find a replacement.

Which brings me to the final point: the database. For the sake of simplicity, some of these low-code/no-code app platforms are built around their own database and assume full control over that database from beginning to end. In some cases this puts my data into a black box I might not be able to port to another platform. In contrast, I can build a Budibase interface on top of an existing database. And that database will still be available independently if Budibase goes away, or if I just want to explore something else. I like having such options and the data security it implies.

I like what I see so far, more than good enough for me to dive into Budibase documentation. Learn how I can translate its theoretical benefits to reality. Starting with an excellent quick start guide.

Many Options For Create/Read/Update/Delete (CRUD) Web Applications

In my quest to stop abusing Excel for database tasks, I returned to the world of web development. I remember studying web development for a while before I came across the term “CRUD app”. Though I might have encountered it even earlier and thought it was derogatory insult instead of a descriptive acronym: Create, Read, Update, and Delete. These four operations neatly encapsulates all fundamental functionality of a productive application (and even some entertainment ones, too.) This set of data manipulation activities drove HTML design from the very start: HTML <form> exist to enter and update data, and HTML <table> exist to present data.

Corresponding to those fundamental HTML 1.0 concepts were server-side mechanism that started with the Common Gateway Interface. Server-side infrastructure has evolved since their beginning, just as browsers have. By the time I started my web development education I had the luxury of platforms like Ruby on Rails. Which offered a “scaffolding” mechanism to automatically generate CRUD infrastructure so I didn’t have to write my own from scratch. Other development platforms offer similar counterparts, but that’s not the only way to go. I have since come across many more options for building a CRUD web app.

The simplest and easiest way into this world is Google Forms (and its competitors). I frequently encounter it for surveys, registration, etc. Which meant I always see the “Create” side of the app, though some forms allow me to return and “Update” my data. Whoever created the Google Form can then “Read” submitted responses and “Delete” if needed. Google Forms make creating a CRUD web app as easy as creating a Google Doc, and with better control over data than sharing a spreadsheet and telling people to add their data to a row.

I want to learn something quick and easy to use so I can use a database when it’s the right tool for the job. Minimize the hurdle of getting over the “ugh, I don’t want to spin up a database” hump. Google Form is a very simple way to go, but it goes too far: the data is output to a spreadsheet (Google Sheets) instead of a database perpetuating the “Excel is not a database” legacy. So I went looking for something between super simple Google Forms and full web development platforms like Ruby on Rails.

What I found are products that advertise themselves as “no-code” or “low-code” web app tools. The first one I came across was Amazon’s Honeycode under their AWS umbrella, but that has since been shut down. (I was going to link to the shutdown announcement, but it was posted to the Honeycode site, which is now dead.) Keenly reminded of the perils of putting my data into an online service, I focused on solutions I could run at home and found Budibase as a promising candidate.

Browser Based Database Front End

I have added WebUSB to my ever-growing to-learn list, hoping it would enable some really cool project ideas. As indicated by its name, I would need to know both web development and USB development before I can make them work together towards my goals. While I work my way through my self-directed USB study syllabus, I will also spend some time reviewing web development.

I’m not a complete beginner with web development, but that world is broad and it evolves quickly. Every time I applied browser-based technology to a project idea, I find some of my knowledge out of date. Plus I have to learn something new in an area I haven’t dealt with before. This is great! Approaching from a different direction every time helps me get a more well-rounded picture of the whole thing. This time I want revisit building web apps that present an user interface for a database.

My motivation is twofold. One is directly related to my desire to use WebUSB. Since any WebUSB project will be dealing with data flowing in and out of an external source (the USB peripheral) I expect many JavaScript constructs to resemble those used to communicate data to and from an external database. The second is that I’ve wanted to get better at applying database technology. So I can use it when it is the right tool for the job instead of, say, abusing Microsoft Excel. “Excel is not a Database” is an ongoing joke in the computer world. People do it because starting an Excel spreadsheet is far easier than setting up a database. Excel is “good enough” at small scales, but spirals into chaos as data set grows. There’s no shortage of horror stories ranging from Formula 1 race car construction to losing important healthcare data.

I want to avoid such disasters myself, because I’m definitely guilty of abusing Excel for poorly-suited problems. Some novel hacks turn out to be a delightful success, others times I find I really needed an actual database. I want it in my toolbox. This is something of a “return to where it all began” because my first web technology lesson many years ago came from The Ruby on Rails Tutorial. The main tutorial project was a web application interfacing with a database server. It’s a pattern common enough that I’ve since learned of an acronym for it: CRUD apps. And there exist many ways for me to build one of my own.

Potential WebUSB Study Syllabus

I followed through the steps of an Adafruit WebUSB example and established connection between Chrome browser on my Android phone to a microcontroller plugged into the USB port on the phone. I think WebUSB would enable many of my project ideas. But before I can turn any of my vague ideas into reality, I have a lot of homework to do.

The Adafruit example was thin on background information. I think it was written for people who already know how to work with TinyUSB library and just wanted to see Adafruit’s adaptation into an Arduino library. This impression is backed up by the fact its GitHub repository README (https://github.com/adafruit/Adafruit_TinyUSB_Arduino) is written using vocabulary I don’t understand. Following the link to TinyUSB’s site gives me similar language, so I have to start climbing my learning curve from somewhere else.

Searching from the browser side, I found a Chrome developer documentation page on WebUSB. I was able to comprehend more of this page, but not all of it. Here I learned one constraint on WebUSB: web apps are only allowed to connect to USB devices the operating system doesn’t already have a driver for. This is a mitigation against malicious apps bypassing operating system protection for USB devices like security keys. It also avoids ambiguity/duplication with existing functionality. For example, there’s no real need for a web app to interface with a USB keyboard via WebUSB when the operating system can already deliver key press events. Though there’s an interesting wrinkle here around USB serial, a common way to connect to microcontroller projects. By this rule, a web app can’t connect to a USB serial device on my desktop via WebUSB because my operating system already knows how to work with a serial device. (So it’s over to Web Serial land.) But apparently Android lacks a built-in handler for serial, so maybe it’s available via WebUSB? At this point I don’t yet know if that’s an opportunity or just a source of confusion.

Fortunately for beginners like myself, author of this Chrome developer documentation page included a link to USB in a NutShell for those unfamiliar with fundamental USB concepts. Hey, that’s me! I will try to start with USB in a NutShell and work my way back to TinyUSB and its various incarnations like Adafruit’s Arduino library. But that’s just the “USB” part, I have a lot I’ll have to learn for the “Web” part as well.

Adafruit WebUSB Arduino Example

I knew WebUSB existed but not much more than the name implying some sort of USB capability for web applications. I was motivated to look into it a bit more after learning about an index of browser diagnostics tools. Initial inconclusive signs were not promising but I kept looking. After a bit I thought: I want to explore this capability for my electronics projects. Maybe somebody at Adafruit has already looked into this? Searching for WebUSB on Adafruit Learn gave me a hit: Using WebUSB with Arduino and TinyUSB. So yes, they have!

I was happy to see the hardware in this example was Adafruit’s Circuit Playground Express (3333), because I have one already on hand. When following an example for the first time, it’s always nice to have the exact same hardware that I know will work, rather than similar hardware that should work. Despite that advantage it was not smooth sailing. I got stuck when it came to changing my Arduino IDE’s Tools/USB Stack to “TinyUSB”: there wasn’t a “USB Stack” option under “Tools” menu! I ran around in circles for a while before I eventually figured out I was using the wrong Arduino board support package. This example required “Adafruit SAMD Boards” and not “Arduino SAMD Boards”. I was thrown off because “Arduino SAMD Boards” included support for a bunch of Arduino boards and Adafruit’s Circuit Playground Express. I was able to select the proper board without realizing I was in the wrong board support library. I don’t know why Arduino claims support for boards that aren’t theirs, when the manufacturer has provided their own board support. It’s confusing, this is the second time they bit me, and I’m not happy.

Anyway, once I installed Adafruit’s board support library and selected Circuit Playground Express under Adafruit’s umbrella, I had a “USB Stack” option under “Tools” and could proceed to follow along with the example with no further issues. My first run used Chrome on my desktop computer, and after that success I tried it with Chrome on my Android phone. It works there, too!

And I can verify chrome://device-log is no longer empty on the phone, it now shows the newly-connected USB hardware.

This is huge! WebUSB might enable many project ideas that involve using one of my retired Android phone as the display (or more) of an electronics project. Which ones? I won’t know for sure until I learn more about the constraints of Android Chrome WebUSB support. I would have to pick a relatively simple one as a starting point before jumping into the more complex ideas. There’s a lot of study ahead. This Adafruit example was unfortunately lacking on background and theory of WebUSB so I’m on my own. I think it was written for people who already have the appropriate background, and that’s not me. Well, not yet. I need a refresher course on web development, and I will need to learn technical details of USB as well.

Android Chrome Device Log Strangely Empty

Learning about Chrome’s index of special URLs was very interesting. Aside from satisfying curiosity, it also gave me the tools to investigate an idea: can I write a web app to use an Android phone as interface to an electronics project that communicates over USB?

I want to repurpose my old retired Android phones as project UI, and have been making small incremental steps. My AS7341 spectral color sensor project presented its data as a web page served by the ESP32 on board, and my Android compass app for magnetometer exploration was also a web app to visualize data from my phone’s onboard sensors. But I haven’t been able to combine a phone’s onboard capability with external offboard capability. The barrier is a security measure: only web apps served from public TLC-secured https address is allowed to access extended capabilities like magnetometer. Web apps served locally over unencrypted http, like those served by my ESP32, is not allowed to access such things.

At one point in the past, web apps served via secured https was allowed to retrieve data from non-secure http sources, but I found that has been locked down in modern browsers. Now they require https all the way. I found this restriction during research for an earlier iteration of my AMG8833 thermal camera idea: I thought I could pair AMG8833 data with a phone’s onboard camera, but the https/http barrier sunk that plan. I had to wait for my Adafruit Memento to revisit that idea.

WebUSB is another one of these https-only features. If I can communicate with external peripherals over WebUSB, I can serve a web app from a https source (like GitHub pages) and talk to my hardware over USB instead of forbidden insecure http. To test this hypothesis, I took a USB keyboard and plugged it into my desktop PC running Google Chrome. I brought up chrome://device-log to verify that a USB keyboard shows up as a newly attached HID peripheral.

I then plugged the same keyboard into my Google Pixel 7. The keyboard is recognized and functional: I brought up Google Chrome and could type chrome://device-log. But unlike Chrome on my desktop, Chrome on my phone does not show a newly attached USB keyboard as HID peripheral. It just shows a completely empty device log. I know that even if a device shows up here it is not a guarantee that it supports WebUSB. But it’s not very promising when the log shows nothing at all. Does this necessarily mean Android Chrome doesn’t even see the hardware? That would be discouraging.

I know USB doesn’t work the same way on an Android phone as it does on a PC. For one thing, Android control panel has this “USB Preferences” screen to control how my Android phone uses its USB port. This screen represents a mechanism unique to Android USB behavior. There may be others, and I’ll have to learn to work with them. I checked https://caniuse.com and it says WebUSB is supported on Chrome for Android. That encouraged me enough to keep searching for more information on how this might work and found a WebUSB example from Adafruit which managed to make my Android device log less empty.

Adafruit Memento a.k.a. PyCamera Photography Parameters

I would love to build upon Adafruit’s work and make something cool with their Memento camera module at its core, but before I brainstorm ideas I need to know what’s already on hand. After reviewing the hardware side of this system, I moved on to the software side. Looking at sample code I immediately saw mention of a “PyCamera”. As far as I can tell, it’s the same thing. Adafruit’s Arduino sample code documentation use the two names interchangeably. Perhaps PyCamera was a development code name for the product that eventually launched as the Memento? Perhaps Adafruit was afraid Arduino fans would pass over a product named PyCamera thinking it implied CircuitPython exclusivity?

One angle Adafruit used to promote Memento is the programmatic control we have over our photography. Given this sales pitch, I wanted to check out this camera’s capability in photography terms I’m familiar with. Between reading Adafruit source code and “OV5640 register datasheet” available on their downloads page, here is my understanding:

Aperture

I found nothing that I recognize as a counterpart to controlling camera aperture. Maybe I’ll find something later, but for now I believe aperture is fixed and we can’t play with our depth of field or other aperture controlled photography techniques.

Shutter Speed

There’s no physical shutter in an OV5640, but “exposure” affects how much time the camera takes to read sensor values. The default setting is to use its built-in automatic exposure control (AEC) which varies image integration time based on an internal algorithm, but it is also possible to switch the camera over to manual exposure mode for deliberately over- or under-exposed pictures. To a limited degree, at least. Even manual control is limited to range of “normal” photography so no multi-hour exposures here. The register datasheet outlines range of values but I don’t understand what they mean yet.

Sensitivity (ISO)

The conceptual counterpart for OV5640 is “gain”, and there is again the default of automatic gain control (AGC) with the option to turn off AGC and write values to specific registers to control gain. The register datasheet discusses the range of values, but I don’t understand what they mean yet.

White Balance

We can turn automatic white balance (AWB) on or off, but that’s all I know from this document. What happens when AWB is turned off is out of scope. Adafruit library exposes set_camera_wb() but then we’re out of luck for the actual values passed into that API. “For advanced AWB settings, contact your local OmniVision FAE.

Focus

This was the most excitement part for me, because vast majority of camera modules available to electronics hobbyists have a fixed focus. The OV5640 on board the Memento has a voice coil motor (VCM) to move its optical path and adjust focus. One of the Adafruit demos performed focus-stacking so I know we have programmatic access, and the camera test app exposes the ability to perform auto-focus. I was looking forward to seeing an auto-focus algorithm in detail!

Unfortunately my hopes were dashed. Indeed we have programmatic access to move the lens within its range of positions, and indeed we have access to an auto-focus algorithm, but the two are separate things. The auto-focus algorithm is an opaque binary blob uploaded to the camera running on its built-in microcontroller. We do not get to see how it works.

On the upside, there are a few auto-focus modes we should be able to select and allow us to specify a region for focus. These controls were designed to support the “tap to focus” usage pattern common to touchscreen cell phone camera apps. So while we don’t get to see the magic inside the box, we have some amount of control over what happens inside. On the downside, this capability is not exposed via Adafruit PyCamera CircuitPython library API so some modifications will be required before experimentation can commence. If I might be doing that, I should dig in to see what’s under the hood.

Denso Ignition Coil-On-Plug Module On Workbench

Experiment to control a Denso ignition coil-on-plug module was far more successful than I had expected. It was a lot of fun! Because high voltages are involved (the very core purpose of an ignition coil…) the first round was done on my garage floor, away from most of my electronics components and equipment. Now that I have gained some confidence it won’t send sparks shooting everywhere, the test rig was moved to my workbench to get some measurements.

These numbers won’t be very good, though. It would be better if I can get an idea of what parameters are important to an ignition coil and what values to expect from a data sheet, but I had no luck finding such official engineering information. Searching for “Denso 90080-19016” found many auto parts suppliers offering to sell me more of those units and/or non-Denso substitutes, but no details beyond which cars the part would fit. Furthermore, this ignition coil was retired from a Toyota Sienna due to error codes relating to the ignition system, so it is probably not functioning to spec anyway.

Its power supply requirement is my biggest unknown. I had tried connecting it to a lithium-ion battery power pack delivering approximately 12 volts, but its power protection circuit believed there was a short circuit and cut power. My bench power supply has a red LED indicating “amperage limit reached”. When using it to power my experiment circuit, that red LED blinks every time the coil fired a spark. So clearly this coil has a brief but very high current draw. As a digital logic person, my understanding of solving such problems only went as far as “add capacitors”. I had some salvaged electrolytic capacitors available and connected them. I installed so much that the inrush current upon plugin would trigger power protection even before the coil started firing. If I disconnect power while running the coil, I can hear those capacitors supply enough to spark three or four times (each less energetic than the last) before fading out. And even with these capacitors, the brief current draw is still high enough to trigger errors. I’m either not using the right types of capacitors, or of the wrong values, to smooth this out. Such is my ignorance of proper electric power system design.

I had thought if the power requirements were simple enough, I could power the whole thing with a USB power bank. With a boost converter to kick USB 5V up to supply the coil at 12V. But given these high draw symptoms, I am skeptical an Amazon lowest-bidder DC boost converter will survive this coil’s demands. I will continue using a lead-acid battery that has (so far) tolerated such abuse.

The next set of experiments concern IGT signal duration. My experiments started with 2ms as a value I eyeballed from a blurry oscilloscope screen capture. If I want to drive this coil faster, I need to know how short of an IGT pulse I can get away with. I modified my Arduino sketch to use the knob to adjust signal duration, and output the current duration to one of my less-precious computers. The results were:

  • 2ms: initial guess that worked.
  • ~1ms: start to hear a difference in the sound of the spark.
  • ~0.8ms: sound of spark is weaker, and I no longer see IGF LED blink. So the coil thinks the spark is no good beyond this point to run an engine. Thankfully I’m not running one.
  • ~0.4ms: even weaker spark sound, and spark generation becomes intermittent.
  • ~0.2ms: no spark at all, just worrisome whining noises from the coil.

Those are the duration of IGT signal pulses with the caveat that I measured these with a constant 17ms (“redline”) between pulses. As the time between pulses shrink as well, it affects behavior in response to IGT signal pulse. The two variables are related, but I don’t understand exactly how. And without a good way to quantify results it’s not very feasible for me to map out a 2D graph charting how those two variables interact.

Lacking such metrics for better understanding, I settled on a maximum of IGT pulses 0.5ms in duration, and 0.5ms between pulses. In other words, an 50% duty cycle square wave at a frequency of 1kHz. Referencing Wikipedia’s chart of piano note frequencies, an upper limit of 1 kHz should still be enough for a bit of silly fun.


Public GitHub repository for this project: https://github.com/Roger-random/ignition_coil

Toyota Sienna Denso Coil-On-Plug Module

The problem with deciding I need to sit down and do some reading is that I’m vulnerable to get distracted by something shiny. Or in this case, something sparkly. Some time back I learned how voltage boost converters worked for laptop screen LED backlights. A little after that, it occurred to me ignition coils in modern electronic ignition engines must be boost converters as well. Except instead of driving a bunch of LEDs in series, an ignition coil raises the car’s 12V DC up high enough to jump across a spark plug gap. I thought it might be fun to try driving a coil in a non-automotive context and discussed the idea with other local makers. I was too cheap to buy a new coil just for this experiment, because I knew eventually someone would replace their car’s ignition coil and I can ask for the old one. That day has come: my friend Emily Velasco let me know she was going to stop by with a Denso ignition coil-on-plug module with associated spark plug, recently retired from her parents’ Toyota Sienna.

Research

The first step is information gathering. Thanks to ubiquity of Toyota vehicles, Emily found a pinout diagram for the ignition coil. This module has four pins. Two for power (+12V DC and ground) and two for signal (IGT and IGF). Toyota’s official workshop service information would have more details on its operation and troubleshooting, but I don’t have access to that. Fortunately there are other web resources like TOYOTAtech’s How to Find Toyota Ignition System Faults Fast! Dotting the IGFs and Crossing the IGTs.

According to this page “IGT” (I guess “ignition trigger”) is a signal from ECU telling the coil to do its thing, and “IGF” (guessing “ignition feedback”) is a signal from coil back to ECU to signal successful operation. Both operate with +5V DC logic. IGT is usually at ground level and briefly raised to +5V by the ECU to call for spark ignition. Looking at the blurry oscilloscope screen capture on TOYOTAtech’s page, my best guess is raised for approximately 2 milliseconds. In contrast, IGF is usually up at +5V DC and the coil pulls it low for roughly 1 millisecond to signal successful spark generation. This open-drain system allows multiple coils to share a common IGF line back to the ECU.

Circuit Board

Armed with this knowledge, I built a quick experiment circuit out of components immediately available at my workbench. Emily helped me by making a connector as CAD practice, for which I was thankful. My board’s output side needs to interface with that connector. On the input side, I needed two voltage levels: +12V DC for the coil and +5V DC for the signal. I have a 3S Lithium-Ion battery pack for 12-ish volts, but its battery management system (BMS) freaked out at the workload. As a backup, I switched to an old-fashioned lead-acid battery. For 5V I used the most expedient thing: an Arduino Nano with its USB socket and +5V DC output pin. To run the coil it will be connected to a cheap disposable USB power bank instead of a computer.

The experiment circuit had two input paths to IGT, switchable by a jumper. The first path is a “manual override” test mechanism with a small push-button switch. Once everything is hooked up, a push on the switch will raise IGT to +5V DC. If there is no spark, we have to backtrack and see what went wrong. If there is a spark, I can move the jumper to the other input path: pulse-generating Arduino Nano.

I’ve already established it needs to raise the line to +5V for approximately 2 milliseconds, but how long should it wait between pulses? A Toyota Sienna should idle somewhere just under 1000 RPM, and redline somewhere around 7000 RPM. This coil is responsible for a single spark plug. As a four-stroke piston engine, it would need to spark once every two revolutions of the crankshaft. The math then works out to: 1000 revolutions/minute * 1 minute/60 seconds * 1 spark/2 revolution = ~8.3 sparks/second. Invert that value to arrive at 120 milliseconds between sparks. Doing the same math for 7000 RPM arrives at 17 milliseconds between sparks. So I would expect this coil to reliably spark once every 120ms to 17ms. For the default program, I programmed the Arduino to raise IGT to +5VDC for 2ms then wait 120ms before repeating.

Coil Arrives

I started building the experiment board immediately after Emily told me she would stop by, hoping to have something ready by the time she arrived. So it was slapped together with speed as the utmost priority and everything else (visual neatness, design elegance, and… ahem… electrical safety) relegated to be dealt with later. We connected the coil’s four wires to my test circuit, and it was time for the moment of truth.

I tapped the switch, and we saw a spark. Woohoo! We powered down the system and moved the jumper. Once powered back up, the Arduino sent its pulses and we have a steady stream of sparks. Success!

Enhancements

I honestly expected a lot more debugging before getting to this point, so I didn’t have anything else prepared. Emily suggested that we connect a potentiometer to the system for interactivity, so out came the soldering iron and associated tools. Emily has built a lot of projects with potentiometer knob adjustments so she handled that addition. As the code monkey I updated my Arduino code to read knob position so we can adjust from “1000 RPM idle” to “7000 RPM redline”.

Emily also fixed a problem with my board: I had connected a LED to IGF but it stayed dark. Reviewing the circuit I realized out of habit I had set it up to shine whenever IGF is high: but the coil never raises IGF to high! It pulls IGF low to report a successful spark. Emily added a 1k pull-up resistor and rewired the LED so it shines when IGF is low.

We connected everything back up and the adjustment knob worked wonders. We can “rev up” our system and it was fun, with Emily capturing a few video clips. Unfortunately the IGF LED stayed dark, but it didn’t dampen our enthusiasm. Now that the coil is up and running under conditions approximating its designed purpose, we enter the “screwing around” phase pushing it beyond original operating range.

Adventuring Beyond Spec

Emily asked for a fluorescent tube, but my house had almost entirely converted to LED lighting. I had but a single remaining tube and it was only still there because I haven’t figured out how to open up its enclosure. Emily figured it out in five seconds and pulled out the tube, connecting its ends in place of the spark plug. The ignition coil was able to act as a (poor) fluorescent tube ballast dimly flashing the tube. (The room had to go dark for Emily to shoot that video.)

After this fluorescent tube experiment, we reinstalled the spark plug and made a fascinating discovery. With no other (intentional) changes, the IGF LED now blinks in sync with sparks as originally expected. We have no idea why. Perhaps something about the workload of driving a fluorescent tube? This coil and plug was replaced because the Toyota Sienna’s engine control unit (ECU) reported codes for ignition issues. It’s possible cylinder combustion was working properly but poor IGF reporting triggered the malfunction indicator light (MIL). We agreed if this was the case, it obviously meant Toyota/Denso must add a fluorescent tube to their official list of repair tools.

Emily tried to see if this spark can light a sheet of paper on fire. There was a lot of glowing, charring, and pitting, but it took persistence before she got a real flame. There are far more effective ways to start a fire! But it did make us wonder if it’d be practical to build a crude electrical-discharge machining (EDM) system out of an ignition coil. That idea has been added to the ever-growing project to-do list.

The most promising experiment was revving this thing far beyond “redline” by shortening time between pulses below 17ms. To go even faster, we reduced the duration of IGT pulse itself. This quickly extinguished the IGF LED, which was the coil’s way to complain things are going too fast for a good combustion-initiating spark. But that’s OK, because we’re not here to ignite air-fuel mixtures, we were trying to turn it into a silly and pointless musical instrument. Still, there was a limit. We started losing the spark (and our musical note) when we went too fast. Exactly how fast is too fast? To make further progress on this front I’ll have to better characterize parameters of this ignition coil-on-plug module.


Public GitHub repository for this project: https://github.com/Roger-random/ignition_coil

Options for Improving Timestamp Precision

After a quick test determined that my Arduino sketch will be dealing data changing at a faster rate than 1kHz, I switched the timestamp query from calling millis() to micros(). As per Arduino documentation, this change improved time resolution by 250 from 1 millisecond precision to 4 microsecond precision. Since I had time on my mind anyway, I took a research detour to learn how this might be improved further. After learning how much work it’d take, I weighed it against my project and decided… nah, never mind.

Hardware: ATmega328P

A web search for ATmega328P processor programming found good information on this page Developing in C for the ATmega328: Marking Time and Measuring Time. The highest possible timing resolution is a counter that increments upon every clock cycle of the processor. For an ATmega328P running at 16MHz, that’s a resolution of 62.5 nanoseconds from ticks(). This 16-bit counter overflows very quickly (once every 4.096 milliseconds) so there’s another 16 bit counter ticks_ro() that increments whenever ticks() overflows. Together they become a 32-bit counter that would overflow every 4.47 minutes, after that we’re on our own to track overflows.

However, ticks() and ticks_ro() are very specific to AVR microcontrollers and not (easily) accessible from Arduino code because that kills its portability. Other microcontrollers have similar concepts but they would not be called the same thing. (Example: ESP32 has cpu_hal_get_cycle_count())

Software: Encoder Library

Another factor in timing precision is the fact that I’m not getting the micros() value when the encoder position is updated. The encoder position counter is updated within the quadrature decoding library, and I call micros() sometime afterwards.

timestamp,position,count
16,0,448737
6489548,1,1
6490076,2,1
6490688,5,1
6491300,8,1
6491912,12,1
6492540,17,1
6493220,21,1
6493876,25,1

Looking at the final two lines of this excerpt, I see my code recorded encoder update from position 21 to 25 over a period of 6493876-6493220 = 656 microseconds. But 6493876 is only when my code ran, that’s not when the encoder clicked over from 24 to 25! There’s been a delay on the order of three-digit microseconds, an approximation derived from 656/(25-21) = 164.

One potential way to improve upon this is to add a variable to the Encoder library, tracking the micros() timestamp of the most recent position update. I can then query that timestamp from my code later, instead of calling micros() myself which pads an unknown delay. I found the encoder library source code at https://github.com/PaulStoffregen/Encoder. I found an update() function and saw a switch() statement that looked at pin states and updated counter as needed. I can add my micros() update in the cases that updated position. Easy, or so I thought.

Looking at the code more closely, I realized the function I found is actually in a comment. It was labeled the “Simple, easy-to-read “documentation” version 🙂” implying the actual code was not as simple or easy to read. I was properly warned as I scrolled down further and found… AVR assembly code. Dang! That’s hard core.

On the upside, AVR assembly code means it can access the hardware registers behind ticks() and ticks_ro() for the ultimate in timer resolution. On the downside, I don’t know AVR assembly and, after some thought, I decided I’m not motivated enough to learn it for this particular project.

This was a fun side detour and I learned things I hadn’t known before, but I don’t think the cost/benefit ratio makes sense for my Canon MX340 teardown project. I want to try some other easy things before I contemplate the harder stuff.


This teardown ran far longer than I originally thought it would. Click here to rewind back to where this adventure started.

Captured CSV and Excel worksheets are included in the companion GitHub repository.

Bug Hunt Could Cross Three or More Levels of Indirection

When running Proxmox VE, my Dell Inspiron 7577’s onboard Realtek Ethernet would quit at unexpected times. Network transmission halts, and a network watchdog timer fires which triggers a debug error message. One proposed workaround is to change to a different Realtek driver. But after learning about the tradeoffs involved, I decided against pursuing that path.

This watchdog timer error message has been reported by many users on Proxmox forums, and some kind of a fix is en route. I’m not confident it’ll help me, because it deactivated ASPM on Realtek devices but turning off ASPM across the board on my computer didn’t keep the machine online. I’m curious how that particular fix was developed, or the data that informed the fix. Thinking generally, pinning such a failure down requires jumping through three levels of indirection. My poorly-informed speculation is as follows:

The first and easiest step is the watchdog timer itself. A call stack is part of the error message, which might be enough to determine the code path that started the timer. But since it is a production binary, the call stack has incomplete symbols. Getting more information would require building a debug kernel in order to get full symbols.

With that information, it should be relatively straightforward to get to the second step: determining what network operation timed out. But then what? Given the random and intermittent nature, the failing network operation was probably just an ordinary transaction that had succeeded many times before and should have succeeded again. But for whatever reason, failed this time because the Realtek driver and/or hardware got in a bad state.

And that’s the difficult third step: how to look at an otherwise ordinary network transaction and deduce a cause for the bad Realtek state. It probably wasn’t the network transaction itself! Which meant at least one more indirect jump. The fix en route dealt with PCIe ASPM (PCI Express Active State Power Management) which probably wasn’t directly on the code path for a normal network data transmission. I’m really curious how that deduction was made and, if the incoming fix doesn’t address my issue, how I can use similar techniques to determine what put my hardware in a bad state.

From the outside, that process feels like a lot of black magic voodoo I don’t understand. For now I will sit tight with my reboot cron job workaround and wait for the updated kernel to arrive.

[UPDATE: A Proxmox VE update has arrived bringing kernel 6.2.16-18-pve to replace 6.2.16-15-pve I had been running. Despite my skepticism about ASPM, either that change or another in this update seems to be successful keeping the machine online!]


Featured image created by Microsoft Bing Image Creator powered by DALL-E 3 with prompt “Cartoon drawing of a black laptop computer showing a crying face on screen and holding a network cable

Reported PCI Express Error was Unrelated

I have a Dell Inspiron 7577 laptop whose Ethernet hardware is unhappy with Proxmox VE 8, dropping off the network at unpredictable times. [UPDATE: Network connectivity stabilized after installing Proxmox VE kernel update from 6.2.16-15-pve to 6.2.16-18-pve. The PCI Express AER messages described in this post also stopped.] Trying to dig deeper, I found there was an error message dump indicating a watchdog timer went off while waiting to transmit data over the network. Searching online, I find bug reports that match the symptoms but that’s not necessarily the cause. A watchdog timer can be triggered by anything that gum up the works, so what resolves the network issue on one machine wouldn’t necessarily work on mine. I went back to dmesg to look for other clues.

Before the watchdog timer triggered, I found several lines of this message at irregular intervals:

[36805.253317] pcieport 0000:00:1c.4: AER: Corrected error received: 0000:3b:00.0

Sometimes only seconds apart, other times hours apart, and sometimes it never happens at all before the watchdog timer barks. This is some sort of error on the PCIe bus from device 0x3b:00.0, which is the Realtek Ethernet controller as per this lspci excerpt:

3b:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)

Even though the debug message said the error was corrected, maybe it triggered side effects causing my problem? Searching on this error message, I found several possibly relevant kernel flags. This Reddit thread has a good summary of them all.

  • pci=noaer disables PCI Express Advanced Error Reporting which sent this message. This is literally shooting the messenger. It’ll silence those messages but won’t do anything to address underlying problems.
  • pci=nomsi disables a PCI Express signaling mechanism that might cause these correctable errors, forcing all devices to fall back to a different mechanism. Some people reported losing peripherals (like USB) when they use this flag, I guess that hardware couldn’t fall back to something else? I tried it and while it didn’t cause any obvious problems (I still had USB) it also didn’t help keep my Ethernet alive either.
  • pci=nommconf disables PCI Express memory-mapped configuration. (I don’t know what those words mean, I just copied them out of kernel documentation.) The good news is adding this flag did eliminate those “Corrected error received” messages. The bad news it didn’t help keep my Ethernet alive, either.

Up until I tried pci=nommconf I had wondered if I’ve been doing kernel flags wrong. I was editing /etc/default/grub then running update-grub. After boot, I checked they showed up on cat /proc/cmdline but I didn’t really know if the kernel actually changed behavior. After pci=nommconf, my confidence was boosted by the lack of “Corrected error received” messages, though that might still be a false sense of confidence because “Corrected error received” messages don’t always happen. It’s an imperfect world, I work with what I have.

And sadly, there is something I need but don’t have today: ability to dig deeper into Linux kernel to find out what has frozen up, leading to the watchdog timer expiring. But I’m out of ideas for now and I still have a computer that drops off the network at irregular times. I don’t want to keep pulling the laptop off the shelf to log in locally and type “reboot” several times a day. I concede I must settle for a hideously ugly hack to do that for me.


Featured image created by Microsoft Bing Image Creator powered by DALL-E 3 with prompt “Cartoon drawing of a black laptop computer showing a crying face on screen and holding a network cable

Ethernet Failure Triggers Network Stack Timeout

I was curious about Proxmox VE capability to migrate virtual machines from one cluster node to another. I set up a small cluster to try it and found it to be as easy as advertised. After migrating my VM experiments to a desktop computer with Intel networking hardware, they have been running flawlessly. This allowed me to resume tinkering with a laptop computer that would drop off the network at unpredictable times. This unfortunate tendency makes it a very poor Proxmox VE server. [UPDATE: Network connectivity stabilized after installing Proxmox VE kernel update from 6.2.16-15-pve to 6.2.16-18-pve.]

Repeating Errors From r8169

After it dropped off the network, I have to log on to the computer locally. The screen is usually filled with error messages. I ran dmesg and saw the same messages there as well. Based on associated timestamp, this block of messages repeats every four minutes:

[68723.346727] r8169 0000:3b:00.0 enp59s0: rtl_chipcmd_cond == 1 (loop: 100, delay: 100).
[68723.348833] r8169 0000:3b:00.0 enp59s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[68723.350921] r8169 0000:3b:00.0 enp59s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[68723.352954] r8169 0000:3b:00.0 enp59s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[68723.355097] r8169 0000:3b:00.0 enp59s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[68723.357156] r8169 0000:3b:00.0 enp59s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[68723.359289] r8169 0000:3b:00.0 enp59s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).
[68723.389357] r8169 0000:3b:00.0 enp59s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
[68723.415890] r8169 0000:3b:00.0 enp59s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).
[68723.442132] r8169 0000:3b:00.0 enp59s0: rtl_eriar_cond == 1 (loop: 100, delay: 100).

Searching on that led me to Proxmox forums, and one of the workarounds was to set the kernel flag pcie_aspm=off. I tried that, but the computer still kept dropping off the network. Either I’m not doing this correctly (editing /etc/default/grub then running update-grub) or the change doesn’t help my situation. Perhaps it addressed a different problem with similar symptoms, leaving open the mystery of what’s going with my machine.

NETDEV WATCHDOG

Looking for more clues, I scrolled backwards in dmesg log and found this block of information just before the repeating series of r8169 errors:

[67717.227089] ------------[ cut here ]------------
[67717.227096] NETDEV WATCHDOG: enp59s0 (r8169): transmit queue 0 timed out
[67717.227126] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x23a/0x250
[67717.227133] Modules linked in: veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilt>[67717.227254]  iwlwifi ttm snd_timer pcspkr drm_display_helper intel_wmi_thunderbolt btintel dell_wmi_descriptor joydev processor_thermal_mbox>[67717.227374]  i2c_i801 xhci_pci i2c_hid_acpi crc32_pclmul i2c_smbus nvme_common i2c_hid realtek xhci_pci_renesas ahci libahci psmouse xhci_hc>[67717.227401] CPU: 1 PID: 0 Comm: swapper/1 Tainted: P           O       6.2.16-15-pve #1
[67717.227404] Hardware name: Dell Inc. Inspiron 7577/0P9G3M, BIOS 1.17.0 03/18/2022
[67717.227406] RIP: 0010:dev_watchdog+0x23a/0x250
[67717.227411] Code: 00 e9 2b ff ff ff 48 89 df c6 05 ac 5d 7d 01 01 e8 bb 08 f8 ff 44 89 f1 48 89 de 48 c7 c7 90 87 80 bc 48 89 c2 e8 56 91 30>[67717.227414] RSP: 0018:ffffae88c014ce38 EFLAGS: 00010246
[67717.227417] RAX: 0000000000000000 RBX: ffff99129280c000 RCX: 0000000000000000
[67717.227419] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[67717.227421] RBP: ffffae88c014ce68 R08: 0000000000000000 R09: 0000000000000000
[67717.227423] R10: 0000000000000000 R11: 0000000000000000 R12: ffff99129280c4c8
[67717.227425] R13: ffff99129280c41c R14: 0000000000000000 R15: 0000000000000000
[67717.227427] FS:  0000000000000000(0000) GS:ffff991600480000(0000) knlGS:0000000000000000
[67717.227429] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[67717.227432] CR2: 000000c0006e1010 CR3: 0000000165810003 CR4: 00000000003726e0
[67717.227434] Call Trace:
[67717.227436]  <IRQ>
[67717.227439]  ? show_regs+0x6d/0x80
[67717.227444]  ? __warn+0x89/0x160
[67717.227447]  ? dev_watchdog+0x23a/0x250
[67717.227451]  ? report_bug+0x17e/0x1b0
[67717.227455]  ? irq_work_queue+0x2f/0x70
[67717.227459]  ? handle_bug+0x46/0x90
[67717.227462]  ? exc_invalid_op+0x18/0x80
[67717.227465]  ? asm_exc_invalid_op+0x1b/0x20
[67717.227470]  ? dev_watchdog+0x23a/0x250
[67717.227474]  ? dev_watchdog+0x23a/0x250
[67717.227477]  ? __pfx_dev_watchdog+0x10/0x10
[67717.227481]  call_timer_fn+0x29/0x160
[67717.227485]  ? __pfx_dev_watchdog+0x10/0x10
[67717.227488]  __run_timers+0x259/0x310
[67717.227493]  run_timer_softirq+0x1d/0x40
[67717.227496]  __do_softirq+0xd6/0x346
[67717.227499]  ? hrtimer_interrupt+0x11f/0x250
[67717.227504]  __irq_exit_rcu+0xa2/0xd0
[67717.227507]  irq_exit_rcu+0xe/0x20
[67717.227510]  sysvec_apic_timer_interrupt+0x92/0xd0
[67717.227513]  </IRQ>
[67717.227515]  <TASK>
[67717.227517]  asm_sysvec_apic_timer_interrupt+0x1b/0x20
[67717.227520] RIP: 0010:cpuidle_enter_state+0xde/0x6f0
[67717.227524] Code: 12 57 44 e8 f4 64 4a ff 8b 53 04 49 89 c7 0f 1f 44 00 00 31 ff e8 22 6d 49 ff 80 7d d0 00 0f 85 eb 00 00 00 fb 0f 1f 44 00>[67717.227526] RSP: 0018:ffffae88c00ffe38 EFLAGS: 00000246
[67717.227529] RAX: 0000000000000000 RBX: ffffce88bfc80000 RCX: 0000000000000000
[67717.227531] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
[67717.227533] RBP: ffffae88c00ffe88 R08: 0000000000000000 R09: 0000000000000000
[67717.227534] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffbd2c3a40
[67717.227536] R13: 0000000000000008 R14: 0000000000000008 R15: 00003d96a543ec60
[67717.227540]  ? cpuidle_enter_state+0xce/0x6f0
[67717.227544]  cpuidle_enter+0x2e/0x50
[67717.227547]  do_idle+0x216/0x2a0
[67717.227551]  cpu_startup_entry+0x1d/0x20
[67717.227554]  start_secondary+0x122/0x160
[67717.227557]  secondary_startup_64_no_verify+0xe5/0xeb
[67717.227563]  </TASK>
[67717.227565] ---[ end trace 0000000000000000 ]---

A watchdog timer went off somewhere in the networking stack while waiting to transmit data. The data output starts with [ cut here ] but I have no idea where this information should be pasted into. I recognize the format of a call trace alongside a dump of CPU register data, but the actual call trace is incomplete. There are a lot of “?” in here because I am not running the debug kernel and symbols are missing.

Looking in the FAQ for Kernel.org, I followed a link to kernelnewbies.org and from there their page “So, you think you’ve found a Linux kernel bug?” I see the section on “Oops messages” and they look very similar to what I see here, except without the actual line with “Oops” in it. From there I was linked to the kernel bug tracking database. A search on watchdog timer expiration in r8169 got several dozen hits across many years, including 217814 which I found earlier via Proxmox forum search, thus coming full circle.

I see some differences between my call trace with that in 217814, but that’s possibly expected differences between my kernel (6.2.16-15-pve) and what generated 217814 (6.2.0-26-generic). In any case, the call stack appears to be for the watchdog timer itself and not whatever triggered it. Supposedly disabling ASPM would resolve 217814. Since it didn’t do anything for me, I conclude there’s something else clogging up the network stack. Teasing out that “something else” requires learning more about Linux kernel inner workings. I’m not enthusiastic about that prospect so I looked for other things to try.


Featured image created by Microsoft Bing Image Creator powered by DALL-E 3 with prompt “Cartoon drawing of a black laptop computer showing a crying face on screen and holding a network cable

A Quick Look at ASPM and Power Consumption

[UPDATE: After installing Proxmox VE kernel update from 6.2.16-15-pve to 6.2.16-18-pve, this problem no longer occurs, allowing the machine to stay connected to the network.]

I’ve configured an old 15″ laptop into a light-duty virtualization server running Proxmox VE, and I’m running into a reliability problem with the Ethernet controller on this Dell Inspiron 7577. My symptoms line up with a bug that others have filed, and a change to address the issue is working its way through the pipeline. I wouldn’t call it a fix, exactly, as the problem seems to be flawed power management in Realtek hardware and/or driver in combination with the latest Linux kernel. The upcoming change doesn’t fix Realtek power management, it merely disables their participation in PCIe ASPM (Active State Power Management).

Until that change arrives, one of the mitigation workarounds is to deactivate ASPM on the entire PCIe bus. There are a lot of components on that bus! Here’s the output from running “lspci” at the command line:

00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers (rev 05)
00:01.0 PCI bridge: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) (rev 05)
00:02.0 VGA compatible controller: Intel Corporation HD Graphics 630 (rev 04)
00:04.0 Signal processing controller: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem (rev 05)
00:14.0 USB controller: Intel Corporation 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller (rev 31)
00:14.2 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Thermal Subsystem (rev 31)
00:15.0 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Serial IO I2C Controller #0 (rev 31)
00:15.1 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Serial IO I2C Controller #1 (rev 31)
00:16.0 Communication controller: Intel Corporation 100 Series/C230 Series Chipset Family MEI Controller #1 (rev 31)
00:17.0 SATA controller: Intel Corporation HM170/QM170 Chipset SATA Controller [AHCI Mode] (rev 31)
00:1c.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #1 (rev f1)
00:1c.4 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #5 (rev f1)
00:1c.5 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #6 (rev f1)
00:1d.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #9 (rev f1)
00:1f.0 ISA bridge: Intel Corporation HM175 Chipset LPC/eSPI Controller (rev 31)
00:1f.2 Memory controller: Intel Corporation 100 Series/C230 Series Chipset Family Power Management Controller (rev 31)
00:1f.3 Audio device: Intel Corporation CM238 HD Audio Controller (rev 31)
00:1f.4 SMBus: Intel Corporation 100 Series/C230 Series Chipset Family SMBus (rev 31)
01:00.0 VGA compatible controller: NVIDIA Corporation GP106M [GeForce GTX 1060 Mobile] (rev a1)
01:00.1 Audio device: NVIDIA Corporation GP106 High Definition Audio Controller (rev a1)
3b:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
3c:00.0 Network controller: Intel Corporation Wireless 8265 / 8275 (rev 78)
3d:00.0 Non-Volatile memory controller: Intel Corporation Device f1aa (rev 03)

Deactivating APSM across the board will impact far more than the Realtek chip. I was curious what impact this would have on power consumption and decided to dig up my Kill-a-Watt meter for some before/after measurements.

Dell Latitude E6230 + Ubuntu Desktop

As a point of comparison, I had measured a few values of Dell Latitude E6230 I had just retired. These are the lowest values I could see within a ~15 second window. It would jump up by a watt or two for a few seconds before dropping.

  • 5W: idle.
  • 8W: hosting Home Assistant OS under KVM but not doing anything intensive.
  • 35W: 100% CPU utilization as HAOS compiled ESPHome firmware updates.

As a light-duty server, the most important value here is the 8W value, because that’s what it will be drawing most of the time.

Dell Inspiron 7577 + Proxmox VM

Since the Inspiron 7577 came with a beefy 180W AC power adapter (versus the 60W unit of the E6230) I was not optimistic about its power consumption. As a newer larger more power-hungry machine, I had expected idle power draw at least double that of the E6230. I was very pleasantly surprised. Running Proxmox VE but with all VMs shut down, the Kill-a-Watt indicated a rock solid two watts. Two!

As I started up my three virtual machines (Home Assistant OS, Plex, and InfluxDB), it jumped up to fifteen watts then gradually ramped back down to two watts as those VMs reached steady state. After that, it would occasionally jump up to four or five watts for a few seconds to service those mostly-idle VMs, then drop back down to two watts.

On the upside, it appears four generations of Intel CPU and laptop evolution has provided significant improvements in power efficiency. However, they were running different software so some of that difference might be credited to Ubuntu Desktop versus Proxmox.

On the downside, the Kill-a-Watt only measures down to whole watts with no fractional numbers. So a baseline of two watts isn’t very useful because it would take a 50% change in power consumption to show up in Kill-a-Watt numbers. I know running three VMs would take some power, but idling with and without VM both bottomed out at two watts. This puts me into measurement error territory. I need finer grained instrumentation to make meaningful measurements, but I’m not willing to pay money for just a curiosity experiment. I shrugged and kept going.

Dell Inspiron 7577 + Proxmox VM + pcie_aspm=off

Reading Ubuntu bug #2031537 I saw one of their investigative steps was to add pcie_aspm=off to the kernel command line. To follow in those footsteps, I first needed to learn what that meant. I could confirm it is documented as a valid kernel command line parameter. Then I had to find instructions on how to add such a thing, which involved editing /etc/default/grub then running update-grub. And finally, after the system rebooted, I could confirm the command line was processed by typing “cat /proc/cmdline“. I don’t know how to verify it actually took effect, though, except by observing system behavior changes.

The first data point is power consumption: now when hosting my three virtual machines, the Kill-a-Watt showed three watts most of the time. It still occasionally dips down to two watts for a second or two, but most of the time it hovers at three watts plus the occasional spike up to four or five watts. Given the coarse granularity, it’s inconclusive whether this reflects actual change or just random.

The second and more important data point is: did it improve Ethernet reliability? Sadly it did not. Before I made this change, I noted three failures from Realtek Ethernet. Each session lasting 36 hours or less. The first reboot after this change lost network after 50 hours. This might be within range of random error (meaning maybe pcie_aspm=off didn’t actually change anything) and definitely not long enough. After that reboot, the system fell off the network again after less than 3 hours. (2 hours 55 minutes!) That is a complete fail.

I’m sad pcie_aspm=off turned out to be a bust. So what’s next? First I need to move these virtual machines to another physical machine, which was a handy excuse to play with Proxmox clusters.

Lithium Iron Phosphate Battery in Commodity Sealed Lead Acid Battery Form Factor

It was instructive to take apart a broken light switch to see why it failed. An unexpected side bonus of replacing this switch is I also learned the sealed-lead acid (SLA) batteries in my uninterruptible power supply (UPS) units are no good. I had shut down electricity to the entire house to swap the switch, a project I expected to take 15-20 minutes. This is well under the estimated run time of my UPS. However, less than ten minutes into my project, I started hearing low battery alarms followed by UPS going dark.

I suspected these batteries might be weak, as the usual recommendation is to replace them every two or three years and some of these are coming up on four years old. Doing the switch swap project has confirmed those batteries are long gone. They are still good enough to handle brief flickering blinks of power outage (common when a neighbor’s air conditioning kicks in) but not for an extended outage.

The last time I needed new UPS batteries, I bought APC-branded replacement battery cartridges and then took apart the old cartridges. Finding it was built around two SLA batteries in a commodity form factor, I thought next time I should try replacing the batteries with generics instead of buying a whole APC-branded cartridge. Now I will put that idea into practice.

I went online to shop for generic “7AH” SLA batteries. They’re not necessarily all seven amp-hours in capacity, but that is typical and it became a way to refer to the form factor as well. Dictating a compatible enclosure size as well as the location and shape of positive and negative terminals. Among listings for “7AH” SLA batteries, I saw some lithium-iron phosphate (or LFP or LiFePO4) batteries packaged into the same form factor and advertised to be drop-in replacements. Hmm, interesting.

On paper, lithium iron phosphate batteries will have a longer useful life. They have lower energy density than the NMC types of lithium-ion batteries popular in our portable electronics. So in electronics context LFP batteries usually meant bigger and heavier battery packs. But in the lead-acid replacement scenario, LFP batteries are smaller and lighter than equivalent lead-acid. 7AH (or 9AH, or 10AH, or 12AH) worth of LFP cells fit comfortably within a commodity 7AH SLA shape, with plenty of room left inside to integrate a battery management system to guard against battery abuse.

Four LFP cells in series have almost the same 12-ish to 14-ish voltage operating range as six lead-acid cells in series, close enough the integrated BMS should prevent any major issues. The biggest disclaimer I saw repeated from several vendors was about battery capacity. While these batteries are compatible with systems designed for lead-acid batteries, a LFP-aware charger is required to access their full capacity. Lead-acid systems typically maintain a standby voltage of 13.8V, and that would only keep these LFP batteries at about 75%-80% full.

I saw that “only 75%-80% full” warning and thought: that’s not a bug, that’s a feature! Limiting lithium chemistry battery state of charge to about 80% significantly prolongs their useful life relative to keeping them at 100%. And longevity is what I want for UPS batteries. I can accept only getting ~5AH out of 7AH capacity as a tradeoff if it means I still have 5AH after four or more years. I would have to pay a premium for LFP batteries over SLA batteries, but their price difference is much smaller than they used to be. And that’s before considering the fact if I don’t have to replace them as frequently, I might even come out ahead! This all sounds interesting enough I will give them a try.