Balena Device Debugging Masterclass
Balena Device Debugging Masterclass
Prerequisite Classes
This masterclass builds upon knowledge that has been taught in previous classes. To gain the most from this masterclass, we recommend that you first undertake the following masterclasses:
Introduction
At balena, we believe the best people to support a customer are the engineers who build the product. They have the depth and breadth of knowledge that can quickly identify and track down issues that traditional support agents usually do not. Not only does this help a customer quickly and efficiently solve most issues, but it also immerses balena engineers in sections of the product they might not otherwise encounter in their usual working life, which further improves the support each engineer can offer. This masterclass has been written as an initial guide for new engineers about to start support duties.
Whilst the majority of devices never see an issue, occasionally a customer will contact balena support with a query where one of their devices is exhibiting anomalous behavior.
Obviously, no guide can cover the range of queries that may occur, but it can give an insight into how to tackle problems and the most common problems that a balena support agent sees, as well as some potential solutions to these problems. In compiling this document, a group of highly seasoned balena engineers discussed their techniques for discovering and resolving on-device issues, as well as techniques for determining how best to mitigate an issue being exhibited.
In this masterclass, you will learn how to:
Gain access to a customer device, when permission has been granted
Retrieve initial diagnostics for the device
Identify and solve common networking problems
Work with the Supervisor
Work with balenaEngine
Examine the Kernel logs
Understand media-based issues (such as SD card corruption)
Understand how heartbeat and the VPN only status affects your devices
Whilst this masterclass is intended for new engineers about to start support duties at balena, it is also intended to act as an item of interest to customers who wish to know more about how we initially go about debugging a device (and includes information that customers themselves could use to give a support agent more information). We recommend, however, ensuring balena support is always contacted should you have an issue with a device that is not working correctly.
Note: The balena VPN service was renamed to cloudlink in 2022 in customer facing documentation.
Hardware and Software Requirements
It is assumed that the reader has access to the following:
A local copy of the example in our Balena Device Debugging Masterclass repository. This copy can be obtained by either method:
git clone https://github.com/balena-io-projects/debugging-masterclass.gitDownload ZIP file (from 'Clone or download'->'Download ZIP') and then unzip it to a suitable directory
A balena supported device, such as a balenaFin 1.1, Raspberry Pi 3 or Intel NUC. If you don't have a device, you can emulate an Intel NUC by installing VirtualBox and following this guide
A suitable shell environment for command execution (such as
bash)A balenaCloud account
A familiarity with Dockerfiles
An installed instance of the balena CLI
Exercises
The following exercises assume access to a device that has been provisioned. As per the other masterclasses in this series we're going to assume that's a Raspberry Pi 4, however you can simply alter the device type as appropriate in the following instructions. The balena CLI is going to be used instead of the WebTerminal in the balenaCloud Dashboard for accessing the device, but all of the exercises could be completed using the WebTerminal if preferred.
First login to your balena account via balena login, and then create a new fleet:
Now provision a device by downloading and flashing a development image from the Dashboard (via Etcher), or by flashing via the command line.
Note: Above, we used a balenaOS Extended Support Release (ESR). These ESRs are currently available for many device types, but only on paid plans and balena team member accounts. If you are going through this masterclass on a free plan, just pick the latest release available and the remainder of the guide is still applicable.
Carry out any configuration generation required, should you be using a Wifi AP and inject the configuration into the image (see balena CLI Advanced Masterclass for more details), or use a configuration for an ethernet connection:
You should now have a device that will appear as part of the DebugFleet fleet:
For convenience, export a variable to point to the root of this masterclass repository, as we'll use this for the rest of the exercises, eg:
Finally, push the code in the multicontainer-app directory to the fleet:
1. Accessing a User Device
Any device owned by you will automatically allow access via either the WebTerminal (in the device's Dashboard page), or the balena CLI via balena ssh <uuid>. However, for a support agent to gain access to a device that isn't owned by them, you would need to grant access for it explicitly.
See: Granting Support Access to a Support Agent
2. Initial Diagnosis
The balenaCloud Dashboard includes the ability to run diagnostics on a device to determine its current condition. This should be the first step in attempting to diagnose an issue without having to actually access the device via SSH. Ensuring diagnostics and health checks are examined first ensures that you have a good idea of the state a device is in before SSHing into it, as well as ensuring that the information can be accessed later if required (should a device be in a catastrophic state). This helps significantly in a support post-mortem should one be required.
2.1 Device Health Checks
To run health checks through balenaCloud dashboard, head to the Diagnostics tab in the sidebar and click the Run checks button to start the tests.

This will trigger a set of health checks to run on the device. You should see all the checks as Succeeded in the Success column if the device is healthy and there are no obvious faults.
That's no fun, let's create a fault.
SSH into your device, via balena ssh <UUID>, using the appropriate UUID. We want to SSH into the host OS, as that's where we'll wreak havoc:
We're going to do a couple of things that will show up as problems. Something you'll often check, and that we'll discuss later, is the state of the balena Supervisor and balenaEngine.
First of all, we're going to kill the balenaEngine maliciously without letting it shut down properly:
What this does is list the processes running, look for the balenad executable (the balenaEngine itself) and then stop the engine with a SIGKILL signal, which will make it immediately terminate instead of shutting down correctly. In fact, we'll do it twice. Once you've waited about 30 seconds, run the command again.
Now if you run the health checks again. After a couple minutes, you'll see the 'check_container_engine` section has changed:
check_container_engine
Failed
Some container_engine issues detected:
test_container_engine_running_now Container engine balena is NOT running
test_container_engine_restarts Container engine balena has 2 restarts and may be crashlooping (most recent start time: Thu 2022-08-18 11:14:32 UTC)
test_container_engine_responding Error querying container engine:
Unclean restarts usually mean that the engine crashed abnormally with an issue. This usually happens when something catastrophic occurs between the Supervisor and balenaEngine or corruption occurs in the image/container/volume store. Let's take a look at the journal for balenaEngine (balena.service) on the device:
You'll see a lot of output, as the logs don't just show the balenaEngine output but the output from the Supervisor as well. However, if you search through the output, you'll see a line like the following:
As you can see, the balena.service was killed with a SIGKILL instruction.
You can also see the two times that our services were attempted to start after the engine was killed and restarted automatically by running:
As you can see, these have now been specifically output for the two running service containers.
If you only want to see balenaEngine output and not from any of the service containers it is running, use journalctl -u balena.service -t balenad. The -t is the shortened form of --identifier=<id>, which in this case ensures that only messages from the balenad syslog identifier are shown.
We'll discuss issues with balenaEngine and the Supervisor later in this masterclass.
There are many other health checks that can immediately expose a problem. For example, warnings on low free memory or disk space can expose problems which will exhibit themselves as release updates failing to download, or service containers restarting abnormally (especially if a service runs unchecked and consumes memory until none is left). We'll also go through some of these scenarios later in this masterclass.
Checkout the Diagnostics page for more information on tests you can run on the device.
3. Device Access Responsibilities
When accessing a customer's device you have a number of responsibilities, both technically and ethically. A customer assumes that the support agent has a level of understanding and competence, and as such support agents should ensure that these levels are met successfully.
There are some key points which should be followed to ensure that we are never destructive when supporting a customer:
Always ask permission before carrying out non-read actions. This includes situations such as stopping/restarting/starting services which are otherwise functional (such as the Supervisor). This is especially important in cases where this would stop otherwise functioning services (such as stopping balenaEngine).
Ensure that the customer is appraised of any non-trivial non-read actions that you are going to take before you carry those actions out on-device. If they have given you permission to do 'whatever it takes' to get the device running again, you should still pre-empt your actions by communicating this clearly.
During the course of carrying out non-read actions on a device, should the customer be required to answer a query before being able to proceed, make it clear to them what you have already carried out, and that you need a response before continuing. Additionally ensure that any incoming agents that may need to access the device have all of your notes and actions up to this point, so they can take over in your stead.
Never reboot a device without permission, especially in cases where it appears that there is a chance that the device will not recover (which may be the case in situations where the networking is a non-optimal state). It is imperative in these cases that the customer is made aware that this could be an outcome in advance, and that they must explicitly give permission for a reboot to be carried out.
Occasionally it becomes very useful to copy files off from a device, so that they can be shared with the team. This might be logfiles, or the Supervisor database, etc.
A quick way of copying data from a device with a known UUID onto a local machine is to use SSH with your balena support username:
You can copy data from your local machine onto a device by piping the file in instead:
4. Accessing a Device using a Gateway Device
It may not always be possible to access the device directly, especially if the VPN component isn't working.
Usually, we're able to stay connected to the device when the OpenVPN service isn't running because we're using a development image, and development images allow passwordless root connections via SSH. Had we been running a production image, then the device would have been in the 'Offline' state, but it would still potentially have had network access and be otherwise running correctly. This brings up an issue though, how can we connect to a faulty production device in the field?
The answer comes from the mechanism behind how SSH is tunneled through the VPN, and we can actually use another device (in the 'Online' state) on the same local network as an 'Offline' device to do this.
You will need UUIDs of both the gateway ('Online') and target ('Offline') devices, as well as your username and, if possible, the IP address of the target device (by default, the last seen 'Online' state IP address will be used if the IP is not passed). Once you have these details, you can carry this out by executing the following on your host machine:
Should this not work, it's possible that the IP address has changed (and if it has, you will need to specify the correct address). To find the potentially correct IP address is to SSH into the gateway device and run the following script (which should work for both legacy DropBear SSH daemons and those running on more recent balenaOS installations):
Ensure you change the prefix variable to the correct prefix for the local network before starting. This script will then go through the range $prefix.2 to $prefix.254, and flag those devices it believes are potential balena devices. This should help you narrow down the address range to try connections to balena devices, substituting the IP address appropriately in the SSH connection command.
All IP addresses will be printed by the script, but those that are potentially balena devices will show up with BALENA DEVICE next to them. If you have multiple potential UUIDs, you'll need to mix and match UUIDs and IP addresses until you find a matching combination.
You may also try using mDNS from the gateway device to locate the IP of the target based on its hostname. Simply ping the .local address and grab the IP that way:
5. Component Checklist
The key to any support is context. As a support agent, you should have enough context from a customer to start an investigation. If you do not, then you should ask for as much context and detail about the device as you can before starting an investigation on the device.
When accessing a device, there are usually some things you can check to see why a device may be in a broken state. Obviously, this depends on the symptoms a customer has reported, as well as those a support agent may have found when running the device diagnostics. However, there are some common issues that can occur to put a device into a broken state that can be quickly fixed.
The following sections discuss some of the first components to check when carrying out on-device support. The components that should be checked and in what order comes down to the context of support, and the symptoms seen.
5.1 Service Status and Journal Logs
balenaOS uses systemd as its init system, and as such almost all the fundamental components in balenaOS run as systemd services. systemd builds a dependency graph of all of its unit files (in which services are defined) to determine the order that these should be started/shutdown in. This is generated when systemd is run, although there are ways to rebuild this after startup and during normal system execution.
Possibly the most important command is journalctl, which allows you to read the service's journal entries. This takes a variety of switches, the most useful being:
--follow/-f- Continues displaying journal entries until the command is halted (eg. with Ctrl-C)--unit=<unitFile>/-u <unitFile>- Specifies the unit file to read journal entries for. Without this, all units entries are read.--pager-end/-e- Jump straight to the final entries for a unit.--all/-a- Show all entries, even if long or with unprintable characters. This is especially useful for displaying the service container logs from user containers when applied tobalena.service.
A typical example of using journalctl might be following a service to see what's occuring. Here's it for the Supervisor, following journal entries in real time:
Any systemd service can be referenced in the same way, and there are some common commands that can be used with services:
systemctl status <serviceName>- Will show the status of a service. This includes whether it is currently loaded and/or enabled, if it is currently active (running) and when it was started, its PID, how much memory it is notionally (and beware here, this isn't always the amount of physical memory) using, the command used to run it and finally the last set of entries in its journal log. Here's example output from the OpenVPN service:systemctl start <serviceName>- Will start a non-running service. Note that this will not restart a service that is already running.systemctl stop <serviceName>- Will stop a running service. If the service is not running, this command will not do anything.systemctl restart <serviceName>- Will restart a running service. If the service is not running, this will start it.systemctl daemon-reload- Will essentially run through the startup process systemd carries out at initialisation, but without restarting services or units. This allows the rebuild of the system dependency graph, and should be carried out if any of the unit files are changed whilst the system is running.
This last command may sound a bit confusing but consider the following. Imagine that you need to dynamically change the balena-supervisor.service unit file to increase the healthcheck timeout on a very slow system. Once that change has been made, you'll want to restart the service. However, first, you need to reload the unit file as this has changed from the loaded version. To do this, you'll run systemctl daemon-reload and then systemctl restart balena-supervisor.service to restart the Supervisor.
In general, there are some core services that need to execute for a device to come online, connect to the balenaCloud VPN, download releases and then run them:
chronyd.service- Responsible for NTP duties and syncing 'real' network time to the device. Note that balenaOS versions less than v2.13.0 usedsystemd-timesyncd.serviceas their NTP service, although inspecting it is very similar to that ofchronyd.service.dnsmasq.service- The local DNS service which is used for all host OS lookups (and is the repeater for user service containers by default).NetworkManager.service- The underlying Network Manager service, ensuring that configured connections are used for networking.os-config.service- Retrieves settings and configs from the API endpoint, including certificates, authorized keys, the VPN config, etc.openvpn.service- The VPN service itself, which connects to the balenaCloud VPN, allowing a device to come online (and to be SSHd to and have actions performed on it). Note that in balenaOS versions less than v2.10.0 this was calledopenvpn-resin.service, but the method for inspecting and dealing with the service is the same.balena.service- The balenaEngine service, the modified Docker daemon fork that allows the management and running of service images, containers, volumes and networking.balena-supervisor.service- The {{ $names.company.short }} Supervisor service, responsible for the management of releases, including downloading updates of the app and self-healing (via monitoring), variables (fleet/device), and exposure of these services to containers via an API endpoint.dbus.service- The DBus daemon socket can be used by services if theio.balena.features.dbuslabel is applied. This exposes the DBus daemon socket in the container which allows the service to control several host OS features, including the Network Manager.
Additionally, there are some utility services that, whilst not required for a barebones operation, are also useful:
ModemManager.service- Deals with non-Ethernet or Wifi devices, such as LTE/GSM modems.avahi-daemon.service- Used to broadcast the device's local hostname (useful in development mode, responds tobalena scan).
We'll go into several of these services in the following sections, but generally these are the first points to examine if a system is not behaving as it should, as most issues will be associated with these services.
Additionally there are a large number of utility services that facilitate the services above, such as those to mount the correct partitions for data storage, configuring the Supervisor and running it should it crash, etc.
5.2 Persistent Logs
See the Device Logs docs to learn about persistent logging.
6. Determining Networking Issues
For determining networking issues, check out the Network Debugging masterclass.
7. Working with the config.json File
config.json FileSee the docs on Configuring balenaOS to learn about working with the config.json file.
8. Working with the Supervisor
Working with the Supervisor
Service: balena-supervisor.service, or resin-supervisor.service if OS < v2.78.0
The balena Supervisor is the service that carries out the management of the software release on a device, including determining when to download updates, the changing of variables, ensuring services are restarted correctly, etc. It is the on-device agent for balenaCloud.
As such, it's imperative that the Supervisor is operational and healthy at all times, even when a device is not connected to the Internet, as the Supervisor still ensures the running of a device that is offline.
The Supervisor itself is a Docker service that runs alongside any installed user services and the healthcheck container. One major advantage of running it as a Docker service is that it can be updated just like any other service, although carrying that out is slightly different than updating user containers. (See Updating the Supervisor).
Before attempting to debug the Supervisor, it's recommended to upgrade the Supervisor to the latest version, as we frequently release bugfixes and features that may resolve device issues.
Otherwise, assuming you're still logged into your development device, run the following:
You can see the Supervisor is just another systemd service (balena-supervisor.service) and that it is started and run by balenaEngine.
Supervisor issues, due to their nature, vary significantly. Issues may commonly be misattributed to the Supervisor. As the Supervisor is verbose about its state and actions, such as the download of images, it tends to be suspected of problems when in fact there are usually other underlying issues. A few examples are:
Networking problems - The Supervisor reports failed downloads or attempts to retrieve the same images repeatedly, where in fact unstable networking is usually the cause.
Service container restarts - The default policy for service containers is to restart if they exit, and this sometimes is misunderstood. If a container is restarting, it's worth ensuring it's not because the container itself is exiting either due to a bug in the service container code or because it has correctly come to the end of its running process.
Release not being downloaded - For instance, a fleet/device has been pinned to a particular version, and a new push is not being downloaded.
It's always worth considering how the system is configured, how releases were produced, how the fleet or device is configured and what the current networking state is when investigating Supervisor issues, to ensure that there isn't something else amiss that the Supervisor is merely exposing via logging.
Another point to note is that the Supervisor is started using healthdog which continually ensures that the Supervisor is present by using balenaEngine to find the Supervisor image. If the image isn't present, or balenaEngine doesn't respond, then the Supervisor is restarted. The default period for this check is 180 seconds. Inspecting /lib/systemd/system/balena-supervisor.service on-device will show whether the timeout period is different for a particular device. For example:
8.1 Restarting the Supervisor
It's rare to actually need a Supervisor restart. The Supervisor will attempt to recover from issues that occur automatically, without the requirement for a restart. When in doubt about whether a restart is required, look at the Supervisor logs and double check other on-duty support agents if needed. If fairly certain, it's generally safe to restart the Supervisor, as long as the user is aware that some extra bandwidth and device resources will be used on startup.
There are instances where the Supervisor is incorrectly restarted when in fact the issue could be the corruption of service images, containers, volumes or networking. In these cases, you're better off dealing with the underlying balenaEngine to ensure that anything corrupt is recreated correctly. See the balenaEngine section for more details.
If a restart is required, ensure that you have gathered as much information as possible before a restart, including pertinent logs and symptoms so that investigations can occur asynchronously to determine what occurred and how it may be mitigated in the future. Enabling persistent logging may also be beneficial in cases where symptoms are repeatedly occurring.
To restart the Supervisor, simply restart the systemd service:
8.2 Updating the Supervisor
Occasionally, there are situations where the Supervisor requires an update. This may be because a device needs to use a new feature or because the version of the Supervisor on a device is outdated and is causing an issue. Usually the best way to achieve this is via a balenaOS update, either from the dashboard or via the command line on the device.
If updating balenaOS is not desirable or a user prefers updating the Supervisor independently, this can easily be accomplished using self-service Supervisor upgrades. Alternatively, this can be programmatically done by using the Node.js SDK method device.setSupervisorRelease.
You can additionally write a script to manage this for a fleet of devices in combination with other SDK functions such as device.getAll.
Note: In order to update the Supervisor release for a device, you must have edit permissions on the device (i.e., more than just support access).
8.3 The Supervisor Database
The Supervisor uses a SQLite database to store persistent state, so in the case of going offline, or a reboot, it knows exactly what state an app should be in, and which images, containers, volumes and networks to apply to it.
This database is located at /mnt/data/resin-data/balena-supervisor/database.sqlite and can be accessed inside the Supervisor container at /data/database.sqlite by running Node. Assuming you're logged into your device, run the following:
This will get you into a Node interpreter in the Supervisor service container. From here, we can use the sqlite3 NPM module used by the Supervisor to make requests to the database:
You can get a list of all the tables used by the Supervisor by issuing:
With these, you can then examine and modify data, if required. Note that there's usually little reason to do so, but this is included for completeness. For example, to examine the configuration used by the Supervisor:
Occasionally, should the Supervisor get into a state where it is unable to determine which release images it should be downloading or running, it is necessary to clear the database. This usually goes hand-in-hand with removing the current containers and putting the Supervisor into a 'first boot' state, whilst keeping the Supervisor and release images. This can be achieved by carrying out the following:
This:
Stops the Supervisor (and the timer that will attempt to restart it).
Removes all current service containers, including the Supervisor.
Removes the Supervisor database. (If for some reason the images also need to be removed, run
balena rmi -f $(balena images -q)which will remove all images including the Supervisor image). You can now restart the Supervisor:
If you deleted all the images, this will first download the Supervisor image again before restarting it. At this point, the Supervisor will start up as if the device has just been provisioned and already registered, and the device's target release will be freshly downloaded if images were removed before starting the service containers.
9. Working with balenaEngine
Service: balena.service
balenaEngine is balena's fork of Docker, offering a range of added features including real delta downloads, minimal disk writes (for reduced media wear) and other benefits for edge devices, in a small resource footprint.
balenaEngine is responsible for the fetching and storage of service images, as well as their execution as service containers, the creation of defined networks and creation and storage of persistent data to service volumes. As such, it is an extremely important part of balenaOS.
Additionally, as the Supervisor is also executed as a container, it is required for its operation. This means that should balenaEngine fail for some reason, it is likely that the Supervisor will also fail.
Issues with balenaEngine themselves are rare, although it can be initially tempting to attribute them to balenaEngine instead of the actual underlying issue. A couple of examples of issues which are misattributed to :
Failure to download release service updates - usually because there is an underlying network problem, or possibly issues with free space
Failure to start service containers - most commonly customer services exit abnormally (or don't have appropriate error checking) although a full data partition can also occur, as can corrupt images
Reading the journal for balenaEngine is similar to all other systemd services. Log into your device and then execute the following:
What you'll first notice here is that there's Supervisor output here. This is because balenaEngine is running the Supervisor and it pipes all Supervisor logs to its own service output. This comes in particularly useful if you need to examine the journal, because it will show both balenaEngine and Supervisor output in the same logs chronologically.
Assuming your device is still running the pushed multicontainer app, we can also see additionally logging for all the service containers. To do so, we'll restart balenaEngine, so that the services are started again. This output shows the last 50 lines of the balenaEngine journal after a restart.
9.1 Service Image, Container and Volume Locations
balenaEngine stores all its writeable data in the /var/lib/docker directory, which is part of the data partition. We can see this by using the mount command:
All balenaEngine state is stored in here, include images, containers and volume data. Let's take a brief look through the most important directories and explain the layout, which should help with investigations should they be required.
Run balena images on your device:
Each image has an image ID. These identify each image uniquely for operations upon it, such as executing it as a container, removal, etc. We can inspect an image to look at how it's made up:
Of particular interest here is the "RootFS" section, which shows all of the layers that make up an image. Without going into detail (there are plenty of easily Google-able pages describing this), each balena image is made up of a series of layers, usually associated with a COPY, ADD, RUN, etc. Dockerfile command at build time. Each layer makes up part of the images file system, and when a service container is created from an image, it uses these layers 'merged' together for the underlying file system.
We can look at these individual layers by making a note of each SHA256 hash ID and then finding this hash in the /var/lib/docker/image/overlay2/layerdb/sha256 directory This directory holds a set of directories, each named after each unique layer using the SHA256 associated with it. Let's look at the layer DB directory:
Note that there are layers named here that are not named in the "RootFS" object in the image description (although base image layers usually are). This is because of the way balenaEngine describes layers and naming internally (which we will not go into here). However, each layer is described by a cache-id which is a randomly generated UUID. You can find the cache-id for a layer by searching for it in the layer DB directory:
In this case, we find two entries, but we're only interested in the directory with the diff file result, as this describes the diff for the layer:
We now have the corresponding cache-id for the layer's directory layout, and we can now examine the file system for this layer (all the diffs are stored in the /var/lib/docker/overlay2/<ID>/diff directory):
You can find the diffs for subsequent layers in the same way.
However, whilst this allows you to examine all the layers for an image, the situation changes slightly when an image is used to create a container. At this point, a container can also bind to volumes (persistent data directories across container restarts) and writeable layers that are used only for that container (which are not persistent across container restarts). Volumes are described in a later section dealing with media storage. However, we will show an example here of creating a writeable layer in a container and finding it in the appropriate /var/lib/docker directory.
Assuming you're running the source code that goes along with this masterclass, SSH into your device:
You should see something similar. Let's pick the backend service, which in this instance is container 4ce5bebc27c6. We'll exec into it via a bash shell, and create a new file. This will create a new writeable layer for the container:
Now we'll determine where this new file has been stored by balenaEngine. Similarly to the images, any writeable layer ends up in the /var/lib/docker/overlay2/<ID>/diff directory, but to determine the correct layer ID we need to examine the layer DB for it. We do this by looking in the /var/lib/docker/image/overlay2/layerdb/mounts directory, which lists all the currently created containers:
As you can see, there's a list of all the container IDs of those container that have been created. If we look for the mount-id file in the directory for the backend container, that will include the layer ID of the layer that has been created. From there, we simply look in the appropriate diff layer directory to find our newly created file (the awk command below is to add a newline to the end of the discovered value for clarity reasons):
9.2 Restarting balenaEngine
As with the Supervisor, it's very rare to actually need to carry this out. However, for completeness, should you need to, this again is as simple as carrying out a systemd restart with systemctl restart balena.service:
However, doing so has also had another side-effect. Because the Supervisor is itself comprised of a container, restarting balenaEngine has also stopped and restarted the Supervisor. This is another good reason why balenaEngine should only be stopped/restarted if absolutely necessary.
So, when is absolutely necessary? There are some issues which occasionally occur that might require this.
One example could be due to corruption in the /var/lib/docker directory, usually related to:
Memory exhaustion and investigation
Container start/stop conflicts
Many examples of these are documented in the support knowledge base, so we will not delve into them here.
9.3 Quick Troubleshooting Tips
Here are some quick tips we have found useful when troubleshooting balenaEngine issues.
Running health check manually. If you suspect balenaEngine's health check may be failing, a good thing to try is running the check manually:
On success, this return 0 and will not print any output. But on error, this may provide you more information than you'd get anywhere else.
Is a pull running? It's not always obvious that an image pull is really running. A quick way to check this is to look for any docker-untar processes:
This docker-untar process is spawned by balenaEngine to decompress the image data as it arrives, so it's a subtle but sure clue that a pull is running.
10. Using the Kernel Logs
There are occasionally instances where a problem arises which is not immediately obvious. In these cases, you might see services fail 'randomly', perhaps attached devices don't behave as they should, or maybe spurious reboots occur.
If an issue isn't apparent fairly soon after looking at a device, the examination of the kernel logs can be a useful check to see if anything is causing an issue.
To examine the kernel log on-device, simply run dmesg from the host OS:
The rest of the output is truncated here. Note that the time output is in seconds. If you want to display a human readable time, use the -T switch. This will, however, strip the nanosecond accuracy and revert to chronological order with a minimum granularity of a second.
Note that the 'Device Diagnostics' tab from the 'Diagnostics' section of a device also runs dmesg -T and will display these in the output window. However, due to the sheer amount of information presented here, it's sometimes easier to run it on-device.
Some common issues to watch for include:
Under-voltage warnings, signifying that a device is not receiving what it requires from the power supply to operate correctly (these warnings are only present on the Raspberry Pi series).
Block device warnings, which could signify issues with the media that balenaOS is running from (for example, SD card corruption).
Device detection problems, where devices that are expected to show in the device node list are either incorrectly detected or misdetected.
11. Media Issues
See the Storage Media Debugging docs for help with debugging media issues.
12. Device connectivity status
See the Device Connectivity states section of our Device Statuses docs to learn about device connectivity statuses.
Conclusion
In this masterclass, you've learned how to deal with balena devices as a support agent. You should now be confident enough to:
Request access from a customer and access their device, including 'offline' devices on the same network as one that is 'online'.
Run diagnostics checks and understand their results.
Understand the core balenaOS services that make up the system, including the ability to read journals from those services, as well as stopping, starting and restarting them.
Enable persistent logs, and then examine them when required.
Diagnose and handle a variety of network issues.
Understand and work with the
config.jsonconfiguration file.Understand the Supervisor's role, including key concepts.
Understand the balenaEngine's role, including key concepts.
Be able to look at kernel logs, and determine some common faults.
Work with media issues, including dealing with full media, working with customer data, and diagnosing corruption issues.
Understand why your device's status is Online (Heartbeat Only) or Online (VPN Only) and how it can be impacting your app.
Last updated
Was this helpful?