The balenaCloud Dashboard includes the ability to run a set of diagnostics on a device to determine its current condition. This should, in most cases, be the first step in attempting to diagnose an issue without having to actually access the device via SSH. Ensuring diagnostics and health checks are examined first ensures that you have a good idea of the state a device is in before SSHing into it, as well as ensuring that the information can be accessed later if required (should a device be in a catastrophic state). This helps greatly in a support post-mortem should one be required.
Currently, diagnosis feature is only available via the Dashboard.
To run device diagnostics through balenaCloud dashboard, head to the
Diagnostics tab in the sidebar and click the
Run checks button to start the tests.
Read more about other diagnostics you can run after checking your device:
As part of the diagnostics suite, you will find a group of checks that can be collectively run on-device. Below is a description of each check and what each means or how to triage.
check in this context is defined as a function that returns a result (good/bad status plus some descriptive text), whereas a
command is simply a data collection tool without any filtering or logic built in. Checks are intended to be used by
everyone, while typically command output is used by support/subject matter experts and power users.
Diagnostics are split into three separate sections: Device health checks, Device diagnostics and Supervisor state.
This check confirms that the version of balenaOS is >2.x. There is further confirmation that the OS release has not since been removed from production.
As of May 1, 2019,
balenaOS 1.x has been deprecated. These OSes are now unsupported. For more information, read our
blog post: https://www.balena.io/blog/all-good-things-come-to-an-end-including-balenaos-1-x/.
Upgrade your device to the latest
balenaOS 2.x (contact support if running 1.x).
Parts of this check depend on fully functional networking stack (see check_networking).
Often seen on Raspberry Pi devices, these kernel messages indicate that the power supply is insufficient for the device and any peripherals that might be attached. These errors also precede seemingly erratic behavior.
Replace the power supply with a known-good supply (supplying at least 5V / >2.5A).
This check simply confirms that a given device is running at a given memory threshold (set to 90% at the moment). Oversubscribed memory can lead to OOM events (learn more about the out-of-memory killer here).
Using a tool like
top, scan the process table for which process(es) are consuming the most memory (
%VSZ) and check
for memory leaks in those services.
This check confirms the container engine is up and healthy. Additionally, this check confirms that there have been no unclean restarts of the engine. These restarts could be caused by crashlooping. The container engine is an integral part of the balenaCloud pipeline.
It is best to let balena's support team take a look before restarting the container engine. At the very least, take a diagnostics snapshot before restarting anything.
This check confirms the Supervisor is up and healthy. The Supervisor is an integral part of the balenaCloud pipeline. The Supervisor depends on the container engine being healthy (see check_container_engine). There is also a check to confirm the running Supervisor is a released version, and that the Supervisor is running the intended release from the API.
It is best to let balena's support team take a look before restarting the supervisor. At the very least, take a diagnostics snapshot before restarting anything.
This check combines a few metrics about the local storage media and reports back any potential issues.
This test simply confirms that a given device is running beneath a given disk utilization threshold (set to 90% at the moment). If a local disk fills up, there are often knock-on issues in the supervisor and release containers.
du -a /mnt/data/docker | sort -nr | head -10 in the hostOS shell to list the ten largest files and directories.
If the results indicate large files in
/mnt/data/docker/containers, this result often indicates a leakage in a container that can be cleaned up (runaway logs, too much local data, etc). Further info can be found
in the Device Debugging Masterclass.
This test compares each partition's average write latency to a predefined target (1s). There are some caveats to this test that are worth considering. Since it attempts to categorize a distribution with a point sample, the reported sample size should always be considered. Smaller sample sizes are prone to fluctuations that do not necessarily indicate failure. Additionally, the metric sampled is merely the number of writes disregarding the size of each write, which again may be noisy with few samples. Writes come primarily from application workloads and less often from operating system operations. For more information, see the relevant kernel documentation.
Slow disk writes could indicate faulty hardware or heavy disk I/O. It is best to investigate the hardware further for signs of degradation.
This test confirms that the host OS properly and fully expanded the partition at boot (>80% of the total disk space has been allocated).
Failure to expand the root filesystem can indicate an unhealthy storage medium or potentially a failure during the provisioning process. It is best to contact support, replace the storage media and re-provision the device.
This test confirms that the data partition for the device has been mounted properly.
Failure to mount the data partition can indicate an unhealthy storage medium or other problems on the device. It is best to contact support to investigate further.
This check confirms that the system clock is actively disciplined.
Confirm that NTP is not blocked at the network level, and that any specified upstream NTP servers are accessible. If absolutely necessary, it is possible to temporarily sync the clock using HTTP headers (though this change will not persist across reboots). Further info can be found in the Device Debugging Masterclass.
This check depends on a fully functional networking stack (see check_networking).
This check looks for evidence of high temperature and CPU throttling.
If there are sensors, this check confirms that the temperature is below 80C (at which point throttling begins).
This looks for evidence of CPU throttling in kernel log messages.
This check confirms that the host OS has not noted any failed boots & rollbacks.
More information available here, contact support to investigate fully.
This check tests various common failures at install locations required for a healthy container lifecycle. More information on networking requirements can be found here.
This test confirms that certain FQDNs are resolvable by each of the configured upstream DNS addresses. Only the failed upstream DNS addresses will be shown in the test results.
This test confirms that if a device is using wifi, the signal level is above a threshold.
This test confirms that packets are not dropped during a ICMP ping.
This test confirms that the device can reach a public IPv4 endpoint when an IPv4 route is detected.
This test confirms that the device can reach a public IPv6 endpoint when an IPv6 route is detected. If necessary you can disable IPv6 entirely on a device if it is experiencing issues.
This test confirms that the device can communicate with the balenaCloud API. Commonly, firewalls or MiTM devices can cause SSL failures here.
This test confirms that the device can communicate with the Docker Hub.
This test depends on the container engine being healthy (see check_container_engine).
This test is an end-to-end check that tries to authenticate with the balenaCloud registry, confirming that all other points in the networking stack are behaving properly.
This test depends on the container engine being healthy (see check_container_engine).
Depending on what part of this check failed, there are various fixes and workarounds. Most however will involve a restrictive local network, or an unreliable connection.
Any checks with names beginning with
check_service_ come from user-defined services.
These checks interrogate the engine to see if any services are restarting uncleanly/unexpectedly or failing health checks.
We allow users to provide their own health checks using the HEALTHCHECK directive
defined in the Dockerfile or docker-compose file.
Any health check output will be collected as-is, truncated to 100 characters, and shown as output along with the exit code.
Investigate the logs of whichever service(s) are restarting uncleanly or failing healthchecks. This issue could be a bug in the error handling or start-up of the aforementioned unhealthy services. These checks are wholly limited in scope to user services and should be triaged by the developer.
This check depends on the container engine being healthy (see check_container_engine).