Application Crash
When the platform starts your container(s) (be it after a successful build, after a scale operation or after a restart operation), your application enters the run phase of its lifespan. Unfortunately, bad things may still happen during this phase, which can lead your application to crash.
Various reasons can explain why your application’s lifespan ends up shortened. At Scalingo, we distinguish two main kinds of crash: Boot Errors and Runtime Errors.
Understanding Boot Errors
Boot Errors can only occur when your container is still in its starting state, before entering its running state.
These errors are thrown by the platform when it detects that your application doesn’t behave as expected.
There are 3 kinds of Boot Errors: Start Errors, Timeout Errors and Hook Errors.
Understanding Start Errors
The Start Error is the default kind of Boot Error. It is thrown as soon as an unmanaged error is detected and caught by the platform. It can occur at any moment, as long as your app isn’t running yet.
A Start Error makes the deployment fail instantly. Note that the former version of your application (if any), keeps running.
In most cases, a Start Error is caused by a misconfiguration of your application, or by some unmanaged error/exception in your application’s code.
Fixing Start Errors
It’s very likely that an action on your side is required to fix the issue. The deployment logs should help you identify the issue.
Understanding Timeout Errors
When your application has a web
or a tcp
process type,
the process started in the corresponding container(s) MUST bind to the
provided network port (PORT
environment variable) within a delay of 60
seconds. After this deadline, the platform considers the application has being
unreachable and throws a Timeout Error, causing the deployment to fail.
-
If this situation arises after a restart or a scale operation, the platform automatically retries to start your container(s). After 20 unsuccessful attempts (with an exponential backoff strategy), the platform gives up. Note that your existing containers keep running during this time and after.
-
If you are trying to deploy a new version of an already running application, this new deployment is considered a failure but the previous version of your application keeps running.
-
Conversely, if this is a very first deployment, the platform does not make any attempt to recover from the error. The deployment fails with a timeout-error status, letting you know that your application didn’t bind to
PORT
soon enough.
Fixing Timeout Errors
To fix a Timeout Error, make sure:
- to bind to the provided network port, by using the
PORT
environment variable. - to listen on
0.0.0.0
and not on127.0.0.1
. - that your application is starting quickly enough.
You may need to edit your Procfile to fulfill these requirements.
If your application doesn’t need a web
or tcp
process type, make sure to
scale the unnecessary process type to zero.
Understanding Hook Errors
If your application has a postdeploy
process type,
the platform can throw a Hook Error if the post-deployment process fails.
- In such a case, the deployment fails with a hook-error status.
- If your application is already running, its code isn’t updated and the former version keeps running.
Fixing Hook Errors
Hook Errors are generally caused by an error in your codebase or by some misconfiguration. To recover from it, we first advise to investigate the logs of your application to understand the root cause. After fixing it, trigger a new deployment by pushing your updated code to Scalingo.
Understanding Runtime Errors
Runtime Errors are errors that happen during the execution of the process. Per definition, they can only occur when your container has reached its running state (which means your deployment is considered successful by the platform).
Unfortunately, and even if it has reached this state, unexpected errors can still lead your application to crash.
The most common causes are:
- Incompatible dependencies
- Segfault of a library/runtime
- Configuration issues
- Bugs in your application code
- Uncaught exception in your code (especially with non-compiled languages)
- Unsufficient resources
- Temporary error/unavailability of an external resource
A Runtime Error can have several consequences, depending on the severity of the error and the impact it has on your application:
- Reduced Performances: slow response time typically result in poor user experience.
- Downtime: the unavailability of your application, even if temporary, can have a significant impact on business operations.
- Data loss, or Corrupt Data: losing strategic data can have serious consequences for your organization or for your users/customers, especially in an HDS context.
- Security Risks: runtime errors can sometimes introduce security risks, especially if your application is too verbose. This can lead to sensitive data leak and further exploitation.
Mitigating and Preventing Runtime Errors
The very first step to take to mitigate the consequences of a Runtime Error is to ensure that you have regular, tested backups (for databases, this feature is included in all our business plans) and to setup some redundancy.
Additionally, we generally advise to have a disaster recovery plan in place. This plan should ideally outline the actions to be taken in the event of such a failure. It may include procedures and step by step guides to identify and resolve the encountered error.
We also recommend to regularly check the application’s health. This can be done by checking the metrics, analyzing the logs, or setting up automated anomaly detection. The goal is to identify and fix potential issues before they occur.
Finally, we strongly encourage developers to test their code, to follow the best practices and to conduct Q/A testings in a dedicated environment before migrating to production.
Recovering from a Runtime Error
When a Runtime Error occurs and leads to an application crash, the platform automatically takes some remedial measures to try to recover from it. The very first action consists in logging the crash event in your dashboard. Once done, the platform uses its restart policy, described hereafter:
The platform maintains a crash counter for each running container. This crash counter is set to zero by default.
When a container crashes for the first time (when its crash counter equals zero), the platform sets the crash counter to one and then tries to restart the container. This is done as soon as the failure is detected.
After a successful restart, the container enters a warmup period which always lasts 10 minutes. From here, we distinguish two main cases:
-
The container successfully goes past the warmup period. In this case, the platform considers that the restart operation fixed the issue and thus resets the container’s crash counter to zero.
-
The container crashes again during the warmup period. In this case, the platform immediately increments the crash counter by one.
- If the crash counter value is less than 13, then the platform waits some
time before restarting the container again. The duration of the waiting
period is given by the following formula:
duration (minutes) = (crash counter value - 2) x 5
- If the crash counter value equals 2, 5 or 12, the platform sends a warning e-mail to the application’s collaborators.
- If the crash counter value reaches 13, which means the container keeps crashing after 12 delayed restarts, the platform gives up. The container is crashed and won’t be restarted anymore, which means your application may now be totally unavailable.
- If the crash counter value is less than 13, then the platform waits some
time before restarting the container again. The duration of the waiting
period is given by the following formula:
Fixing Runtime Errors
Besides the measures taken by the platform, we strongly advise you to carefully investigate the logs of your crashing application as soon as possible. Once your fix is ready, push your updated code to Scalingo to trigger a new deployment, hopefully resolving the issue.
Getting Notified When a Crash Occurs
By default, the platform only notifies the application’s collaborator(s) when a recurrent crash occurs (i.e. when an early Runtime Error keeps occuring, or when a Timeout Error occurs).
You can modify this behavior by tweaking your
Notifier’s configuration.
The app_crashed
, app_crashed_repeated
and the app_deploy
events can be
particularily worth considering.