Fail better, heal faster

I want to talk about something I don’t usually see much content out there, but is something I personally think is key for a healthy application: Dealing with failed jobs.

I'm going to assume at this point you already understand how background queues work. In case you don’t, this might be a good start.

Dealing with failed jobs is that kind of thing that is very easy to look away when you have another 8373653 things to do and tickets to get done. You may even think is not even your job to care about this (and it may very well not be indeed). But — IMO — I think is something all developers in a team should be aware of and thinking at all times.

So, in this article, I'll go over some tips & tricks on how to deal with failed jobs, but not only after they fail, but also while they are failing. This post is going to show some examples using the Laravel queue system, but the main concepts should be easily applied to any language/framework.

Whatever can go wrong will go wrong

You know about Murphy's law, right? So you already know that things are going to fail, so try to embrace it as much as you can. Your code can work perfectly for months, and then one beautiful day that shitty API you had to integrate with changed without any previous notice, or AWS is having an issue or [insert your random failure reason here].

The fact is, something unexpected is going to happen. So Deal with things you know how to deal, and make sure you "fail better" when you don't know how to deal. One of the key things for failing better is having enough information to debug what happened after the fact. Usually, logging is the best tool for the job for this kind of thing.

Let's look at a simple code example to illustrate this better:

<?php 

class UploadImageJob implements ShouldQueue
{
    public function __construct($path)
    {
        $this->path = $path;
    }

    public function handle(UploaderClient $client)
    {
	      try {
            $client->upload($this->path);
	      } catch (RateLimitException $exception) {
            $this->release($exception->getRetryAfter());
	      } catch (Exception $exception) {
            Log::critical('[UploadImage] Unkown error when trying to upload image', [
                'error' => $exception->getMessage(),
                'path' => $this->path,
            ]);

            $this->fail($exception);
	      }
    }
}

In the above example, we specifically know how to deal with Rate Limit exceptions in the API, by releasing the job back into the queue after our rate limit expires. We are also catching general errors that can happen and applying some good logging information so we can debug what happened later.

Better odds with retries

In the above example, one of the many reasons the job could have failed is because the image upload service we are using was having an outage. So if your job is not time-sensitive, it may be a good idea to implement some kind of exponential backoff strategy.

In Laravel, this can be done using the number of attempts and the retryAfter method.

<?php 

public function retryAfter()
{
    return now()->addMinutes($this->attempts() * 5);
}

Here, it’s important to be aware of the max number of tries from your queue workers and/or your job. Another way to do this is by calling the release method directly.

<?php 

class UploadImageJob implements ShouldQueue
{
    public function __construct($path)
    {
        $this->path = $path;
    }

    public function handle(UploaderClient $client)
    {
	      try {
            $client->upload($this->path);
	      } catch (Exception $exception) {
            $this->release($this->attempts() * 5);
	      }
    }
}

It’s important to note here that this approach may not be the right approach for every job. So be aware of this.

Keep your failed_jobs table clean

Cloudfare had an outage and affected basically the whole internet, and most of your jobs failed to process. It’s true that during the outage there isn’t much you can do, but you can (and should) always come back to re-run the queued jobs that failed. They are all (or at least they should) be stored in the failed jobs table.

If you don’t clean up one time, the next time it happens you may not do because there are some old records in that table that you are not completely sure the impact they would cause. And so on, until you have 50k failed jobs in the table and you have no choice unless deleting everything or pretending you didn’t remember about that table.

So, <span class="text-highlight">keep your failed jobs clean</span>. And for this, I have a few tips to avoid problems while keeping your table clean.

There are some things you don’t need to or can’t re-run

Re-running some failed jobs could do more harm than good in some situations. So, be aware of this when retrying failed jobs.

You don’t want to re-run a job that processes a transaction that was already processed in the next run of a command or something, for instance. Or send a “Your order was delivered” notification 3 months later doesn’t really make much sense.

Laravel doesn’t really have a great way of retrying or flushing specific jobs in batches, so I wrote this package that adds this functionality in a basic level, so it may be helpful to you.

So, my recommendation here is to remove what you don’t want and re-run what you want. Which brings me to my next point.

Make your jobs as idempotent as you can

Idempotent, fancy word. But basically, you want to make sure that if the same job run multiple times, it won’t charge your customer multiple times for the same thing, for example.

Another situation where this thinking is important is when updating a database record via a job payload. I work on this system recently that receives thousands of payloads to create/update/delete documents. Sometimes, for a number of reasons, these jobs fail. But if I re-run the jobs one day later, a new update for the document could have been executed already, with newer data. So in that case, I implemented a logic like this:

<?php 

class UpdateDocumentJob implements ShouldQueue
{
    public function __construct($document, $payload)
    {
        $this->document = $document;
        $this-> payload = $payload;
    }

    public function handle()
    {
	      if ($this->document->updated_at > $this->payload->updated_at) {
            Log::info("[UpdateDocumentJob] Not updating document because document last updated is greater than the payload last updated", [
                'document' => $this->document->id,
                'document_updated_at' => $this->document-> updated_at,
                'payload_updated_at' => $this->payload->updated_at,
            ]);

            return;
        }
    }
}

Ignoring missing models

<blockquote>Illuminate\Database\Eloquent\ModelNotFoundException: No query results for model</blockquote>

I bet you had a few of these in your Laravel life, right? In queued jobs, this can happen because Laravel serializes and unserializes your models when sending/retrieving the jobs to/from the queue. That's what that SerializesModels trait does.

<?php 

class SendArticleUpdatedWebhookJob implements ShouldQueue
{
    use SerializesModels;

    public $deleteWhenMissingModels = true;

    public function __construct(Article $article)
    {
        $this->article = $article;
    }
}

If Laravel tries to unserialize your model (behind the scenes, it basically tries to fetch it from the database again with a findOrFail) and the model was deleted, the ModelNotFoundException will be thrown. In case you don't care about this in your job, you can simply set the deleteWhenMissingModels property in your job and Laravel is going to ignore missing models and it won't send the job to the failed_jobs table.

This may be an edge case, but depending on how many jobs you are processing it really helps to keep things clean.

Logging

Queued jobs run in async ways, and one of the best tools you have to understand what is going on (even on success), is logging. This sometimes can sound a little dumb when you are writing the log message in someplace you know exactly what’s going on, but I can’t tell you how many times I was bitten by this.

<span class="text-highlight">Your local data and your test data are not your production data.</span>. Remember the “Whatever can go wrong, will go wrong” part? Sometimes a simple thing you log can literally save you hours of debugging.

One thing I find very useful is to use all the different available log levels and set up some good alerting system based on the levels.

And the last thing, which is kind of more related to your whole stack, is having a centralized place for logs where you can search and set up alerts. You may have noticed that I like to prefix my log entries with the job name as a "namespace". This is REALLY helpful for searching logs and created separated log streams and alerts.

That's all I have, hopefully, if you get all the way down here, this was somehow helpful to you.