When taking care of a legacy project, I and those around me noticed that there would be a few timeouts each day on calls to the database - then the following request would work fine. On investigation it heavily leaned towards an underlying network infrastructure issue - out of our hands. The issue was obvious as it would occur on a large display for all to see. So with a new project in the pipeline, I thought to protect against this, and came across what Entity Framework 6 calls "Execution Strategies".

Oh and if you're worried about the name containing Azure, don't fret. It's named this way to handle transient issues that could occur across the internet to your Azure SQL instance in the cloud. But since Azure SQL is based off SQL Server we can use it all the same.

If you're interested straight in the execution for easy copy and paste, head to the bottom of the post for Using Execution Strategies.

Execution Strategies

The core of the documentation is found here and there's not much else to say. Simply, EF6 comes with four out of the box and I'll bring up three:

  1. DefaultSqlExecutionStrategy
    This is the one we implicity use by default. It simply has no retry but bubbles up to the user that they may want to use some resiliency (if issues arise).
  2. DbExecutionStrategy
    The class to inherit if you want to make your own custom execution strategy.
  3. SqlAzureExecutionStrategy
    This is why we're here. It will automatically retry after some errors with some smarts around SQL errors.

Starting out with SqlAzureExecutionStrategy, let's go for a deep dive!

What Errors Will Be Retried?

With the migration of .NET to GitHub we can go straight to the source for SqlAzureExecutionStrategy, here.

But this class isn't too interesting on the whole. It just inherits from DbExecutionStrategy but has a particular line in the ShouldRetryOn() method:

protected override bool ShouldRetryOn(Exception exception)
{
	return SqlAzureRetriableExceptionDetector.ShouldRetryOn(exception);
}

Interesting. Let's go take a peek at that here and oh yes, here is what we're looking for. 21 different SqlException error codes (and a couple non-SQL issues) that will result in a retry. We have errors spanning: 49918 not enough resource, to  40613 database not available to 10053 transport errors and many more - exactly the types of errors that could be intermittent and we'd want to silently retry.

How Are Errors Retried?

Now that we know which errors will be retried how does the retrying part work? Back in the SqlAzureExecutionStrategy class the XML documentation on the default constructor gives it away:

/// <summary>
/// Creates a new instance of <see cref="SqlAzureExecutionStrategy" />.
/// </summary>
/// <remarks>
/// The default retry limit is 5, which means that the total amount of time spent between retries is 26 seconds plus the random factor.
/// </remarks>
public SqlAzureExecutionStrategy()
{
}

But for fun we also have a constructor that might be the giveaway of where to start if we wanted to customise some SqlAzureExecutionStrategy parameters:

/// <summary>
/// Creates a new instance of <see cref="SqlAzureExecutionStrategy" /> with the specified limits for
/// number of retries and the delay between retries.
/// </summary>
/// <param name="maxRetryCount"> The maximum number of retry attempts. </param>
/// <param name="maxDelay"> The maximum delay in milliseconds between retries. </param>
public SqlAzureExecutionStrategy(int maxRetryCount, TimeSpan maxDelay)
    :base(maxRetryCount, maxDelay)
{
}

With both constructors relying on the base class, DbExecutionStrategy, let's see what's there. Looking at the constructors:

/// <summary>
/// Creates a new instance of <see cref="DbExecutionStrategy" />.
/// </summary>
/// <remarks>
/// The default retry limit is 5, which means that the total amount of time spent between retries is 26 seconds plus the random factor.
/// </remarks>
protected DbExecutionStrategy()
    : this(DefaultMaxRetryCount, DefaultMaxDelay)
{
}

/// <summary>
/// Creates a new instance of <see cref="DbExecutionStrategy" /> with the specified limits for number of retries and the delay between retries.
/// </summary>
/// <param name="maxRetryCount"> The maximum number of retry attempts. </param>
/// <param name="maxDelay"> The maximum delay in milliseconds between retries. </param>
protected DbExecutionStrategy(int maxRetryCount, TimeSpan maxDelay)
{
    if (maxRetryCount < 0)
    {
        throw new ArgumentOutOfRangeException("maxRetryCount");
    }
    if (maxDelay.TotalMilliseconds < 0.0)
    {
        throw new ArgumentOutOfRangeException("maxDelay");
    }

    _maxRetryCount = maxRetryCount;
    _maxDelay = maxDelay;
}

On the default constructor we get the same XML talking about five attempts (DefaultMaxRetryCount) over 26 seconds (DefaultMaxDelay) - where those defaults are set as private constant properties in DbExecutionStrategy:

// <summary>
// The default number of retry attempts, must be nonnegative.
// </summary>
private const int DefaultMaxRetryCount = 5;

// <summary>
// The default maximum random factor, must not be lesser than 1.
// </summary>
private const double DefaultRandomFactor = 1.1;

// <summary>
// The default base for the exponential function used to compute the delay between retries, must be positive.
// </summary>
private const double DefaultExponentialBase = 2;

// <summary>
// The default coefficient for the exponential function used to compute the delay between retries, must be nonnegative.
// </summary>
private static readonly TimeSpan DefaultCoefficient = TimeSpan.FromSeconds(1);

// <summary>
// The default maximum time delay between retries, must be nonnegative.
// </summary>
private static readonly TimeSpan DefaultMaxDelay = TimeSpan.FromSeconds(30);

But I've left in a couple of other properties here because they'll help us understand the next part which is about exponential backoff and jitter. (Quietly, you may notice DefaultMaxDelay is actually 30 seconds, but the documentation says something about a total time of 26 seconds? We'll get to that after we learn the next part).

A Quick Aside: Exponential Backoff and Jitter

There are way smarter people from Microsoft, Amazon, Google and the internet in general that can explain in depth about it but in short:

It's a way we can avoid clogging a network by waiting longer for each request until a max number to which we then call the request a fail. Imagine a web service getting hammered, slowly dropping more and more requests; each of those requests might immediately retry causing more traffic and contributing to the congestion. With exponential backoff we can give the server some time to breathe before another retry - if everyone does this, we end up minimising the worst of the effect.

But then we need to think about something else. Let's a service comes online and everyone who is interested immediately makes a request exactly at the same time and overwhelms the service. That's cool, we have exponential backoff to solve this - except it won't. It will just mean everyone's cool off time, being the same, will cause them to flood the endpoint at the same time causing the same issue. This is what jitter brings in.

Jitter brings in a slight amount of randomness to the wait time therefore slightly staggering calls to further ease up on an endpoint. As a simple example, let's compare three made up request types and look at the seconds between each retry. One instantly retrying, one with exponential backoff and the final with exponential backoff and jitter:

Attempt Instant retry Exponential Backoff Exponential Backoff + Jitter
1 0 0 0
2 0 1000 1185
3 0 2000 1821
4 0 4000 4005
5 0 8000 8227

With the jitter, we potentially ease up on hammering the server at the same time as someone else. Both of these together create a more resilient retry methodology that's better for us and the other side.

Back to Investigating Retries

Now that we've had a primer, we can see how cleanly this type of retry is done in DbExecutionStrategy:

/// <summary>
/// Determines whether the operation should be retried and the delay before the next attempt.
/// </summary>
/// <param name="lastException">The exception thrown during the last execution attempt.</param>
/// <returns>
/// Returns the delay indicating how long to wait for before the next execution attempt if the operation should be retried;
/// <c>null</c> otherwise
/// </returns>
protected internal virtual TimeSpan? GetNextDelay(Exception lastException)
{
    _exceptionsEncountered.Add(lastException);

    var currentRetryCount = _exceptionsEncountered.Count - 1;
    if (currentRetryCount < _maxRetryCount)
    {
        var delta = (Math.Pow(DefaultExponentialBase, currentRetryCount) - 1.0)
                    * (1.0 + _random.NextDouble() * (DefaultRandomFactor - 1.0));

        var delay = Math.Min(
            DefaultCoefficient.TotalMilliseconds * delta,
            _maxDelay.TotalMilliseconds);

        return TimeSpan.FromMilliseconds(delay);
    }

    return null;
}

This is the piece we're interested in right now:

var delta = (Math.Pow(DefaultExponentialBase, currentRetryCount) - 1.0) 
          * (1.0 + _random.NextDouble() * (DefaultRandomFactor - 1.0));

The first set of brackets creates our new exponential backoff number which is multiplied by the second set of brackets, the jitter.

Here's what it could look like. I've copied this code and ran it a few times to get an idea of the real world numbers (seconds) we are looking at, especially the jitter:

currentRetryCount Values of delta without jitter Values of delta with jitter
0 0 0
1 1 1.019, 1.044, 1.072
2 3 3.123, 3.230, 3.008
3 7 7.148, 7.298, 7.593
4 15 15.417, 15.813, 15.172
5 31 33.697, 32.866, 32.953

As you can see, the values with jitter slightly offset the values without jitter, I.E. the original backoff values. If the default values are used in DbExecutionStrategy, technically we would never see 31ish because it would cap at the max number of 26 by this code:

var delay = Math.Min(DefaultCoefficient.TotalMilliseconds * delta,
                    _maxDelay.TotalMilliseconds);

And this is where we get 26 seconds from the documentation instead of 30 as outlined by DefaultMaxDelay because the next delay value of ~31 seconds exceeds DefaultMaxDelay and we stop attempting further - meaning all the attempts so far equal ~26: 0 + 1 + 3 + 7 + 15 = 26 seconds.

Bringing it all together, the XML documentation for the DbExecutionStrategy describes the backoff and jitter formula as:

min(random(1, 1.1) * (2 ^ retryCount - 1), maxDelay)

Now with all that understanding how the attempts are spaced apart, now we can get to how the actual EF query itself is run.

The running of the query is in the generic Execute() method. Which when extremely stripped down it's just a while loop with a Thread.Sleep():

while (true)
{
    try
    {
        return operation();
    }
    catch (Exception ex)
    {
        // error handling 
    }       

    Thread.Sleep(delay.Value);
}

Looks like something I could write.

Now that we've gone a little deep into how Execution Strategies work in EF6, how about we use them?

Using Execution Strategies

Before I even knew I was looking for execution strategies, I searched around and found this blog post which got me started on the right track. Digging around more on how to implement I found this great post that further dissects how errors are handled with practical examples. But it is also where I saw the code I'm going to use. When writing this post, I went to look for the origin, and I believe this code was originally written by Rowan Miller when he was still on the EF team and you can still find it on his blog here, or the bit I'm interested in below:

public class MyConfiguration : DbConfiguration
{
    public MyConfiguration()
    {
        this.SetExecutionStrategy("System.Data.SqlClient", () => SuspendExecutionStrategy
            ? (IDbExecutionStrategy)new DefaultExecutionStrategy()
            : new SqlAzureExecutionStrategy());
    }

    public static bool SuspendExecutionStrategy
    {
        get
        {
            return (bool?)CallContext.LogicalGetData("SuspendExecutionStrategy") ?? false;
        }
        set
        {
            CallContext.LogicalSetData("SuspendExecutionStrategy", value);
        }
    }
}

I just renamed the class to something else and dropped it into the same project as my EF classes and it seemed to work! A great win to distract me from deadlines but I suffered from the syndrome of copying and pasting without understanding.

The Gotcha

There was a query I wanted to run as nolock aka just dirty reading. It was for a might-be-needed yuck table scan for a piece of old data in a string. Cool, so I'll just set a TransactionScope for the context and...

To be fair, this is exactly in the Microsoft documentation linked earlier:

User initiated transactions are not supported

But it turns out, simple me at the time copied the right piece of code. The very one you see above with SuspendExecutionStrategy which does exactly what's on the tin and swaps in the DefaultExecutionStrategy which allows for user initiated transactions. And reading in Rowan Miller's blog, it's exactly what I'm looking for. Meaning it is as simple as putting the following before running any queries:

MyConfiguration.SuspendExecutionStrategy = true;

Then this afterwards:

MyConfiguration.SuspendExecutionStrategy = false;

Though considering this variable, SuspendExecutionStrategy, is stored in the CallContext (info here) I believe it should only be present for the duration of the call, meaning any other threads/processes/contexts should have their own version, which by default is false. So maybe you don't need to put a false call after? Maybe put false if later calls are to be retried in that context? I'm no expert, so I'll follow Mr Miller's process and tidy up after myself, even if there are no other database calls.

To End

Going through the motions of needing some resilience, to how SqlAzureExecutionStrategy manages SQL specific errors, then to the internals of EF6 strategies, a primer on exponential backoff and jitter, to what actually gets used.  

Execution Strategies are simple to use, especially SqlAzureExecutionStrategy - just paste it into your project and you're done. If you need user initiated transactions, they're catered for too. Rowan Miller wrote a beautiful piece of code that I'll continue to use whenever I need to use EF6.