The hidden cost of delegates - Software performance

In this post I talk about choosing the right approach to optimisation, and how some times the best approach is the simplest.

Improving performance in C#

The thing that is often misunderstand about GC and JIT runtimes is that they are not inherently slow. Especially in languages with GCs and JITs as mature as the .Net or Java ecosystems.

The bulk of performance issues are dealing with the infinite memory problem. Whenever you are using memory that hasn't been statically allocated, you have to manage it somehow somehow, either with a GC, ref-counting, or malloc and free. GCs have a massive advantage in that they can do some great optimisations that are hard to do with manual memory management, like being generational and concurrent.

C# has a number of ways to help you write performant code - all with the "where applicable" caveat:

Do less (i.e. don't run code in loops that doesn't need to be run)
Use the right data structures for your Big O requirements (i.e. don't use a List when you need a HashSet)
Use value types to reduce heap allocations and GC pressure
Pool and reuse heap objects

The problem

For my game engine, the code for the game loop looks something like this:

while (running)
{
  game.Update();
  game.Render();

  world.UpdateSubscriptions();
}

For the Update and Render methods, I'm batching up groups of entities to be operated over, and then calling a system's update method to process them. For UpdateSubscriptions, I'm parallelising the calls to the event handlers.

The critical point is that for each delegate created (by implicitly or explicitly casting a method to an Action), a new object is created on the heap. This is a major problem when you're calling a method tens or hundreds of thousands of times a frame, as you will see in the final benchmark.

public class System
{
  private readonly World m_World;
  private readonly Subscription m_Subscription;

  // This method is called once per frame
  public void Update()
  {
    foreach (var entitySegment in m_Subscription.SegmentEntities)
    {
      // The allocation happens silently when casting the method to an Action here
      m_World.Scheduler.ScheduleBatch(entitySegment, /**/ Update /**/);
    }
  }

  protected virtual void Update(ArraySegment<Entity> entities)
  {
    foreach (var entity in entities)
    {
      Update(entity);
    }
  }

  protected virtual void Update(Entity entity)
  {
  }
}

As the scheduler is going to be enqueuing and running tens or hundreds of thousands of tasks every frame, this is a problem. We can't just do less (although we can tune the batch sizes). That leaves data structures and value types. So what are our options?

Function pointers

C# 7.3 introduced function pointers, which are a way to call a method without allocating a delegate. They're like a more strongly-typed version of void* in C++, and they're a way to call a method without allocating a delegate for the Update call.

delegate*<ArraySegment<Entity>, void> m_UpdateMethod;

This approach won't work - it requires a method to be static, and it's not possible to use them with virtual methods. Since I'm using inheritance and am referring to non-static data, this isn't an option.

Reusing delegates

One of the best ways of improving performance in C# while staying with classes (of which we have no choice with delegates) is object pooling. This is where you create a pool of a type of object and reuse them, rather than creating and destroying them. Since there's no way of reassigning fields within a delegate, we can't use this approach either.

However, we can assign the delegate only once and reuse it. To simplify the code, I've chosen to go for a factory pattern, where the factory is a struct so that it can also avoid hitting the heap.

public readonly struct RepeatableBatchScheduledTask(
  ArraySegment<Entity> entities,
  RepeatableBatchScheduledTask.UpdateAction batchAction
)
{
  public delegate void UpdateAction(in ArraySegment<Entity> entities);

  public void Execute() => batchAction(entities);
}

public readonly struct RepeatableBatchScheduledTaskFactory(RepeatableBatchScheduledTask.UpdateAction batchAction)
{
  public RepeatableBatchScheduledTask Create(ArraySegment<Entity> entities) => new(entities, batchAction);
}

public class System
{
  private readonly World m_World;
  private readonly RepeatableBatchScheduledTaskFactory m_UpdateTask;

  public System()
  {
    m_UpdateTask = new RepeatableBatchScheduledTaskFactory(Update);
  }

  private void Update()
  {
    foreach (var entitySegment in m_Subscription.SegmentEntities)
    {
      m_World.Scheduler.EnqueueBatchTask(m_UpdateTask.Create(segment));
    }
  }

  protected virtual void Update(in ArraySegment<Entity> entities)
  {
    foreach (var entity in entities)
    {
      Update(entity);
    }
  }

  protected virtual void Update(Entity entity)
  {
  }
}

Now we have the best of both worlds: we're only allocating one new object for our Update call per system, which lives for the lifetime of the game running, and we're staying within the bounds of what the language does best and avoiding an unsafe solution.

The benchmark

Here are the results of the work:


BenchmarkDotNet=v0.13.2, OS=Windows 11 (10.0.22631.3296)
Intel Core i9-10980HK CPU 2.40GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK=8.0.100-rc.2.23502.2
  [Host]   : .NET 7.0.17 (7.0.1724.11508), X64 RyuJIT AVX2
  .NET 7.0 : .NET 7.0.17 (7.0.1724.11508), X64 RyuJIT AVX2

Job=.NET 7.0  Runtime=.NET 7.0

Method	Iterations	Mean	Ratio	Gen0	Gen1	Gen2	Allocated
SchedulingWithRecurringAllocations	1000	106.81 μs	1	7.8125	0.8545	-	64.24 KB
SchedulingWithSingleAllocation	1000	98.42 μs	0.92	0.1221	-	-	1.8 KB

SchedulingWithRecurringAllocations	100000	19,666.98 μs	1	750	718.75	-	6251.77 KB
SchedulingWithSingleAllocation	100000	19,374.66 μs	0.99	-	-	-	1.83 KB

SchedulingWithRecurringAllocations	1000000	268,410.26 μs	1	9000	8500	1500	62503.09 KB
SchedulingWithSingleAllocation	1000000	166,118.27 μs	0.62	-	-	-	2.15 KB

That's quite the result - we have a ton less memory allocated and a much faster execution time for larger batches of tasks.

Conclusion

The best optimisation is often the simplest. By reusing a delegate, we've managed to reduce the memory allocated and the execution time of our code. This is a great example of how implicit syntactic sugar can hide some serious performance issues, and how by knowing what is happening under the hood, you can make your code faster and more efficient.

To identify heap allocations, I use the Heap Allocations Viewer plugin for Rider.