/ C++OPTIMIZATION

# Optimization without Inlining

Inlining is one of the most important compiler optimizations. We can often write abstractions and thin wrapper functions without incurring any performance penalty, because the compiler will expand the method for us at call site.

If a function is not inlined, conventional wisdom says that the compiler has to assume that the method can modify any global state and change the memory behind any pointer or reference that might have “escaped”.

In this short post, I’ll demonstrate exactly this effect. Furthermore, we will see that even if a function is not inlined, as long as the implementation is visible, some optimizations are still performed and sometimes to great effect.

# Example

Let us consider this simple test function,

// implemented in different TU or otherwise linked in
int foo(int n);

int test(int const& n)
{
auto sum = 0;
sum += foo(n);
sum += foo(n);
return sum;
}


which results in the following assembly:

test(int const&):
push rbp
push rbx
push rax
mov rbx, rdi
mov edi, dword ptr [rdi] // load n from memory
call foo(int)            // call foo
mov ebp, eax
mov edi, dword ptr [rbx] // load n from memory _again_
call foo(int)            // call foo
pop rbx
pop rbp
ret


Unsurprising, foo is called twice. For all we know, it has important side effects that have to be executed twice. Maybe a bit more subtle: n had to be reloaded from memory, because the first call to foo might have changed the memory. (For example, n might be a reference to a global int that foo happens to increment.)

This is really the worst case for the compiler. Nothing about the inner workings of foo is known.

On the other end of the spectrum, we have a small, known foo:

int foo(int n)
{
return n * n;
}

int test(int const& n)
{
auto sum = 0;
sum += foo(n);
sum += foo(n);
return sum;
}


which results in the following assembly:

test(int const&):
mov eax, dword ptr [rdi] // load n from memory
imul eax, eax            // tmp = n*n
add eax, eax             // return tmp + tmp
ret


Not only is the call to foo inlined, further optimizations made sure that n * n is not computed twice.

# To Inline or not to Inline

If foo is part of a different TU, the compiler can obviously not inline the call. Should you have enough patience, link-time optimization might come to the rescue. (It’s usually really expensive, so I can only really recommend it for CI building release versions.)

However, there are many other reasons why foo is not inlined. In general, inlining is a double-edged sword. Each inlined function increases the pressure on the instruction cache and makes the parent function harder to analyze. More local variables mean more register spills. All these can negatively affect performance and compilers have various heuristic to try to make good decisions.

Most small functions will get inlined by default. As the complexity of the function body rises, this will stop at some point. Recursion usually prevents inlining, though compilers are definitely able to inline recursive functions via tail-call elimination.

We can artificially prevent inlining by declaring the function __declspec(noinline) (for msvc) or __attribute__((noinline)) (for gcc/clang):

__attribute__((noinline)) int foo(int n)
{
return n * n;
}

int test(int const& n)
{
auto sum = 0;
sum += foo(n);
sum += foo(n);
return sum;
}


which results in the following assembly:

test(int const&):
mov edi, dword ptr [rdi] // load n from memory
call foo(int)            // tmp = foo(n)
add eax, eax             // return tmp + tmp
ret


So, obviously foo was not inlined. However, the compiler could still see and analyze it. It came to the conclusion that foo does not read or modify global state. foo was flagged as a pure function.

Pure functions are nice. Pure functions return the same result if given the same input. Pure functions do not modify global state, they work solely on their input values. Thus, foo(n) + foo(n) was simplified to tmp = foo(n); tmp + tmp, even without inlining.

# Fun with Loops

The difference becomes even larger with loops:

// different TU
int foo(int n);

int test(int const& n)
{
auto sum = 0;
for (auto i = 0; i <= n; ++i)
sum += foo(n);
return sum;
}


which results in the following assembly:

test(int const&):
push rbp
push r14
push rbx
mov r14, rdi
mov edi, dword ptr [rdi]
xor ebp, ebp
test edi, edi // n < 0 case
js .LBB0_3
mov ebx, -1
.LBB0_2:         // loop body begin
call foo(int)            // call foo
mov edi, dword ptr [r14] // reload n from memory
inc ebx
cmp ebx, edi
jl .LBB0_2     // loop body end
.LBB0_3:
mov eax, ebp
pop rbx
pop r14
pop rbp
ret


With no information, the compiler has to call foo (and load n from memory) in each iteration. Contrast this with:

__attribute__((noinline)) int foo(int n)
{
return n * n;
}

int test(int const& n)
{
auto sum = 0;
for (auto i = 0; i <= n; ++i)
sum += foo(n);
return sum;
}


which results in the following assembly:

test(int const&):
mov edx, dword ptr [rdi]
test edx, edx
js .L5 // special case n < 0
mov edi, edx
call foo(int)  // tmp1 = foo(n)
imul edx, eax  // tmp2 = tmp1 * n
add eax, edx   // return tmp2 + n
ret
.L5:
xor eax, eax // return 0
ret


So gcc is able to optimize the whole thing to basically foo(n) * (n + 1), without inlining. Funnily enough, clang tries (and fails) to be clever with lots of SIMD.

# Conclusion

This is not a long post, but it shows that while inlining is a very important optimization, a non-inlined function is not the end of the world optimization. As long as the function implementation is visible, compilers can and will analyze them so that they don’t have to assume the worst case. This, in turn, re-enables many optimizations such as value numbering, common subexpression elimination, and loop-invariant code motion.

PS: gcc and clang have a variety of function attributes that can be used to enable optimizations, even if the implementation is in a different TU. The two most important ones are __attribute__((const)) and __attribute__((pure)).

(Title image from unsplash)