The idea of this post came from another blogpost which compared the performance of a little benchmark in C, Go and Python. The surprising result in that blog was, that the Go implementation performed much better than the C version.

The benchmark was a simple program took one command line argument and computed the sum of all integers up until the argument.
I wanted to see what was going on so I tried to run it locally and indeed, when invoked with a parameter 100,000,000 it took 0.259 seconds for the C implementation to finish and only 0.140 seconds for the Go version.

The C version of the benchmark is the following:

[code language=”c”]
#include <stdio.h>
#include <stdio.h>
#include <stdlib.h>

main(int argc, char *argv[])
{

int arg1 = 1;

arg1 = atoi(argv[1]);

long a;
long sum = 0;
/* for loop execution */
for( a = 0; a < arg1; a++ )
{
sum += a;
}
printf("sum: %ld\n", sum);
return 0;
}
[/code]

The blog suggested using some optimization flags for compiling the C program, so I tried:

[code language=”bash”]
gcc -O3 bench.c -o bench -march=native
[/code]

The O3 flag tells gcc to use its most aggressive optimization strategies and march=native tells it to take advantage of the local cpu version when compiling to machine code.
This had quite a dramatic effect: Instead of 0.259 seconds, the entire program now took only 0.001 seconds!
And it stayed this low when I increased the parameter to 1,000,000,000. So it seems that the compiler has rewritten our program to an implementation which only takes constant time.

To find out what was causing this amazing speedup I compiled again using the -S flag, so I would get the assembly code. Here it is, including some comments that explain the instructions:

[code language=”c”]
.section __TEXT,__text,regular,pure_instructions
.macosx_version_min 10, 10
.globl _main
.align 4, 0x90
_main: ## @main
.cfi_startproc
## BB#0:
pushq %rbp
Ltmp0:
.cfi_def_cfa_offset 16
Ltmp1:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp2:
.cfi_def_cfa_register %rbp
movq 8(%rsi), %rdi
callq _atoi ; Converts the argument to integer, put the result in ax
xorl %esi, %esi
testl %eax, %eax
jle LBB0_2 ; Skip the calculation if ax is 0
## BB#1: ## %.lr.ph
cltq ; the next couple of lines represent the for-loop:
leaq -1(%rax), %rdx ; dx = ax – 1
leaq -2(%rax), %rcx ; cx = ax – 2
mulxq %rcx, %rcx, %rdx ; cx = cx * dx
shldq $63, %rcx, %rdx ; dx = cx / 2
leaq -1(%rax,%rdx), %rsi ; si = dx + ax – 1
LBB0_2:
leaq L_.str(%rip), %rdi ; ready, transform the result to a string
xorl %eax, %eax
callq _printf ; and print it
xorl %eax, %eax
popq %rbp
retq
.cfi_endproc

.section __TEXT,__cstring,cstring_literals
L_.str: ## @.str
.asciz "sum: %ld\n"

.subsections_via_symbols
[/code]

The compiler has transformed the for-loop into a single calculation:
(ax – 1) * (ax-2) / 2 + ax – 1, which can be simplified to: ax * (ax + 1) / 2
This is the well known formula for a partial sum of an arithmetic sequence with ax elements. Gcc has recognized that our loop could be rewritten as a single operation!

It also uses a couple of micro optimizations along the way. For instance, to compute dx = ax – 1, the lines:

[code language=”c”]
mov %rax, %rdx
dec %rdx
[/code]

would have done the trick. However the compiler chooses to do this in a single instruction:

[code language=”c”]
leaq -1(%rax), %rdx
[/code]

This instruction was originally meant for manipulating address pointers with offsets but it can be used to perform simple 64 bit additions as well. On modern cpu’s it performs a faster than the two separate instructions and it saves an instruction.
Another compiler trick is the following line:

[code language=”c”]
shldq $63, %rcx, %rdx
[/code]

The shld instruction performs a shift left on the second operand rcx and the overflow bits are moved into the third operand, rdx. By left-shifting the 64 bits rcx register 63 positions into rdx, it effectively performs an integer division by 2 on rcx where the result ends up in rdx.
This again saves an instruction but this time the result is equally fast as moving rcx into rdx and then dividing rdx by 2 (by doing a shift right).

Conclusion

The conclusion: writing a proper benchmark is tough, comparing the performance of languages is difficult and modern compilers are amazingly clever.
An open question for me remains what kind of algorithm the compiler uses to recognize the nature of our for-loop. Does it use some simple heuristic or is our for-loop a special case of a generic class of loops that can be simplified?