Benchmarking the Benchmarks

An intriguing result

While running some Benchmarks on the Nexus 4 and Nexus 7 recently we came across an interesting result from Antutu. Their floating point results for the 3.X version of their benchmark had the Nexus 4 scoring 2570 but the Nexus 7 scoring 2692 despite clocking at a lower frequency.

This intrigued us, but before we could do much more than stroke our beards and go "hmmmm", a new version of Antutu was released. Version 4 of the benchmark had a different result with the Nexus 4 scoring around 1500 and the Nexus 7 stuck at 1200.

This is interesting because the hardware has stayed exactly the same, the only change is in what the benchmark is doing. We can't directly compare the numbers between versions because they are likely doing different things, but we can compare the relative performance of the 2 devices. The 3.X result was surprising since you would expect the Nexus 4 to outperform the Nexus 7 because of its faster processor and more advanced architecture.

Cache performance? Or something else?

As we have previously seen, the Nexus 4 has a smaller L1 cache than the Nexus 7. Perhaps the same performance hit was occurring in this benchmark? Using our Prism Technology we simulated the instruction and data cache performance, but we found no significant differences. We needed to look elsewhere to discover the cause of the discrepancy. The debug information in the executable itself gave us a clue. Profiling with Prism we saw the bulk of the time in Version 3.X of Antutu was spent in a function called DoFourier while version 4 spent its time in a function called DoLU. The algorithm used in the benchmark appears to have changed from a Fourier transform to an LU decomposition. Both perfectly valid processor intensive algorithms.

Benchmark evolution

From the Antutu website we have this quote:

AnTuTu Benchmark V4.0 does improvement on the tests to CPU integer, CPU float-point, memory, 2D/3D, database and SD-card. Via optimizing the algorithm and improving the supporting programs, AnTuTu keeps steps with the times which can accurately test out the performance level of current Android-based smart devices.
So we shouldn't be too surprised that the update to the benchmark appears to have changed the algorithm which is used for the floating point test. As new architectures appear on the market and more cores are added to smartphone CPUs it is inevitable that benchmarks will have to change to keep up.

In the case of the Antutu benchmark, the change to the floating point test appears to be reasonable. Switching out one floating point algorithm for another is not too controversial. It may be that this algorithm is better for testing some aspect of the CPU or is easier to split across multiple cores. So we decided to look a little closer at what the low level code is doing, to gain a better understanding of the change of relative score between versions for the Nexus 4 and 7.

How good is your floating point test?

It is hard to give a phone a score that can be used to compare it to other devices which may have totally different architectures. Likewise drilling down into a CPU and testing how good it is at "floating point" can rapidly become very complicated. There is definitely something to be said for picking a reasonable algorithm that any phone can do and running it through many iterations to get a number for how fast it goes. It is easily comparable across phones and you would hope that a better phone would go faster. But as the slightly odd result from the Antutu 3.X benchmarks shows, it isn't always that simple.

Our previous article looking at low level differences in the FPU between the Cortex-A9 and the Krait demonstrated that there can be performance differences at the instruction level: some FP operations are faster on the Nexus 4, some are faster on the Nexus 7. Can we explain the performance differences for the different benchmarks? Our business is in making code go faster, so we decided to focus on why the scores are different..

The LU decomposition is a pair of tight loops which correspond to 2 almost identical blocks of assembler taking up over 80% of the runtime of the test. At their core, they load values from a matrix, multiple them together, and subtract them from a total. The core floating point calculation boils down to a single vmls.f64 instruction, which performs the multiply and subtract. The other floating point instruction in the loop is to load the floating point value into a register. It is a very elegant algorithm once the compiler has optimized it.

Contrast this with the Fourier transform from Antutu 3.X: it spends around half its time in the Pow library function. The most executed block of code takes up about 22% of the instructions for the test and is composed of a long list of FP instructions. Specifically, it contains:

Instruction Occurrences
vadd.f64 9
vcvt.f64.t13 1
vdiv.f64 1
vldr 12
vmla.f64 10
vmov 11
vmov.f64 5
vmul.f64 11
vsub.f64 13

It is difficult to say which is the most relevant for an average smartphone user: is doing an LU decomposition more representative than a Fourier transform? However, looking at the different types of instruction exercised by the two versions of the benchmark, it would appear that the Antutu 3.X floating point benchmark is a more comprehensive test. It includes a divide and a convert as well as the more usual multiply, add, and multiply accumulate. The Antutu 4 benchmark on the other hand is only really doing a multiply accumulate.

Know your tools

A benchmark is a tool for comparing different devices, but is also a marketing tool and an optimization tool. Having good benchmark scores is important for manufacturers and a lot of effort is put in to squeezing the maximum performance out of the device to make sure it compares favorably with the competition. For benchmarks to be useful they have to test performance thoroughly and should represent closely real user scenarios. A poorly written benchmark may be vulnerable to attack by a vendor keen on getting a headline, thus failing in its goal of informing the end user about the relative performance of different products. The differences in the Antutu benchmark scores show how updates to a benchmark can make significant changes to the portion of the system under test. The Antutu 4 floating point test is only exercising a fraction of the instructions that the previous version did and arguably is no longer as useful in informing users about what to expect from devices benchmarked using it.