Newer Is Better?

In the middle of 2012, Qualcomm released the Snapdragon S4(Krait) processor which powers the Google Nexus 4. AnandTech described it like this:

Krait will be faster regardless of application, regardless of usage model. You're looking at a generational gap in architecture here, not simply a clock bump.
But is newer always better?

Benchmarks exist to help us compare different products on an equal footing but a modern CPU has so many constituent parts that it is difficult to summarize its characteristics. Chip manufacturers always have to make trade-offs about how to use the limited amount of silicon so inevitably there will be some situation where something unexpected happens. For example we were looking at the Nexus 4 vs the Nexus 7 when we discovered an interesting result from the AnTuTu 3.X benchmarks. Despite clocking faster, the Nexus 4 performs worse on the floating point part of the benchmark than the theoretically slower Nexus 7.

On paper the Nexus 4's Krait processor delivers significant enhancements over the existing Cortex-A9 architecture of the Nexus 7. It can perform 3 instruction decodes per cycle and has dual-channel LPDDR2 for higher memory bandwidth. It has a super fast 4k/4k L0 cache with a 16k/16k L1, it uses a 28nm manufacturing process and it has a 128-bit wide NEON bus vs the Cortex-A9's 64-bits. In an effort to better understand the differences between ARM-based processors, we did an investigation into the low level performance of the Krait compared to the Cortex-A9. Using simple benchmarks to test the speed of low level operations we can start to build up a picture of the strengths and weaknesses of various chips which is of course fundamental to anyone wanting to extract the maximum performance from a device or platform.

The test

To compare the Krait and the Cortex-A9 we chose to use the Nexus 4 (1.5GHz Quad Core Krait) smartphone and the Nexus 7 (1.2GHz Quad Core A9) tablet and write synthetic benchmarks to test performance in a tight loop. To begin with we restricted ourselves to examining performance while repeatedly executing individual instruction types. We built a loop which performed a series of independent operations and repeated the loop many times to give us the average. Repeating the test with different numbers of operations in the loop gave us more information. By starting with only one operation in the loop and increasing from there we got information about the latency and throughput of the different instruction types. Running the tests on both devices and calculating the number of cycles in the loop allowed us to normalize the results against the CPU frequency and therefore tell how good the two FPUs were at executing sequences of the same instruction. We could also check against the timing information which ARM publishes on the Cortex-A9.

This test is certainly not exhaustive, and a modern processor has many features which affect it' overall performance in day to day use. The goal of these tests is not give each processor a mark out of ten, but to demonstrate how important the details can be when comparing different devices.

Note: Within the body of the loop the ADDs are independent of each other, but there is a dependency between iterations of the loop, so if we have a small number of operations in the loop, we will have to wait for the first operation of the last iteration to complete before we can start to execute the next one.

Floating Point Add

1

Looking at the results of our tests for the VADD(Formerly known as FADD) instruction we see that the 2 devices take an almost identical number of cycles in all cases. So, normalized for clock frequency, both chips have equivalent performance and it looks like both devices are taking 5 cycles to perform an ADD.

2

This shape of graph is caused by the pipelining of instructions; Performing 1 ADD takes five cycles, but we can start another add on cycle 2, which means that 2 ADDs will only take 6. The first time we go through the loop we incur the cost of filling up the pipeline, but after that we gain performance. In this example we can see that performing a loop of 5 operations takes on average 5 cycles, the cost of performing 1. Once we have more than 5 ADDs in a row, we have to wait for an ADD to pop out of the end of the pipeline before we can issue the next. We always have to wait for a minimum of 1 instruction to complete but the pipeline may let us get started with the calculations of the next iteration, which is why the graph stays flat as we add more instructions until we exceed the latency. The rate of increase gives us the throughput of the instruction.

Floating Point Multiply

3

For the Multiply instruction the result is slightly different. The ARM counts have a throughput of 1 and a latency of 5, however it only achieves a latency of 6. The Krait appears to have the advantage here with a lower latency of 5 for single precision floating point multiplies. This is obviously an advantage on top of any improvement brought about by increases in clock frequency.

Floating Point MAC(Multiply and accumulate)

4

The multiply and accumulate instruction tells a different story. Here Nexus 7 appears to have the advantage. The instruction timings for the Cortex-A9 give a throughput of 1 and a latency of 8, which matches the graph exactly. The Krait appears to have a higher latency of 9 which is a small disadvantage on an individual instruction, but it also appears to have a throughput of 2, which means it can only issue a MAC instruction once every 2 cycles. So if you have code that performs a lot of MAC operations, you may not see the full benefit of a higher clock speed in the Krait and it might not be as fast as you expect.

5

If we remove the normalization on clock frequency and look at nanoseconds instead, the Nexus 7 starts off slightly slower, but quickly gains an advantage with its lower latency.

Why is this important?

It may be argued that testing like this does not accurately represent what real code does, and there is a strong case for that. However it does illustrate the point that it is not straightforward to comprehensively measure the performance of a processor. This is why benchmarks exist. However, although benchmarks allow people to easily compare different products, they are only as good as the tests they run, and if they don't represent well real world workloads, they may leave users with a misleading impression.