Nexus 4 Cache Performance

It’s not all about the CPU

In Sisoftware’s benchmark comparison of the Samsung Galaxy S3(Cortex-A9), Galaxy S4 3G (Snapdragon) and Galaxy S4 LTE(Exynos 5 Octa) they remark that compared to the Cortex-A9 on their AES-256 Encrypt/Decrypt benchmark test:

Krait is only 13% faster here, based on the increased memory bandwidth and clock increase we expected more. Its small L0/L1 caches may be to blame.

They highlight a possible problem, and one of the more difficult parts of performance optimization. So we thought we would take a closer look to see what is going on.

The processor speed is only one part of the puzzle when trying to make code run fast. If you have slow memory accesses then you can’t feed the processor fast enough to make full use of all clock cycles. This is true of applications which process large amounts of data or access data inefficiently. It can also be true of the instructions that make up the program. Those also need to be retrieved from memory just like data, and if you have a very complicated algorithm, or if your program jumps around between lots of functions, you may slow things down more than you might imagine. The fastest caches live very close to the registers on the chip and there are separate data and instruction caches for each processor. The larger, slower L2 cache is shared between data and instructions and across all CPUs in a multicore system. The Krait used on the Nexus 4 has a 4k/4k super fast L0, a 16K/16K L1, and a 2MB L2. This contrasts with a 32K/32K L1 and 1MB L2 on the Cortex-A9. Another difference between the Krait and the A9 is the length of the cache lines they use. The A9 has 32 byte lines, the Krait has 64 byte. This means that the Krait pulls in twice as much data whenever it adds something to the cache. This is good if you are accessing lots of data all in a row. But if you only use a small portion of the cache line, then you are wasting space in the cache with data you don’t use.

The Sisoftware article suggests that there may be something about the AES-256 algorithm which exceeds the cache capacity of the Nexus 4. If you have data or an algorithm which does not fit in the L1 cache, then you will waste time pulling in the same data multiple times. Sisoftware is limited in the article to drawing conclusions based on the benchmark numbers. We can perform a deeper analysis using our Prism tool technology so let’s see if they are right.

Replicating the results

To reproduce the results we took an existing implementation of the Rijndael cypher from the mibench benchmark and compiled it for Android. We ran it on both a Nexus 4 and a Nexus 7 to compare the Krait and the Cortex-A9. We were comparing the complete systems, so there could be other differences, such as memory bandwidth, playing a part. We chose a large block of text as our test data so that the encryption takes about 9-10 seconds on the devices.

Test Nexus 4 – nanoseconds Nexus 7 – nanoseconds Performance Ratio Nexus 4 – cycles Nexus 7 – cycles
Rijndael – Encrypt 9115916376 10718004000 1.18 13673874564 12861604800
Rjindael – Decrypt 9063695997 10474305000 1.16 13595543995.5 12569166000

Similar to the Sisoftware test, we can see that the Nexus 4 is faster. But, disappointingly, only by around 18%. It is clocked 25% faster, should have faster memory and may even some other advantages. So something does not seem quite right.

Digging deeper

Now we have these results, we wanted to examine the reasons behind the behaviour. To provide some illumination we turned to the Prism Technology Platform, our profiling and analysis suite. It instruments the application under analysis at runtime and allows us to gather comprehensive information about the memory and processing characteristics of an application in order to identify areas of poor performance. Because it tracks all data accesses, we can build a picture of how well an algorithm is using the cache by modelling the cache’s attributes.

Device Instructions Cache Line Length
Cache Size
Number of Cache
L1 Data Cache Misses
L1 Data Cache Misses
Nexus 4 15885136198 64 16 256 419 7455912
Nexus 7 15885135967 32 32 1024 760 0

Examining the results showed a clear difference. The first time we load a line of data into the cache we have no choice, it will always be a cache miss, so we call it a compulsory miss. These are recorded separately and showed that the Nexus 4 loads up fewer, longer cache lines as expected. The interesting comparison is with the number of cache lines the two L1 caches have. We can see the Nexus 4 is trying to load more data than can fit in its cache. Because of this, we see that it has capacity misses, times when data has to be reloaded into the cache because it has been evicted by something else. Whenever we miss the L1 cache we incur a performance penalty as the L2 cache is slower. We do not have timing information for the Krait, but the Cortex-A9 has an L2 latency of 8 cycles compared with 1-2 for L1. So the Nexus 4 performance is being reduced because of the size of its cache. Without more detailed architectural information it is hard to estimate exactly what effect the cache misses are having on the algorithm, except to say that the Nexus 4 will certainly be slowed as a result.

We can improve our confidence in that conclusion by adjusting the Rjindael algorithm. Our implementation by default uses a number of 8kb tables. Instead we can use slower 2kb tables which should make everything fit into the smaller cache of the Nexus 4.

Test Nexus 4 – nanoseconds Nexus 7 – nanoseconds Performance Ratio Nexus 4 – cycles Nexus 7 – cycles
Rijndael – Encrypt (Smaller Table) 9751472633 14872027000 1.53 14627208949.5 17846432400
Rjindael – Decrypt (Smaller Table) 13286006396 19539650000 1.47 19929009594 23447580000

Our changes have made the test run slower on both devices, but the relative performance has changed in favor of the Nexus 4. With the working set of data now inside the 16kb limit of the L1 cache, the performance penalty disappears and the Nexus 4 is about 1.5 times faster than the Nexus 7.

Same problem, different place

We went on to survey a range of benchmarks, looking for unusual results to analyze further. One which jumped out at us was the performance of the Nexus 4 on the Mandelbrot part of the Smartbench benchmark. The Nexus 4 was faster overall, but achieved a Mandelbrot score of around 3000. The Nexus 7 scored over 4200 despite having a slower processor. Further investigation revealed that the Benchmark was spending a lot of time in the libskia library. At its lowest level it was spending a lot of time drawing the pixels which make up the fractal images, but the associated execution time is spent in a large number of functions. Using Prism we were again able to simulate the behavior of the L1 cache.

Device Instructions L1 Instruction Cache Misses
L1 Instruction Cache Misses
L1 Instruction Cache Miss Bytes
Instruction Cache Unused
Instruction Unused %
Nexus 4 3236505383 3065 15431998 987647872 550337409 55.7%
Nexus 7 3765191562 6892 2538686 81237952 35526175 43.7%

This time the difference that stands out is in the instruction cache. The Nexus 4 has about 13 million more cache misses than the Nexus 7. This benchmark fragment uses about 1/3 of the instructions of the Rijndael example but has almost twice as many cache misses so the performance impact of having to go to the L2 cache to fetch instructions will be even greater.

The Prism Technology Platform gathered detailed information about the accesses of the program and modeled the behavior of the cache. This let us know which accesses missed and how much of a cache line is used before it is evicted, information which would otherwise be very hard to collect. From the Smartbench example we could see that both the Nexus 4 and the Nexus 7 are not using all of the data that they pull in to the cache. In fact the Nexus 4 with its longer cache lines is using less than half the data it loads into L1. This is wasted space and effort, and when trying to optimize applications, improving the efficiency of cache use is one of the advanced optimization strategies that the Prism tools allow us to investigate.

Details matter

Performance evaluation and optimization is a complex subject. In the rapidly evolving mobile computing sector, new processors and devices appear frequently and optimizing not only applications but also the underlying virtual machines and libraries is a constantly shifting problem. The correct tools let you gather detailed performance information and allow you to move from identifying unusual behavior to discovering the underlying cause. Without this low level information it is difficult to get to the point where you can find a performance bottleneck and start improving it. The example above shows the benefits of being able to perform low-level profiling and modeling. It allowed us to confirm the hypothesis made by Sisoftware and target our search within the test case to the amount of data the algorithm used. Without such measurements you rely on trial and error which can take a long time and gives you little understanding of the underlying performance limitations of the system.