Arm has announced three new Armv9-based CPUs: the Arm Cortex-X2, the Cortex-A710, and the Cortex-A510.
Arm’s CPU designs are used in the vast majority of Android smartphones today, with everyone from Google and OnePlus to Samsung and Huawei using the company’s CPUs in some form. These companies license Arm’s CPU cores and use them together with a GPU, NPU, ISP, DSP, etc., to make a system-on-a-chip (SoC). For example, the Snapdragon 888 uses a Cortex-X1, three Cortex-A78 cores, and four Cortex-A55 cores.
Those are all 64-bit Armv8 CPU designs. Arm recently launched its new instruction set architecture (ISA) for the next decade, Armv9. The new architecture is 64-bit and backward compatible with Armv8 but adds lots of future-proofing tech like Scalable Vector Extensions 2 (SVE2) and security-related features like Memory Tagging Extensions (MTE). With the move to Armv9, the company needs to upgrade all three of its mobile CPU tiers to Armv9. That means we’re getting three new CPU core designs in one batch. Here’s what we know about them!
Cortex-X2: The performance core gets more performance
The Cortex-X1 was the first CPU core from Arm’s Cortex-X Custom (CXC) program. This focuses on performance over efficiency, even more so than Arm’s traditional big cores. The Cortex-X1 has found its way into the Exynos 2100 and Snapdragon 888 chipsets, serving as the new prime core in these SoCs. Because it is tweaked for performance, there is normally only one X core on a mobile device. However, there is always the potential for multiple Cortex-X cores in an SoC designed for Chromebooks or other laptops.
Now, Arm has revealed the Cortex-X2. It is a 64-bit only (no 32-bit mode) Armv9-based CPU with the potential of a 16% performance improvement over the X1 (if built using the same manufacturing process and clock frequencies).
The company expects the processors using the Cortex-X2 to offer up to a 30% performance boost over 2021’s flagship phones (which use the X1) when other improvements like more cache are taken into account. Arm also says you can expect a 2x boost to machine learning performance over the X1.
The Armv9-based Cortex-X2 has the potential of a 16% performance improvement over the X1.
To find the extra performance, the X2 designers have decoupled the branch production from the fetch. This means the fetch can run ahead of the branch predictor and allow it to smooth out any gaps that may appear in the pipeline due to branching. The predictor itself has also been improved and now includes an alternative path predictor. This results in fewer branch misses, which in turn increases performance.
The graph below shows the reduction in branch miss predictions per 1,000 instructions (MPKI) of the X2 compared to the X1.
The X2 uses a 10-stage pipeline with an increased out-of-order window. Since it is an Armv9 CPU, it implements SVE2, this time at 128-bits. The X2 also improves instruction-level parallelism by increasing the load-store window/structure sizes.
The improved performance can also partially be attributed to increases in cache size. More specifically, while the L2 cache still tops out at 1MB, the L3 cache has been doubled from a maximum of 8MB in the Cortex-X1 and can now support up to 16MB.
Cortex-A710: The big core sips less juice
Arm has also issued a successor to the Cortex-A78, and the company is going with an all-new name in the Cortex-A710.
The Cortex-A710 doesn’t have the same peak performance as the X2, but you still see a respectable 10% performance boost over a Cortex-A78 on the same manufacturing process. But a far bigger improvement is to be had when it comes to machine learning and battery life, as Arm touts a 2x performance gain and 30% efficiency gain, respectively.
Arm has increased the performance by improving the branch predictor accuracy at the front-end of the processor and doubling the capacity of key branch prediction structures, namely the Branch Target Buffer (BTB) and the Global History Buffer (GHB).
For improved efficiency, the A710 is a five-wide core (versus six-wide on the A78) and switches to a 10-stage pipeline (much like the Cortex-X2). In addition, there are changes in the data-prefetcher that yield improved coverage and accuracy.
Unlike the X2, the Cortex-A710 also supports AArch32 (i.e., 32-bit apps), a feature that will soon disappear. Arm has announced that by 2023 all its new CPU cores for mobile will be 64-bit only. Like the Cortex-X2, the SVE2 engine is 128-bits wide.
Cortex-A510: Finally, a new little core
Arm hasn’t released a new little core in four years, which is an eternity in smartphone years. Thankfully, the wait is over as the company has launched the Armv9-based Cortex-A510 to pick up where the Cortex-A55 left off.
As you’d expect from a long-overdue upgrade, Arm says the Cortex-A510 brings a 35% performance improvement, a 20% efficiency gain, and a 3x boost to machine learning compared to a Cortex-A55 on the same process.
The company says a combination of a three-wide in-order design (compared to two-wide in the A55), along with branch prediction and data prefetching tech from the Cortex-X project, have contributed to the A510’s improved performance and efficiency. It also uses a three-wide decode, a three-wide issue, features three integer ALU pipelines, and dual load/store pipelines. The load/store pipelines can work as 2x load or 1x load plus 1x store.
The most interesting feature of the Cortex-A510 is its merged-core microarchitecture. Two Cortex-A510 cores can be grouped in a complex. When in a complex, the Cortex-A510 cores share some resources, most notably the L2 cache, the L2 Translation Lookaside Buffer (TLB), and the SIMD engine (meaning floating-point, NEON, and SVE2).
The most interesting feature of the Cortex-A510 is its merged-core microarchitecture.
This is a similar idea to simultaneous multithreading (SMT), which you may know as hyperthreading, in that parts of the CPU core are shared. However, the Cortex-A510 merged-core microarchitecture is much less drastic. The main parts of the core are still independent, and everything except floating-point operations and SIMD operating remains on each core. However, when the core needs to do some vector math, it uses a NEON/SVE2 engine that is shared with another core. Some clever fine-grained scheduling between the cores means there is minimal overhead even when both cores are using the vector unit. Under some floating-point heavy benchmarks, Arm is seeing only a 1% dip in math performance.
The advantages of the merged-core microarchitecture setup aren’t so much about performance or energy efficiency, but area. The more transistors in a processor, the more money it costs. This isn’t normally a problem at the high-end. However, price-sensitive phones need to save money wherever possible, including down to how many mm2 the CPU core occupies.
Speaking of vector math, since the Cortex-A510 is an Armv9 processor, it implements SVE2. However, unlike the X2 and the A710, the A510 can be built using a 64-bit implementation of SVE2 or a 128-bit one. This gives chip makers the flexibility between area and performance.
Since the Cortex-A510 will also be used in flagship processors, it is possible to create one-core complexes, meaning there are no shared resources. So, to get the best performance from the A510, it needs to use one-core complexes and 128-bit SVE2. An area-conscious version would use two cores per complex and 64-bit SVE2.
There was lots of internal discussion at Arm about the architecture for the Cortex-A510: should it remain an in-order CPU like the Cortex-A53 and Cortex-A55, or should it move to an out-of-order design? In-order designs are very efficient, but the question was, can the desired performance be obtained? The answer is yes; the in-order design was the right way to go for maintaining power efficiency while boosting performance.
To highlight this, Arm makes a comparison to the 2016/2017 Cortex-A73. That CPU design was found in processors like the Qualcomm Snapdragon 835 and phones like the Google Pixel 2. The Cortex-A73 is an 11-stage, out-of-order processor based on Armv8. A smartphone processor that uses just the Cortex-A510 in 2022 will offer 90% of the performance compared to a Cortex-A73-based smartphone but consume 35% less power. That also means the Cortex-A510 is faster than the Cortex-A57 and the Cortex-A72! In other words, today’s power-efficiency cores (the little cores) are closing in on the performance levels of past big core CPU designs.
Arm has deliberately left the door open for maxed-out configurations of the Cortex-X2 if that is what its partners want to build. There is no technical reason stopping someone from building an octa-core Cortex-X2 processor with up to 16MB L3 cache and 32MB of system-level cache. It would be designed for laptops or even small desktop units. Will someone build such a processor? We can only hope! A potentially more realistic option would be a quad-core Cortex-X2 plus quad-core Cortex-A710 setup, again aimed at Chromebooks or laptops.
We should see phones using upgraded processors in the first quarter of 2022.
We will likely see a repeat of the common 1+3+4 format in the mobile space, but this time with one X2, three A710 cores, and four Cortex-A510 cores. Could this be the setup of Samsung’s mobile processor for the Galaxy S22? Such a processor would theoretically offer a 30% jump in single-core peak performance (thanks to the X2), a 30% increase in sustained efficiency (thanks to the Cortex-A710), and a 35% uplift in little core performance (thanks to the Cortex-A510).
We can expect to see the Cortex-A710 coupled with the Cortex-A510 in either a 4+4 or 2+6 setup for chipmakers who aren’t part of the Cortex-X Custom program. There is also the potential for an octa-core A510 processor or even a quad-core variant. Octa-core Cortex-A53 processors were quite popular, but we didn’t see the same enthusiasm for octa-core Cortex-A55 chips. The Cortex-A510 has the potential to rekindle the passions for such processors, especially considering the area-saving benefits of the merge-core microarchitecture. However, since the Cortex-A510 is 64-bit only, it might limit the appeal in markets that don’t use Google’s services (i.e., haven’t transitioned to 64-bit only apps yet).
When will we see the new CPUs?
Designing modern CPU cores can take years. In fact, the first discussions about the Cortex-A510 took place as early as 2016, and the ideas around the merged-core microarchitecture were being touted even as far back as the design of the Cortex-A53. The public announcement of these new cores is one of the final steps. However, long before we heard about these designs, Arm’s key partners — including Qualcomm, Samsung, and MediaTek — will have already been working with Arm.
This means we can expect to see Armv9 processors announced, using some or all of these cores, towards the end of 2021. Actual phones using these processors might launch as early as the first quarter of 2022.