Arm Mali-G77 logo on silicon wafer backdrop

Alongside its new Cortex-A77 CPU core, Arm has unveiled a next-generation GPU destined for next-generation smartphone SoCs. The Mali-G77, not to be confused with the new Mali-D77 display processor, marks the departure of Arm’s Bifrost architecture and the move over to Valhall.

We’ll get into the fine details of the new architecture in a moment. First, we’ll leap right into what users should expect in terms of performance gains.

Mali-G77 performance overview

Arm boasts up to a 40 percent graphics performance boost with next-gen Mali-G77 devices compared with today’s Mali-G76 models. This number is taking into account process as well as architectural improvements. The Mali-G77 is configurable from 7 to 16 shader cores, and each core is almost exactly the same size as the G76 core. This means that high-end smartphones will likely ship with similar GPU core counts as they do today – somewhere in the low teens. Handily, this lets us make some speculative performance assessments against existing chipsets.

Looking at the popular Manhattan GFXBench benchmark, a 40 percent performance boost opens up a sizeable lead against current generation hardware. Qualcomm’s next-generation Adreno chip will need its own significant performance upgrade to keep the playing field level. The tables appear to be turning in Arm’s favor.

Architecture wise, gaming performance increases 20 to 40%, while machine learning earns a 60% boost

Based on this rather crude ballparking, a 10 core Mali-G77 (a configuration we often see from Huawei) looks to just about edge out this generation’s top of the line mobile graphics hardware. A 12 core configuration, typically seen in Samsung’s Exynos, provides a big lead for Arm’s latest GPU. Of course, real benchmarks will depend on other factors, including process node, GPU cache memory, LPDDR memory configuration, and the type of application you’re testing. So take the above graph with a hefty dose of salt.

In terms of the new architecture alone, Arm states that the Mali-G77 offers an averaged 30 percent improvement to energy efficiency and performance density. There’s also a huge 60 percent boost for machine learning applications, thanks to INT8 dot product support. Gaming performance expectations are set somewhere between a 20 and 40 percent boost, depending on the title and the type of graphics workloads on offer.

To understand exactly how Arm has achieved this performance uplift, let’s take a deeper dive into the architecture.

Meet Valhall, Bifrost’s successor

Vahall is Arm’s second generation scalar GPU architecture. It is a 16-wide-warp execution engine, which essentially means the GPU executes 16 instructions in parallel per cycle, per processing unit, per core. That’s up from 4 and 8 wide in Bifrost.

Other new architectural features include dynamic instruction scheduling managed entirely in hardware and an all-new instruction set that retains operational equivalence to Bifrost. Others include support for Arm’s AFBC1.3 compression format, FP16 render targets, layered rendering, and vertex shader outputs.

The Mali-G77 does 33% more math in parallel than the G76.

The keys to understanding the major architectural changes are found by examing the execution unit inside the core. This part of the GPU is responsible for number crunching.

Inside the execution engine

In Bifrost, each GPU core contained three execution engines or two in the case of some lower-end Mali-G52 designs. Each engine contains an i-cache, register file, and warp control unit. In the Mali-G72, each engine handles 4 instructions per cycle, which increased to 8 in last year’s Mali-G76. Spread across these three cores allows for 12 and 24 32-bit floating point (FP32) fused multiply-accumulate (FMA) instructions per cycle.

With Valhall and the Mali-G77, there’s just a single execution engine inside each GPU core. As before, this engine houses the warp control unit, register, and icache, which is now shared across two processing units. Each processing unit handles 16 warp instructions per cycle, for a total throughput of 32 FP32 FMA instructions per core. That’s a 33 percent boost to instruction throughput over the Mali-G76.

Arm has transitioned from three to just one execution unit per GPU core, but there are now two processing units within a G77 core.

In addition, each of these processing units contains two new mathematical function blocks. The new convert unit (CVT) handles basic integer, logic, branch, and conversion instructions. The special function unit (SFU) accelerates integer multiplication, divisions, square root, logarithms, and other complex integer functions.

The standard FMA unit has seen a few tweaks, supporting 16 FP32 instructions per cycle, 32 FP16, or 64 INT8 dot product instructions. These optimizations produce the 60 percent performance uplift in machine learning applications.

The Quad Texture Mapper

The other key change in the Mali-G77 is the introduction of a quad texture mapper, up from a dual texture mapper in the previous generation. The texture mapper is responsible for mapping the 3D polygons in a scene into the 2D representation that you see on a screen. It’s responsible for sampling, interpolation, and filtering to smooth out angled and moving content to avoid harsh, low-quality edges.

Low-cost anti-aliasing remains in place to assist with image quality, but the doubling of texture performance is the major benefit here. The texture unit now processes 4 bilinear texels per clock up from 2 previously, 2 trilinear texels per clock, and handle faster FP16 and FP32 filtering.

The quad texture mapper is split into two paths, providing a shorter pipeline for threads that hit content in the cache. The miss path, which handles format conversion and texture decompression, features a wider interface to L2 cache. This is also helpful for machine learning workloads that might frequently need to pull in new data from memory.

Bringing everything together in the Mali-G77

Arm has made a number of other tweaks to the Mali-G77 to coincide with the major changes in the Valhall architecture. The control block is simplified thanks to the single execution unit design, while the internal dynamic scheduler actually allows for a more flexible instruction issuing inside each core. With a higher throughput in each core, the datapath is also shorter and lower in latency, down to just 4-cycles from 8 previously.

The new design is also better aligned with the Vulkan API, simplifying driver descriptors to lower driver overhead for improved “to the metal” performance.

In summary, the Mali-G72 and Valhall make important changes from Bifrost that promise significant performance boosts for gaming and machine learning applications. Importantly, the design fits within the same power and area budgets as Bifrost, ensuring that mobile devices will be able to offer more peak performance without worrying about heat, power, and silicon costs. Based on the performance projections, the Mali-G77 should be able to give Qualcomm’s next-gen Adreno a good run for its money.

Comments
Read comments