On GTC 2021 not long ago, Lao Huang announced that Nvidia will soon launch Grace, a central processor for HPC AI, based on the Arm architecture. This is big news in the data center/server/infrastructure industry.
Although from last year to this year, Arm has won two years in all walks of life, and Arm has also frequently appeared in data centers—for example, Amazon’s self-developed Graviton2 processor is based on Arm, and Fujitsu’s HPC-oriented A64FX processor is based on Arm— These are mentioned in the article “Arm’s Ten-Year PC Journey, and Microsoft’s “Ambiguous””.
But Nvidia’s weight in the data center has almost pushed Arm to the forefront, and it will somewhat soon be able to bring down the x86 rooted in this market. Regarding Nvidia’s Grace, we will publish another article with a separate review soon, even if I personally think the market Grace is focusing on is quite targeted. This article attempts to show, aside from Grace, how far Arm has developed in the server market today.
Arm takes the server market seriously for the first time
Arm has always had ambitions to break through the mobile and embedded markets, but just like Arm’s 10-year journey on the PC, this process requires constant trial and error. NVIDIA’s exploration of high-performance CPU/SoC did not start today. More than 10 years ago, NVIDIA announced a project called Project Denver, which aims to cooperate with Arm to launch CPU products for the HPC (high-performance computing) market.
Not only NVIDIA, but Qualcomm has also launched Centriq processors based on Arm architecture, targeting the enterprise and server markets; Cavium’s ThunderX was also a well-known Arm server chip before; there are Broadcom’s Vulcan, AMD’s Opteron A1100… Although some of them Most have failed. It feels like from the PPT shown in the early days of these projects, Intel should have been trembling in the corner long ago. But in fact, Arm has never really entered the mainstream in this field (although Arm has always claimed that it has the highest market share in “infrastructure” equipment, including routers, switches, base stations, servers, etc.).
It was only when server chips such as the Kunpeng 920 came out in modern times, and Ampere Computing, which is quite active today, was also pushing Arm-based server processors, that Arm has gradually become decent in this field. At this year’s GTC keynote speech, Huang Renxun also announced that NVIDIA’s GPUs will be paired with CPU/SoC from Ampere Computing, Amazon, MediaTek and other partners, from cloud to edge to consumer terminals. Although in an interview, Huang Renxun said that Grace will not affect existing customers to a great extent, but this obviously means dismantling the AMD/Intel corner.
Arm’s trial and error history in server, infrastructure, or data center products will not be reviewed much. In fact, although Arm has always meant to test the server market in the past, they have never had the infrastructure for these data centers, and they have pushed dedicated IPs-Arm’s IP in this market is basically shared with the IP for the consumer market. .
This is actually understandable. Most of the core IP of the same generation of chip manufacturers will be shared to a greater extent for different markets. But this also shows that Arm has not had a clear and decent market plan for infrastructure equipment before, which is inherently detrimental to ecological construction.
The turning point appeared in 2018. At TechCon in October of this year, Arm officially announced the launch of the Neoverse series IP for the server market, from cloud to edge. At the same time, the product roadmap for the next 3 years was also disclosed, as shown in the figure above. Arm Neoverse can be understood as the server version of Arm Cortex.
This picture lists the Neoverse version that is iterated every year, such as the processor core IP named Ares in 2019, which is Neoverse N1. Arm first announced that it would achieve a 30% performance improvement for each of the future iterations – which sounds faster than the average growth rate of the Cortex series next to it, and faster than the competition. At a later press conference, Arm also announced the performance improvement of N1, which is actually 60% higher than that of Cosmos in 2018 (but Cosmos is said to not refer to a special architecture), which is a double improvement of the original target – based on SPEC2017 Shape test (SPECspeed2017_int_base). By the release of N2 this year, the speed of improvement seems to have exceeded expectations again, and more specific micro-architecture changes will be discussed later.
In February 2019, Arm officially announced the launch of Neoverse N1 and E1 platforms – this should be the beginning of Arm’s transformation of the server market.
Of course, just changing the name is definitely useless. In the previous articles discussing the development history of Arm, we have all talked about the core IP of modern Arm. On the premise of maintaining low power consumption, there have also been breakthroughs in high performance. Even if the average chip maker can’t do the level of the Apple M1, the Qualcomm Snapdragon 8cx already has the feasibility of being applied to the PC. At least this era is fundamentally different from the Nvidia Project Denver era (Cortex-A15 era) 10 years ago.
In 2018, Arm launched the Cortex A76 core IP to the market, which should be well known – the Snapdragon 8cx that Qualcomm pushed for PCs applied the Cortex A76 microarchitecture. The Neoverse N1, which came out the following year, was actually a variant of the Cortex A76 microarchitecture, or based on the Cortex A76.
Austin Family Microarchitecture (Optional)
Both are from the Arm Austin Design Center, both belong to the Austin family of microarchitectures and are based on the ARMv8 instruction set. In fact, the subsequent updated versions of Cortex and Neoverse, including Cortex A78 and Neoverse V1, should all belong to the Austin family. The new generation of Neoverse Poseidon, which is expected to be introduced to the market this year, will adopt a new micro-architecture.
Neoverse V1, like Cortex A76, uses 4-wide fetch/decode, 11-level pipeline depth, which can be reduced to 9 levels when needed. There is little difference between the front end and the back end.
The big difference between the two is mainly in terms of storage and connection – as a server processor, this is also inevitable. L1-I cache achieves complete coherency at the hardware level, which is an improvement for virtual environment performance. L2 adds 1MB optional size (A76 is 512KB), for storage-sensitive applications.
The storage hierarchy has changed considerably. The N1 CPU is connected to the mesh network, namely CMN-600 (CMN-600 is the SoC interconnect IP first released by Arm in 2016, the full name is coherent mesh network; as shown in the figure above, the connection passes through CAL and mesh network. XP intersection; each CAL layer connects at most 2 N1s—that is, two cores as a cluster). In Arm’s reference design, it is subsequently connected to the system-level cache – SLC (system level cache) slice, each cluster is 2MB, and the reference design 64-core N1 has a total of 64MB SLC.
This picture is from Wikichip, the structure is more clearly depicted
N1 removes the snoop-filter logic of L3 and DSU (DynamIQ Shared Unit), and the CPU core is directly connected to the CHI interface of CMN. In this way, the communication between the memory controller and the CPU core only needs to go through the mesh network. This also seems to belong to the standard configuration of the server CPU.
The 7nm process is also matched with the above design content, and the core area is still small as a whole. In addition, a relatively big change of Neoverse N1 is that the maximum frequency has been raised. When it was released, it said that it reached 3.1GHz, and the voltage needs to be increased accordingly to provide higher single-thread performance – the frequency is increased by 19%, but it actually needs 44% At the cost of power consumption, this also shows that Arm has no magic in the relationship between frequency and power consumption. Amazon Graviton2 is a chip based on Neoverse N1, and the CPU core frequency is only 2.5GHz.
Based on the frequency increase of consumer-grade products, the thinking of traditional server CPU suppliers such as Intel and AMD is different: these two players in the x86 market share their server CPU micro-architecture with consumer CPUs, but the server CPUs are pulled down. core frequency. This has a lot to do with the positioning of both parties in the consumer market.
But even so, the power consumption still has an advantage. Arm previously claimed that the total power consumption of the 64-core N1 reference design is about 105W; Arm disclosed its 64-core reference design SPECint_rate2006 throughput test score of 1310 points, integer delay score (SPECint2006) score 37 points, It can still show the advantages of Arm’s energy efficiency.
For network and storage servers, Arm recommends 8-32 core N1 design, TDP 25-65W; similar to 5G base station edge device, 16-64 core target design, TDP 35-105W; for hyperscale data centers, N1 target Design 64-128 cores, TDP >150W.
Arm’s Neoverse N1 platform reference design has 64 cores, plus the aforementioned CMN-600 mesh network and 64MB of SLC cache. The die size of the 64-core N1 reference design manufactured by TSMC’s 7nm process is close to 400mm? Arm also recommends a design like chiplet, and the chiplet dies communicate through CCIX interconnection.
In addition, in the Neoverse N1 platform design, SmartNIC can be integrated – accelerating network connection is still an important factor in achieving high throughput in data centers today (see what Nvidia is promoting now). CMN-600 can be connected to some fixed function acceleration IP. Connecting third-party IP through CCIX can achieve storage consistency. More features are not listed, but are related to server RAS, security, etc.
How is the efficiency of Arm server CPUs now?
The previous Arm server processor IP cores were not mainstream enough, largely due to their poor performance and efficiency. The performance and power consumption at the hardware level are the basic requirements for competing for this market.
There are not many channels to understand the performance of server CPUs, and chips like Amazon’s Graviton2 are still used by Amazon. However, as Arm has been active in the server market in the past two years, foreign media like AnandTech have also begun to apply performance testing to server and infrastructure processor products.
In fact, in 2018, Cavium’s ThunderX2 was considered by AnandTech to be the first processor product in this field that the Arm platform can be compared with Intel and AMD. The emergence of Amazon’s Graviton processors since then has also shown that Arm processors can become mainstream in the server field.
In addition to Amazon Graviton2, in the Neoverse N1 implementation, the more representative should be the latest Altra product line of Ampere Computing. Last year’s Altra Q80-33 was used to benchmark Intel and AMD’s high-end products in the server market.
Altra Q80-33 clocked at up to 3.3GHz, 80 cores; CMN-600 mesh interconnection, 1MB per core L2 option, and 32MB SLC – the SLC allocated to each core may be a little less. I/O and the higher system level will not be introduced. The Mount Jade, 2-socket 2U rack server built by Ampere, if you are interested, you can check the peripheral configuration.
It is worth mentioning that the TDP of this processor is 250W. In fact, it does not refer to the average power consumption of conventional loads, but the average power consumption under peak conditions. Its actual power consumption is lower than 250W in most cases. AnandTech believes that according to the standard method of Intel and AMD, the TDP of Altra Q80-33 should be around 200W.
Correspondingly, Intel just released the Ice Lake-SP Xeon processor not long ago, the high-end model TDP 270W (up to 8380 is 40 cores, the Sunny Cove architecture of the tenth-generation Core); AMD launched the EPYC codenamed Milan last month. Processor, TDP 280W (up to 64 cores, Zen 3 architecture). If you only look at the price of the high-end version of the processor, the price/performance ratio of the Ampere Altra is still a lot higher.
AnandTech just recently tested Ice Lake-SP, including AMD Milan, Ampere Altra, and Amazon Graviton2. The test items are divided into multi-threaded performance (SPECint2017/SPECfp2017 Base Rate-N), single-threaded performance (SPEC2017 Rate-1), per-core performance (for per core licensing), JAVA performance (SPECjbb MultiJVM), LLVM compilation, and NAMD performance. If you are interested, you can go to learn about it. The specific results are not listed here (for reasons of space, the above figure only lists the integer multi-thread performance/single-thread performance).
Looking at the x86 platform alone, since the advent of AMD Zen 2, Intel Xeon processors have shown all-round weakness in performance. The previous generation AMD EPYC and Intel Xeon have opened a relatively large gap in performance. This generation of Intel has caught up a little bit, but there is still a gap overall in flagship products. Intel is now more and more emphasis on system performance, from its own advantages including storage, software optimization, etc., to make up for the weakness of the CPU itself, so AnandTech’s test may still be relatively one-sided. And the planned Sapphire Rapids in the second half of this year will be launched soon. This is off topic.
Ampere Altra based on Arm Neoverse N1 is able to play back and forth with AMD’s previous generation Rome architecture 64-core EPYC. Neoverse still lags behind the x86 platform in per-core performance; in addition, Altra’s performance in storage-sensitive tests is not very good, which is related to its cache configuration (and possibly mesh interconnection); in addition, Ampere’s overall system solution, dual The socket extension is still not comparable to Intel/AMD. However, in the load scenario with partial computing power, Altra has an advantage with more cores; in terms of energy efficiency, as mentioned above, it has a significant advantage in power consumption compared to the x86 platform.
It is particularly worth mentioning that the server processor of the Arm platform also has a significant price advantage. At the same time, Ampere also plans to launch an Altra-Max this year, using 128 Neoverse-N1 cores, which is the top design target of Arm.
Although Arm processors represented by Ampere Altra are still inferior to x86 (mainly AMD) in some aspects of performance, they have really posed a serious threat to the x86 server market.
It should be pointed out that strengthening the ecological construction is Arm’s top priority, whether it is strengthening cooperation with hardware and software partners or formulating specifications. When Neoverse was released two years ago, Arm also released the ServerReady compliance certification program to help users make safe and compliant deployments of Arm server systems.
Release of Neoverse N2 and V1
At the GTC conference, Nvidia said that the Grace CPU will use a new generation of Neoverse architecture, but did not say what the specific architecture is. According to the schedule, last September, Arm released a new generation of Neoverse architecture. In addition to the N1 iteration N2, this time a new V series has been added: the Neoverse V1 codenamed Zeus.
Neoverse V1 is a performance-oriented micro-architecture based on Cortex X1. Like Cortex X1, Neoverse V1 is more performance-oriented in terms of PPA three pointers, partially sacrificing power consumption and area. Therefore, its design direction is different from that of the N series. Therefore, V1 has a larger cache and core structure. Arm’s data mentioned that V1 has a 50% improvement in IPC compared to N1, which is still a huge amount in this era. After the actual product frequency increase, it should not be a problem to beat x86 in per-core performance.
In addition, V1 will become the first Arm core to support SVE (Scalable Vector Extension). Fujitsu’s A64FX has already taken the lead in supporting this aspect. The SIMD unit width of V1 is half of that of A64FX. In addition, V1 also introduced Bfloat16 format support.
The N2 of the N1 iteration continues to focus on the balanced development of PPA. The Cortex family micro-architecture corresponding to the Neoverse N2 has not yet been released, and the N2 is code-named Perseus. It is said that Arm started licensing the N2 architecture at the end of last year. The target design of N2 has reached a maximum of 192W, and the TDP has been increased to 350W. This should also be a breakthrough in stacking, and Nvidia’s Grace CPU is very likely to apply the N2 solution.
AnandTech speculates that Neoverse N2 may apply ARMv9 instruction set + SVE2 support. In addition, the 5nm chip code-named Poseidon, which was originally planned to be launched this year, is expected to be delayed until next year. Now Neoverse’s planning is progressing in an orderly manner. Even without Nvidia, Arm’s move into the data center market seems premeditated, and Nvidia will clearly be the catalyst for this move.
The Links: CLAA170EA07QL PM30RMC060