IBM did not get any piece of the CORAL-2 contract and neither did Nvidia, and it is highly unlikely that a future Argonne machine that could happen some years hence will be based on IBM Power10 or Power11 CPUs and future Nvidia GPUs. It is much more likely that it will be an all-AMD machine like Frontier and El Capitan. And while no company is dependent on supercomputer contracts like the CORAL-2 deal to sustain their businesses, such deals help pay for research and development for future products that can be commercialized for other customers – and sold at much, much higher margins.
Back in August, when some of the details of the El Capitan machine were divulged by Lawrence Livermore, it seemed a bit coy not to talk about what CPUs and GPUs were going to be used in the system. But that was not the intent. There was actually some game theory going on here, which is what you would expect from an organization that does world-class simulations.
“Lawrence Livermore uses best value procurements, and our decision was based on evaluating the options that were available in the timeframe that we needed,” explained Bronis de Supinski, chief technical officer at Livermore Computing, the division of the lab that architects and runs its supercomputers, during a conference call announcing the awarding of the compute engines to AMD. “There were others, and based on the performance that we expect the AMD processors to deliver to our actual workload, our decision was that they would provide by far the best value to the government.”
AMD, Cray, and Lawrence Livermore did not give any more specifics about the El Capitan architecture, except to say that it would be using a single-socket server Epyc linked coherently to four Radeon Instinct GPU cards so they can share memory, and that this is a distinguishing feature for the architecture to simplify programming. Norrod did say that this Radeon Instinct card was being create din conjunction with key HPC and AI customers like Lawrence Livermore and that it would support all kinds of mixed precision as well as the single and double precision floating point operations that HPC centers require, and that it would also pack a future HBM memory technology. Norrod also said that AMD would be working with Lawrence Livermore to tightly integrate OpenMP into the ROCm programming environment that Oak Ridge will also be helping to widen and deepen on the Frontier system.
All of that extra compute is something that Lawrence Livermore desperately needs because as nuclear weapons in the US stockpile age, we need to run more sophisticated models than can even be done at a reasonable speed on the 150 petaflops Sierra hybrid CPU-GPU system.
“As the nuclear stockpile ages, the complexity of the simulations only increases,” explained de Supinski. “So we need to be able to use larger and larger systems in order to maintain the level of assurance that the nation really needs. And El Capitan, with its significant performance, will meet that need. In particular, it will make it so we can do 3D simulations on a regular basis. So simulations that now require all or a significant portion of Sierra will be able to run routinely, which means that we will be able to have much greater statistical confidence in the results and the model that we use to provide the certification will be more accurate.”
Being a hybrid CPU-GPU machine, there is a temptation to think of El Capitan as Oak Ridge does with its current Summit and future Frontier machines, and that is as an AI-HPC supercomputer. But that is not what Sierra and El Capitan are really about. As Lawrence Livermore explained back in August, not only do the existing nuclear weapons need to be simulated to see if they can work – the Nuclear Test Ban Treaty prevents us from blowing one up to know for sure – but also to completely redesign the nuclear weapons and reuse their nuclear explosives without being able to test them and still know they will work. This is an incredibly massive and difficult set of simulations and designs.
“Our workloads are primarily not deep learning models, although we are exploring something we call cognitive simulation, which brings deep learning and other AI models to bear on our workloads by evaluating how they can accelerate our simulations and how they can also improve their accuracy and find where they actually work,” explained de Supinski. “And so for that, we see this system as providing some significant benefits because of those operations. But I think it’s important to understand that that the primary goal of this system is large scale physics simulation and not deep learning.”
NVIDIA在计算业的吃相也太难看了一点 被金主门拒绝也是很正常的事情
DOE 的操行就是只要达到要求,谁便宜谁接单,完全不管低下人死活。估计那些用CUDA的人员已经开始撞墙砸桌子了。

然后绝大部分开发人员做的都是框架上的东西,目前rocm的框架版本号也追上来了,比如前不久tensorflow就发了rocm(2.4)tf 2.0已经非常接近cudnn version,这个甚至不是转译,是原生的厂家实现。
痛苦的是那些底部调优的人员和完全绕开厂家一方库自写底层加速库的,写kernel function的那批,cuda的调优代码全都不能用了。
