Servers & Systems: The Right Compute
1753603 Members
5953 Online
108797 Solutions
New Article
AdvEXperts

Application development on Frontier: running software at exascale

How will application programmers leverage the hardware capabilities of exascale systems? See how AMD and HPE answered the question with software tools for developing and porting applications at unprecedented scale.  

By Nicholas Malaya, PhD, Principal Engineer, Exascale Application Performance, AMD 

AMD-HPE-Software-Exascale.png

Oak Ridge National Laboratory's Frontier supercomputer is on track to be the first exascale system in the U.S. With plans to operate at a rate far exceeding an exaflops, Frontier is expected to be (one of) the most powerful exascale-scale class systems in the world dedicated to open science.

Powered by AMD EPYCTM CPUs and AMD InstinctTM accelerators, the unprecedented compute power and capabilities of Frontier will be used to drive important scientific discovery in a wide range of fields, from advancing humanity's knowledge in fundamental science, such as particle physics and cosmology, to providing wide societal benefit through the design of more energy efficiency transport systems and more effective drugs and vaccines.

AMD technology powering Frontier

The solution utilizes a single optimized 3rd Gen AMD EPYC™ CPU and four AMD InstinctTM MI200 GPUs per node. These are linked together by AMD Infinity Fabric™ technology, which provides a high-speed and memory coherent fabric between the CPU and GPUs. Each of the HPE Cray EX235a Accelerator Blades consist of two of these nodes, and all the blades in the system are connected via HPE Slingshot HPC-optimized high-speed interconnect to form the backbone of the entire system which also integrates with the HPE ClusterStor E1000 Lustre filesystem.  

How are application programmers expected to leverage the exciting hardware capabilities of Frontier and other exascale systems? It requires a close collaboration between AMD and HPE software and application engineering teams to develop software tools and enable exascale application developers to develop and port their applications at unprecedented scale.

Software and management solutions for demanding HPC applications

In this article we mainly focus on two software and management solutions to help users take advantage of the incredible technology in the AMD processor-based HPE Cray EX system. The first is AMD ROCmTM—an open-source collection of software encompassing drivers, compilers, and libraries. The second is the HPE Cray Programming Environment—one of the first complete toolchains for exascale systems and the only fully commercially supported programming environment for AMD processor-based systems. These two solutions come together to tune and optimize the most demanding HPC applications.

AMD ROCm

The foundational pillars of AMD exascale software are open-source software, OpenMP® and HIP, HPC abstraction layers and Machine Learning frameworks, and powerful tools for CPUs and GPUs. At the base of AMD's software enablement for exascale is the ROCm stack. It provides full traceability from the application level down to the AMD CDNA™ architecture instructions that are executed on the device. This open-source environment enables the entire HPC user-community to report bugs, request features, and contribute to the ROCm ecosystem. To get started running applications right away, the AMD Infinity hub is a great resource. This is the  repository for optimized GPU applications which are conveniently provided in Docker or Singularity containers.

AMD EPYC CPUs and AMD Instinct GPUs also support common HPC and ML frameworks. Kokkos and Raja are supported on AMD hardware today via the HIP compiler backend.  The majority of Frontier C and C++ codes use Kokkos and Raja to interface with GPU via ROCm drivers. AMD also supports major machine learning frameworks such as PyTorch and Tensorflow, permitting scientific workloads to leverage artificial intelligence in their workflows.

HPE Cray Programming Environment

The HPE Cray Programming Environment integrates with AMD ROCm for a seamless user experience. AMD ROCm provides the GPU driver software that HPE compilers include in the HPE Cray Programming Environment and use to translate high level languages such as FORTRAN, C and C++ into the AMD CDNA machine code that controls the compute cores. A key advantage of HPE Cray Programming Environment is that it is a holistic solution. It enables software development for the full system (i.e., CPUs, GPUs, and high-speed interconnect) so exascale customers can get optimized performance from their applications. This suite offers customers compilers, debuggers, performance analysis tools, scientific libraries, and MPI.

Although parallelization techniques are the foundation of HPC programs, it's important to determine a balance between serial and parallel segments of an application code. GPU cores are very efficient at performing vector arithmetic operations at high speed, however, the overhead of transferring data to and from CPU-attached memory as well as RDMA over the high-speed interconnect means that parallel programming is most often a balancing act between traditional CPU algorithms and accelerated vector arithmetic using GPU kernels. Rather than attempt to optimize codes via trial-and-error,  performance data can be used to identify those regions of an application code which could benefit from the use of OpenMP threads or multi-node computations using MPI data. The performance analysis tools included with the suite are therefore essential to application porting and optimization for exascale systems.

Because MPI is a critical contributor to large system scalability, HPE has worked extensively with AMD to develop GPU-optimized RDMA techniques able to take advantage of unified memory. This work helps ensure efficient transfer of GPU data over the HPE Slingshot network. To develop applications that are specifically aimed at next generation heterogeneous computing, users need a variety of tools and techniques, including compiler vectorization, OpenMP threads with GPU offload, and MPI rank affinity. Using the knowledge and experiences of specialist HPC developers, extensive collaboration with staff from DOE laboratories and other contributors from academia, these techniques have been applied to a wide range of application software for Frontier and more.

Now it's your turn!

Frontier is certain to be a historic compute system that will bring scientific benefits to all. From all of us at AMD and HPE, we hope this summary of the capabilities our collaboration is bringing to market has excited you about what you can accomplish at exascale. We're eagerly looking forward to seeing all you achieve!

Come learn more about the AMD solutions powering the Exascale Era at the AMD HPC and AI Solutions Hub.

Visit the Discover More Network to hear the latest advancements in enabling HPC and AI at HPE.


Meet our guest blogger Nicholas Malaya, PhD, Principal Engineer, Exascale Application Performance, AMD 

Nicholas-Malaya.pngNicholas Mayala is a Principal Engineer for AMD with an emphasis in software development, algorithms, and HPC. He is AMD's technical lead for exascale application performance, which is focused on ensuring workloads run efficiently on the world's largest supercomputers. Nick's research interests include HPC, computational fluid dynamics, Bayesian inference, and ML/AI. He received his PhD from the University of Texas. Before that, he double majored in Physics and Mathematics at Georgetown University where he received the Treado medal. His postings are his own opinions and may not represent AMD's positions, strategies, or opinions. Links to third-party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.  


Advantage EX Experts
Hewlett Packard Enterprise

twitter.com/hpe_hpc
linkedin.com/showcase/hpe-ai/
hpe.com/info/hpc

AMD, the AMD Arrow logo, EPYC, Instinct, ROCm and combinations thereof are trademarks of Advanced Micro Devices, Inc.The OpenMP name and the OpenMP logo are registered trademarks of the OpenMP Architecture Review Board. Docker and the Docker logo are trademarks or registered trademarks of Docker, Inc. TensorFlow, the TensorFlow logo and any related marks are trademarks of Google Inc. PyTorch, the PyTorch logo and any related marks are trademarks of Facebook, Inc.

 

0 Kudos
About the Author

AdvEXperts

Our team of Hewlett Packard Enterprise Advantage EX experts helps you dive deep into high performance computing and supercomputing topics.