The whitepaper was released as part of the Huawei Middle East & Central Asia Tech Carnival 2023, held for the first time in Kazakhstan
Huawei and the IEEE Kazakhstan Subsection have jointly released a white paper that sheds light on lossless Ethernet, a network solution that promotes high-performance computing (HPC) and AI scenarios thanks to its unrivaled advantages in network performance compatibility, cost-effectiveness and flexibility.
The report, titled HPC Lossless Ethernet and AI Fabric Network Technical White Paper, was launched at the Huawei Network Summit, which took place on the first day of the annual flagship Huawei Middle East & Central Asia (ME&CA) Tech Carnival, which is taking place in Almaty, Kazakhstan for the first time, from June 5 – 7. Under the ‘Leading Digital Infrastructure for New Value Together’ theme, Huawei’s ME&CA Tech Carnival brings together industry opinion leaders to explore the infinite possibilities of digitalization.
In attendance at the launch ceremony of the white paper were Arthur Wang, President of the Data Center Network Domain, Huawei Data Communication Product Line, Dr. Carlo Molardi, IEEE Kazakhstan Subsection, Liu Gui, Vice President of Enterprise Business Group, Huawei Middle East & Central Asia Region; and Fahem Al Nuaimi, CEO, Ankabut, the United Arab Emirates.
Around the world, many countries and regions are proactively formulating policies conducive to developing HPC and AI to promote digital transformation and lossless Ethernet is and will keep playing a key role in these efforts.
For example, Saudi Arabia has formulated Saudi Vision 2030, aiming to develop HPC and AI technologies to improve global competitiveness. It also established the King Abdulaziz City for Science and Technology (KACST) to support the R&D of HPC and AI projects. Qatar also released National Vision 2030 to propel HPC and AI technologies. The Qatar Computing Research Institute is Qatar’s foremost research institute focusing on HPC and AI with the mission to promote technological innovation. The UAE also established the National AI Strategy 2031 and appointed the world’s first Minister for AI.
Ankabut, UAE’s advanced research and education network (NREN), has adopted lossless Ethernet in its supercomputing center to provide computing services for 35 educational institutions and 80 colleges and universities and support research in fields such as meteorological prediction and modeling, life sciences, fluid mechanics and oil exploration.
Dr. Fahem Al Nuaimi, CEO of Ankabut, said, “Ankabut wants to play its role in promoting cutting-edge research and development and collaboration for member institutions and the country. An advanced network is crucial for this role. Working with partners like Huawei, we can create a future-proof network that helps advance our mission of making UAE universities leaders in research.”
According to the white paper, lossless Ethernet technology adopts intelligent remote direct memory access (RDMA) and network scale load balancing (NSLB), which help to achieve zero packet loss and 90% ultra-high network throughput.
The report begins with the HPC network architectures: Clos, Multi-Rail, and directly connected topology (DCT). Clos is a multistage architecture, with each switching unit connected to all switching units at its lower stage. It is strict-sense non-blocking, re-arrangeable, and scalable. Multi-Rail adopts cell switching of modular devices to implement absolute load balancing. DCT features ultra-large networking, low costs, and a small number of end-to-end communication hops.
The paper then details how the software architecture improves HPC and AI application performance from two aspects: network optimization and convergence and optimization of the network and application system. The objective of network optimization is to maximize the throughput and minimize the latency of the entire network, which involves the following technologies:
- Flow control: Identifies cyclic buffer dependencies and eliminates the necessary conditions for generating them, resolving the PFC deadlock problem and enhancing network reliability.
- Congestion control: Draws on the AI algorithm to dynamically adjust the explicit congestion notification (ECN) thresholds for maximized network bandwidth and minimized network latency.
- Traffic scheduling: Adopts NSLB technology to load balance network-wide traffic, achieving 90% ultra-high network throughput and improving AI training efficiency by 20%.
To converge and optimize the network and application system, in-network computing (INC) technology is used on HPC networks. Designed for Message Passing Interface (MPI) communication, this technology enables network devices to participate in computing, minimizing the job completion time.
HPC and AI are becoming converged and ready for everything, and Lossless Ethernet is ideal for laying a solid foundation for HPC. This will help to fully unleash computing power across industries to promote the prosperity of the digital industry and fuel global digital transformation, underpinning the construction of a fully-connected, intelligent world.