New Machine Learning Cluster for ScaDS.AI
In February, a new HPC cluster, which is especially designed for machine learning, will be handed over to user operation at the data center of TU Dresden (LZR - Rechenzentrum Lehmann-Zentrum). It was financed from a special budget of the BMBF for our competence center ScaDS.AI Dresden/Leipzig.
After a Europe-wide tender, NEC Deutschland GmbH won the bid and was able to start installation in December 2020. At the heart and essential for the computing power are a total of 272 A100 GPUs from NVIDIA. Eight of these GPUs are contained in each of the 34 compute nodes. Their respective theoretical maximum performance of floating point operations is more than 2.6 PFlop/s at 64-bit and more than 5.3 PFlop/s at 32-bit. This is expected to make the system fast enough for an entry in the upcoming Top500 list in June 2021.
Each node also features a large 1 TB of main memory and 3.2 TB of local NVMe cache to quickly feed data to the GPUs. Fast connectivity to the central HPC storage complex is provided via two HDR Infiniband ports each with a combined 400 Gbps. The maximum power consumption of a node is 4.8 kW. Direct hot water cooling ensures high energy efficiency while utilizing the waste heat.
The new computing cluster will be integrated into the existing high-performance computing infrastructure of the Center for Information Services and High Performance Computing (ZIH) and will primarily be available for the research of the competence center ScaDS.AI Dresden/Leipzig. In particular, the execution of highly parallel applications that use artificial intelligence methods for fast data analysis will benefit from this efficient system and advance both model development and expressiveness of analyses. Currently, the system is in the acceptance phase and is available for initial testing with scientific applications. Access can be requested via an HPC-DA project application on the ZIH HPC web pages: https://tu-dresden.de/zih/hochleistungsrechnen/zugang