Bootcamp on Accelerated Distributed Computing - powered by NVIDIA

Application Deadline: 28.02.2025

Participation is by application only! If you are interested in participating in this event, please apply with your official email address to prove your affiliation. The final participants will be selected and informed after the registration deadline has passed. Priority will be given to members of the Munich Center for Machine Larning (MCML).

This event is part of the "LRZ AI Training Series", a series of courses aiming at the needs and expectations of data analytics, big data & AI users at LRZ. Priority admission to the event will be given to members of MCML.

This course will be organised as an on-site event at LRZ in Garching near Munich allowing for direct interaction with trainers during hands-on and demos.

There will be no possibility to join online remotely via video conference. Participants are expected to bring their own laptops. There are no PCs installed in the course room!

 

 

Contents

The Accelerated Distributed Model Training Bootcamp is designed from a real-world perspective on how to efficiently utilise GPUs in training models in a distributed manner. Attendees walk through the system topology to learn the dynamics of multi-GPU and multi-node connections and architecture. They will also learn and understand state-of-the-art strategies for training models in a multi-GPU and multi-node environment using the PyTorch Framework. Furthermore, attendees will learn to profile code, inspect & analyse, and optimise using NVIDIA® Nsight™ Systems, a tool that helps identify optimisation opportunities and improve the performance of applications running on a system consisting of multiple CPUs and GPUs.

Participation in this bootcamp is strongly recommended for teams who wish to apply for the EuroCC AI Hackathon, taking place in October 2025. For this reason, we strongly recommend participants to apply as a team of 2 or more for this bootcamp, and priority will be given to those with a team.

Topics

  • Training strategy
    • Data Parallelism
    • Model Parallelism
    • Message Passing
    • Horovod
    • Pipeline Parallelism
    • Mixed Precision
    • ZeRO, Fully Sharded Data Parallelism (FSDP), Mixture-of-Experts (MoE)
    • PyTorch SLURM
  • System Topology
    • Communication concepts
    • Intra-Node Communication Topology
    • NCCL
  • Implementation
    • NeMo Megatron Core/Nemotron
    • Profiler

Prelimiary Agenda

All times are in Central European Time (CET).

  • 09:00 - 09:15: Welcome and Introduction
  • 09:15 - 09:30: Cluster connection walkthrough (Demo) 
  • 09:30 - 10:30: Fundamentals of accelerated distributed model training methods (Lecture)
  • 10:30 - 11:30: Instructor Lab Walk through (Demo)
  • 11:30 - 12:00: Break
  • 12:00 - 13:30: Multi-GPU Multi-node Training strategy (Lab)
  • 13:30 - 14:00: Nsight System Profiling (Lab)
  • 14:00 - 14:15: Wrap up and Q&A 

The bootcamp is co-organised by LRZ, NVIDIA and the OpenACC organization.

Prerequisites

  • Background knowledge of Python programming and Pytorch framework is required.

Language

English

Lecturers

The lecturers will be from NVIDIA.

Prices and Eligibility

The course is open and free of charge for academic participants from Germany. Priority admission to the event will be given to members of MCML.

Registration

Please apply with your official email address to prove your affiliation. The final participants will be selected and informed after the registration deadline has passed. Priority will be given to members of the Munich Center for Machine Larning (MCML).

Withdrawal Policy

See Withdrawal

Legal Notices

This bootcamp is co-organised with NVIDIA. Some of your personal data will be transferred to NVIDIA (salutation, title, first name, surname, institution, country, email and bootcamp-specific information provided in the registration form). The legal basis is in accordance with Article 6(1)(b) GDPR. Please see also our data protection notice (in German: https://www.lrz.de/datenschutzerklaerung/).

For registration for LRZ courses and workshops we use the service edoobox from Etzensperger Informatik AG (www.edoobox.com). Etzensperger Informatik AG acts as processor and we have concluded a Data Processing Agreement with them.

See Legal Notices

Course Bootcamp on Accelerated Distributed Computing - powered by NVIDIA
Number hdta5w24
Available places 2
Date 26.03.2025 – 26.03.2025
Price EUR 0.00
Location Leibniz Rechenzentrum
Boltzmannstr. 1
85748 Garching b. München
Room Seminarraum 1
Registration deadline 28.02.2025 13:59
E-mail education@lrz.de
No.
1
Date26.03.2025
Time09:00 – 14:15
LocationLeibniz Rechenzentrum
RoomSeminarraum 1
DescriptionLecture
No. Date Time Teacher Location Room Description
1 26.03.2025 09:00 – 14:15 Leibniz Rechenzentrum Seminarraum 1 Lecture