• 0 Posts
  • 2 Comments
Joined 2 years ago
cake
Cake day: June 15th, 2023

help-circle
  • The main benefit I think is massive scalability. For instance, DOE scientists at Argonne National Laboratory are working on training a language model for scientific uses. This isn’t something you can do on even 10s of GPUs for a few hours, like is common for jobs run in university clusters and similar. They’re doing this by scaling up to use a large portion of ALCF Aurora, which is an Exascale supercomputer.

    Basically, for certain problems you either need both the ability to run jobs on lots of hardware and the ability to run them for long (but not too long to limit other labs’ work) periods of time. Big clusters like Aurora are helpful for that.