Nvidia DGX Compute Cluster – Center for Applied AI Hub

on September 23, 2024

Summary

The Nvidia DGX cluster is a Slurm-based high-performance computing cluster. The cluster consists of 5 Nvidia DGX H100 compute nodes providing 1120 CPUs, 10 TB of RAM, and 40 H100 Nvidia GPUs, each with 80GB of VRAM.

This cluster is designed for GPU-intensive accelerated research:

Genomics
Large Language Models (LLMs)
Image generation

Usage

The DGX cluster uses Slurm as a workload manager. Slurm schedules jobs based on the resource availability in the cluster. Users define work via shell scripts and submit them to the Slurm scheduler via queues. As resources become available, Slurm provisions and isolates the requested resources using cgroups and executes the scripted work on behalf of the user. After job completion, users can review the results and output.

Jobs run on the cluster run inside containers. Containers are isolated environments that contain the software and packages needed to run a job. Because these environments are isolated, the input and output directories must be mounted into the container from the outside. For further information, refer to the documentation: https://ukyrcd.atlassian.net/wiki/spaces/RCDDocs/pages and search for ‘dgx’.

Access

Requesting Access:

The link to request access to the DGX is https://ukyrcd.atlassian.net/servicedesk/customer/portal/4/group/16/create/51.

Please add a project description to the form description field and any other members (linkblue IDs if applicable) that should be added to your project. We’re providing/tracking jobs based on projects, not only for internal processes but for external reporting as well, so please be somewhat detailed in your project description.

Ownership

Projects that use the DGX:

Seizure Detection
- “Epilepsy is one of the most common neurological diseases. Up to 30-40% of patients do not respond to medication and suffer from uncontrolled seizures. To better control seizures than existing therapies do, we focus our research on identifying novel therapeutic targets and strategies. To test such novel strategies, we determine the seizure burden in animal epilepsy models using a piezo-video-EEG system. This piezo-video-EEG system provides 3 data streams: motion (piezo), visual (video), and brain electrical activity (encephalography, EEG), each of which records animals 24/7 to capture seizure events. To identify seizures in these data streams promptly, we developed an AI script that is housed on the DGX cluster.”
ADNI Whole Genome Project
- ADNI whole genome files over 30TB need Parabricks processing
BioNLP Group Projects
- The BioNLP group at UK focuses on two project themes that leverage DGX GPUs for research:
- (1) As autoregressive decoder language models, also called large language models (LLMs), gain mainstream attention for their conversational abilities and zero/few shot skills in information extraction, it is not clear if they will ever beat encoder or encoder-decoder models when ample training data is available. A high-level objective of our lab is to answer this question with different information extraction tasks (named entity recognition, relation extraction) with different datasets across different sizes of LLMs. This will provide general end-user guidelines as to when to use smaller encoder or encoder-decoder models in contrast with resource-intensive LLMs.
- (2) As LLMs gain popularity in healthcare owing to their reasoning abilities in answering multiple-choice medical exam questions, they are also being touted as capable of predictive modeling and risk stratification for various diseases. However, there is a vast technical gap between answering multiple-choice questions and predicting disease risk with long clinical histories. Our goal is to focus on substance use disorders as a starting point to measure the few/zero and fully supervised capabilities of LLMs in predictive modeling in comparison with less expensive encoder models.
Visium Spatial Transcriptomics
- Spatial transcriptomics is a new technology that allows transcriptomic profiling at close to single-cell resolution with morphological context in individual tissue sections. As a representative of this technology, Visium by 10x Genomics can simultaneously assess up to 5,000 spots within a user-specified 6.5mmx6.5mm capture area of a tissue section to obtain the gene expression profile of each of the spots. Our group has generated Visium data for a cohort of lung cancer patients from the Markey Cancer Center, and is also generating Visium data for a cohort of pediatric brain cancer patients across Kentucky. The goal of this project is to develop deep learning models to cluster patients into subgroups based on spatial transcriptomic data, and to investigate the association between the subgroup assignment and clinical outcomes such as overall survival and treatment response. The NVIDIA DGX H100 nodes in the DGX cluster will allow us to more efficiently train our models.

Resources

The cluster consists of 5 Nvidia DGX H100 compute nodes providing 1120 CPUs, 10 TB of RAM, and 40 H100 Nvidia GPUs, each with 80GB of VRAM.

Groups using the DGX include: CAAI, the BioNLP group at UK, and the Markey Cancer Center.

Categories:

Tags:

No Tag