Nvidia DGX Cluster for Accelerated Research

Summary

Project Description: The Nvidia DGX cluster is a state-of-the-art, Slurm-based high-performance computing system. It provides robust computational power, specifically optimized for GPU-intensive research applications. This infrastructure serves as a critical resource for advancing discovery in areas demanding significant parallel processing capabilities.

Problem: Modern scientific research, particularly in fields like genomics, large language models, and image generation, requires immense computational resources that traditional computing environments cannot efficiently provide. The challenge lies in processing vast datasets and training complex, deep learning models that necessitate high-performance GPUs, substantial memory, and efficient workload management. Without such specialized infrastructure, researchers face significant limitations in the scope, speed, and feasibility of their studies.

Existing Solutions: While various computing resources exist, the Nvidia DGX cluster represents a dedicated, high-density GPU computing solution designed to meet the extreme demands of cutting-edge AI and data science. It offers a significant leap in computational throughput compared to general-purpose computing systems, enabling research that would be impractical or impossible otherwise.

Solution: The Nvidia DGX cluster is comprised of five Nvidia DGX H100 compute nodes, collectively offering 1120 CPUs, 10 TB of RAM, and 40 H100 Nvidia GPUs, each equipped with 80GB of VRAM. It leverages Slurm as a workload manager to efficiently schedule and execute jobs. All jobs run within isolated containers, ensuring consistent and reproducible environments for complex software stacks. This architecture enables researchers to define their work via shell scripts and submit them to a powerful scheduler that provisions and isolates requested resources, greatly accelerating the pace of discovery.

Impact: The DGX cluster directly facilitates groundbreaking research that was previously resource-prohibitive. By providing unparalleled GPU acceleration, it enables rapid training of large language models, intricate genomic analyses, and the creation of sophisticated image generation algorithms. This accelerates scientific breakthroughs, fosters innovation, and positions researchers at the forefront of their respective fields, ultimately contributing to advancements in areas such as precision medicine, AI development, and computational biology.

Datasets/Model

Description of Datasets/Models: The Nvidia DGX cluster is an enabling infrastructure that supports research utilizing a diverse range of large-scale datasets and complex models across various disciplines. While the cluster itself does not host specific public datasets or models, it provides the computational power necessary for their processing and training. Examples of data and models handled on the cluster include:

  • Genomic Data: Processing large-scale whole-genome files, such as the 30TB ADNI whole-genome dataset, often requiring specialized tools like Parabricks for accelerated analysis.
  • Large Language Models (LLMs): Training and fine-tuning various LLMs for natural language processing tasks, including named entity recognition, relation extraction, and predictive modeling for health conditions like substance use disorders.
  • Image Data: Processing and generating images for computer vision tasks, including analysis of medical imaging (e.g., EEG, video data for seizure detection) and spatial transcriptomics data (e.g., Visium by 10x Genomics for lung and pediatric brain cancer patients). These projects involve developing deep learning models for classification, regression, and clustering based on these rich datasets.

Researchers using the DGX cluster manage their own datasets, which may be stored on various secure institutional storage systems. DOIs: Specific DOIs for datasets and models are tied to individual research projects utilizing the cluster, rather than the cluster itself. Researchers are encouraged to seek and cite relevant DOIs for the data and models used in their specific work.

Access

How to Use: Access to the Nvidia DGX cluster is managed through a centralized request system. The cluster utilizes Slurm as a workload manager, and jobs are submitted via shell scripts. All work runs within containers, requiring input and output directories to be mounted. Comprehensive documentation regarding job submission, container usage, and other operational details can be found by searching for “dgx” at the following documentation portal: https://ukyrcd.atlassian.net/wiki/spaces/RCDDocs/pages.

Requesting Access: To request access to the DGX cluster, please use the following link: https://ukyrcd.atlassian.net/servicedesk/customer/portal/4/group/16/create/51. When filling out the form, ensure you provide a detailed project description in the designated field. Additionally, include the Linkblue IDs of any other team members who require access to your project on the cluster. Project descriptions are vital for internal processes and external reporting, so please be as specific as possible regarding your research goals and computational needs.

Ownership

Project Status: The Nvidia DGX cluster is a fully operational, ongoing high-performance computing resource managed by the institution. It is actively supporting numerous research initiatives across various departments.

Other Projects Using This: The DGX cluster is a shared resource that currently supports a wide array of projects, including:

  • Seizure Detection: Utilizing AI scripts for real-time analysis of multi-stream physiological data in animal epilepsy models to identify novel therapeutic targets.
  • ADNI Whole Genome Project: Processing extensive genomic data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) for advanced genetic analysis.
  • BioNLP Group Projects: Research focusing on the capabilities of Large Language Models (LLMs) for information extraction and predictive modeling in healthcare, comparing their performance against traditional NLP models.
  • Visium Spatial Transcriptomics: Developing deep learning models to analyze spatial transcriptomic data from cancer patients, aiming to identify disease subgroups and correlate them with clinical outcomes.

Publications: Publications resulting from research conducted on the DGX cluster are attributed to the individual research teams and projects. Researchers are responsible for acknowledging the use of the Nvidia DGX cluster in their publications. Any publications directly resulting from the cluster’s operational development or specific cluster-related research will be disclaimed here upon renewal.

Resources Utilized

Services Used: The primary service utilized is the Nvidia DGX Cluster itself, providing high-performance GPU-accelerated computing. This includes access to:

  • Nvidia DGX H100 compute nodes: 5 nodes total.
  • CPUs: 1120 CPUs.
  • RAM: 10 TB of RAM.
  • GPUs: 40 H100 Nvidia GPUs, each with 80GB of VRAM.
  • Slurm: As the workload manager for job scheduling and resource allocation.
  • Containerization: For isolated and reproducible job environments.

Tags: