Summary
Cat-Vision is a self-service platform for AI-powered image analysis. It is designed to assist researchers working with unlabeled, or hard to label, image datasets, especially in the biomedical domain. It helps researchers extract high-dimensional feature representations from unlabeled image data, making it particularly valuable in domains where labeled datasets are limited, sensitive, or expensive to produce.
At the core of Cat-Vision is DinoMX, a modular PyTorch-based training framework that facilitates self-supervised representation learning using Vision Transformers (ViTs). The pipeline builds upon the DINO and DINOv2 frameworks (self-distillation with no labels) introduced by Meta in 2023. DinoMX supports self-supervised learning (SSL), a training paradigm in which models learn generalizable visual representations by solving pretext tasks derived from the data itself, without requiring manual labels. Vision foundation models trained via SSL are particularly well-suited for downstream tasks such as classification, segmentation, and similarity search. Designed to support both pretraining and downstream adaptation, this is a flexible solution well-suited for feature extraction, representation learning, and model adaptation across diverse domains.
Cat-Vision is designed for rapid prototyping and low-barrier experimentation. Key capabilities include:
- LoRA Fine-Tuning (via PEFT): Parameter-efficient adaptation for customizing models without retraining the full backbone. LoRA weights can be merged for efficient inference or further training.
- Knowledge Distillation: Transfer learned representations from a large teacher model to a smaller student model for efficient deployment without significant performance loss.
- ClearML Integration: Built-in experiment tracking, logging, and artifact storage for fully reproducible workflows.
- Model Standardization: Shared formats and checkpoints compatible with Hugging Face and other major model repositories.
- Multi-model model development: Integration of DinoMX’s vision encoders with Large Language Models (LLMs) to build multimodal foundation models tailored for clinical AI applications.
Cat-Vision, the web-based self-service tool, is still in development. DinoMX, the modular and flexible framework described above, is complete and accessible. For more information, please contact ai@uky.edu.
Model & Code
Cat-Vision is powered by DinoMX, a modular PyTorch-based training framework for learning visual representations with Vision Transformers (ViTs). It builds on Meta’s DINO and iBOT approaches to self-supervised learning, enabling the extraction of rich, transferable features from unlabeled datasets.
DinoMX replaces traditional convolutional segmentation heads like U-Net with an attention map-based segmentation strategy. Instead of relying on decoder-specific layers, the model uses its native transformer attention maps to localize and interpret image regions. This not only simplifies the architecture but also provides greater transparency and interpretability in model predictions.
To adapt pretrained vision encoders to domain-specific tasks, DinoMX integrates parameter-efficient fine-tuning via LoRA (Low-Rank Adaptation). LoRA allows researchers to stack small adapter layers onto frozen foundational models, enabling rapid customization without a fraction of the computational cost. Similarly, DinoMX facilitates knowledge distillation, enabling smaller models to inherit capabilities from larger pretrained networks—an important feature for researchers working with constrained computational resources.
The entire training pipeline is optimized for distributed training, DINO-MX framework uses two types of configuration files to run a training process. First configuration file designates accelerator attributes which identifies training strategy: FSDP and DDP.This ensures that even massive datasets can be processed efficiently. All experiments are fully tracked and reproducible thanks to ClearML integration, which handles experiment logging, model versioning, and orchestration of training jobs across compute backends.
Outputs from the training pipeline follow a standardized model format, compatible with repositories like Hugging Face, which supports streamlined sharing and reuse of pretrained models.
Through this combination of SSL, efficient fine-tuning, distributed scalability, and ecosystem compatibility, DinoMX offers an innovative solution for developing powerful, task-specific multi-modal models at scale.
Projects using Cat Vision
The initial use case for Cat-Vision focused on neuropathology, where image labeling is labor-intensive and subject-matter expertise is required. As part of the Federated Brain Digital Slide Archive project, Mahmut Gokmen at CAAI developed NP-TEST-0, a Vision Transformer pretrained using DinoMX on real-world neuropathology data from the University of Kentucky. This model supports a range of downstream tasks, including:
- Transfer learning for related biomedical applications
- Tissue segmentation
- Patch-level classification
- Similarity search for large image repositories
Resources
Send email to ai@uky.edu for more information.