Self-Supervised Dual-Domain Segmentation for Static and Dynamic Bone Histomorphometry Using LeJEPA

on May 15, 2026

Summary

Bone Histopathology Image Analysis: Scaling Annotation for Bone Analysis

Traditional bone histomorphometry focuses on quantifying static volumes and surfaces to diagnose metabolic diseases, such as osteoporosis. CAAI is advancing innovation in this area by using deep learning to characterize functional states in mineralized bone tissue. This project aims to use computer vision for semi-automated slide annotation to identify cell characteristics associated with different bone turnover rates. Such an approach shifts the focus from simple structural measurement to a deeper understanding of tissue quality and metabolic trajectory, enabling more personalized short- and long-term treatment planning for complex bone and kidney diseases. This project develops a self-supervised dual-domain segmentation pipeline for static and dynamic bone histomorphometry. The system trains a single foundational Vision Transformer encoder on both fluorescence microscopy and Masson’s Trichrome brightfield ROI images, then routes tiles to modality-specific segmentation heads for downstream histomorphometric quantification.

Beyond model development, the project delivers a user-friendly interface that supports exploration and semi-automated annotation of bone histology images. By enabling scalable annotation of mineralized bone tissue states, this tool not only advances automated pathology analysis but also helps generate a unique, high-quality dataset that can serve as a foundational resource for researchers worldwide working in bone and mineral metabolism. The approach is designed to reduce dependence on labor-intensive manual annotation while supporting scalable quantitative analysis from limited labeled data.

The project workflow includes dataset preparation from fluorescence and brightfield ROI images, tile generation, LeJEPA pretraining of a shared vision encoder, extraction of embedding features, domain classification, and routing to Masson’s Trichrome or fluorescence segmentation heads for static and dynamic histomorphometric quantification.

Datasets/Model

This project began with a small, highly specialized dataset of just over 400 pathology images, far fewer than what is typically required to train deep learning models effectively. To mitigate this constraint, we applied a tiling and segmentation-based methodology in combination with the DINOv2 foundation model. This approach substantially increased the number of usable training samples while preserving local structural and spatial features relevant to bone biology despite severe data scarcity.

The dataset consists of microscopy image data from bone histomorphometry studies at the University of Kentucky. It includes two imaging modalities: fluorescence microscopy images and Masson’s Trichrome brightfield images. A total of 16,732 image tiles of size 224 x 224 pixels were derived from 108 fluorescence and 415 brightfield region-of-interest (ROI) images. Each 448 x 448 region produced four non-overlapping quadrant tiles plus one center crop to improve boundary-region representation.

The model uses a shared self-supervised Vision Transformer (ViT) encoder with register tokens, pretrained through Latent Joint-Embedding Predictive Architecture (LeJEPA). LeJEPA learns spatially rich representations from unlabeled image tiles without requiring pixel-level annotations during pretraining.

After pretraining, the feature space shows a natural separation between fluorescence and brightfield domains. A lightweight linear classifier uses this separation to route each tile to a modality-specific segmentation head.

The fluorescence head performs three-class segmentation of bone, osteoid, and active mineralization fronts. The region between tetracycline double-labels is filled to delineate the interlabel area representing new bone formation, which supports dynamic histomorphometric parameters such as Mineral Apposition Rate (MAR).

The brightfield head delineates mineralized bone from osteoid tissue, supporting static histomorphometric measurements such as tissue perimeter and area. Both heads use a progressive CNN decoder that fuses ViT patch-token embeddings with multi-head attention maps. At inference, ROI images are tiled, segmented independently, and reconstructed into full-resolution segmentation maps.

Evaluated on held-out test sets, the fluorescence model achieves a Dice score of 0.9436 (IoU: 0.8944), while the brightfield model achieves an overall Dice score of 0.8420 (IoU: 0.7560) for mineralized bone and osteoid tissue segmentation, with detection F1-scores exceeding 0.975 in both modalities. This level of performance suggests that the approach is reliable for downstream histomorphometric quantification, including mineralized bone area, unmineralized osteoid area, and Mineral Apposition Rate (MAR).

The system does not use LLMs or tool calling. It is based on custom computer vision training and inference code built around LeJEPA-style self-supervised learning, Vision Transformers, domain routing, and modality-specific segmentation decoders.

Custom code was developed for ROI tiling, dual-domain self-supervised pretraining, feature-space domain routing, segmentation decoding, and full-resolution ROI reconstruction.

A key technical component is the use of a shared self-supervised encoder across heterogeneous imaging modalities while preserving modality-specific segmentation behavior through separate decoder heads. Another important component is the fluorescence post-processing logic that fills the region between tetracycline double-labels to recover the interlabel area used for dynamic histomorphometry.

There were two major challenges while working on this project. One challenge was the substantial visual difference between fluorescence and brightfield images. Instead of training two completely independent pipelines, the system uses a shared self-supervised encoder to learn common spatial representations, then routes each tile to a modality-specific segmentation head. Another challenge was boundary loss during tile extraction. To reduce this issue, each 448 x 448 region was divided into four quadrant tiles and supplemented with a center crop, improving coverage of tissue boundaries during training and inference.

Access

The data used for this project is restricted and not publicly available.

No public Hugging Face dataset, model repository, or code repository is currently available for this project.

A demo is currently under development and is publicly unavailable at this time.

Ownership

This project has been completed, and we are currently working on a demo to make the system easier to present and evaluate.

This project is related to broader work on self-supervised learning for specialized biomedical imaging domains, including DINO-MX and Vision Foundry. The same general training infrastructure may support other medical imaging and computational pathology projects.

An abstract titled “Self-Supervised Dual-Domain Segmentation for Static and Dynamic Bone Histomorphometry Using LeJEPA” has been prepared with Florence Lima as the presenting author.

Resources Utilized

Mahmut Gokmen (Lead Developer), Emily Collier, and Dr. Cody Bumgardner worked on this project.

Vital to the project’s success is the deep involvement of clinical domain experts from the College of Medicine’s Division of Nephrology, Bone and Mineral Metabolism. These collaborators provided critical definitions and annotations of key biological features as well as expert insight into how positional and spatial relationships between these features correspond to biological function and tissue turnover. This locational knowledge was essential for guiding model development and interpretation, ensuring that computational outputs align with clinically meaningful concepts.

The model was trained using high-performance GPU computing resources. The workflow requires GPU-based training for Vision Transformer self-supervised pretraining and segmentation model optimization. Storage was required for ROI image data, extracted tiles, model checkpoints, logs, and segmentation outputs.

The project used high-performance computing resources for GPU-based model training and evaluation. If applicable, this can be listed as:

DGX / GPU computing resources
Local or institutional storage for ROI data and extracted tiles
Custom Python/PyTorch-based training and inference pipeline

No LLM Factory or LLM-based services were used.

Categories:

Tags:

computer vision pathology