Summary
Even with a web interface and clear results and visualizations for each model, it can be difficult to interpret how a model achieved its results or what it could mean for the data itself. Therefore, this interface can also provide explainability scores for each feature that indicates its contribution to the model’s predictions. With this, users can see exactly how each column of the data affects the model and could gain new insights into the data itself.
Finally, CLASSify also provides tools for synthetic data generation. Clinical datasets frequently have imbalanced class labels or protected information that necessitates the use of synthetically-generated data that follows the same patterns and trends as real data. With this interface, users can generate entirely new datasets, bolster existing data with synthetic examples to balance class labels, or fill missing values with appropriate data.
CLASSify Specifics
A variety of additional tools and programs are used in CLASSify, such as ClearML for job queueing, Optuna for parameter tuning, and S3 for secure storage. All training and evaluation is run on our DGX cluster, providing quick and efficient processing. The Synthetic Data Vault (SDV) library provides the models used for synthetic data generation. Explainability scores are calculated using the SHAP algorithm to identify feature importances for each model.
CLASSify is not HIPAA compliant, but private, HIPAA compliant instances can be created on request. Please reach out to ai@uky.edu to learn more.
Available Models
CLASSify currently provides ten unique machine learning models to train and evaluate:
- Random Forest- common ensemble classification algorithm using decision trees
- Gradient Boosting- similar to random forest, builds trees sequentially to minimize loss using gradient descent
- Histogram-based Gradient Boosting- optimization that bins continuous variables
- XGBoost- uses regularization and pruning to prevent overfitting in gradient boosting algorithm
- Bagging- uses bootstrapped samples to train multiple estimators, typically decision trees
- Logistic Regression- statistical method of building linear model
- SGD Classifier- uses stochastic gradient descent to optimize linear model parameters
- K-Nearest Neighbors- compares ‘distance’ between points to determine classes
- Multi-Layer Perceptron- neural network designed for classification
- TabPFN- more complex transformer model
Each of these models has customizable parameters that you can modify when submitting a job, leave as defaults, or perform parameter tuning to automatically determine the most optimal parameter combinations for each model.
Collaborative Projects using CLASSify
Below are just a few examples of the types of projects that CLASSify has been used with by a variety of researchers.
- Hepatitis A diagnosis using a variety of demographic and medical data
- Predicting adherence to a remote alcohol monitoring program
- Identifying key predictors of adipose miR-1 levels after exercise
- Word domain clustering using LLM generated vector embeddings
- Classifying osteoporosis/osteopenia with patient data and measurements
Resources
Accessing CLASSify
CLASSify is available on an individual basis on CAAI’s self-service tool website. Before you can get started, you must be granted the necessary permissions from a CAAI Administrator. Please contact us for access or submit our collaboration intake form here.
User Guide
The User Guide provides a detailed overview of the systems capabilities and functions, check it out here.
Instructional Video
A tutorial video explaining CLASSify and how to use it can be found here.
Citation
A paper detailing the development and usage of this tool was submitted and accepted to the American Medical Informatics Association (AMIA) in 2023. This paper can be found here: https://arxiv.org/abs/2310.03618