CLASSify: A Web-based Tool for Machine Learning

on October 21, 2024

Summary

Clinicians often produce large amounts of data, from patient metrics to drug component analysis. Classical statistical analysis can provide a peek into data interactions, but in many cases, machine learning can provide additional insight into new features. Recently, with the boom of new artificial intelligence models, these clinicians are more interested in applying machine learning to their data. However, in many cases, they may not possess the necessary knowledge and skills to effectively train and infer a model. Fortunately, using ML techniques and a user-friendly web interface, we can provide these clinicians with a way to automatically train tabular data on many different machine learning models to find which produces the best results. Therefore, we present CLASSify as a way for clinicians to bridge the gap to artificial intelligence.

Even with a web interface and clear results and visualizations for each model, it can be difficult to interpret how a model achieved its results or what it could mean for the data itself. Therefore, this interface can also provide explainability scores for each feature that indicates its contribution to the model’s predictions. With this, users can see exactly how each column of the data affects the model and could gain new insights into the data itself.

Finally, CLASSify also provides tools for synthetic data generation. Clinical datasets frequently have imbalanced class labels or protected information that necessitates the use of synthetically-generated data that follows the same patterns and trends as real data. With this interface, users can generate entirely new datasets, bolster existing data with synthetic examples to balance class labels, or fill missing values with appropriate data.

CLASSify Specifics

A variety of additional tools and programs are used in CLASSify, such as ClearML for job queueing, Optuna for parameter tuning, and S3 for secure storage. All training and evaluation is run on our DGX cluster, providing quick and efficient processing. The Synthetic Data Vault (SDV) library provides the models used for synthetic data generation. Explainability scores are calculated using the SHAP algorithm to identify feature importances for each model.

CLASSify is not HIPAA compliant, but private, HIPAA compliant instances can be created on request. Please reach out to ai@uky.edu to learn more.

Available Models

CLASSify currently provides ten unique machine learning models to train and evaluate:

Random Forest- common ensemble classification algorithm using decision trees
Gradient Boosting- similar to random forest, builds trees sequentially to minimize loss using gradient descent
Histogram-based Gradient Boosting- optimization that bins continuous variables
XGBoost- uses regularization and pruning to prevent overfitting in gradient boosting algorithm
Bagging- uses bootstrapped samples to train multiple estimators, typically decision trees
Logistic Regression- statistical method of building linear model
SGD Classifier- uses stochastic gradient descent to optimize linear model parameters
K-Nearest Neighbors- compares ‘distance’ between points to determine classes
Multi-Layer Perceptron- neural network designed for classification
TabPFN- more complex transformer model

Each of these models has customizable parameters that you can modify when submitting a job, leave as defaults, or perform parameter tuning to automatically determine the most optimal parameter combinations for each model.

Collaborative Projects using CLASSify

Below are just a few examples of the types of projects that CLASSify has been used with by a variety of researchers.

Hepatitis A diagnosis using a variety of demographic and medical data
Predicting adherence to a remote alcohol monitoring program
Identifying key predictors of adipose miR-1 levels after exercise
Word domain clustering using LLM generated vector embeddings
Classifying osteoporosis/osteopenia with patient data and measurements

Resources

Accessing CLASSify

CLASSify is available on an individual basis on CAAI’s self-service tool website. Before you can get started, you must be granted the necessary permissions from a CAAI Administrator. Please contact us for access or submit our collaboration intake form here.

User Guide

The User Guide provides a detailed overview of the systems capabilities and functions, check it out here.

Instructional Video

A tutorial video explaining CLASSify and how to use it can be found here.

Citation

A paper detailing the development and usage of this tool was submitted and accepted to the American Medical Informatics Association (AMIA) in 2023. This paper can be found here: CLASSify: A Web-Based Tool for Machine Learning

Categories:

Data Science Project Self-Service Tool

Tags:

data management data science and analytics