CAT-Talk: AI-Powered Transcription Platform

Summary

Project Description: CAT-Talk is a cutting-edge, secure, and web-based AI platform designed to quickly generate accurate, speaker-labeled, and time-stamped transcriptions from audio and video files. It integrates advanced speech-to-text and speaker-diarization capabilities with summarization and theme extraction tools, all within a compliant infrastructure.

Problem: Healthcare professionals often face a significant administrative burden, particularly with patient note-taking and documentation. Manual transcription of medical conversations is time-consuming, prone to human error, and can detract from direct patient care. Existing solutions typically involve manual processes or less sophisticated transcription tools that struggle with complex conversations, noisy environments, and overlapping dialogue, leading to inefficiencies and potential inaccuracies.

Existing Solutions: The primary existing “solution” for detailed documentation in many healthcare settings is manual note-taking and transcription, which is highly labor-intensive and inefficient. While some basic transcription services exist, they often lack the accuracy, speaker identification, and security features necessary for sensitive medical conversations. CAT-Talk aims to significantly improve upon these manual and less advanced digital methods, streamlining the process and making manual verification more efficient.

Solution: CAT-Talk combines the latest advancements in speech-to-text transcription and speaker-labeling (diarization) into a unified platform. It utilizes PyAnnote speaker-diarization-3.1 for accurate speaker identification and OpenAI’s whisper-large-v3 for high-quality speech-to-text transcription. These two processes are executed and combined into a single, time-stamped, speaker-labeled transcript using WhisperX, which handles the alignment and merging. When a user uploads an audio file via the secure web interface, ClearML manages job scheduling to ensure efficient and simultaneous processing. The resulting transcript is then optimized for easy human verification through a user-friendly web interface, offering integrated summarization and theme extraction tools. The entire platform operates on UK-owned NIST-53 and HIPAA-compliant infrastructure, ensuring robust data security and privacy.

Impact: CAT-Talk provides a scalable, automated, and secure approach to generating and analyzing audio-video transcriptions. By drastically reducing the need for manual transcription, it frees up valuable time for healthcare professionals, allowing them to focus more on patient care. The system’s ability to accurately handle complex, noisy, and overlapping conversations significantly improves the quality and reliability of documentation. This directly contributes to reducing administrative burden, enhancing efficiency, and potentially improving the quality of healthcare delivery by providing readily accessible and analyzable conversational data.

Datasets/Model

Description of Datasets: The development and evaluation of CAT-Talk have explored over 40 hours of simulated medical conversations between patients and providers. This dataset was crucial for training and validating the system’s ability to handle the nuances of clinical dialogue, including specific terminology, varied speaking styles, and challenging acoustic conditions. The dataset currently comprises audio-video files with associated ground-truth transcriptions and speaker labels.

DOIs: There are no public DOIs for this specific internal dataset due to its protected nature.

Available Models: Integral to CAT-Talk’s functionality are several highly advanced open-source models:

  • PyAnnote speaker-diarization-3.1: This model is used for speaker labeling (diarization), accurately identifying who is speaking and when. It is an open-source, PyTorch-based pipeline capable of processing mono audio sampled at 16kHz and outputting speaker diarization as an Annotation instance.
  • OpenAI’s whisper-large-v3: This is a state-of-the-art automatic speech recognition (ASR) and speech translation model used for the core transcription task. It is a Transformer-based encoder-decoder model trained on massive amounts of audio data, designed to deliver high accuracy across a wide range of languages and audio conditions, even in noisy environments with multiple speakers.
  • WhisperX: This tool combines the outputs from PyAnnote and Whisper. It integrates diarization and transcription tasks into a single time-stamped, speaker-labeled transcript by leveraging forced phoneme alignment and voice activity detection to ensure accurate word-level timestamps and precise speaker segmentation.
  • ClearML: This is an MLOps platform that manages job scheduling for the transcription and diarization processes, ensuring simultaneous and efficient processing when a user uploads an audio file via the web interface. ClearML helps automate, integrate, and scale AI development workflows.

Access

How to use dataset, code, etc: Our transcription services platform, CAT-Talk, is available for experimental use within the secure UK-owned infrastructure. Due to the sensitive nature of the data and the HIPAA-compliant environment, direct public access to the dataset or the underlying code for independent deployment is not available. To gain access or inquire about collaboration: Please reach out to vaiden.logan@uky.edu.

Ownership

Project Status: CAT-Talk is an actively ongoing project. The platform is continuously being developed and refined, with current efforts focused on enhancing its capabilities and expanding its applications, particularly within healthcare and historical documentation.

Other projects using this (if applicable): CAT-Talk serves as a collaborative platform for several other projects, demonstrating its versatility and utility across different domains:

  • SpeakEZ: A collaboration with the University of Kentucky’s Nunn Center for Oral History, aimed at leveraging CAT-Talk for the accurate and efficient transcription of oral history interviews, enhancing their accessibility and research value.

Publications that came from anything: A paper detailing the development and usage of CAT-Talk has been published:

Researchers utilizing CAT-Talk in their work are encouraged to cite the platform and relevant publications appropriately.

Resources Utilized

Cost breakdowns: Detailed cost breakdowns for the development of CAT-Talk are not publicly available as it is an internal project utilizing institutional resources.

Services used: The development and operation of CAT-Talk heavily utilize:

  • NIST-53, HIPAA-compliant infrastructure: This ensures the secure handling and processing of sensitive health information.
  • ClearML: For managing job scheduling, workflow automation, and experiment tracking.
  • PyAnnote (open-source model): For speaker diarization.
  • OpenAI Whisper (open-source model): For speech-to-text transcription.
  • WhisperX (open-source tool): For combining diarization and transcription outputs.

The project involves dedicated development staff, though specific FTE allocations are not provided in the summary.