Summary
Combining the latest advancements in speech-to-text transcription and speaker-labeling (diarization), we are developing a platform that can quickly produce accurate transcriptions of audio-video files. The goal is to provide a scalable, automated and secure approach to generating transcriptions and performing analysis on those outputs.
This effort began as a solution to the administrative burden healthcare professional face, specifically in patient note taking. Evaluation so far has explored over 40 hours of medical conversations between a patient and their provider. The system is adept at handling complex conversations, noisy environments, and overlapping dialog with ease. Above all, the system strives to make manual evaluation more efficient.
CAT-Talk is a secure, web-based AI platform offering fast, speaker-labeled, and time-stamped transcripts, integrating summarization and theme extraction tools on UK-owned NIST-53, HIPAA-compliant infrastructure.
Access
Our transcription services platform is available for experimental use. For more information, please reach out to ai@uky.edu.
Collaborative Projects using CAAI’s Transcription Platform
SpeakEZ – A collaboration with UK’s Nunn Center for Oral History
Ambient Listening – An exploration of quality improvement in healthcare settings.
Models
Integral to this system are two highly advanced open-source tools, PyAnnote speaker-diarization-3.15 model for speaker labeling and OpenAI’s Whisper for transcription. Diarization and transcription tasks are executed in parallel to optimize processing efficiency. When a user uploads an audio file via our web interface ClearML manages job scheduling ensuring simultaneous processing. The two outputs are then merged by aligning timestamps and calculating speaker probabilities. Discrepancies between diarization and transcription timestamps are resolved by probabilistic matching, using the Whisper transcription as the source of truth. Speaker identification is further refined through the use of an LLM. The LLM provides additional context-aware adjustments that improve the accuracy of speaker labeling. The result is one unified, time-stamped, speaker-labeled transcript optimized for simple human verification via a user-friendly web-interface.
Resources
Read more about the development Toward Automated Clinical Transcriptions in the paper linked.
A training video is available on YouTube.