Combining the latest advancements in speech-to-text transcription and speaker-labeling (diarization), we are developing a platform that can quickly produce accurate transcriptions of audio-video files. The goal is to provide a scalable, automated and secure approach to generating transcriptions. Evaluation so far has explored over 40 hours of medical conversations between a patient and their provider. The system is adept at handling complex conversations, noisy environments, and overlapping dialog with ease. Above all, the system strives to make manual evaluation more efficient.

The system was designed with security in mind, data is encrypted both in transit and at rest. HIPAA compliant instances are a future development goal.

Citation

Read more about the development Toward Automated Clinical Transcriptions in the paper linked.

How it works

Integral to this system are two highly advanced open-source tools, PyAnnote speaker-diarization-3.15 model for speaker labeling and OpenAI’s Whisper for transcription.

Diarization and transcription tasks are executed in parallel to optimize processing efficiency. When a user uploads an audio file via our web interface ClearML manages job scheduling ensuring simultaneous processing. The two outputs are then merged by aligning timestamps and calculating speaker probabilities. Discrepancies between diarization and transcription timestamps are resolved by probabilistic matching, using the Whisper transcription as the source of truth. Speaker identification is further refined through the use of an LLM. The LLM provides additional context-aware adjustments that improve the accuracy of speaker labeling.

The result is one unified, time-stamped, speaker-labeled transcript optimized for simple human verification.

Access

Our transcription services platform is still in development. For more information, please reach out to ai@uky.edu.

Collaborative Projects using CAAI’s Transcription Platform

SpeakEZ – transcription, diarization, summarization, theme extraction

Authors