Jitsi Meet: why transcription for electronic medical records requires Jigasi and Vosk
Transcription in Jitsi Meet turned out to be not a 'button' but a separate stack: Jigasi joins the call as a participant, sends audio to Vosk, and stores the…
AI-processed from Habr AI; edited by Hamidun News
Jitsi Meet supports transcription, but in a real product it turns out to be not a toggle in the interface, but a separate infrastructure layer. In the case of automatic EMR filling after video consultations, this layer proved to be the most labor-intensive: it required bringing together Jigasi, XMPP/SIP, Vosk, and LLM post-processing.
Architecture Under the Hood
The basic idea of Jitsi looks simple only on the surface. The video call itself is handled by the Jitsi Meet frontend, the Jitsi Videobridge media server, the Jicofo conference manager, and the Prosody XMPP layer. But transcription doesn't live inside a single button in the interface.
Jigasi is responsible for it — a separate gateway that connects to the room as a regular participant, receives audio from interlocutors, and sends the stream to an external speech recognition service. This creates a false sense of simplicity at the start. Because of this, the task quickly transitions from the "connect an API" level to the infrastructure level.
You need not just enable an option in the UI, but coordinate several services, network connections, and a separate STT backend. In the analyzed case, such a backend was Vosk, running over WebSocket. The approach itself is convenient for asynchronous processing after a consultation: the system doesn't need to fit into strict real-time latencies, and the resulting text can be comfortably parsed after the call ends.
Where the Scheme Breaks
The main problem is that transcription has several independent points of failure right away. The Jigasi config, the Jitsi Meet frontend parameters, and the availability of the STT service must coincide at the same time. If one layer is misconfigured, the system often doesn't fail with a clear error, but simply doesn't deliver the expected result: the bot doesn't enter the room, the file isn't saved, or the text is too weak for practical use. Without viewing logs, such failures are easy to mistake for randomness.
"Jigasi is a SIP gateway with transcription, not the other way around".
- a separate XMPP account for Jigasi in Prosody: if there's an error with it, the transcriber bot won't appear in the conference at all;
- permissions on the directory with transcripts: intermediate subtitles may come through, but the final file won't be saved to disk;
- choice of STT model: basic Vosk works for MVP, but handles medical terms, drug names, and dosages worse;
- detection of session end: Jigasi writes the final text only when the room is actually empty, but the downstream pipeline needs a reliable trigger for processing.
A separate nuance is the separation of channels for saving and sending results. One set of parameters is responsible for recording the final text to disk after the consultation is complete, another for forwarding intermediate fragments to participants via XMPP. For a product that fills EMR after the fact, it's more important to reliably get the final file than to show subtitles in real time. Otherwise, the next processing stage has nothing to launch from, and the entire automation hangs.
From Text to EMR
Even after successfully setting up Jitsi, the task doesn't end. The Jigasi output is a raw dialog with timestamps: the doctor asks questions, the patient answers, then come appointments and recommendations. For a medical record, such text is almost useless in its original form, because the system needs not replicas as such, but structured entities: complaints, symptom history, medications, dosages, administration schedule, and further actions.
Between speech recognition and EMR remains another large layer of transformations. That's why another layer was needed on top of STT — LLM processing. The model normalizes text, corrects some recognition errors based on context, and breaks down the result into fields compatible with FHIR structures.
After that, the data goes to a frontend form where the doctor checks and confirms the record before final saving to EMR. Such a human-in-the-loop here is not overcaution, but a mandatory requirement: in a clinical scenario, you can't automatically write drugs, doses, and prescriptions to the chart without review. This is where the limitation of "cheap" STT becomes visible.
If the base model poorly recognizes domain vocabulary, the entire rest of the chain starts spending resources on correcting errors. For a production version, heavier Vosk models suggest themselves, a specialized engine like Deepgram with a medical profile, or a combination of STT and LLM normalization where the language model compensates for recognition errors. Otherwise, the cost of errors is too high already at the level of the medical record.
What This Means
The Jitsi Meet story shows a simple thing: transcription for an applied AI product is a separate subsystem, not a cosmetic feature. For an MVP, an asynchronous scheme with Jigasi and Vosk will work, but for production in medicine, precise tuning of the entire stack is needed, good logs, session completion control, and a normalization layer that turns a conversation into data suitable for EMR. The stricter the domain, the more expensive the illusion that everything is solved with a single checkbox in the interface.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.