Reddit – The heart of the internet
Hey everyone! I hope you’re having a great day. I recently compared all the open source whisper-based packages that support long-form transcription. Long-form transcription is basically transcribing audio files that are longer than whisper’s input limit, which is 30 seconds. This can be useful if you want to chat with a youtube video or podcast etc. I compared the following packages: OpenAI’s official whisper package Huggingface Transformers Huggingface BetterTransformer (aka Insanely-fast-whisper) FasterWhisper WhisperX Whisper.cpp I compared between them in the following areas: Accuracy – using word error rate (wer) and character error rate (cer) Efficieny – using vram usage and latency I’ve written a detailed blog post about this. If you just want the results, here they are: For all metrics, lower is better If you have any comments or questions please leave them below.