Using LLMs to Transcribe and Translate (Part 1)
Languages are hard. They're complex systems of communication that take years to understand and many more to master.
Though I study Japanese when I can (read: I'm terrible at self-study motivation, so I don't study nearly as much as I should), my knowledge and understanding of the spoken and written word could at best be described as rudimentary; more often than not, it's closer to illiterate. Despite my slow progress, I still want to enjoy and understand various Japanese shows and radio programs at a deeper-than-surface level. Official translation outside of high-budget works such as Anime TV series or films are few and far between, and while fan translators are excellent and wonderful people who produce amazing, high quality work, they are still just people; individuals or small groups who translate the content they enjoy as a hobby or by commission. They do incredible work for those of us less linguistically inclined, but sometimes the content I'm looking to understand doesn't have a dedicated, overworked and thankless translator working into the wee hours to deliver localized translations of those works because they find it fun.
Before we go any further, allow me a moment to be candid: This is not a post about how to replace fan translators or official translators with Large Language Models (LLMs). This is also not a post complaining that XYZ program or broadcast does not have a fan translator working on it yet. The ethics surrounding the training of LLMs are poor, but they are a tool that when used properly can aid in both understanding and learning (in this post, both in a technical and linguistic sense).
If (hopefully when) I'm smart enough to actually understand (any of) the nuances of Japanese, I'd be filling in the gaps and publishing translations of things I found important too.
For the time being, I've been leveraging one of my favorite language models: whisper. Released by OpenAI, whisper is a multilingual speech recognition model capable of both speech-to-text and text-to-speech conversion. We're interested in the former capability here, taking the spoken word from a radio show, youtube live stream, or TV show and converting it to text that we (or I) can understand. The model is small enough and fast enough to run on commodity desktop hardware. With the right machine, it's even capable of real-time transcription and translation of voice to text.
If that sounds like something either a) cool or b) useful to you, read on. I'll show you how I'm currently leveraging whisper (and other LLMs) to generate reasonable, understandable transcriptions and translations of the spoken word.
This will be a multi-part series. This is just an introduction post to the world of whisper and its tooling.
Environment
From this point on, we're going to get technical.
Whisper is a niche, specialized LLM focused entirely on transcribing speech-to-text. It's specialization is great for us commonfolk who don't own a data center; it's overall size is quite small (several gigabytes for the largest model). It comes in a variety of sizes (small-en, small, medium, large, etc) for use on single board computers like a raspberry pi to high end desktop and server hardware. The current front-runner version of the model, large-v3, takes up about 3GB on disk, and when expanded into system or video memory consumes about 7GB during workloads. Most tools built on whisper even support cpu-only inference at reasonable speeds.
This has a few advantages:
- any modern computer with a halfway decent CPU can transcribe video using the high-end (high parameter count) whisper models
- a CUDA-compatible (NVIDIA) gpu with >=8GB of VRAM can load the entire model into video memory for high-speed operation
The old laptop you have sitting in the corner can run Whisper, and that's a really cool thing.
Dependency Hell
One of the most challenging aspects of using whisper (and the tools developed around it) is managing the packages and dependencies required by each tool. To put it mildly, its dependency hell.
Many machine learning (ML) and LLM tools are built with Python due to its nearly endless suite of third party libraries and extensions. However, Python is an "interpreted" language instead of a "compiled" one, so instead of the developer sharing a built and packaged binary with you (like an .exe on Windows), your machine must have a version of Python and any dependencies the program author makes use of on your system every time you run the tool.
I've dealt with this in two ways: virtual environments and containers (docker). Which you choose is ultimately up to you and your comfort level, but my recommendation is to use containers. You'll save yourself a lot of headache in the long run.
Containers (Podman / Docker)
If you're unfamiliar with containers here's the elevator pitch on why they're useful:
What is a container? Simply put, containers are isolated processes for each of your app's components. Each component - the frontend React app, the Python API engine, and the database - runs in its own isolated environment, completely isolated from everything else on your machine. read more
They're small, predefined, prepackaged environments that are ready to run the tool installed in them. No managing dependent packages or 10 versions of Python. You run the container, and it runs the software exactly as the developer packaged it. Think of it like a mini operating system that you download from the internet, and run inside your main OS (Windows, macOS, or Linux). Multiple containers can run side by side, each with their own versions of software dependencies that will never conflict with each other.
They solve the "works on my machine" problem for those who don't want to get their hands (and operating systems) dirty.
To enter the magical world of containers, install either podman or docker onto your computer. I recommend podman, which for reasons outside the scope of this post is slowly gaining ground in the container desktop runtime space from docker (the "established" corporate player) due to its excellent feature set and backwards compatibility with the docker ecosystem.
For macOS and Windows, I recommend using podman desktop because it offers a good, UI-driven view of your podman install & running containers. It also offers step-by-step instructions for installing the podman runtime. As a note, on both Windows and Mac, podman (and docker) use a virtual machine to run container processes, and you'll have to configure that as part of the first-time setup. Once it's all configured, you don't have to touch it again.
If you need them, the linux install instructions are here, along with CLI-only installation instructions for both Windows and Mac. Linux users can also make use of podman desktop, if that's your thing.
Python Virtual Environments
If containers aren't up your alley but you smartly don't want to install every required package under the sun to your global system state (otherwise known as dependency hell), python virtual environments will be your friend. I've had success with both virtualenv and anaconda in the past; the former for managing project dependencies from pip (python's package manager) and the latter to manage versioned installations of python. Both have trade-offs (and this is not an exhaustive list), but for beginners I'd recommend starting with anaconda. Tools built around AI/LLMs use many different versions of python, so having a version manager for python itself is a good idea.
If you're a developer who regularly works with python, feel free to ignore my advice and advocate your position!
For windows anaconda users who find themselves needing access to their environments over ssh, execute the following after connecting to your windows machine:
powershell.exe -ExecutionPolicy ByPass -NoExit -Command "& 'C:\Users\<youruser>\anaconda3\shell\condabin\conda-hook.ps1' ; conda activate 'C:\Users\<youruser>\anaconda3' "
This is the same command the "Anaconda Shell" start menu entry runs to launch anaconda.
Some resources to help get started with Anaconda:
GPU Acceleration
Unless you live under a rock, you no doubt are aware that running AI inference workloads (like whisper) on a GPU speeds up the process dramatically. Whisper is a small model that runs fine on a CPU, but if you have a GPU with >=8GB of VRAM, you might as well try to accelerate it.
Each GPU vendor (Nvidia, AMD, Intel) have their own acceleration api (CUDA, rocm, and oneapi/openvino, respectively). Each API has its own setup steps for each operating system, so refer to the respective documentation for installation instructions. Make sure you have up-to-date GPU drivers for your card as well. There is also the Vulkan graphics api. which supports multiple GPU types which can be used as a fallback.
Frankly, the easiest path to GPU acceleration is to use containers. All of the libraries (like CUDA) are installed within the container instead of globally on the operating system, making multiple library versions a non-issue. I'll walk through my setup in a subsequent post, for both Nvidia CUDA and Vulkan.
Transcription
Whisper itself is just a model so to actually use whisper we'll need some tools. Fortunately, other developers have built a plethora for us to choose from. I'll list a few of my favorites here:
- whisper.cpp - a high-speed c++ command-line based whisper execution tool that works with practically every gpu/machine acceleration API currently in use
- stream-translator - a tool for generating real-time subtitles from a live stream (like youtube or a
.m3u8manifest) - whisper-webui - a self-hostable web interface to running whisper that takes file uploads from the browser (or youtube URLs) to generate transcripts/translations. Very new-user friendly.
There are many other tools to transcribe audio to text (and even translate it to english!) using whisper, but these are the ones I've used in the past.
Translation
While whisper can translate text to English in the same breath as transcribing in the audio's native language, the translations aren't high quality. They're serviceable, but you'll get better results shoving the transcribed text/subtitle file through another general purpose LLM (with internet access preferred) like Claude Sonnet, Gemini or ChatGPT. (Anecdotally, ChatGPT 5.x's translation quality is far below GPT4, so YMMV. I've had good luck with Claude via Kagi.)