Converting "hardcoded" subtitlesLast updated 13 Oct 2023
Last year I was looking for something to watch and noticed a French film on Mubi called In Bed with Victoria (known just as Victoria in France) which I seem to remember having some middling reviews on Letterboxd at the time but decided to give it a punt anyway. After I'd watched and enjoyed the film, I looked to buy a copy of the film since Mubi used to have a system where films were only available for thirty days. Looking at Blu-ray.com, I was disappointed to see that none of the Blu-rays available had English subtitles included. "No problem," I thought, I can rip the Blu-ray and graft on some subtitles from somewhere else so I ordered a French copy of the film. I don't think there were any DVD copies of the film which included English subtitles but there are lots of websites offering subtitles in SubRip format (SRT) and I found various versions for In Bed with Victoria.
I've not had much success with Blu-rays on Linux so all the tools I'm using are on Windows (specifically Windows 10 for this). MakeMKV is a great piece of software for ripping discs to a single Matroshka (MKV) file, selecting which audio, video and subtitle streams you want (I have the paid version which incurs a one-off purchase but I think you can still rip Blu-rays even with the free version). You can then put the resulting file into something like Handbrake to convert it and create a smaller file. Handbrake doesn't support SRT as a format to bake in a selectable subtitle stream for the outputted MKV file but I normally watch films using VLC or Kodi/OSMC which both support them (and if they have the same filename as the video file except the file extension then they will be automatically applied).
When I started applying the downloaded subtitles to the Blu-ray rip, I quickly noticed basic issues with the formatting (eg when two characters are speaking at the same time) and tried downloading some different ones when I started to realise there was a more fundamental problem: the translation was totally different to the one I'd seen on Mubi. Spending some more time checking these subtitles, it became clear that the translations weren't just different, they were vastly inferior. I don't know if it's because there's not a disc of this film available to buy which includes English subtitles and these were just the results of an automated translations or whether they were done by a less-skilled subtitler. One of the reasons I wanted to buy the film was so I could introduce it to other people I thought I would enjoy it, I could tell these new subtitles would be seriously detrimental to my enjoyment of the film (or to anyone else I wanted to introduce the film to).
At this point, the film was still available to view on Mubi, I watched it again and skipped to the end credits for details of the subtitles and found they were credited to TETRAFILM and were written by Sionann O'Neill. I tried contacting TETRAFILM to ask for a copy of the subtitles but received no response; I looked for Sionann but could only find an inactive Twitter account (though I can see she subsequently has some minor activity on it). Interestingly, I did come across an article about her from 2011 with a San Francisco news site called SFGATE promoting the release of Francois Ozon's latest film at the time, Potiche, which she had done the English subtitles for. I used to do captioning for live theatre shows so am more interested in the art of subtitles and captions than most people but it was an extremely insightful article and it drives home the significance and importance of her work (including the concept of translation vs adaption which I never had to deal with):
O'Neill says in her work, she has two mottos. The first: Less is more. "English is often more succinct than French," she says, "So I have to find a way to synthesize the French while being true to the character. I'm always trying to streamline it to have fewer words, because it's terrible to have the eyes down at the subtitles all the time. I want people to forget they're reading subtitles."
Her second motto: Sometimes you have to go further away to get closer. "That's the essence of adaptation," she says, using the French word. "You're adapting it. The literal translation won't cut it. You have to express how that character would put it if they were an American saying it."
Realising that I'd hit a dead end with acquiring some soft subtitles of the correct translation, I decided to figure out a way to extract the subtitles from the Mubi stream of the film. I put "hardcoded" in the title of this article in quotes because in the Mubi stream, the subtitles are selectable but I ended up having to deal with them as part of the video stream. Obviously, as a part of a paid-for subscription service, the video streams on Mubi are using Digital Rights Management (DRM) to protect them. I'd hoped the subtitles might be delivered separately to this but inspecting the Network tab in FireFox while the video played with subtitles showed there was no way to grab them.
Using a popular piece of streaming tool called Open Broadcasting Software (OBS), I was able to capture a screen recording of the film while it played on my computer. I'm not sure how Mubi determines its output quality but when playing it in a browser on a screen at a resolution of 1920x1080 I have never found it to be great, especially not for retaining a copy of the film. The important thing was to capture the subtitles in the video and luckily the film has an aspect ratio of 2.40:1 which meant that being played back on 16:9 screen left the subtitle text entirely on a black background. This allowed me to use a tool called VideoSubFinder without changing any of the presets (except for selecting the bottom portion of the video) to output each frame of subtitle has an image with the timestamp in the filename.
Once you have these image files, you need to use Optical Character Recognition (OCR) to convert them into text. Another popular tool called SubtitleEdit is meant to have the functionality to do this using various third-party systems such as Tesseract but I found it to be so slow and unreliable that it was completely unusable. I found a tool on GitHub by a user called Abu3safeer which is a Python script that allowed you to use the Google Docs OCR. You need to follow the Google Developers Python quickstart guide to generate a
credentials.json file, then when you run the Python script you will be prompted for your Google account credentials before it goes and OCRs all your files. It ran much more quickly than the SubtitleEdit process and produced 1329 text files with no errors.
Going back to SubtitleEdit, you can import via a batch of text files - the ones output by the Python tool retain the timestamps in the filenames so SubtitleEdit knows where to place them. The idea now is to export this as an SRT file but unfortunately we still have some hoops to jump through. Even before we start thoroughly inspecting the quality of the OCR, it's clear that there's a timing issue. Setting the first subtitle and offsetting the rest from there, I can see that by the end of the film the subtitles and video are about 10 seconds out of sync. I checked the framerate of the OBS recording using VLC Tools > Codec Information and could see it was 30fps whereae the Blu-ray rip was 23.976fps. Using a tool called Subtitle framerate changer I thought I could simply convert my SRT file from 30 to 23.976fps but the resultant file was way out. I started to think about how the framerate shouldn't matter because the SRT uses timestamps, not frame information but obviously there was a mismatch. Even though my OBS recording was set to 30fps, whatever framerate Mubi had broadcast the film at would affect the length of the film. Sure enough, I used to the tool to convert the SRT from 24 to 23.976fps and the resulting file matched perfectly.
With the hard bit out of the way, now I had to actually work on the content of the subtitles. Even though the OCR via Google had been much more successful than the providers in SubtitleEdit, I could see there was still a lot of work needing done. I don't know if it was a problem with how I configured VideoSubFinder or the Python script but they hadn't handled line breaks at all. Sometimes these were easy to see where they should be because the last word of the first line and the first word of the second line would be concatenated but plenty of them were not so easy to see. Lines which included two characters speaking at the same time should start with a hyphen and a space, many of the hyphens had been lost or converted to other characters such as an em dash or interpunkt. More critically, many of these instances ommitted the second line altogether and there were other short, individual lines which had been missed. I'm not sure whether these were missed by the extraction of the images or the OCR process and I don't have an easy way to tell now.
After going through all the lines in SubtitleEdit and adding line breaks where I thought necessary, I lined up my OBS copy of the film in VLC with the window set to View > Always on top. This means I can sit it just above the video player in SubtitleEdit, view the subtitles being played at the same time and make changes to the subtitles without losing visibility of VLC. I changed the skip time feature in SubtitleEdit from 0.5 to 3 seconds under the Adjust tabe in the bottom left which matches the amount of time VLC jumps when using Shift + Left/Right. Now I played the two videos together looking for any changes I needed to make, like adding line breaks or lines that were completely missing. Obviously because of the different framerates, the videos will go out of sync, but you can briefly pause the faster one periodically and you might need to stop them altogether if you are doing larger edits.
Once the editing is finished, you can save your work as an SRT file and place it in the same folder as your video file. If you want, you can use a tool like MKVToolNix) to add the subtitles as a selectable track in your MKV file.
If you want a copy of the subtitles I have uploaded a copy here: