http://www.bunkus.org/dvdripping4linux/en/separate/subtitles.html
last modified on August 23, 2002
On a DVD subtitles are stored as pictures that are shown on top of the movie by your movie player. That way the authors have a wide choice of how their subtitles look like (and it makes subtitles in Asian languages much easier to implement). For us this may or may not be a problem - depending on whether we want to include subtitles directly into the picture of have them as a separate file/stream.
Very often you don't want to be forced to see the subtitles. This is not possible if you include the subtitles in the picture during encoding. You have to extract the subtitles from the DVD into an external file/stream that the user can activate (or not). I will describe the process of converting the DVD subtitles into a text format that is widely used. Text subtitles can be easily scaled by the player (by selecting an appropriate font) and they are really small (most often below 100KB).
For this process you must have transcode and its sources. You need tccat and tcextract from transcode itself and the files in transcode/contrib/subrip from the transcode sources.
Unfortunately no binary package (RPM, deb) that I know of includes subrip so we have to compile and install it ourselves. But this is rather easy.
Here I assume that you've copied your DVD with vobcopy -m meaning that it has been completely mirrored including the .IFO files. If not then you'll have to adjust the sources.
First let's see which subtitles are available. We can use mplayer for this task:
mplayer -dvd-device /space/st-tng/disc1/ -dvd 1 -vo null -ao null -frames 0 -v 2>&1 | grep sid
This causes mplayer to just print a lot of information about the source and not to play anything at all. It should give you a list of subtitles:
[open] subtitle ( sid ): 0 language: da [open] subtitle ( sid ): 1 language: de [open] subtitle ( sid ): 2 language: en [open] subtitle ( sid ): 3 language: es [open] subtitle ( sid ): 4 language: fr [open] subtitle ( sid ): 5 language: it [open] subtitle ( sid ): 6 language: nl [open] subtitle ( sid ): 7 language: no [open] subtitle ( sid ): 8 language: sv [open] subtitle ( sid ): 9 language: en
Now that we have the sid (subtitle ID) for the language that we want we can fire up the transcode tools and let them extract the raw subtitle stream:
tccat -i /space/st-tng/dic1/ -T 1 -L | tcextract -x ps1 -t vob -a 0x22 > subs-en
The -a 0x21 is the subtitle stream's hexadecimal number: 0x20 + sid. Here I use the English subtitles.
Ok, we have a raw subtitle stream - but what can we do with it? First we have to convert each subtitle entry into a picture. This can be easily done with
subtitle2pgm -o english -c 255,255,0,255 < subs-en
Here's a catch however. With -c you can specify the grey levels used in the conversion. The idea is to make the job for gocr as easy as possible. Therefore you might have to experiment with the parameters - but this is easy, too. I've taken the following samples from my Star Trek - The Next Generation DVD:
As you can see you need a picture that does not contain outlined characters.
subtitle2pgm creates a lot of images - one for each subtitle - and a control file, called english.srtx in my case, that contains the duration for each subtitle. The next step is to let gocr recognize the text:
pgm2txt english
Be warned - gocr will ask you often about charcters that it can't recognize. This is normal. Once you're done you should run ispell over all the newly created text files:
ispell -d american english*txt
Adjust the languange to your needs, of course.
The last step is to let srttool include the actual text into the .srtx file:
srttool -s -w < english.srtx > english.srt
Voila, you have a working subtitle file. You can watch them with e.g.
mplayer -sub english.srt mymovie.avi