New subtitle ripper 

Newsgroups:  gmane.comp.video.transcode.user
Date:        Sun, 20 Aug 2006 22:27:24 +0200

Recently, I discovered that the subtitles on DVDs are stored as images. I also figured, that those image subtitles cannot be attached to ogm or mkv files. So I read a few tutorials and happened to install the "media-video/subtitleripper". This program is designed to convert the image-subtitles to text-subtitles (srt format). It's using a program for (handwritten) text recognition, and requires alot of corrections.

Now to the interesting part. I decided to write my own application to do the job, and it worked out very good. Better than I expected. So I decided to release it. You can download it from sourceforge: http://sourceforge.net/projects/sub2text/ It is however the first time I release or even work on a public project.

Please have a look at what I did - and try it for yourselves. I need people testing it.

Christian.Wasserthal

New subtitle ripper 

> Could you shed more light onto it please? Ie.
>
> - What kind of OCR mechanism are you using? - I haven't tried it, but I
> see that you're using the Java Swing interface.
>   Does it require many user interaction? Any way to automate the
>   process?
> - Does your sub2text depend on some dictionary to do the auto
> correction? - Any error report for mis-spelled words, etc? - Would it
> work for other languages than English?

Thank you for replying to my mail : ).

  1. The mechanism I am using to recognize characters is the following: I look for connected areas (4-neighborhood) in the 2-color image. One or more areas form a 'shape' which lies in a small database and is associated with an unicode-letter. So a 'shape' is represented by a list of pixel positions relative to the seed point (the leftmost of the topmost pixels in the shape).
  2. When the database is empty, every letter found has to be fulfilled (that means that all areas that belong to it have to be selected) and the unicode representation has to be entered. Also user-interaction is needed to resolve ambiguous characters. (In a lot of DVD-fonts the uppercase i and the lowercase L look exactly the same). But the interface is very fast, you don't need to touch the mouse often. The arrowkeys and return do most of the job. (I just thought of using gocr for guessing the letter…)
  3. No. Also there is no 'auto'-detection (in the way of software- intuition). But aspell can be used to solve the ambiguousness's.
  4. No. Only words containing ambiguous characters are check in this way.
  5. It _should_ be ready for all unicode-eventualities.

I hope I can make a screenshot tutorial soon.

Christian.Wasserthal