Ffmpeg Extract Ass Subtitle

  1. Ffmpeg Extract Vobsub
  2. Ffmpeg Subtitle Stream
  3. Ffmpeg Remove Subtitle
  4. Ffmpeg Subtitle Srt
  5. Ffmpeg Extract Pgs Subtitles

I became a huge subtitle user when I met my wife. We both like to watch a lotof non-English/non-Chinese movies, and while I use English subtitles, sheprefers to have subtitles in Chinese most of the time since she can read itfaster than English.

Currently ASS subtitles are only supported if there is an external.ass file, but almost always it is a stream inside an MKV container (so there are 3 streams: video, audio and ASS subtitle) It is possible currently to extract the subtitle stream from the original container, then load it in a second step, but it would be cool to be able to do it in one step instead. Ffmpeg -i video.mp4 -vf 'ass=mysubtitle.ass' -strict -2 out.mp4 The output shows the subtitles, but the aspect is changed and the subtitles appear partially sometimes since the scale is changed and I would like to get the same aspect ratio (16:9) and same width x height as input video. Or even better reduce video size with same aspect ratio.

Ffmpeg subtitle srt

Over the years of doing this I've acquired quite a lot of knowledge in thisarea, and built quite a few tools to help. This post is a way of introducingthem to the world, and hopefully it will help anyone in a similar predicamentto mine.

Things you will need:

Ffmpeg remove subtitle
  • ffmpeg, a set of tools to manipulate multimedia data
  • srt, a Python library and set of tools I've written for dealing with SRTfiles (install with pip install srt)

Conversion from other formats to SRT

Ffmpeg Extract Vobsub

The SRT format is by far my favourite subtitle format. Its spec has itsoddities (not least that there is no widely accepted formal spec), but ingeneral if you stick to the accepted commonalities of the format between mediaplayers, you'll find it's not only simple, but easy to modify and scriptaround.

If you have another format, like SSA, for example, you'll probably findthat ffmpeg does a pretty good job converting it with ffmpeg -i foo.ssafoo.srt.

Acquiring subtitles

I won't go into too much detail on this, since you probably will have goodenough luck Googling '[movie] [language] subtitle', but here are somerecommendations:

  • If you want to extract existing subtitles that are already in your video file(for example, to mux them with other ones), see Extracting subtitles from avideo file, below. This is oftenthe best way since these have already been checked to work with the versionof the movie you have.
  • For Chinese subtitles, Shooter is pretty good and frequently updated.
  • Otherwise, Google '[movie] [language] subtitle'.

Fixing encoding problems

All of the SRT tools take UTF-8 as input, since it's a sane, reasonableencoding across the board. You may find that your subtitles are not encoded asUTF-8 and require conversion.

Let's take Chinese subtitles as an example, as they often use country-preferredencoding schemes. Chinese subtitles usually come encoded as Big5 orGB18030.

I personally find that enca is pretty good at detecting the encoding andconverting it appropriately. You can call it as enca -c -x UTF8 -L <languageiso code> <sub> to convert subtitles to UTF-8 based on encoding detectionheuristics, regardless of their source encoding.

Extracting subtitles from a video file

I'll assume you're using a Matroska file, since they're so popularnowadays, but much of this will also apply elsewhere.

Inside an MKV file are multiple streams. They contain things like the videodata, the audio data, and subtitles. You can list them with ffprobe:

Looking at the three streams marked 'Subtitle', you can see that we haveEnglish, Spanish, and French subtitles available in this MKV.

Say you want to extract the Spanish subtitle to an SRT file. When converting,ffmpeg will pick the first suitable stream that it finds – by default, then,you will get the English subtitle. To avoid this, you can use -map to selectthe Spanish subtitle for output.

In this case we know that the Spanish subtitle is stream 0:4, so we run thiscommand:

We can see that the right subtitle has been selected:

We will use this subtitle for most of the subsequent examples.

Stripping HTML-like entities from subtitles

As you can see in the subtitle above, sometimes subtitles contain HTMLentities, like <b>, <color>, etc. These are not part of the SRTspec, they remain to be interpreted by the media player. Since not all mediaplayers support this sometimes they are just shown raw, which looks quite bad.

Ffmpeg remove subtitle

The srt project contains a tool to deal with this called process, which canperform arbitrary operations on files:

Correcting time shifts

Getting subtitles from the internet is an imperfect business. There are often afew different packagings of a movie in different markets, some with differentintros, some from different original sources, etc. This can result in thesubtitles requiring some correction prior to use.

Your media player may contain some rudimentary controls to correct this atruntime, which may suffice for fixed timeshifts, but for linear timeshifts andcases where you need two sync two subtitles exactly prior to muxing, modifyingthe SRT file directly is a good idea.

The srt project contains two tools to deal with this:

  • fixed-timeshift, which shifts all subtitles by a fixed amount. Forexample, you may want to shift all subtitles back a certain number of secondsto sync properly with your video.

  • linear-timeshift, which takes two existing time points in the input, andscales all subtitles so that those time points are shifted to the correctvalues. For example, if you had three subtitles with times 1, 2, and 3, youset the existing times as 1 and 3, and you set the new times as 1 and 5, thenew times for those subtitles would be 1, 3, and 5.

    On the command line, 'f' means 'from', 't' means 'to', and the numbers arejust the unique ID for each pair of times.

Muxing subtitles together

The srt project contains a tool, mux, that takes multiple streams of SRTsand muxes them into one. It also attempts to clamp multiple subtitles to usethe same start/end times if they are similar (by default, if they are within600ms of each other), in order to stop subtitles jumping around the screen whendisplayed.

Say we wanted to create Spanish/French dual language subs for this movie(having already retrieved a suitable French subtitle in ES-french.srt).

In that case, we'd run something like this:

Removing other languages from dual-language subtitles

This is easier for some languages than others. For example, it's easy to detectand isolate lines containing CJK characters from lines containing (say)English, since their range of characters tends not to intersect.

It's more difficult (and more error prone) to try to detect languages usingmore advanced heuristics, but there are a few ways that you can do it usingsrt.

srt has a program called lines-matching, to which you can pass anarbitrary Python function that returns True if the line is to be kept, andFalse otherwise. This means you can easily build your own heuristics forlanguage based detection, or anything else you want to isolate.

Ffmpeg Subtitle Stream

As an example, this is how you would isolate to Chinese lines usinghanzidentifier (must be installed):

You can pass -m multiple times for multiple imports. -f is a function thattakes one argument, line. In this case, hanzidentifier.has_chinese alreadytakes one argument, so we don't need to do anything complicated.

As a more general solution, there is also langdetect, but since this isheuristic, you may find it gets it wrong some of the time. For example(langdetect must be installed):

Notice that we have to use double quotes instead of single quotes inside thesyntax block, since we're already quoting the expression itself with singlequotes.

Ffmpeg Remove Subtitle

Using the muxed Spanish and French output we generated earlier as input, thisoutputs the following:

Ffmpeg Subtitle Srt

Notice that one line — <i>sur la voie 'B.'</i> — is completelygone. Language detection is not an absolute science, and sometimes langdetectgets it completely wrong, particularly on short sentences without much contextand with language-ambiguous words. For example, in this case, it's very unsurewhat the language is because the content is quite short. Notice that itscertainties vary wildly between runs, sometimes even completely omitting French:

Ffmpeg Extract Pgs Subtitles

One thing you can do if you want to match per-subtitle rather than per-line(which only makes sense if your different languages are actually in differentSRT blocks) is use -s/--per-subtitle, which may help to give better contextto langdetect. This fixes the problem above: