--- title: "Parsing subtitles in srt format" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Parsing subtitles in srt format} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = FALSE, comment = "#>" ) ``` ```{r setup} library(flexo) ``` ## Parsing subtitles in srt format The [srt](https://www.speechpad.com/captions/srt) subtitle format is a simple representation of subtitles for video which consists of timestamped lines of text. # SRT format example The first 10 lines of dialogue from "It's a Wonderful Life" in `srt` format ```{r} srt <- " 1 00:01:25,210 --> 00:01:28,004 I owe everything to George Bailey. 2 00:01:28,422 --> 00:01:30,298 Help him, dear Father. 3 00:01:30,674 --> 00:01:33,718 Joseph, Jesus and Mary, 4 00:01:33,802 --> 00:01:36,429 help my friend Mr. Bailey. 5 00:01:36,889 --> 00:01:39,515 Help my son George tonight. 6 00:01:40,350 --> 00:01:42,226 He never thinks about himself, God. 7 00:01:42,311 --> 00:01:44,061 That's why he's in trouble. 8 00:01:44,146 --> 00:01:45,313 George is a good guy. 9 00:01:46,482 --> 00:01:47,732 Give him a break, God. 10 00:01:47,816 --> 00:01:49,942 I love him, dear Lord. " ``` # Lex the srt file into tokens ```{r} #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # Define the regex for each token #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ srt_regexes <- c( time = "\\d+:\\d+:\\d+,\\d+", link = "\\s*-->\\s*", index = "^\\d+$", text = "^.+?$" ) #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # Collapse the file into a single string #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ srt <- paste(enc2utf8(srt), collapse = "\n") #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # Split the file by regex, and drop the 'link' between times #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ tokens <- lex(srt, multiline = TRUE, srt_regexes) tokens <- tokens[names(tokens) != 'link'] tokens ``` # Parse raw tokens into a data.frame ```{r} #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # Merge together runs of text #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ rl <- unclass(rle(names(tokens))) end <- cumsum(rl$lengths)[rl$values == 'text'] len <- rl$lengths[rl$values == 'text'] start <- end - len + 1 text <- mapply(function(start, end) { paste(tokens[start:end], collapse = "\n") }, start, end) #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # Extract index and time vectors #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ indices <- as.integer(tokens[names(tokens) == 'index']) times <- tokens[names(tokens) == 'time'] #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # Munge into data.frame #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ data.frame( index = indices, start = times[c(T, F)], end = times[c(F, T)], text = text, stringsAsFactors = FALSE ) ```