strsplit(perl=TRUE), gregexpr(perl=TRUE) very slow for long strings

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

strsplit(perl=TRUE), gregexpr(perl=TRUE) very slow for long strings

R devel mailing list
While doing some speed testing I noticed that in R-3.2.3 the perl=TRUE
variants of strsplit() and gregexpr() took time proportional to the
square of the number of pattern matches in their input strings.  E.g.,
the attached test function times gsub, strsplit, and gregexpr, with
perl TRUE (PCRE) and FALSE (TRE), when the input string contains 'n'
matches to the given pattern.  Notice the quadratic (in n) time growth
for the StrSplitPCRE and RegExprPCRE columns.

> regex.perf.test(N=2^(11:20))

              N SubTRE SubPCRE StrSplitTRE StrSplitPCRE RegExprTRE RegExprPCRE

elapsed    2048   0.00    0.00        0.00         0.00       0.00        0.00
elapsed    4096   0.00    0.00        0.01         0.00       0.00        0.00
elapsed    8192   0.00    0.00        0.00         0.01       0.00        0.01
elapsed   16384   0.02    0.00        0.00         0.05       0.02        0.08
elapsed   32768   0.00    0.00        0.01         0.16       0.00        0.29
elapsed   65536   0.02    0.01        0.04         0.59       0.01        0.96
elapsed  131072   0.03    0.02        0.08         2.37       0.05        2.43
elapsed  262144   0.06    0.04        0.17         9.58       0.10        9.61
elapsed  524288   0.14    0.05        0.36        39.14       0.21       38.58
elapsed 1048576   0.30    0.08        0.52       155.50       0.40      155.43

I have not looked at R's code, but it is possible that the problem is
caused by PCRE repeatedly scanning (once per match) the entire input
string to make sure it is valid UTF-8.  If so, adding
PCRE_NO_UTF8_CHECK to the flags given to pcre_exec would solve the
problem.  Perhaps R is already doing that in gsub(perl=TRUE).

Here is the test function:

regex.perf.test <- function(N=c(1e4, 2e4, 4e4, 8e4)) {
  makeTestString <- function(n) paste(collapse="",  rep("ab", n))
  s <- lapply(N, makeTestString)
  fns <- list(SubTRE=function(si) gsub("a", "", si, perl=FALSE),
              SubPCRE=function(si) gsub("a", "", si, perl=TRUE),
              StrSplitTRE=function(si) strsplit(si, "a", perl=FALSE),
              StrSplitPCRE=function(si) strsplit(si, "a", perl=TRUE),
              RegExprTRE=function(si) gregexpr("a", si, perl=FALSE),
              RegExprPCRE=function(si) gregexpr("a", si, perl=TRUE))
  times <- lapply(fns, function(fn) sapply(s, function(si)
system.time(fn(si))["elapsed"]))
  do.call("cbind", c(list(N=N), times))
}

Bill Dunlap
TIBCO Software
wdunlap tibco.com

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel