strsplit(perl=TRUE), gregexpr(perl=TRUE) very slow for long strings
While doing some speed testing I noticed that in R-3.2.3 the perl=TRUE
variants of strsplit() and gregexpr() took time proportional to the
square of the number of pattern matches in their input strings. E.g.,
the attached test function times gsub, strsplit, and gregexpr, with
perl TRUE (PCRE) and FALSE (TRE), when the input string contains 'n'
matches to the given pattern. Notice the quadratic (in n) time growth
for the StrSplitPCRE and RegExprPCRE columns.
N SubTRE SubPCRE StrSplitTRE StrSplitPCRE RegExprTRE RegExprPCRE
I have not looked at R's code, but it is possible that the problem is
caused by PCRE repeatedly scanning (once per match) the entire input
string to make sure it is valid UTF-8. If so, adding
PCRE_NO_UTF8_CHECK to the flags given to pcre_exec would solve the
problem. Perhaps R is already doing that in gsub(perl=TRUE).