patch for gregexpr(perl=TRUE)

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

patch for gregexpr(perl=TRUE)

Toby Hocking-2
Hi all,

Several people have noticed that gregexpr is very slow for large subject
strings when perl=TRUE is specified.
-
https://stackoverflow.com/questions/31216299/r-faster-gregexpr-for-very-large-strings
-
http://r.789695.n4.nabble.com/strsplit-perl-TRUE-gregexpr-perl-TRUE-very-slow-for-long-strings-td4727902.html
- https://stat.ethz.ch/pipermail/r-help/2008-October/178451.html

I figured out the issue, which is fixed by changing 1 line of code in
src/main/grep.c -- there is a strlen function call which is currently
inside of the while loop over matches, and the patch moves it before the
loop.
https://github.com/tdhock/namedCapture-article/blob/master/linear-time-gregexpr-perl.patch

I made some figures that show the quadratic time complexity before applying
the patch, and the linear time complexity after applying the patch
https://github.com/tdhock/namedCapture-article#19-feb-2019

I would have posted a bug report on bugs.r-project.org but I do not have an
account. So can an R-devel person please either (1) accept this patch, or
(2) give me an account so I can post the patch on the bug tracker?

Finally I would like to mention that Bill Dunlap noticed a similar problem
(time complexity which is quadratic in subject size) for strsplit with
perl=TRUE. My patch does NOT fix that, but I suspect that a similar fix
could be accomplished (because I see that strlen is being called in a while
loop in do_strsplit as well).

Thanks
Toby Dylan Hocking

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: patch for gregexpr(perl=TRUE)

Tomas Kalibera
Thanks, in R-devel 76138, I confirm it speeds up gregexpr() with pcre in
Bill Dunlap's example from
https://stat.ethz.ch/pipermail/r-devel/2017-January/073577.html
(RegExprPCRE column)

The performance problem of StrSplitPCRE does not seem to be due to strlen().

Best
Tomas

On 2/19/19 9:37 PM, Toby Hocking wrote:

> Hi all,
>
> Several people have noticed that gregexpr is very slow for large subject
> strings when perl=TRUE is specified.
> -
> https://stackoverflow.com/questions/31216299/r-faster-gregexpr-for-very-large-strings
> -
> http://r.789695.n4.nabble.com/strsplit-perl-TRUE-gregexpr-perl-TRUE-very-slow-for-long-strings-td4727902.html
> - https://stat.ethz.ch/pipermail/r-help/2008-October/178451.html
>
> I figured out the issue, which is fixed by changing 1 line of code in
> src/main/grep.c -- there is a strlen function call which is currently
> inside of the while loop over matches, and the patch moves it before the
> loop.
> https://github.com/tdhock/namedCapture-article/blob/master/linear-time-gregexpr-perl.patch
>
> I made some figures that show the quadratic time complexity before applying
> the patch, and the linear time complexity after applying the patch
> https://github.com/tdhock/namedCapture-article#19-feb-2019
>
> I would have posted a bug report on bugs.r-project.org but I do not have an
> account. So can an R-devel person please either (1) accept this patch, or
> (2) give me an account so I can post the patch on the bug tracker?
>
> Finally I would like to mention that Bill Dunlap noticed a similar problem
> (time complexity which is quadratic in subject size) for strsplit with
> perl=TRUE. My patch does NOT fix that, but I suspect that a similar fix
> could be accomplished (because I see that strlen is being called in a while
> loop in do_strsplit as well).
>
> Thanks
> Toby Dylan Hocking
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel