Bug: time complexity of substring is quadratic as string size and number of substrings increases

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Bug: time complexity of substring is quadratic as string size and number of substrings increases

Toby Hocking-2
Hi all, (and especially hi to Tomas Kalibera who accepted my patch sent
yesterday)

I believe that I have found another bug, this time in the substring
function. The use case that I am concerned with is when there is a single
(character scalar) text/subject, and many substrings to extract. For example

substring("AAAA", 1:4, 1:4)

or more generally,

N=1000
substring(paste(rep("A", N), collapse=""), 1:N, 1:N)

The problem I observe is that the time complexity is quadratic in N, as
shown on this figure
https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png
source:
https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R

I expected the time complexity to be linear in N.

The example above may seem contrived/trivial, but it is indeed relevant to
a number of packages (rex, rematch2, namedCapture) which provide functions
that use gregexpr and then substring to extract the text in the captured
sub-patterns. The figure
https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.png
shows the issue: these packages have quadratic time complexity, whereas
other packages (and the gregexpr function with perl=TRUE after applying the
patch discussed yesterday) have linear time complexity. I believe the
problem is the substring function. Source for this figure:
https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.R

I suspect that a fix can be accomplished by optimizing the implementation
of substring, for the special case when the text/subject is a single
element (character scalar). Right now I notice that the substring R code
uses rep_len so that the text/subject which is passed to the C code is a
character vector with the same length as the number of substrings to
extract. Maybe the C code is calling strlen for each of these (identical)
text/subject elements?

Anyway, it would be useful to have some feedback to make sure this is
indeed a bug before I post on bugzilla. (btw thanks Martin for signing me
up for an account)

Toby

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bug: time complexity of substring is quadratic as string size and number of substrings increases

Toby Hocking-2
Update: I have observed that stringi::stri_sub is linear time complexity,
and it computes the same thing as base::substring. figure
https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png
source:
https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R

To me this is a clear indication of a bug in substring, but again it would
be nice to have some feedback/confirmation before posting on bugzilla.

Also this suggests a fix -- just need to copy whatever stringi::stri_sub is
doing.




On Wed, Feb 20, 2019 at 11:16 AM Toby Hocking <[hidden email]> wrote:

> Hi all, (and especially hi to Tomas Kalibera who accepted my patch sent
> yesterday)
>
> I believe that I have found another bug, this time in the substring
> function. The use case that I am concerned with is when there is a single
> (character scalar) text/subject, and many substrings to extract. For example
>
> substring("AAAA", 1:4, 1:4)
>
> or more generally,
>
> N=1000
> substring(paste(rep("A", N), collapse=""), 1:N, 1:N)
>
> The problem I observe is that the time complexity is quadratic in N, as
> shown on this figure
> https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png
> source:
> https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R
>
> I expected the time complexity to be linear in N.
>
> The example above may seem contrived/trivial, but it is indeed relevant to
> a number of packages (rex, rematch2, namedCapture) which provide functions
> that use gregexpr and then substring to extract the text in the captured
> sub-patterns. The figure
> https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.png
> shows the issue: these packages have quadratic time complexity, whereas
> other packages (and the gregexpr function with perl=TRUE after applying the
> patch discussed yesterday) have linear time complexity. I believe the
> problem is the substring function. Source for this figure:
> https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.R
>
> I suspect that a fix can be accomplished by optimizing the implementation
> of substring, for the special case when the text/subject is a single
> element (character scalar). Right now I notice that the substring R code
> uses rep_len so that the text/subject which is passed to the C code is a
> character vector with the same length as the number of substrings to
> extract. Maybe the C code is calling strlen for each of these (identical)
> text/subject elements?
>
> Anyway, it would be useful to have some feedback to make sure this is
> indeed a bug before I post on bugzilla. (btw thanks Martin for signing me
> up for an account)
>
> Toby
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bug: time complexity of substring is quadratic as string size and number of substrings increases

Tomas Kalibera
On 2/20/19 7:55 PM, Toby Hocking wrote:

> Update: I have observed that stringi::stri_sub is linear time complexity,
> and it computes the same thing as base::substring. figure
> https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png
> source:
> https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R
>
> To me this is a clear indication of a bug in substring, but again it would
> be nice to have some feedback/confirmation before posting on bugzilla.
>
> Also this suggests a fix -- just need to copy whatever stringi::stri_sub is
> doing.

Thanks for the report, I am working on a patch that will address this.

I confirm there is a lot of potential for speedup. On my system,

'N=200000; x <- substring(paste(rep("A", N), collapse=""), 1:N, 1:N)'

spends 96% time in checking if the string is ascii and 3% in strlen();
if we take advantage of the pre-computed value in the ASCII bit, the
speed up is about 40x. Of course, with micro-benchmarks, any performance
limitation can be arbitrarily inflated, users cannot expect to see these
or any close speedups in applications as a result of the patch. The
patch is going to do other easy optimizations that will not complicate
the code, including avoiding the strlen() call (taking advantage of
pre-computed length of R character object).

Best
Tomas

>
>
>
> On Wed, Feb 20, 2019 at 11:16 AM Toby Hocking <[hidden email]> wrote:
>
>> Hi all, (and especially hi to Tomas Kalibera who accepted my patch sent
>> yesterday)
>>
>> I believe that I have found another bug, this time in the substring
>> function. The use case that I am concerned with is when there is a single
>> (character scalar) text/subject, and many substrings to extract. For example
>>
>> substring("AAAA", 1:4, 1:4)
>>
>> or more generally,
>>
>> N=1000
>> substring(paste(rep("A", N), collapse=""), 1:N, 1:N)
>>
>> The problem I observe is that the time complexity is quadratic in N, as
>> shown on this figure
>> https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png
>> source:
>> https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R
>>
>> I expected the time complexity to be linear in N.
>>
>> The example above may seem contrived/trivial, but it is indeed relevant to
>> a number of packages (rex, rematch2, namedCapture) which provide functions
>> that use gregexpr and then substring to extract the text in the captured
>> sub-patterns. The figure
>> https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.png
>> shows the issue: these packages have quadratic time complexity, whereas
>> other packages (and the gregexpr function with perl=TRUE after applying the
>> patch discussed yesterday) have linear time complexity. I believe the
>> problem is the substring function. Source for this figure:
>> https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.R
>>
>> I suspect that a fix can be accomplished by optimizing the implementation
>> of substring, for the special case when the text/subject is a single
>> element (character scalar). Right now I notice that the substring R code
>> uses rep_len so that the text/subject which is passed to the C code is a
>> character vector with the same length as the number of substrings to
>> extract. Maybe the C code is calling strlen for each of these (identical)
>> text/subject elements?
>>
>> Anyway, it would be useful to have some feedback to make sure this is
>> indeed a bug before I post on bugzilla. (btw thanks Martin for signing me
>> up for an account)
>>
>> Toby
>>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Bug: time complexity of substring is quadratic as string size and number of substrings increases

Tomas Kalibera
An optimized version of substring/substr is now in R-devel (76172).

Best,
Tomas

On 2/22/19 8:16 PM, Tomas Kalibera wrote:

> On 2/20/19 7:55 PM, Toby Hocking wrote:
>> Update: I have observed that stringi::stri_sub is linear time
>> complexity,
>> and it computes the same thing as base::substring. figure
>> https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png 
>>
>> source:
>> https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R 
>>
>>
>> To me this is a clear indication of a bug in substring, but again it
>> would
>> be nice to have some feedback/confirmation before posting on bugzilla.
>>
>> Also this suggests a fix -- just need to copy whatever
>> stringi::stri_sub is
>> doing.
>
> Thanks for the report, I am working on a patch that will address this.
>
> I confirm there is a lot of potential for speedup. On my system,
>
> 'N=200000; x <- substring(paste(rep("A", N), collapse=""), 1:N, 1:N)'
>
> spends 96% time in checking if the string is ascii and 3% in strlen();
> if we take advantage of the pre-computed value in the ASCII bit, the
> speed up is about 40x. Of course, with micro-benchmarks, any
> performance limitation can be arbitrarily inflated, users cannot
> expect to see these or any close speedups in applications as a result
> of the patch. The patch is going to do other easy optimizations that
> will not complicate the code, including avoiding the strlen() call
> (taking advantage of pre-computed length of R character object).
>
> Best
> Tomas
>
>>
>>
>>
>> On Wed, Feb 20, 2019 at 11:16 AM Toby Hocking <[hidden email]> wrote:
>>
>>> Hi all, (and especially hi to Tomas Kalibera who accepted my patch sent
>>> yesterday)
>>>
>>> I believe that I have found another bug, this time in the substring
>>> function. The use case that I am concerned with is when there is a
>>> single
>>> (character scalar) text/subject, and many substrings to extract. For
>>> example
>>>
>>> substring("AAAA", 1:4, 1:4)
>>>
>>> or more generally,
>>>
>>> N=1000
>>> substring(paste(rep("A", N), collapse=""), 1:N, 1:N)
>>>
>>> The problem I observe is that the time complexity is quadratic in N, as
>>> shown on this figure
>>> https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png 
>>>
>>> source:
>>> https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R 
>>>
>>>
>>> I expected the time complexity to be linear in N.
>>>
>>> The example above may seem contrived/trivial, but it is indeed
>>> relevant to
>>> a number of packages (rex, rematch2, namedCapture) which provide
>>> functions
>>> that use gregexpr and then substring to extract the text in the
>>> captured
>>> sub-patterns. The figure
>>> https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.png 
>>>
>>> shows the issue: these packages have quadratic time complexity, whereas
>>> other packages (and the gregexpr function with perl=TRUE after
>>> applying the
>>> patch discussed yesterday) have linear time complexity. I believe the
>>> problem is the substring function. Source for this figure:
>>> https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.R 
>>>
>>>
>>> I suspect that a fix can be accomplished by optimizing the
>>> implementation
>>> of substring, for the special case when the text/subject is a single
>>> element (character scalar). Right now I notice that the substring R
>>> code
>>> uses rep_len so that the text/subject which is passed to the C code
>>> is a
>>> character vector with the same length as the number of substrings to
>>> extract. Maybe the C code is calling strlen for each of these
>>> (identical)
>>> text/subject elements?
>>>
>>> Anyway, it would be useful to have some feedback to make sure this is
>>> indeed a bug before I post on bugzilla. (btw thanks Martin for
>>> signing me
>>> up for an account)
>>>
>>> Toby
>>>
>>     [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel