sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

Hervé Pagès-2
Hi,

I have a long numeric vector 'xx' and I want to use sum() to count
the number of elements that satisfy some criteria like non-zero
values or values lower than a certain threshold etc...

The problem is: sum() returns an NA (with a warning) if the count
is greater than 2^31. For example:

   > xx <- runif(3e9)
   > sum(xx < 0.9)
   [1] NA
   Warning message:
   In sum(xx < 0.9) : integer overflow - use sum(as.numeric(.))

This already takes a long time and doing sum(as.numeric(.)) would
take even longer and require allocation of 24Gb of memory just to
store an intermediate numeric vector made of 0s and 1s. Plus, having
to do sum(as.numeric(.)) every time I need to count things is not
convenient and is easy to forget.

It seems that sum() on a logical vector could be modified to return
the count as a double when it cannot be represented as an integer.
Note that length() already does this so that wouldn't create a
precedent. Also and FWIW prod() avoids the problem by always returning
a double, whatever the type of the input is (except on a complex
vector).

I can provide a patch if this change sounds reasonable.

Cheers,
H.

--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: [hidden email]
Phone:  (206) 667-5791
Fax:    (206) 667-1319

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

Henrik Bengtsson-5
I second this feature request (it's understandable that this and
possibly other parts of the code was left behind / forgotten after the
introduction of long vector).

I think mean() avoids full copies, so in the meanwhile, you can work
around this limitation using:

countTRUE <- function(x, na.rm = FALSE) {
  nx <- length(x)
  if (nx < .Machine$integer.max) return(sum(x, na.rm = na.rm))
  nx * mean(x, na.rm = na.rm)
}

(not sure if one needs to worry about rounding errors, i.e. where n %% 0 != 0)

x <- rep(TRUE, times = .Machine$integer.max+1)
object.size(x)
## 8589934632 bytes

p <- profmem::profmem( n <- countTRUE(x) )
str(n)
## num 2.15e+09
print(n == .Machine$integer.max + 1)
## [1] TRUE

print(p)
## Rprofmem memory profiling of:
## n <- countTRUE(x)
##
## Memory allocations:
##      bytes calls
## total     0


FYI / related: I've just updated matrixStats::sum2() to support
logicals (develop branch) and I'll also try to update
matrixStats::count() to count beyond .Machine$integer.max.

/Henrik

On Fri, Jun 2, 2017 at 4:05 AM, Hervé Pagès <[hidden email]> wrote:

> Hi,
>
> I have a long numeric vector 'xx' and I want to use sum() to count
> the number of elements that satisfy some criteria like non-zero
> values or values lower than a certain threshold etc...
>
> The problem is: sum() returns an NA (with a warning) if the count
> is greater than 2^31. For example:
>
>   > xx <- runif(3e9)
>   > sum(xx < 0.9)
>   [1] NA
>   Warning message:
>   In sum(xx < 0.9) : integer overflow - use sum(as.numeric(.))
>
> This already takes a long time and doing sum(as.numeric(.)) would
> take even longer and require allocation of 24Gb of memory just to
> store an intermediate numeric vector made of 0s and 1s. Plus, having
> to do sum(as.numeric(.)) every time I need to count things is not
> convenient and is easy to forget.
>
> It seems that sum() on a logical vector could be modified to return
> the count as a double when it cannot be represented as an integer.
> Note that length() already does this so that wouldn't create a
> precedent. Also and FWIW prod() avoids the problem by always returning
> a double, whatever the type of the input is (except on a complex
> vector).
>
> I can provide a patch if this change sounds reasonable.
>
> Cheers,
> H.
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: [hidden email]
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

Martin Maechler
In reply to this post by Hervé Pagès-2
>>>>> Hervé Pagès <[hidden email]>
>>>>>     on Fri, 2 Jun 2017 04:05:15 -0700 writes:

    > Hi, I have a long numeric vector 'xx' and I want to use
    > sum() to count the number of elements that satisfy some
    > criteria like non-zero values or values lower than a
    > certain threshold etc...

    > The problem is: sum() returns an NA (with a warning) if
    > the count is greater than 2^31. For example:

    >> xx <- runif(3e9) sum(xx < 0.9)
    >    [1] NA Warning message: In sum(xx < 0.9) : integer
    > overflow - use sum(as.numeric(.))

    > This already takes a long time and doing
    > sum(as.numeric(.)) would take even longer and require
    > allocation of 24Gb of memory just to store an intermediate
    > numeric vector made of 0s and 1s. Plus, having to do
    > sum(as.numeric(.)) every time I need to count things is
    > not convenient and is easy to forget.

    > It seems that sum() on a logical vector could be modified
    > to return the count as a double when it cannot be
    > represented as an integer.  Note that length() already
    > does this so that wouldn't create a precedent. Also and
    > FWIW prod() avoids the problem by always returning a
    > double, whatever the type of the input is (except on a
    > complex vector).

    > I can provide a patch if this change sounds reasonable.

This sounds very reasonable,  thank you Hervé, for the report,
and even more for a (small) patch.

Martin

    > Cheers, H.

    > --
    > Hervé Pagès

    > Program in Computational Biology Division of Public Health
    > Sciences Fred Hutchinson Cancer Research Center 1100
    > Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA
    > 98109-1024

    > E-mail: [hidden email] Phone: (206) 667-5791 Fax:
    > (206) 667-1319

    > ______________________________________________
    > [hidden email] mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

Martin Maechler
>>>>> Martin Maechler <[hidden email]>
>>>>>     on Tue, 6 Jun 2017 09:45:44 +0200 writes:

>>>>> Hervé Pagès <[hidden email]>
>>>>>     on Fri, 2 Jun 2017 04:05:15 -0700 writes:

    >> Hi, I have a long numeric vector 'xx' and I want to use
    >> sum() to count the number of elements that satisfy some
    >> criteria like non-zero values or values lower than a
    >> certain threshold etc...

    >> The problem is: sum() returns an NA (with a warning) if
    >> the count is greater than 2^31. For example:

    >>> xx <- runif(3e9) sum(xx < 0.9)
    >> [1] NA Warning message: In sum(xx < 0.9) : integer
    >> overflow - use sum(as.numeric(.))

    >> This already takes a long time and doing
    >> sum(as.numeric(.)) would take even longer and require
    >> allocation of 24Gb of memory just to store an
    >> intermediate numeric vector made of 0s and 1s. Plus,
    >> having to do sum(as.numeric(.)) every time I need to
    >> count things is not convenient and is easy to forget.

    >> It seems that sum() on a logical vector could be modified
    >> to return the count as a double when it cannot be
    >> represented as an integer.  Note that length() already
    >> does this so that wouldn't create a precedent. Also and
    >> FWIW prod() avoids the problem by always returning a
    >> double, whatever the type of the input is (except on a
    >> complex vector).

    >> I can provide a patch if this change sounds reasonable.

    > This sounds very reasonable, thank you Hervé, for the
    > report, and even more for a (small) patch.

I was made aware of the fact, that R treats logical and
integer very often identically in the C code, and in general we
even mention that logicals are treated as 0/1/NA integers in
arithmetic.

For the present case that would mean that we should also
safe-guard against *integer* overflow in sum(.)  and that is
not something we have done / wanted to do in the past...  Speed
being one reason.

So this ends up being more delicate than I had thought at first,
because changing  sum(<logical>)  only would mean that

  sum(LOGI)      and
  sum(as.integer(LOGI))

would start differ for a logical vector LOGI.

So, for now this is something that must be approached carefully,
and the R Core team may want discuss "in private" first.

I'm sorry for having raised possibly unrealistic expectations.
Martin

    > Martin

    >> Cheers, H.

    >> --
    >> Hervé Pagès

    >> Program in Computational Biology Division of Public
    >> Health Sciences Fred Hutchinson Cancer Research Center
    >> 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA
    >> 98109-1024

    >> E-mail: [hidden email] Phone: (206) 667-5791 Fax:
    >> (206) 667-1319

    >> ______________________________________________
    >> [hidden email] mailing list
    >> https://stat.ethz.ch/mailman/listinfo/r-devel

    > ______________________________________________
    > [hidden email] mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

Hervé Pagès-2
Hi Martin,

On 06/07/2017 03:54 AM, Martin Maechler wrote:

>>>>>> Martin Maechler <[hidden email]>
>>>>>>      on Tue, 6 Jun 2017 09:45:44 +0200 writes:
>
>>>>>> Hervé Pagès <[hidden email]>
>>>>>>      on Fri, 2 Jun 2017 04:05:15 -0700 writes:
>
>      >> Hi, I have a long numeric vector 'xx' and I want to use
>      >> sum() to count the number of elements that satisfy some
>      >> criteria like non-zero values or values lower than a
>      >> certain threshold etc...
>
>      >> The problem is: sum() returns an NA (with a warning) if
>      >> the count is greater than 2^31. For example:
>
>      >>> xx <- runif(3e9) sum(xx < 0.9)
>      >> [1] NA Warning message: In sum(xx < 0.9) : integer
>      >> overflow - use sum(as.numeric(.))
>
>      >> This already takes a long time and doing
>      >> sum(as.numeric(.)) would take even longer and require
>      >> allocation of 24Gb of memory just to store an
>      >> intermediate numeric vector made of 0s and 1s. Plus,
>      >> having to do sum(as.numeric(.)) every time I need to
>      >> count things is not convenient and is easy to forget.
>
>      >> It seems that sum() on a logical vector could be modified
>      >> to return the count as a double when it cannot be
>      >> represented as an integer.  Note that length() already
>      >> does this so that wouldn't create a precedent. Also and
>      >> FWIW prod() avoids the problem by always returning a
>      >> double, whatever the type of the input is (except on a
>      >> complex vector).
>
>      >> I can provide a patch if this change sounds reasonable.
>
>      > This sounds very reasonable, thank you Hervé, for the
>      > report, and even more for a (small) patch.
>
> I was made aware of the fact, that R treats logical and
> integer very often identically in the C code, and in general we
> even mention that logicals are treated as 0/1/NA integers in
> arithmetic.
>
> For the present case that would mean that we should also
> safe-guard against *integer* overflow in sum(.)  and that is
> not something we have done / wanted to do in the past...  Speed
> being one reason.
>
> So this ends up being more delicate than I had thought at first,
> because changing  sum(<logical>)  only would mean that
>
>    sum(LOGI)      and
>    sum(as.integer(LOGI))
>
> would start differ for a logical vector LOGI.
>
> So, for now this is something that must be approached carefully,
> and the R Core team may want discuss "in private" first.
>
> I'm sorry for having raised possibly unrealistic expectations.

No worries. Thanks for taking my proposal into consideration.
Note that the isum() function in src/main/summary.c is already using
a 64-bit accumulator to accommodate intermediate sums > INT_MAX.
So it should be easy to modify the function to make it overflow for
much bigger final sums without altering performance. Seems like
R_XLEN_T_MAX would be the natural threshold.

Cheers,
H.


> Martin
>
>      > Martin
>
>      >> Cheers, H.
>
>      >> --
>      >> Hervé Pagès
>
>      >> Program in Computational Biology Division of Public
>      >> Health Sciences Fred Hutchinson Cancer Research Center
>      >> 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA
>      >> 98109-1024
>
>      >> E-mail: [hidden email] Phone: (206) 667-5791 Fax:
>      >> (206) 667-1319
>
>      >> ______________________________________________
>      >> [hidden email] mailing list
>      >> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwIDAw&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=dyRNzyVdDYXzNX0sXIl5sdDqDXSxROm4-uM_XMquX_E&s=Qq6QdMWvudWgR_WGKdbBVNnVs5JO6s692MxjDo2JR9Y&e=
>
>      > ______________________________________________
>      > [hidden email] mailing list
>      > https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwIDAw&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=dyRNzyVdDYXzNX0sXIl5sdDqDXSxROm4-uM_XMquX_E&s=Qq6QdMWvudWgR_WGKdbBVNnVs5JO6s692MxjDo2JR9Y&e=
>

--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: [hidden email]
Phone:  (206) 667-5791
Fax:    (206) 667-1319

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

Henrik Bengtsson-5
In reply to this post by Henrik Bengtsson-5
Just following up on this old thread since matrixStats 0.53.0 is now
out, which supports this use case:

> x <- rep(TRUE, times = 2^31)

> y <- sum(x)
> y
[1] NA
Warning message:
In sum(x) : integer overflow - use sum(as.numeric(.))

> y <- matrixStats::sum2(x, mode = "double")
> y
[1] 2147483648
> str(y)
 num 2.15e+09

No coercion is taking place, so the memory overhead is zero:

> profmem::profmem(y <- matrixStats::sum2(x, mode = "double"))
Rprofmem memory profiling of:
y <- matrixStats::sum2(x, mode = "double")

Memory allocations:
      bytes calls
total     0

/Henrik

On Fri, Jun 2, 2017 at 1:58 PM, Henrik Bengtsson
<[hidden email]> wrote:

> I second this feature request (it's understandable that this and
> possibly other parts of the code was left behind / forgotten after the
> introduction of long vector).
>
> I think mean() avoids full copies, so in the meanwhile, you can work
> around this limitation using:
>
> countTRUE <- function(x, na.rm = FALSE) {
>   nx <- length(x)
>   if (nx < .Machine$integer.max) return(sum(x, na.rm = na.rm))
>   nx * mean(x, na.rm = na.rm)
> }
>
> (not sure if one needs to worry about rounding errors, i.e. where n %% 0 != 0)
>
> x <- rep(TRUE, times = .Machine$integer.max+1)
> object.size(x)
> ## 8589934632 bytes
>
> p <- profmem::profmem( n <- countTRUE(x) )
> str(n)
> ## num 2.15e+09
> print(n == .Machine$integer.max + 1)
> ## [1] TRUE
>
> print(p)
> ## Rprofmem memory profiling of:
> ## n <- countTRUE(x)
> ##
> ## Memory allocations:
> ##      bytes calls
> ## total     0
>
>
> FYI / related: I've just updated matrixStats::sum2() to support
> logicals (develop branch) and I'll also try to update
> matrixStats::count() to count beyond .Machine$integer.max.
>
> /Henrik
>
> On Fri, Jun 2, 2017 at 4:05 AM, Hervé Pagès <[hidden email]> wrote:
>> Hi,
>>
>> I have a long numeric vector 'xx' and I want to use sum() to count
>> the number of elements that satisfy some criteria like non-zero
>> values or values lower than a certain threshold etc...
>>
>> The problem is: sum() returns an NA (with a warning) if the count
>> is greater than 2^31. For example:
>>
>>   > xx <- runif(3e9)
>>   > sum(xx < 0.9)
>>   [1] NA
>>   Warning message:
>>   In sum(xx < 0.9) : integer overflow - use sum(as.numeric(.))
>>
>> This already takes a long time and doing sum(as.numeric(.)) would
>> take even longer and require allocation of 24Gb of memory just to
>> store an intermediate numeric vector made of 0s and 1s. Plus, having
>> to do sum(as.numeric(.)) every time I need to count things is not
>> convenient and is easy to forget.
>>
>> It seems that sum() on a logical vector could be modified to return
>> the count as a double when it cannot be represented as an integer.
>> Note that length() already does this so that wouldn't create a
>> precedent. Also and FWIW prod() avoids the problem by always returning
>> a double, whatever the type of the input is (except on a complex
>> vector).
>>
>> I can provide a patch if this change sounds reasonable.
>>
>> Cheers,
>> H.
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: [hidden email]
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

Martin Maechler
>>>>> Henrik Bengtsson <[hidden email]>
>>>>>     on Thu, 25 Jan 2018 09:30:42 -0800 writes:

    > Just following up on this old thread since matrixStats 0.53.0 is now
    > out, which supports this use case:

    >> x <- rep(TRUE, times = 2^31)

    >> y <- sum(x)
    >> y
    > [1] NA
    > Warning message:
    > In sum(x) : integer overflow - use sum(as.numeric(.))

    >> y <- matrixStats::sum2(x, mode = "double")
    >> y
    > [1] 2147483648
    >> str(y)
    > num 2.15e+09

    > No coercion is taking place, so the memory overhead is zero:

    >> profmem::profmem(y <- matrixStats::sum2(x, mode = "double"))
    > Rprofmem memory profiling of:
    > y <- matrixStats::sum2(x, mode = "double")

    > Memory allocations:
    > bytes calls
    > total     0

    > /Henrik

Thank you, Henrik, for the reminder.

Back in June, I had mentioned to Hervé and R-devel that
'logical' should remain to be treated as 'integer' as in all
arithmetic in (S and) R.     Hervé did mention the isum()
function in the C code which is relevant here .. which does have
a LONG INT counter already -- *but* if we consider that sum()
has '...' i.e. a conceptually arbitrary number of long vector
integer arguments that counter won't suffice even there.

Before talking about implementation / patch, I think we should
consider 2 possible goals of a change --- I agree the status quo
is not a real option

1) sum(x) for logical and integer x  would return a double
      in any case and overflow should not happen (unless for
      the case where the result would be larger the
      .Machine$double.max which I think will not be possible
      even with "arbitrary" nargs() of sum.

2) sum(x) for logical and integer x  should return an integer in
       all cases there is no overflow, including returning
       NA_integer_ in case of NAs.
   If there would be an overflow it must be detected "in time"
   and the result should be double.

The big advantage of 2) is that it is back compatible in 99.x %
of use cases, and another advantage that it may be a very small
bit more efficient.  Also, in the case of "counting" (logical),
it is nice to get an integer instead of double when we can --
entirely analogously to the behavior of length() which returns
integer whenever possible.

The advantage of 1) is uniformity.

We should (at least provisionally) decide between 1) and 2) and then go for that.
It could be that going for 1) may have bad
compatibility-consequences in package space, because indeed we
had documented sum() would be integer for logical and integer arguments.

I currently don't really have time to
{work on implementing + dealing with the consequences}
for either ..

Martin

    > On Fri, Jun 2, 2017 at 1:58 PM, Henrik Bengtsson
    > <[hidden email]> wrote:
    >> I second this feature request (it's understandable that this and
    >> possibly other parts of the code was left behind / forgotten after the
    >> introduction of long vector).
    >>
    >> I think mean() avoids full copies, so in the meanwhile, you can work
    >> around this limitation using:
    >>
    >> countTRUE <- function(x, na.rm = FALSE) {
    >> nx <- length(x)
    >> if (nx < .Machine$integer.max) return(sum(x, na.rm = na.rm))
    >> nx * mean(x, na.rm = na.rm)
    >> }
    >>
    >> (not sure if one needs to worry about rounding errors, i.e. where n %% 0 != 0)
    >>
    >> x <- rep(TRUE, times = .Machine$integer.max+1)
    >> object.size(x)
    >> ## 8589934632 bytes
    >>
    >> p <- profmem::profmem( n <- countTRUE(x) )
    >> str(n)
    >> ## num 2.15e+09
    >> print(n == .Machine$integer.max + 1)
    >> ## [1] TRUE
    >>
    >> print(p)
    >> ## Rprofmem memory profiling of:
    >> ## n <- countTRUE(x)
    >> ##
    >> ## Memory allocations:
    >> ##      bytes calls
    >> ## total     0
    >>
    >>
    >> FYI / related: I've just updated matrixStats::sum2() to support
    >> logicals (develop branch) and I'll also try to update
    >> matrixStats::count() to count beyond .Machine$integer.max.
    >>
    >> /Henrik
    >>
    >> On Fri, Jun 2, 2017 at 4:05 AM, Hervé Pagès <[hidden email]> wrote:
    >>> Hi,
    >>>
    >>> I have a long numeric vector 'xx' and I want to use sum() to count
    >>> the number of elements that satisfy some criteria like non-zero
    >>> values or values lower than a certain threshold etc...
    >>>
    >>> The problem is: sum() returns an NA (with a warning) if the count
    >>> is greater than 2^31. For example:
    >>>
    >>> > xx <- runif(3e9)
    >>> > sum(xx < 0.9)
    >>> [1] NA
    >>> Warning message:
    >>> In sum(xx < 0.9) : integer overflow - use sum(as.numeric(.))
    >>>
    >>> This already takes a long time and doing sum(as.numeric(.)) would
    >>> take even longer and require allocation of 24Gb of memory just to
    >>> store an intermediate numeric vector made of 0s and 1s. Plus, having
    >>> to do sum(as.numeric(.)) every time I need to count things is not
    >>> convenient and is easy to forget.
    >>>
    >>> It seems that sum() on a logical vector could be modified to return
    >>> the count as a double when it cannot be represented as an integer.
    >>> Note that length() already does this so that wouldn't create a
    >>> precedent. Also and FWIW prod() avoids the problem by always returning
    >>> a double, whatever the type of the input is (except on a complex
    >>> vector).
    >>>
    >>> I can provide a patch if this change sounds reasonable.
    >>>
    >>> Cheers,
    >>> H.
    >>>
    >>> --
    >>> Hervé Pagès

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

Hervé Pagès-2
Hi Martin, Henrik,

Thanks for the follow up.

@Martin: I vote for 2) without *any* hesitation :-)

(and uniformity could be restored at some point in the
future by having prod(), rowSums(), colSums(), and others
align with the behavior of length() and sum())

Cheers,
H.


On 01/27/2018 03:06 AM, Martin Maechler wrote:

>>>>>> Henrik Bengtsson <[hidden email]>
>>>>>>      on Thu, 25 Jan 2018 09:30:42 -0800 writes:
>
>      > Just following up on this old thread since matrixStats 0.53.0 is now
>      > out, which supports this use case:
>
>      >> x <- rep(TRUE, times = 2^31)
>
>      >> y <- sum(x)
>      >> y
>      > [1] NA
>      > Warning message:
>      > In sum(x) : integer overflow - use sum(as.numeric(.))
>
>      >> y <- matrixStats::sum2(x, mode = "double")
>      >> y
>      > [1] 2147483648
>      >> str(y)
>      > num 2.15e+09
>
>      > No coercion is taking place, so the memory overhead is zero:
>
>      >> profmem::profmem(y <- matrixStats::sum2(x, mode = "double"))
>      > Rprofmem memory profiling of:
>      > y <- matrixStats::sum2(x, mode = "double")
>
>      > Memory allocations:
>      > bytes calls
>      > total     0
>
>      > /Henrik
>
> Thank you, Henrik, for the reminder.
>
> Back in June, I had mentioned to Hervé and R-devel that
> 'logical' should remain to be treated as 'integer' as in all
> arithmetic in (S and) R.     Hervé did mention the isum()
> function in the C code which is relevant here .. which does have
> a LONG INT counter already -- *but* if we consider that sum()
> has '...' i.e. a conceptually arbitrary number of long vector
> integer arguments that counter won't suffice even there.
>
> Before talking about implementation / patch, I think we should
> consider 2 possible goals of a change --- I agree the status quo
> is not a real option
>
> 1) sum(x) for logical and integer x  would return a double
>        in any case and overflow should not happen (unless for
>        the case where the result would be larger the
>        .Machine$double.max which I think will not be possible
>        even with "arbitrary" nargs() of sum.
>
> 2) sum(x) for logical and integer x  should return an integer in
>         all cases there is no overflow, including returning
>         NA_integer_ in case of NAs.
>     If there would be an overflow it must be detected "in time"
>     and the result should be double.
>
> The big advantage of 2) is that it is back compatible in 99.x %
> of use cases, and another advantage that it may be a very small
> bit more efficient.  Also, in the case of "counting" (logical),
> it is nice to get an integer instead of double when we can --
> entirely analogously to the behavior of length() which returns
> integer whenever possible.
>
> The advantage of 1) is uniformity.
>
> We should (at least provisionally) decide between 1) and 2) and then go for that.
> It could be that going for 1) may have bad
> compatibility-consequences in package space, because indeed we
> had documented sum() would be integer for logical and integer arguments.
>
> I currently don't really have time to
> {work on implementing + dealing with the consequences}
> for either ..
>
> Martin
>
>      > On Fri, Jun 2, 2017 at 1:58 PM, Henrik Bengtsson
>      > <[hidden email]> wrote:
>      >> I second this feature request (it's understandable that this and
>      >> possibly other parts of the code was left behind / forgotten after the
>      >> introduction of long vector).
>      >>
>      >> I think mean() avoids full copies, so in the meanwhile, you can work
>      >> around this limitation using:
>      >>
>      >> countTRUE <- function(x, na.rm = FALSE) {
>      >> nx <- length(x)
>      >> if (nx < .Machine$integer.max) return(sum(x, na.rm = na.rm))
>      >> nx * mean(x, na.rm = na.rm)
>      >> }
>      >>
>      >> (not sure if one needs to worry about rounding errors, i.e. where n %% 0 != 0)
>      >>
>      >> x <- rep(TRUE, times = .Machine$integer.max+1)
>      >> object.size(x)
>      >> ## 8589934632 bytes
>      >>
>      >> p <- profmem::profmem( n <- countTRUE(x) )
>      >> str(n)
>      >> ## num 2.15e+09
>      >> print(n == .Machine$integer.max + 1)
>      >> ## [1] TRUE
>      >>
>      >> print(p)
>      >> ## Rprofmem memory profiling of:
>      >> ## n <- countTRUE(x)
>      >> ##
>      >> ## Memory allocations:
>      >> ##      bytes calls
>      >> ## total     0
>      >>
>      >>
>      >> FYI / related: I've just updated matrixStats::sum2() to support
>      >> logicals (develop branch) and I'll also try to update
>      >> matrixStats::count() to count beyond .Machine$integer.max.
>      >>
>      >> /Henrik
>      >>
>      >> On Fri, Jun 2, 2017 at 4:05 AM, Hervé Pagès <[hidden email]> wrote:
>      >>> Hi,
>      >>>
>      >>> I have a long numeric vector 'xx' and I want to use sum() to count
>      >>> the number of elements that satisfy some criteria like non-zero
>      >>> values or values lower than a certain threshold etc...
>      >>>
>      >>> The problem is: sum() returns an NA (with a warning) if the count
>      >>> is greater than 2^31. For example:
>      >>>
>      >>> > xx <- runif(3e9)
>      >>> > sum(xx < 0.9)
>      >>> [1] NA
>      >>> Warning message:
>      >>> In sum(xx < 0.9) : integer overflow - use sum(as.numeric(.))
>      >>>
>      >>> This already takes a long time and doing sum(as.numeric(.)) would
>      >>> take even longer and require allocation of 24Gb of memory just to
>      >>> store an intermediate numeric vector made of 0s and 1s. Plus, having
>      >>> to do sum(as.numeric(.)) every time I need to count things is not
>      >>> convenient and is easy to forget.
>      >>>
>      >>> It seems that sum() on a logical vector could be modified to return
>      >>> the count as a double when it cannot be represented as an integer.
>      >>> Note that length() already does this so that wouldn't create a
>      >>> precedent. Also and FWIW prod() avoids the problem by always returning
>      >>> a double, whatever the type of the input is (except on a complex
>      >>> vector).
>      >>>
>      >>> I can provide a patch if this change sounds reasonable.
>      >>>
>      >>> Cheers,
>      >>> H.
>      >>>
>      >>> --
>      >>> Hervé Pagès
>      
>

--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: [hidden email]
Phone:  (206) 667-5791
Fax:    (206) 667-1319

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

Martin Maechler
>>>>> Hervé Pagès <[hidden email]>
>>>>>     on Tue, 30 Jan 2018 13:30:18 -0800 writes:

    > Hi Martin, Henrik,
    > Thanks for the follow up.

    > @Martin: I vote for 2) without *any* hesitation :-)

    > (and uniformity could be restored at some point in the
    > future by having prod(), rowSums(), colSums(), and others
    > align with the behavior of length() and sum())

As a matter of fact, I had procrastinated and worked at
implementing '2)' already a bit on the weekend and made it work
- more or less.  It needs a bit more work, and I had also been considering
replacing the numbers in the current overflow check

        if (ii++ > 1000) { \
            ii = 0; \
            if (s > 9000000000000000L || s < -9000000000000000L) { \
                if(!updated) updated = TRUE; \
                *value = NA_INTEGER; \
                warningcall(call, _("integer overflow - use sum(as.numeric(.))")); \
                return updated; \
            } \
        } \

i.e. think of tweaking the '1000' and '9000000000000000L',
but decided to leave these and add comments there about why. For
the moment.
They may look arbitrary, but are not at all: If you multiply
them (which looks correct, if we check the sum 's' only every 1000-th
time ...((still not sure they *are* correct))) you get  9*10^18
which is only slightly smaller than  2^63 - 1 which may be the
maximal "LONG_INT" integer we have.

So, in the end, at least for now, we do not quite go all they way
but overflow a bit earlier,... but do potentially gain a bit of
speed, notably with the ITERATE_BY_REGION(..) macros
(which I did not show above).

Will hopefully become available in R-devel real soon now.

Martin

    > Cheers,
    > H.


    > On 01/27/2018 03:06 AM, Martin Maechler wrote:
    >>>>>>> Henrik Bengtsson <[hidden email]>
    >>>>>>> on Thu, 25 Jan 2018 09:30:42 -0800 writes:
    >>
    >> > Just following up on this old thread since matrixStats 0.53.0 is now
    >> > out, which supports this use case:
    >>
    >> >> x <- rep(TRUE, times = 2^31)
    >>
    >> >> y <- sum(x)
    >> >> y
    >> > [1] NA
    >> > Warning message:
    >> > In sum(x) : integer overflow - use sum(as.numeric(.))
    >>
    >> >> y <- matrixStats::sum2(x, mode = "double")
    >> >> y
    >> > [1] 2147483648
    >> >> str(y)
    >> > num 2.15e+09
    >>
    >> > No coercion is taking place, so the memory overhead is zero:
    >>
    >> >> profmem::profmem(y <- matrixStats::sum2(x, mode = "double"))
    >> > Rprofmem memory profiling of:
    >> > y <- matrixStats::sum2(x, mode = "double")
    >>
    >> > Memory allocations:
    >> > bytes calls
    >> > total     0
    >>
    >> > /Henrik
    >>
    >> Thank you, Henrik, for the reminder.
    >>
    >> Back in June, I had mentioned to Hervé and R-devel that
    >> 'logical' should remain to be treated as 'integer' as in all
    >> arithmetic in (S and) R.     Hervé did mention the isum()
    >> function in the C code which is relevant here .. which does have
    >> a LONG INT counter already -- *but* if we consider that sum()
    >> has '...' i.e. a conceptually arbitrary number of long vector
    >> integer arguments that counter won't suffice even there.
    >>
    >> Before talking about implementation / patch, I think we should
    >> consider 2 possible goals of a change --- I agree the status quo
    >> is not a real option
    >>
    >> 1) sum(x) for logical and integer x  would return a double
    >> in any case and overflow should not happen (unless for
    >> the case where the result would be larger the
    >> .Machine$double.max which I think will not be possible
    >> even with "arbitrary" nargs() of sum.
    >>
    >> 2) sum(x) for logical and integer x  should return an integer in
    >> all cases there is no overflow, including returning
    >> NA_integer_ in case of NAs.
    >> If there would be an overflow it must be detected "in time"
    >> and the result should be double.
    >>
    >> The big advantage of 2) is that it is back compatible in 99.x %
    >> of use cases, and another advantage that it may be a very small
    >> bit more efficient.  Also, in the case of "counting" (logical),
    >> it is nice to get an integer instead of double when we can --
    >> entirely analogously to the behavior of length() which returns
    >> integer whenever possible.
    >>
    >> The advantage of 1) is uniformity.
    >>
    >> We should (at least provisionally) decide between 1) and 2) and then go for that.
    >> It could be that going for 1) may have bad
    >> compatibility-consequences in package space, because indeed we
    >> had documented sum() would be integer for logical and integer arguments.
    >>
    >> I currently don't really have time to
    >> {work on implementing + dealing with the consequences}
    >> for either ..
    >>
    >> Martin
    >>
    >> > On Fri, Jun 2, 2017 at 1:58 PM, Henrik Bengtsson
    >> > <[hidden email]> wrote:
    >> >> I second this feature request (it's understandable that this and
    >> >> possibly other parts of the code was left behind / forgotten after the
    >> >> introduction of long vector).
    >> >>
    >> >> I think mean() avoids full copies, so in the meanwhile, you can work
    >> >> around this limitation using:
    >> >>
    >> >> countTRUE <- function(x, na.rm = FALSE) {
    >> >> nx <- length(x)
    >> >> if (nx < .Machine$integer.max) return(sum(x, na.rm = na.rm))
    >> >> nx * mean(x, na.rm = na.rm)
    >> >> }
    >> >>
    >> >> (not sure if one needs to worry about rounding errors, i.e. where n %% 0 != 0)
    >> >>
    >> >> x <- rep(TRUE, times = .Machine$integer.max+1)
    >> >> object.size(x)
    >> >> ## 8589934632 bytes
    >> >>
    >> >> p <- profmem::profmem( n <- countTRUE(x) )
    >> >> str(n)
    >> >> ## num 2.15e+09
    >> >> print(n == .Machine$integer.max + 1)
    >> >> ## [1] TRUE
    >> >>
    >> >> print(p)
    >> >> ## Rprofmem memory profiling of:
    >> >> ## n <- countTRUE(x)
    >> >> ##
    >> >> ## Memory allocations:
    >> >> ##      bytes calls
    >> >> ## total     0
    >> >>
    >> >>
    >> >> FYI / related: I've just updated matrixStats::sum2() to support
    >> >> logicals (develop branch) and I'll also try to update
    >> >> matrixStats::count() to count beyond .Machine$integer.max.
    >> >>
    >> >> /Henrik
    >> >>
    >> >> On Fri, Jun 2, 2017 at 4:05 AM, Hervé Pagès <[hidden email]> wrote:
    >> >>> Hi,
    >> >>>
    >> >>> I have a long numeric vector 'xx' and I want to use sum() to count
    >> >>> the number of elements that satisfy some criteria like non-zero
    >> >>> values or values lower than a certain threshold etc...
    >> >>>
    >> >>> The problem is: sum() returns an NA (with a warning) if the count
    >> >>> is greater than 2^31. For example:
    >> >>>
    >> >>> > xx <- runif(3e9)
    >> >>> > sum(xx < 0.9)
    >> >>> [1] NA
    >> >>> Warning message:
    >> >>> In sum(xx < 0.9) : integer overflow - use sum(as.numeric(.))
    >> >>>
    >> >>> This already takes a long time and doing sum(as.numeric(.)) would
    >> >>> take even longer and require allocation of 24Gb of memory just to
    >> >>> store an intermediate numeric vector made of 0s and 1s. Plus, having
    >> >>> to do sum(as.numeric(.)) every time I need to count things is not
    >> >>> convenient and is easy to forget.
    >> >>>
    >> >>> It seems that sum() on a logical vector could be modified to return
    >> >>> the count as a double when it cannot be represented as an integer.
    >> >>> Note that length() already does this so that wouldn't create a
    >> >>> precedent. Also and FWIW prod() avoids the problem by always returning
    >> >>> a double, whatever the type of the input is (except on a complex
    >> >>> vector).
    >> >>>
    >> >>> I can provide a patch if this change sounds reasonable.
    >> >>>
    >> >>> Cheers,
    >> >>> H.
    >> >>>
    >> >>> --
    >> >>> Hervé Pagès
    >>
    >>

    > --
    > Hervé Pagès

    > Program in Computational Biology
    > Division of Public Health Sciences
    > Fred Hutchinson Cancer Research Center
    > 1100 Fairview Ave. N, M1-B514
    > P.O. Box 19024
    > Seattle, WA 98109-1024

    > E-mail: [hidden email]
    > Phone:  (206) 667-5791
    > Fax:    (206) 667-1319

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

Martin Maechler
>>>>> Martin Maechler <[hidden email]>
>>>>>     on Thu, 1 Feb 2018 16:34:04 +0100 writes:

> >>>>> Hervé Pagès <[hidden email]>
> >>>>>     on Tue, 30 Jan 2018 13:30:18 -0800 writes:
>
>     > Hi Martin, Henrik,
>     > Thanks for the follow up.
>
>     > @Martin: I vote for 2) without *any* hesitation :-)
>
>     > (and uniformity could be restored at some point in the
>     > future by having prod(), rowSums(), colSums(), and others
>     > align with the behavior of length() and sum())
>
> As a matter of fact, I had procrastinated and worked at
> implementing '2)' already a bit on the weekend and made it work
> - more or less.  It needs a bit more work, and I had also been considering
> replacing the numbers in the current overflow check
>
> if (ii++ > 1000) { \
>    ii = 0; \
>    if (s > 9000000000000000L || s < -9000000000000000L) { \
> if(!updated) updated = TRUE; \
> *value = NA_INTEGER; \
> warningcall(call, _("integer overflow - use sum(as.numeric(.))")); \
> return updated; \
>    } \
> } \
>
> i.e. think of tweaking the '1000' and '9000000000000000L',
> but decided to leave these and add comments there about why. For
> the moment.
> They may look arbitrary, but are not at all: If you multiply
> them (which looks correct, if we check the sum 's' only every 1000-th
> time ...((still not sure they *are* correct))) you get  9*10^18
> which is only slightly smaller than  2^63 - 1 which may be the
> maximal "LONG_INT" integer we have.
>
> So, in the end, at least for now, we do not quite go all they way
> but overflow a bit earlier,... but do potentially gain a bit of
> speed, notably with the ITERATE_BY_REGION(..) macros
> (which I did not show above).
>
> Will hopefully become available in R-devel real soon now.
>
> Martin

After finishing that... I challenged myself that one should be able to do
better, namely "no overflow" (because of large/many
integer/logical), and so introduced  irsum()  which uses a double
precision accumulator for integer/logical  ... but would really
only be used when the 64-bit int accumulator would get close to
overflow.
The resulting code is not really beautiful, and also contains a
a comment     " (a waste, rare; FIXME ?) "
If anybody feels like finding a more elegant version without the
"waste" case, go ahead and be our guest !

Testing the code does need access to a platform with enough GB
RAM, say 32 (and I have run the checks only on servers with >
100 GB RAM). This concerns the new checks at the (current) end
of <R-devel_R>/tests/reg-large.R

In R-devel svn rev >= 74208  for a few minutes now.

Martin

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel