# sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

10 messages
Open this post in threaded view
|

## sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

 Hi, I have a long numeric vector 'xx' and I want to use sum() to count the number of elements that satisfy some criteria like non-zero values or values lower than a certain threshold etc... The problem is: sum() returns an NA (with a warning) if the count is greater than 2^31. For example:    > xx <- runif(3e9)    > sum(xx < 0.9)    [1] NA    Warning message:    In sum(xx < 0.9) : integer overflow - use sum(as.numeric(.)) This already takes a long time and doing sum(as.numeric(.)) would take even longer and require allocation of 24Gb of memory just to store an intermediate numeric vector made of 0s and 1s. Plus, having to do sum(as.numeric(.)) every time I need to count things is not convenient and is easy to forget. It seems that sum() on a logical vector could be modified to return the count as a double when it cannot be represented as an integer. Note that length() already does this so that wouldn't create a precedent. Also and FWIW prod() avoids the problem by always returning a double, whatever the type of the input is (except on a complex vector). I can provide a patch if this change sounds reasonable. Cheers, H. -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: [hidden email] Phone:  (206) 667-5791 Fax:    (206) 667-1319 ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Open this post in threaded view
|

## Re: sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

 I second this feature request (it's understandable that this and possibly other parts of the code was left behind / forgotten after the introduction of long vector). I think mean() avoids full copies, so in the meanwhile, you can work around this limitation using: countTRUE <- function(x, na.rm = FALSE) {   nx <- length(x)   if (nx < .Machine\$integer.max) return(sum(x, na.rm = na.rm))   nx * mean(x, na.rm = na.rm) } (not sure if one needs to worry about rounding errors, i.e. where n %% 0 != 0) x <- rep(TRUE, times = .Machine\$integer.max+1) object.size(x) ## 8589934632 bytes p <- profmem::profmem( n <- countTRUE(x) ) str(n) ## num 2.15e+09 print(n == .Machine\$integer.max + 1) ## [1] TRUE print(p) ## Rprofmem memory profiling of: ## n <- countTRUE(x) ## ## Memory allocations: ##      bytes calls ## total     0 FYI / related: I've just updated matrixStats::sum2() to support logicals (develop branch) and I'll also try to update matrixStats::count() to count beyond .Machine\$integer.max. /Henrik On Fri, Jun 2, 2017 at 4:05 AM, Hervé Pagès <[hidden email]> wrote: > Hi, > > I have a long numeric vector 'xx' and I want to use sum() to count > the number of elements that satisfy some criteria like non-zero > values or values lower than a certain threshold etc... > > The problem is: sum() returns an NA (with a warning) if the count > is greater than 2^31. For example: > >   > xx <- runif(3e9) >   > sum(xx < 0.9) >   [1] NA >   Warning message: >   In sum(xx < 0.9) : integer overflow - use sum(as.numeric(.)) > > This already takes a long time and doing sum(as.numeric(.)) would > take even longer and require allocation of 24Gb of memory just to > store an intermediate numeric vector made of 0s and 1s. Plus, having > to do sum(as.numeric(.)) every time I need to count things is not > convenient and is easy to forget. > > It seems that sum() on a logical vector could be modified to return > the count as a double when it cannot be represented as an integer. > Note that length() already does this so that wouldn't create a > precedent. Also and FWIW prod() avoids the problem by always returning > a double, whatever the type of the input is (except on a complex > vector). > > I can provide a patch if this change sounds reasonable. > > Cheers, > H. > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: [hidden email] > Phone:  (206) 667-5791 > Fax:    (206) 667-1319 > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Open this post in threaded view
|

## Re: sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

 In reply to this post by Hervé Pagès-2 >>>>> Hervé Pagès <[hidden email]> >>>>>     on Fri, 2 Jun 2017 04:05:15 -0700 writes:     > Hi, I have a long numeric vector 'xx' and I want to use     > sum() to count the number of elements that satisfy some     > criteria like non-zero values or values lower than a     > certain threshold etc...     > The problem is: sum() returns an NA (with a warning) if     > the count is greater than 2^31. For example:     >> xx <- runif(3e9) sum(xx < 0.9)     >    [1] NA Warning message: In sum(xx < 0.9) : integer     > overflow - use sum(as.numeric(.))     > This already takes a long time and doing     > sum(as.numeric(.)) would take even longer and require     > allocation of 24Gb of memory just to store an intermediate     > numeric vector made of 0s and 1s. Plus, having to do     > sum(as.numeric(.)) every time I need to count things is     > not convenient and is easy to forget.     > It seems that sum() on a logical vector could be modified     > to return the count as a double when it cannot be     > represented as an integer.  Note that length() already     > does this so that wouldn't create a precedent. Also and     > FWIW prod() avoids the problem by always returning a     > double, whatever the type of the input is (except on a     > complex vector).     > I can provide a patch if this change sounds reasonable. This sounds very reasonable,  thank you Hervé, for the report, and even more for a (small) patch. Martin     > Cheers, H.     > --     > Hervé Pagès     > Program in Computational Biology Division of Public Health     > Sciences Fred Hutchinson Cancer Research Center 1100     > Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA     > 98109-1024     > E-mail: [hidden email] Phone: (206) 667-5791 Fax:     > (206) 667-1319     > ______________________________________________     > [hidden email] mailing list     > https://stat.ethz.ch/mailman/listinfo/r-devel______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Open this post in threaded view
|

## Re: sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

 >>>>> Martin Maechler <[hidden email]> >>>>>     on Tue, 6 Jun 2017 09:45:44 +0200 writes: >>>>> Hervé Pagès <[hidden email]> >>>>>     on Fri, 2 Jun 2017 04:05:15 -0700 writes:     >> Hi, I have a long numeric vector 'xx' and I want to use     >> sum() to count the number of elements that satisfy some     >> criteria like non-zero values or values lower than a     >> certain threshold etc...     >> The problem is: sum() returns an NA (with a warning) if     >> the count is greater than 2^31. For example:     >>> xx <- runif(3e9) sum(xx < 0.9)     >> [1] NA Warning message: In sum(xx < 0.9) : integer     >> overflow - use sum(as.numeric(.))     >> This already takes a long time and doing     >> sum(as.numeric(.)) would take even longer and require     >> allocation of 24Gb of memory just to store an     >> intermediate numeric vector made of 0s and 1s. Plus,     >> having to do sum(as.numeric(.)) every time I need to     >> count things is not convenient and is easy to forget.     >> It seems that sum() on a logical vector could be modified     >> to return the count as a double when it cannot be     >> represented as an integer.  Note that length() already     >> does this so that wouldn't create a precedent. Also and     >> FWIW prod() avoids the problem by always returning a     >> double, whatever the type of the input is (except on a     >> complex vector).     >> I can provide a patch if this change sounds reasonable.     > This sounds very reasonable, thank you Hervé, for the     > report, and even more for a (small) patch. I was made aware of the fact, that R treats logical and integer very often identically in the C code, and in general we even mention that logicals are treated as 0/1/NA integers in arithmetic. For the present case that would mean that we should also safe-guard against *integer* overflow in sum(.)  and that is not something we have done / wanted to do in the past...  Speed being one reason. So this ends up being more delicate than I had thought at first, because changing  sum()  only would mean that   sum(LOGI)      and   sum(as.integer(LOGI)) would start differ for a logical vector LOGI. So, for now this is something that must be approached carefully, and the R Core team may want discuss "in private" first. I'm sorry for having raised possibly unrealistic expectations. Martin     > Martin     >> Cheers, H.     >> --     >> Hervé Pagès     >> Program in Computational Biology Division of Public     >> Health Sciences Fred Hutchinson Cancer Research Center     >> 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA     >> 98109-1024     >> E-mail: [hidden email] Phone: (206) 667-5791 Fax:     >> (206) 667-1319     >> ______________________________________________     >> [hidden email] mailing list     >> https://stat.ethz.ch/mailman/listinfo/r-devel    > ______________________________________________     > [hidden email] mailing list     > https://stat.ethz.ch/mailman/listinfo/r-devel______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Open this post in threaded view
|

## Re: sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

 Hi Martin, On 06/07/2017 03:54 AM, Martin Maechler wrote: >>>>>> Martin Maechler <[hidden email]> >>>>>>      on Tue, 6 Jun 2017 09:45:44 +0200 writes: > >>>>>> Hervé Pagès <[hidden email]> >>>>>>      on Fri, 2 Jun 2017 04:05:15 -0700 writes: > >      >> Hi, I have a long numeric vector 'xx' and I want to use >      >> sum() to count the number of elements that satisfy some >      >> criteria like non-zero values or values lower than a >      >> certain threshold etc... > >      >> The problem is: sum() returns an NA (with a warning) if >      >> the count is greater than 2^31. For example: > >      >>> xx <- runif(3e9) sum(xx < 0.9) >      >> [1] NA Warning message: In sum(xx < 0.9) : integer >      >> overflow - use sum(as.numeric(.)) > >      >> This already takes a long time and doing >      >> sum(as.numeric(.)) would take even longer and require >      >> allocation of 24Gb of memory just to store an >      >> intermediate numeric vector made of 0s and 1s. Plus, >      >> having to do sum(as.numeric(.)) every time I need to >      >> count things is not convenient and is easy to forget. > >      >> It seems that sum() on a logical vector could be modified >      >> to return the count as a double when it cannot be >      >> represented as an integer.  Note that length() already >      >> does this so that wouldn't create a precedent. Also and >      >> FWIW prod() avoids the problem by always returning a >      >> double, whatever the type of the input is (except on a >      >> complex vector). > >      >> I can provide a patch if this change sounds reasonable. > >      > This sounds very reasonable, thank you Hervé, for the >      > report, and even more for a (small) patch. > > I was made aware of the fact, that R treats logical and > integer very often identically in the C code, and in general we > even mention that logicals are treated as 0/1/NA integers in > arithmetic. > > For the present case that would mean that we should also > safe-guard against *integer* overflow in sum(.)  and that is > not something we have done / wanted to do in the past...  Speed > being one reason. > > So this ends up being more delicate than I had thought at first, > because changing  sum()  only would mean that > >    sum(LOGI)      and >    sum(as.integer(LOGI)) > > would start differ for a logical vector LOGI. > > So, for now this is something that must be approached carefully, > and the R Core team may want discuss "in private" first. > > I'm sorry for having raised possibly unrealistic expectations. No worries. Thanks for taking my proposal into consideration. Note that the isum() function in src/main/summary.c is already using a 64-bit accumulator to accommodate intermediate sums > INT_MAX. So it should be easy to modify the function to make it overflow for much bigger final sums without altering performance. Seems like R_XLEN_T_MAX would be the natural threshold. Cheers, H. > Martin > >      > Martin > >      >> Cheers, H. > >      >> -- >      >> Hervé Pagès > >      >> Program in Computational Biology Division of Public >      >> Health Sciences Fred Hutchinson Cancer Research Center >      >> 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA >      >> 98109-1024 > >      >> E-mail: [hidden email] Phone: (206) 667-5791 Fax: >      >> (206) 667-1319 > >      >> ______________________________________________ >      >> [hidden email] mailing list >      >> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwIDAw&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=dyRNzyVdDYXzNX0sXIl5sdDqDXSxROm4-uM_XMquX_E&s=Qq6QdMWvudWgR_WGKdbBVNnVs5JO6s692MxjDo2JR9Y&e=> >      > ______________________________________________ >      > [hidden email] mailing list >      > https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwIDAw&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=dyRNzyVdDYXzNX0sXIl5sdDqDXSxROm4-uM_XMquX_E&s=Qq6QdMWvudWgR_WGKdbBVNnVs5JO6s692MxjDo2JR9Y&e=> -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: [hidden email] Phone:  (206) 667-5791 Fax:    (206) 667-1319 ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Open this post in threaded view
|

## Re: sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

 In reply to this post by Henrik Bengtsson-5 Just following up on this old thread since matrixStats 0.53.0 is now out, which supports this use case: > x <- rep(TRUE, times = 2^31) > y <- sum(x) > y [1] NA Warning message: In sum(x) : integer overflow - use sum(as.numeric(.)) > y <- matrixStats::sum2(x, mode = "double") > y [1] 2147483648 > str(y)  num 2.15e+09 No coercion is taking place, so the memory overhead is zero: > profmem::profmem(y <- matrixStats::sum2(x, mode = "double")) Rprofmem memory profiling of: y <- matrixStats::sum2(x, mode = "double") Memory allocations:       bytes calls total     0 /Henrik On Fri, Jun 2, 2017 at 1:58 PM, Henrik Bengtsson <[hidden email]> wrote: > I second this feature request (it's understandable that this and > possibly other parts of the code was left behind / forgotten after the > introduction of long vector). > > I think mean() avoids full copies, so in the meanwhile, you can work > around this limitation using: > > countTRUE <- function(x, na.rm = FALSE) { >   nx <- length(x) >   if (nx < .Machine\$integer.max) return(sum(x, na.rm = na.rm)) >   nx * mean(x, na.rm = na.rm) > } > > (not sure if one needs to worry about rounding errors, i.e. where n %% 0 != 0) > > x <- rep(TRUE, times = .Machine\$integer.max+1) > object.size(x) > ## 8589934632 bytes > > p <- profmem::profmem( n <- countTRUE(x) ) > str(n) > ## num 2.15e+09 > print(n == .Machine\$integer.max + 1) > ## [1] TRUE > > print(p) > ## Rprofmem memory profiling of: > ## n <- countTRUE(x) > ## > ## Memory allocations: > ##      bytes calls > ## total     0 > > > FYI / related: I've just updated matrixStats::sum2() to support > logicals (develop branch) and I'll also try to update > matrixStats::count() to count beyond .Machine\$integer.max. > > /Henrik > > On Fri, Jun 2, 2017 at 4:05 AM, Hervé Pagès <[hidden email]> wrote: >> Hi, >> >> I have a long numeric vector 'xx' and I want to use sum() to count >> the number of elements that satisfy some criteria like non-zero >> values or values lower than a certain threshold etc... >> >> The problem is: sum() returns an NA (with a warning) if the count >> is greater than 2^31. For example: >> >>   > xx <- runif(3e9) >>   > sum(xx < 0.9) >>   [1] NA >>   Warning message: >>   In sum(xx < 0.9) : integer overflow - use sum(as.numeric(.)) >> >> This already takes a long time and doing sum(as.numeric(.)) would >> take even longer and require allocation of 24Gb of memory just to >> store an intermediate numeric vector made of 0s and 1s. Plus, having >> to do sum(as.numeric(.)) every time I need to count things is not >> convenient and is easy to forget. >> >> It seems that sum() on a logical vector could be modified to return >> the count as a double when it cannot be represented as an integer. >> Note that length() already does this so that wouldn't create a >> precedent. Also and FWIW prod() avoids the problem by always returning >> a double, whatever the type of the input is (except on a complex >> vector). >> >> I can provide a patch if this change sounds reasonable. >> >> Cheers, >> H. >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: [hidden email] >> Phone:  (206) 667-5791 >> Fax:    (206) 667-1319 >> >> ______________________________________________ >> [hidden email] mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Open this post in threaded view
|

## Re: sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

Open this post in threaded view
|