Any chance R will ever get beyond the 2^31-1 vector size limit?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Any chance R will ever get beyond the 2^31-1 vector size limit?

Matthew Keller
Hi all,

My institute will hopefully be working on cutting-edge genetic
sequencing data by the Fall of 2010. The datasets will be 10's of GB
large and growing. I'd like to use R to do primary analyses. This is
OK, because we can just throw $ at the problem and get lots of RAM
running on 64 bit R. However, we are still running up against the fact
that vectors in R cannot contain more than 2^31-1. I know there are
"ways around" this issue, and trust me, I think I've tried them all
(e.g., bringing in portions of the data at a time; using large-dataset
packages in R; using SQL databases, etc). But all these 'solutions'
are, at the end of the day, much much more cumbersome,
programming-wise, than just doing things in native R. Maybe that's
just the cost of doing what I'm doing. But my questions, which  may
well be naive (I'm not a computer programmer), are:

1) Is there an *inherent* limit to vectors being < 2^31-1 long? I.e.,
in an alternative history of R's development, would it have been
feasible for R to not have had this limitation?

2) Is there any possibility that this limit will be overcome in future
revisions of R?

I'm very very grateful to the people who have spent important parts of
their professional lives developing R. I don't think anyone back in,
say, 1995, could have foreseen that datasets would be >>2^32-1 in
size. For better or worse, however, in many fields of science, that is
routinely the case today. *If* it's possible to get around this limit,
then I'd like to know whether the R Development Team takes seriously
the needs of large data users, or if they feel that (perhaps not
mutually exclusively) developing such capacity is best left up to ad
hoc R packages and alternative analysis programs.

Best,

Matt



--
Matthew C Keller
Asst. Professor of Psychology
University of Colorado at Boulder
www.matthewckeller.com

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Any chance R will ever get beyond the 2^31-1 vector size limit?

Duncan Murdoch
On 09/04/2010 7:38 PM, Matthew Keller wrote:

> Hi all,
>
> My institute will hopefully be working on cutting-edge genetic
> sequencing data by the Fall of 2010. The datasets will be 10's of GB
> large and growing. I'd like to use R to do primary analyses. This is
> OK, because we can just throw $ at the problem and get lots of RAM
> running on 64 bit R. However, we are still running up against the fact
> that vectors in R cannot contain more than 2^31-1. I know there are
> "ways around" this issue, and trust me, I think I've tried them all
> (e.g., bringing in portions of the data at a time; using large-dataset
> packages in R; using SQL databases, etc). But all these 'solutions'
> are, at the end of the day, much much more cumbersome,
> programming-wise, than just doing things in native R. Maybe that's
> just the cost of doing what I'm doing. But my questions, which  may
> well be naive (I'm not a computer programmer), are:
>
> 1) Is there an *inherent* limit to vectors being < 2^31-1 long? I.e.,
> in an alternative history of R's development, would it have been
> feasible for R to not have had this limitation?

The problem is that we use "int" as a vector index.  On most platforms,
that's a signed 32 bit integer, with max value 2^31-1.


>
> 2) Is there any possibility that this limit will be overcome in future
> revisions of R?


Of course, R is open source.  You could rewrite all of the internal code
tomorrow to use 64 bit indexing.

Will someone else do it for you?  Even that is possible.  One problem
are that this will make all of your data incompatible with older
versions of R.  And back to the original question:  are you willing to
pay for the development?  Then go ahead, you can have it tomorrow (or
later, if your budget is limited).  Are you waiting for someone else to
do it for free?  Then you need to wait for someone who knows how to do
it to want to do it.


> I'm very very grateful to the people who have spent important parts of
> their professional lives developing R. I don't think anyone back in,
> say, 1995, could have foreseen that datasets would be >>2^32-1 in
> size. For better or worse, however, in many fields of science, that is
> routinely the case today. *If* it's possible to get around this limit,
> then I'd like to know whether the R Development Team takes seriously
> the needs of large data users, or if they feel that (perhaps not
> mutually exclusively) developing such capacity is best left up to ad
> hoc R packages and alternative analysis programs.

There are many ways around the limit today.  Put your data in a
dataframe with many columns each of length 2^31-1 or less.  Put your
data in a database, and process it a block at a time.  Etc.

Duncan Murdoch

>
> Best,
>
> Matt
>
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Any chance R will ever get beyond the 2^31-1 vector size limit?

Martin Morgan
On 04/09/2010 05:36 PM, Duncan Murdoch wrote:
> On 09/04/2010 7:38 PM, Matthew Keller wrote:
>> Hi all,
>>
>> My institute will hopefully be working on cutting-edge genetic
>> sequencing data by the Fall of 2010. The datasets will be 10's of GB
>> large and growing. I'd like to use R to do primary analyses. This is
>> OK, because we can just throw $ at the problem and get lots of RAM
>> running on 64 bit R. However, we are still running up against the fact
>> that vectors in R cannot contain more than 2^31-1. I know there are

Duncan provided an answer (maybe not too encouraging!) to your question;
I'll postulate that the two large sequence-style data sets are SNPs and
'next generation' sequencing, and that for these the picture is quite
positive.

SNPs seem more likely to (or already have) hit the 2^31-1 limit, where
one wants to represent say 2 million SNPs assayed in 10,000's of
individuals. But here it's very natural to process the SNPs either
independently or per chromosome (linkage group / ...); Clayton's
snpMatrix (on Bioconductor) package and the ncdf (ncdf4, I think, is
recently released) package make for excellent handling of this data;
throw in a bit of multicore or Rmpi and you'll be very satisfied.

Next gen sequencing generates reads (character strings), but only say
20M reads / lane of an Illumina flow cell, so even a full flow cell (8
lanes) is about an order of magnitude below the 2^31-1 reads that would
overflow a character vector. And the Biostrings package in Bioconductor
has DNAStringSet, which has already (in the development version; to be
released shortly after the next R release) addressed the 2^31-1 limit
(and has diverse sequence manipulation methods available). I'd also
think that in this case too any memory limitations are likely amenable
to sequential processing (e.g., by lane of a flow cell). Further, much
of the 'interesting' analysis from R's perspective will be when the data
are reduced (e.g., 'coverage' vectors, or count data over a relatively
small (10,000's) number of regions).

Which is not to say that truly large problems (1000's of fully sequenced
human genomes) aren't just around the corner. But here the likely
solution is a clever disk-based data base-like representation, probably
one that transcends the R community (BAM files and bigWig, for which see
Bioconductor Rsamtools and rtracklayer in the 'devel' branch, represent
perhaps a first pass at this kind of approach).

Certainly limited vector lengths add challenges to managing this
information. But I don't think these challenges are difficult to
overcome, nor do I think that they are likely to outweigh the benefits
that R offers in terms of accessible advanced statistical analysis
coupled with flexible and integrated work flows.

Martin

>> "ways around" this issue, and trust me, I think I've tried them all
>> (e.g., bringing in portions of the data at a time; using large-dataset
>> packages in R; using SQL databases, etc). But all these 'solutions'
>> are, at the end of the day, much much more cumbersome,
>> programming-wise, than just doing things in native R. Maybe that's
>> just the cost of doing what I'm doing. But my questions, which  may
>> well be naive (I'm not a computer programmer), are:
>>
>> 1) Is there an *inherent* limit to vectors being < 2^31-1 long? I.e.,
>> in an alternative history of R's development, would it have been
>> feasible for R to not have had this limitation?
>
> The problem is that we use "int" as a vector index.  On most platforms,
> that's a signed 32 bit integer, with max value 2^31-1.
>
>
>>
>> 2) Is there any possibility that this limit will be overcome in future
>> revisions of R?
>
>
> Of course, R is open source.  You could rewrite all of the internal code
> tomorrow to use 64 bit indexing.
>
> Will someone else do it for you?  Even that is possible.  One problem
> are that this will make all of your data incompatible with older
> versions of R.  And back to the original question:  are you willing to
> pay for the development?  Then go ahead, you can have it tomorrow (or
> later, if your budget is limited).  Are you waiting for someone else to
> do it for free?  Then you need to wait for someone who knows how to do
> it to want to do it.
>
>
>> I'm very very grateful to the people who have spent important parts of
>> their professional lives developing R. I don't think anyone back in,
>> say, 1995, could have foreseen that datasets would be >>2^32-1 in
>> size. For better or worse, however, in many fields of science, that is
>> routinely the case today. *If* it's possible to get around this limit,
>> then I'd like to know whether the R Development Team takes seriously
>> the needs of large data users, or if they feel that (perhaps not
>> mutually exclusively) developing such capacity is best left up to ad
>> hoc R packages and alternative analysis programs.
>
> There are many ways around the limit today.  Put your data in a
> dataframe with many columns each of length 2^31-1 or less.  Put your
> data in a database, and process it a block at a time.  Etc.
>
> Duncan Murdoch
>
>>
>> Best,
>>
>> Matt
>>
>>
>>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Any chance R will ever get beyond the 2^31-1 vector size limit?

Matthew Keller
In reply to this post by Duncan Murdoch
HI Duncan and R users,

Duncan, thank you for taking the time to respond. I've had several
other comments off the list, and I'd like to summarize what these have
to say, although I won't give sources since I assume there was a
reason why people chose not to respond to the whole list. The long and
short of it is that there is hope for people who want R to get beyond
the 2^31-1 vector size limit.

First off, I received a couple of responses from people who wanted to
commiserate and me to summarize what I learned. Here you go.

Second, the package bigmemory and ff can both help with memory issues.
I've had success using bigmemory before, and found it to be quite
intuitive.

Third, one knowledgeable responder doubted that changing the 2^31-1
limit would 'break' old datasets. He says, "This might be true for
isolated cases of objects stored in binary formats or in workspaces,
but I don't see that as anywhere near as important as the change you
(and we) would like to see."

Fourth, another knowledgeable responder felt it was likely that, given
the demand driven by the huge increases in dataset sizes, this
limitation would likely be overcome within the next few years.

Best,

Matt


On Fri, Apr 9, 2010 at 6:36 PM, Duncan Murdoch <[hidden email]> wrote:

> On 09/04/2010 7:38 PM, Matthew Keller wrote:
>>
>> Hi all,
>>
>> My institute will hopefully be working on cutting-edge genetic
>> sequencing data by the Fall of 2010. The datasets will be 10's of GB
>> large and growing. I'd like to use R to do primary analyses. This is
>> OK, because we can just throw $ at the problem and get lots of RAM
>> running on 64 bit R. However, we are still running up against the fact
>> that vectors in R cannot contain more than 2^31-1. I know there are
>> "ways around" this issue, and trust me, I think I've tried them all
>> (e.g., bringing in portions of the data at a time; using large-dataset
>> packages in R; using SQL databases, etc). But all these 'solutions'
>> are, at the end of the day, much much more cumbersome,
>> programming-wise, than just doing things in native R. Maybe that's
>> just the cost of doing what I'm doing. But my questions, which  may
>> well be naive (I'm not a computer programmer), are:
>>
>> 1) Is there an *inherent* limit to vectors being < 2^31-1 long? I.e.,
>> in an alternative history of R's development, would it have been
>> feasible for R to not have had this limitation?
>
> The problem is that we use "int" as a vector index.  On most platforms,
> that's a signed 32 bit integer, with max value 2^31-1.
>
>
>>
>> 2) Is there any possibility that this limit will be overcome in future
>> revisions of R?
>
>
> Of course, R is open source.  You could rewrite all of the internal code
> tomorrow to use 64 bit indexing.
>
> Will someone else do it for you?  Even that is possible.  One problem are
> that this will make all of your data incompatible with older versions of R.
>  And back to the original question:  are you willing to pay for the
> development?  Then go ahead, you can have it tomorrow (or later, if your
> budget is limited).  Are you waiting for someone else to do it for free?
>  Then you need to wait for someone who knows how to do it to want to do it.
>
>
>> I'm very very grateful to the people who have spent important parts of
>> their professional lives developing R. I don't think anyone back in,
>> say, 1995, could have foreseen that datasets would be >>2^32-1 in
>> size. For better or worse, however, in many fields of science, that is
>> routinely the case today. *If* it's possible to get around this limit,
>> then I'd like to know whether the R Development Team takes seriously
>> the needs of large data users, or if they feel that (perhaps not
>> mutually exclusively) developing such capacity is best left up to ad
>> hoc R packages and alternative analysis programs.
>
> There are many ways around the limit today.  Put your data in a dataframe
> with many columns each of length 2^31-1 or less.  Put your data in a
> database, and process it a block at a time.  Etc.
>
> Duncan Murdoch
>
>>
>> Best,
>>
>> Matt
>>
>>
>>
>
>



--
Matthew C Keller
Asst. Professor of Psychology
University of Colorado at Boulder
www.matthewckeller.com

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Any chance R will ever get beyond the 2^31-1 vector size limit?

Thomas Lumley


There is one thing that would definitely break. Quite a bit of compiled code relies on the fact that the R integer type and the type used to index arrays and the C int type are the same.  The C int type won't change, so if the type used to index arrays changes, the R integer type will be different from at least one of them.

Suppose you now write   .C("foo", as.double(x), as.integer(length(x))) to call a C function
    void foo(double* x, int* n)
If length(x) could be larger than 2^31, then this isn't going to work -- either as.integer() will fail, or it will succeed and produce a value that doesn't fit in an int.


    -thomas





On Thu, 15 Apr 2010, Matthew Keller wrote:

> HI Duncan and R users,
>
> Duncan, thank you for taking the time to respond. I've had several
> other comments off the list, and I'd like to summarize what these have
> to say, although I won't give sources since I assume there was a
> reason why people chose not to respond to the whole list. The long and
> short of it is that there is hope for people who want R to get beyond
> the 2^31-1 vector size limit.
>
> First off, I received a couple of responses from people who wanted to
> commiserate and me to summarize what I learned. Here you go.
>
> Second, the package bigmemory and ff can both help with memory issues.
> I've had success using bigmemory before, and found it to be quite
> intuitive.
>
> Third, one knowledgeable responder doubted that changing the 2^31-1
> limit would 'break' old datasets. He says, "This might be true for
> isolated cases of objects stored in binary formats or in workspaces,
> but I don't see that as anywhere near as important as the change you
> (and we) would like to see."
>
> Fourth, another knowledgeable responder felt it was likely that, given
> the demand driven by the huge increases in dataset sizes, this
> limitation would likely be overcome within the next few years.
>
> Best,
>
> Matt
>
>
> On Fri, Apr 9, 2010 at 6:36 PM, Duncan Murdoch <[hidden email]> wrote:
>> On 09/04/2010 7:38 PM, Matthew Keller wrote:
>>>
>>> Hi all,
>>>
>>> My institute will hopefully be working on cutting-edge genetic
>>> sequencing data by the Fall of 2010. The datasets will be 10's of GB
>>> large and growing. I'd like to use R to do primary analyses. This is
>>> OK, because we can just throw $ at the problem and get lots of RAM
>>> running on 64 bit R. However, we are still running up against the fact
>>> that vectors in R cannot contain more than 2^31-1. I know there are
>>> "ways around" this issue, and trust me, I think I've tried them all
>>> (e.g., bringing in portions of the data at a time; using large-dataset
>>> packages in R; using SQL databases, etc). But all these 'solutions'
>>> are, at the end of the day, much much more cumbersome,
>>> programming-wise, than just doing things in native R. Maybe that's
>>> just the cost of doing what I'm doing. But my questions, which  may
>>> well be naive (I'm not a computer programmer), are:
>>>
>>> 1) Is there an *inherent* limit to vectors being < 2^31-1 long? I.e.,
>>> in an alternative history of R's development, would it have been
>>> feasible for R to not have had this limitation?
>>
>> The problem is that we use "int" as a vector index.  On most platforms,
>> that's a signed 32 bit integer, with max value 2^31-1.
>>
>>
>>>
>>> 2) Is there any possibility that this limit will be overcome in future
>>> revisions of R?
>>
>>
>> Of course, R is open source.  You could rewrite all of the internal code
>> tomorrow to use 64 bit indexing.
>>
>> Will someone else do it for you?  Even that is possible.  One problem are
>> that this will make all of your data incompatible with older versions of R.
>>  And back to the original question:  are you willing to pay for the
>> development?  Then go ahead, you can have it tomorrow (or later, if your
>> budget is limited).  Are you waiting for someone else to do it for free?
>>  Then you need to wait for someone who knows how to do it to want to do it.
>>
>>
>>> I'm very very grateful to the people who have spent important parts of
>>> their professional lives developing R. I don't think anyone back in,
>>> say, 1995, could have foreseen that datasets would be >>2^32-1 in
>>> size. For better or worse, however, in many fields of science, that is
>>> routinely the case today. *If* it's possible to get around this limit,
>>> then I'd like to know whether the R Development Team takes seriously
>>> the needs of large data users, or if they feel that (perhaps not
>>> mutually exclusively) developing such capacity is best left up to ad
>>> hoc R packages and alternative analysis programs.
>>
>> There are many ways around the limit today.  Put your data in a dataframe
>> with many columns each of length 2^31-1 or less.  Put your data in a
>> database, and process it a block at a time.  Etc.
>>
>> Duncan Murdoch
>>
>>>
>>> Best,
>>>
>>> Matt
>>>
>>>
>>>
>>
>>
>
>
>
> --
> Matthew C Keller
> Asst. Professor of Psychology
> University of Colorado at Boulder
> www.matthewckeller.com
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
Thomas Lumley Assoc. Professor, Biostatistics
[hidden email] University of Washington, Seattle

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.