Efficient way to subset rows in R for dataset with 10^7 columns

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Efficient way to subset rows in R for dataset with 10^7 columns

jackarnestad
I have a data.table with dimensions 100 by 10^7.

When I do

    trainIndex <-
      caret::createDataPartition(
        df$status,
        p = .9,
        list = FALSE,
        times = 1
      )
    outerTrain <- df[trainIndex]
    outerTest  <- df[-trainIndex]

Subsetting the rows of df takes over 20 minutes.

What is the best way to efficiently subset this?

Thanks!

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Efficient way to subset rows in R for dataset with 10^7 columns

Jeff Newmiller
You have 10^7 columns? That process is bound to be slow.

On April 13, 2018 5:31:32 PM PDT, Jack Arnestad <[hidden email]> wrote:

>I have a data.table with dimensions 100 by 10^7.
>
>When I do
>
>    trainIndex <-
>      caret::createDataPartition(
>        df$status,
>        p = .9,
>        list = FALSE,
>        times = 1
>      )
>    outerTrain <- df[trainIndex]
>    outerTest  <- df[-trainIndex]
>
>Subsetting the rows of df takes over 20 minutes.
>
>What is the best way to efficiently subset this?
>
>Thanks!
>
> [[alternative HTML version deleted]]
>
>______________________________________________
>[hidden email] mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

--
Sent from my phone. Please excuse my brevity.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Efficient way to subset rows in R for dataset with 10^7 columns

Jeff Newmiller
Oh, there are ways, but the constraining issue here is moving data (memory bandwidth), and data table is probably already the fastest mechanism for doing that. If you have a computer with four or more real cores you can try setting up a subset of the columns in each task and cbind the results afterward, but it will be hard to accomplish without making extra copies of the data. You are already probably already using virtual memory which is saved to and from hard disk storage as needed.

Working in Spark with a distributed file system like Hadoop might solve some of these problems... but I haven't done real work with such tools.

On April 13, 2018 6:31:32 PM PDT, Jack Arnestad <[hidden email]> wrote:

>Yes unfortunately. The goal of the "outer" is to do feature selection
>before fitting it to a model.
>
>Is there a way it could be parallelized?
>
>Thanks!
>
>On Fri, Apr 13, 2018 at 9:08 PM, Jeff Newmiller
><[hidden email]>
>wrote:
>
>> You have 10^7 columns? That process is bound to be slow.
>>
>> On April 13, 2018 5:31:32 PM PDT, Jack Arnestad
><[hidden email]>
>> wrote:
>> >I have a data.table with dimensions 100 by 10^7.
>> >
>> >When I do
>> >
>> >    trainIndex <-
>> >      caret::createDataPartition(
>> >        df$status,
>> >        p = .9,
>> >        list = FALSE,
>> >        times = 1
>> >      )
>> >    outerTrain <- df[trainIndex]
>> >    outerTest  <- df[-trainIndex]
>> >
>> >Subsetting the rows of df takes over 20 minutes.
>> >
>> >What is the best way to efficiently subset this?
>> >
>> >Thanks!
>> >
>> >       [[alternative HTML version deleted]]
>> >
>> >______________________________________________
>> >[hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> >https://stat.ethz.ch/mailman/listinfo/r-help
>> >PLEASE do read the posting guide
>> >http://www.R-project.org/posting-guide.html
>> >and provide commented, minimal, self-contained, reproducible code.
>>
>> --
>> Sent from my phone. Please excuse my brevity.
>>

--
Sent from my phone. Please excuse my brevity.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.