From long to wide format

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

From long to wide format

Jorge I Velez
Dear R-help,

I am working with some data stored as "filename.txt.gz" in my working
directory.
After reading the data in using read.table(), I can see that each of them
has four columns (variable, id, outcome, and rate) and the following
structure:

# sample data
x2 <- data.frame(variable = rep(paste0('x', 1:100), each = 100), id =
rep(paste0('p', 1:100), 100), outcome = sample(0:2, 10000, TRUE), rate =
runif(10000, 0.5, 1))
str(x2)

Each variable, i.e., x1, x2,..., x100 is repeated as many times as the
number of unique IDs (100 in this example).  What I would like to do is to
transform the data above
in a long format.  I can do this by using

# reshape
wide <- reshape(x2[, -4], v.names = "outcome", idvar = "id",
                timevar = "variable", direction = "wide")
str(wide)

# or a "hack" with mclapply:

require(parallel)
sel <- as.character(unique(x2$variable))
id <- as.character(unique(x2$id))
X <- matrix(NA, ncol = length(sel) + 1, nrow = length(id))
X[, 1] <- id
colnames(X) <- c('id', sel)
r <- mclapply(seq_along(sel), function(i){
                        out <- x2[x2$variable == sel[i], ][, 3]
                        }, mc.cores = 4)
X[, -1] <- do.call(rbind, r)
X

However, I was wondering if it is possible to come up with another solution
, hopefully faster than these
.  Unfortunately, either one of these takes a very long time to process,
specially when the number of variables is very large
(> 250,000) and the number of ids is ~2000.

I would very much appreciate your suggestions.   At the end of this message
is my sessionInfo().

Thank you very much in advance.

Best regards,
Jorge Velez.-


R>  sessionInfo()

R version 3.0.2 Patched (2013-12-11 r64449)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] graphics  grDevices utils     datasets  parallel  compiler  stats
[8] methods   base

other attached packages:
[1] knitr_1.6.3            ggplot2_1.0.0          slidifyLibraries_0.3.1
[4] slidify_0.3.52

loaded via a namespace (and not attached):
 [1] colorspace_1.2-4 digest_0.6.4     evaluate_0.5.5   formatR_0.10
 [5] grid_3.0.2       gtable_0.1.2     markdown_0.7.1   MASS_7.3-33
 [9] munsell_0.4.2    plyr_1.8.1       proto_0.3-10     Rcpp_0.11.2
[13] reshape2_1.4     scales_0.2.4     stringr_0.6.2    tools_3.0.2
[17] whisker_0.4      yaml_2.1.13

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: From long to wide format

arun kirshna


Hi Jorge,

You may try:
library(dplyr)
library(tidyr)

#Looks like this is faster than the other methods.
system.time({wide1 <- x2%>%
        select(-rate) %>%
        mutate(variable=factor(variable, levels=unique(variable)),id=factor(id, levels=unique(id))) %>%                 
            spread(variable,outcome)
colnames(wide1)[-1] <- paste("outcome",colnames(wide1)[-1],sep=".")
})

 #user  system elapsed
 #  0.006    0.00    0.006



system.time(wide <- reshape(x2[, -4], v.names = "outcome", idvar = "id",
               timevar = "variable", direction = "wide"))
 #user  system elapsed
 # 0.169   0.000   0.169



system.time({
sel <- unique(x2$variable)
id <- unique(x2$id)
X <- matrix(NA, ncol = length(sel) + 1, nrow = length(id))
X[, 1] <- id
colnames(X) <- c('id', sel)
r <- mclapply(seq_along(sel), function(i){
                        out <- x2[x2$variable == sel[i], ][, 3]
                        }, mc.cores = 4)
X[, -1] <- do.call(rbind, r)
X
})

# user  system elapsed
#  0.125   0.011   0.074


 wide2 <- wide1
wide2$id <- as.character(wide2$id)
 wide$id <- as.character(wide$id)
all.equal(wide, wide2, check.attributes=F)
#[1] TRUE

A.K.



On Sunday, June 29, 2014 11:48 PM, Jorge I Velez <[hidden email]> wrote:
Dear R-help,

I am working with some data stored as "filename.txt.gz" in my working
directory.
After reading the data in using read.table(), I can see that each of them
has four columns (variable, id, outcome, and rate) and the following
structure:

# sample data
x2 <- data.frame(variable = rep(paste0('x', 1:100), each = 100), id =
rep(paste0('p', 1:100), 100), outcome = sample(0:2, 10000, TRUE), rate =
runif(10000, 0.5, 1))
str(x2)

Each variable, i.e., x1, x2,..., x100 is repeated as many times as the
number of unique IDs (100 in this example).  What I would like to do is to
transform the data above
in a long format.  I can do this by using

# reshape
wide <- reshape(x2[, -4], v.names = "outcome", idvar = "id",
                timevar = "variable", direction = "wide")
str(wide)

# or a "hack" with mclapply:

require(parallel)
sel <- as.character(unique(x2$variable))
id <- as.character(unique(x2$id))
X <- matrix(NA, ncol = length(sel) + 1, nrow = length(id))
X[, 1] <- id
colnames(X) <- c('id', sel)
r <- mclapply(seq_along(sel), function(i){
                        out <- x2[x2$variable == sel[i], ][, 3]
                        }, mc.cores = 4)
X[, -1] <- do.call(rbind, r)
X

However, I was wondering if it is possible to come up with another solution
, hopefully faster than these
.  Unfortunately, either one of these takes a very long time to process,
specially when the number of variables is very large
(> 250,000) and the number of ids is ~2000.

I would very much appreciate your suggestions.   At the end of this message
is my sessionInfo().

Thank you very much in advance.

Best regards,
Jorge Velez.-


R>  sessionInfo()

R version 3.0.2 Patched (2013-12-11 r64449)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] graphics  grDevices utils     datasets  parallel  compiler  stats
[8] methods   base

other attached packages:
[1] knitr_1.6.3            ggplot2_1.0.0          slidifyLibraries_0.3.1
[4] slidify_0.3.52

loaded via a namespace (and not attached):
[1] colorspace_1.2-4 digest_0.6.4     evaluate_0.5.5   formatR_0.10
[5] grid_3.0.2       gtable_0.1.2     markdown_0.7.1   MASS_7.3-33
[9] munsell_0.4.2    plyr_1.8.1       proto_0.3-10     Rcpp_0.11.2
[13] reshape2_1.4     scales_0.2.4     stringr_0.6.2    tools_3.0.2
[17] whisker_0.4      yaml_2.1.13

    [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: From long to wide format

Jorge I Velez
Hi Arun,

Thank you very much for your suggestion.

While running some tests, I came across the following:

# sample data
n <- 2000
p <- 1000
x2 <- data.frame(variable = rep(paste0('x', 1:p), each = n), id =
rep(paste0('p', 1:p), n), outcome = sample(0:2, n*p, TRUE), rate =
runif(n*p, 0.5, 1))
str(x2)

library(dplyr)
library(tidyr)

# Arun's suggestion
system.time({wide1 <- x2%>%
        select(-rate) %>%
        mutate(variable=factor(variable,
levels=unique(variable)),id=factor(id, levels=unique(id))) %>%

            spread(variable,outcome)
colnames(wide1)[-1] <- paste("outcome",colnames(wide1)[-1],sep=".")
})

# Error: C stack usage  18920219 is too close to the limit
# Timing stopped at: 13.833 0.251 14.085


Do you happen to know what can be done to avoid this?

Thank you.

Best,
Jorge.-


On Mon, Jun 30, 2014 at 6:51 PM, arun <[hidden email]> wrote:

>
>
> Hi Jorge,
>
> You may try:
> library(dplyr)
> library(tidyr)
>
> #Looks like this is faster than the other methods.
> system.time({wide1 <- x2%>%
>         select(-rate) %>%
>         mutate(variable=factor(variable,
> levels=unique(variable)),id=factor(id, levels=unique(id)))
> %>%
>             spread(variable,outcome)
> colnames(wide1)[-1] <- paste("outcome",colnames(wide1)[-1],sep=".")
> })
>
>  #user  system elapsed
>  #  0.006    0.00    0.006
>
>
>
> system.time(wide <- reshape(x2[, -4], v.names = "outcome", idvar = "id",
>                timevar = "variable", direction = "wide"))
>  #user  system elapsed
>  # 0.169   0.000   0.169
>
>
>
> system.time({
> sel <- unique(x2$variable)
> id <- unique(x2$id)
> X <- matrix(NA, ncol = length(sel) + 1, nrow = length(id))
> X[, 1] <- id
> colnames(X) <- c('id', sel)
> r <- mclapply(seq_along(sel), function(i){
>                         out <- x2[x2$variable == sel[i], ][, 3]
>                         }, mc.cores = 4)
> X[, -1] <- do.call(rbind, r)
> X
> })
>
> # user  system elapsed
> #  0.125   0.011   0.074
>
>
>  wide2 <- wide1
> wide2$id <- as.character(wide2$id)
>  wide$id <- as.character(wide$id)
> all.equal(wide, wide2, check.attributes=F)
> #[1] TRUE
>
> A.K.
>
>
>
> On Sunday, June 29, 2014 11:48 PM, Jorge I Velez <[hidden email]>
> wrote:
> Dear R-help,
>
> I am working with some data stored as "filename.txt.gz" in my working
> directory.
> After reading the data in using read.table(), I can see that each of them
> has four columns (variable, id, outcome, and rate) and the following
> structure:
>
> # sample data
> x2 <- data.frame(variable = rep(paste0('x', 1:100), each = 100), id =
> rep(paste0('p', 1:100), 100), outcome = sample(0:2, 10000, TRUE), rate =
> runif(10000, 0.5, 1))
> str(x2)
>
> Each variable, i.e., x1, x2,..., x100 is repeated as many times as the
> number of unique IDs (100 in this example).  What I would like to do is to
> transform the data above
> in a long format.  I can do this by using
>
> # reshape
> wide <- reshape(x2[, -4], v.names = "outcome", idvar = "id",
>                 timevar = "variable", direction = "wide")
> str(wide)
>
> # or a "hack" with mclapply:
>
> require(parallel)
> sel <- as.character(unique(x2$variable))
> id <- as.character(unique(x2$id))
> X <- matrix(NA, ncol = length(sel) + 1, nrow = length(id))
> X[, 1] <- id
> colnames(X) <- c('id', sel)
> r <- mclapply(seq_along(sel), function(i){
>                         out <- x2[x2$variable == sel[i], ][, 3]
>                         }, mc.cores = 4)
> X[, -1] <- do.call(rbind, r)
> X
>
> However, I was wondering if it is possible to come up with another solution
> , hopefully faster than these
> .  Unfortunately, either one of these takes a very long time to process,
> specially when the number of variables is very large
> (> 250,000) and the number of ids is ~2000.
>
> I would very much appreciate your suggestions.   At the end of this message
> is my sessionInfo().
>
> Thank you very much in advance.
>
> Best regards,
> Jorge Velez.-
>
>
> R>  sessionInfo()
>
> R version 3.0.2 Patched (2013-12-11 r64449)
> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>
> locale:
> [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
>
> attached base packages:
> [1] graphics  grDevices utils     datasets  parallel  compiler  stats
> [8] methods   base
>
> other attached packages:
> [1] knitr_1.6.3            ggplot2_1.0.0          slidifyLibraries_0.3.1
> [4] slidify_0.3.52
>
> loaded via a namespace (and not attached):
> [1] colorspace_1.2-4 digest_0.6.4     evaluate_0.5.5   formatR_0.10
> [5] grid_3.0.2       gtable_0.1.2     markdown_0.7.1   MASS_7.3-33
> [9] munsell_0.4.2    plyr_1.8.1       proto_0.3-10     Rcpp_0.11.2
> [13] reshape2_1.4     scales_0.2.4     stringr_0.6.2    tools_3.0.2
> [17] whisker_0.4      yaml_2.1.13
>
>     [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: From long to wide format

arun kirshna
HI Jorge,

I was able to reproduce the error.  The link below provides a way to adjust the stack. I didn't test it. 


http://stackoverflow.com/questions/14719349/error-c-stack-usage-is-too-close-to-the-limit
Also check this link

http://stackoverflow.com/questions/13245019/how-to-change-the-stack-size-using-ulimit-or-per-process-on-mac-os-x-for-a-c-or


A.K.

On Monday, June 30, 2014 10:08 PM, Jorge I Velez <[hidden email]> wrote:



Hi Arun,

Thank you very much for your suggestion.   

While running some tests, I came across the following:

# sample data
n <- 2000
p <- 1000
x2 <- data.frame(variable = rep(paste0('x', 1:p), each = n), id = rep(paste0('p', 1:p), n), outcome = sample(0:2, n*p, TRUE), rate = runif(n*p, 0.5, 1))
str(x2)

library(dplyr)
library(tidyr)

# Arun's suggestion
system.time({wide1 <- x2%>%
        select(-rate) %>%
        mutate(variable=factor(variable, levels=unique(variable)),id=factor(id, levels=unique(id))) %>%                 
            spread(variable,outcome)
colnames(wide1)[-1] <- paste("outcome",colnames(wide1)[-1],sep=".")
})

# Error: C stack usage  18920219 is too close to the limit
# Timing stopped at: 13.833 0.251 14.085


Do you happen to know what can be done to avoid this?

Thank you.

Best,
Jorge.-



On Mon, Jun 30, 2014 at 6:51 PM, arun <[hidden email]> wrote:


>
>Hi Jorge,
>
>You may try:
>library(dplyr)
>library(tidyr)
>
>#Looks like this is faster than the other methods.
>system.time({wide1 <- x2%>%
>        select(-rate) %>%
>        mutate(variable=factor(variable, levels=unique(variable)),id=factor(id, levels=unique(id))) %>%                 
>            spread(variable,outcome)
>colnames(wide1)[-1] <- paste("outcome",colnames(wide1)[-1],sep=".")
>})
>
> #user  system elapsed
> #  0.006    0.00    0.006
>
>
>
>system.time(wide <- reshape(x2[, -4], v.names = "outcome", idvar = "id",
>               timevar = "variable", direction = "wide"))
> #user  system elapsed
> # 0.169   0.000   0.169
>
>
>
>system.time({
>sel <- unique(x2$variable)
>id <- unique(x2$id)
>
>X <- matrix(NA, ncol = length(sel) + 1, nrow = length(id))
>X[, 1] <- id
>colnames(X) <- c('id', sel)
>r <- mclapply(seq_along(sel), function(i){
>                        out <- x2[x2$variable == sel[i], ][, 3]
>                        }, mc.cores = 4)
>X[, -1] <- do.call(rbind, r)
>X
>})
>
># user  system elapsed
>#  0.125   0.011   0.074
>
>
> wide2 <- wide1
>wide2$id <- as.character(wide2$id)
> wide$id <- as.character(wide$id)
>all.equal(wide, wide2, check.attributes=F)
>#[1] TRUE
>
>A.K.
>
>
>
>
>On Sunday, June 29, 2014 11:48 PM, Jorge I Velez <[hidden email]> wrote:
>Dear R-help,
>
>I am working with some data stored as "filename.txt.gz" in my working
>directory.
>After reading the data in using read.table(), I can see that each of them
>has four columns (variable, id, outcome, and rate) and the following
>structure:
>
># sample data
>x2 <- data.frame(variable = rep(paste0('x', 1:100), each = 100), id =
>rep(paste0('p', 1:100), 100), outcome = sample(0:2, 10000, TRUE), rate =
>runif(10000, 0.5, 1))
>str(x2)
>
>Each variable, i.e., x1, x2,..., x100 is repeated as many times as the
>number of unique IDs (100 in this example).  What I would like to do is to
>transform the data above
>in a long format.  I can do this by using
>
># reshape
>wide <- reshape(x2[, -4], v.names = "outcome", idvar = "id",
>                timevar = "variable", direction = "wide")
>str(wide)
>
># or a "hack" with mclapply:
>
>require(parallel)
>sel <- as.character(unique(x2$variable))
>id <- as.character(unique(x2$id))
>X <- matrix(NA, ncol = length(sel) + 1, nrow = length(id))
>X[, 1] <- id
>colnames(X) <- c('id', sel)
>r <- mclapply(seq_along(sel), function(i){
>                        out <- x2[x2$variable == sel[i], ][, 3]
>                        }, mc.cores = 4)
>X[, -1] <- do.call(rbind, r)
>X
>
>However, I was wondering if it is possible to come up with another solution
>, hopefully faster than these
>.  Unfortunately, either one of these takes a very long time to process,
>specially when the number of variables is very large
>(> 250,000) and the number of ids is ~2000.
>
>I would very much appreciate your suggestions.   At the end of this message
>is my sessionInfo().
>
>Thank you very much in advance.
>
>Best regards,
>Jorge Velez.-
>
>
>R>  sessionInfo()
>
>R version 3.0.2 Patched (2013-12-11 r64449)
>Platform: x86_64-apple-darwin10.8.0 (64-bit)
>
>locale:
>[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
>
>attached base packages:
>[1] graphics  grDevices utils     datasets  parallel  compiler  stats
>[8] methods   base
>
>other attached packages:
>[1] knitr_1.6.3            ggplot2_1.0.0          slidifyLibraries_0.3.1
>[4] slidify_0.3.52
>
>loaded via a namespace (and not attached):
>[1] colorspace_1.2-4 digest_0.6.4     evaluate_0.5.5   formatR_0.10
>[5] grid_3.0.2       gtable_0.1.2     markdown_0.7.1   MASS_7.3-33
>[9] munsell_0.4.2    plyr_1.8.1       proto_0.3-10     Rcpp_0.11.2
>[13] reshape2_1.4     scales_0.2.4     stringr_0.6.2    tools_3.0.2
>[17] whisker_0.4      yaml_2.1.13
>
>
>    [[alternative HTML version deleted]]
>
>______________________________________________
>[hidden email] mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.