cbind alternate

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

cbind alternate

Mary Kindall
I have two one dimensional list of elements and want to perform cbind and
then write into a file. The number of entries are more than a million in
both lists. R is taking a lot of time performing this operation.

Is there any alternate way to perform cbind?

x = table1[1:1000000,1]
y = table2[1:1000000,5]

z = cbind(x,y)   //hanging the machine

write.table(z,'out.txt)



--
-------------
Mary Kindall
Yorktown Heights, NY
USA

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: cbind alternate

Bos, Roger-2
You could break the data into chunks, so you cbind and save 50,000
observations at a time.  That should be less taxing on your machine and
memory.

-----Original Message-----
From: [hidden email] [mailto:[hidden email]]
On Behalf Of Mary Kindall
Sent: Friday, January 06, 2012 12:43 PM
To: [hidden email]
Subject: [R] cbind alternate

I have two one dimensional list of elements and want to perform cbind
and then write into a file. The number of entries are more than a
million in both lists. R is taking a lot of time performing this
operation.

Is there any alternate way to perform cbind?

x = table1[1:1000000,1]
y = table2[1:1000000,5]

z = cbind(x,y)   //hanging the machine

write.table(z,'out.txt)



--
-------------
Mary Kindall
Yorktown Heights, NY
USA

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

***************************************************************
This message is for the named person's use only. It may\...{{dropped:11}}

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: cbind alternate

Marc Schwartz-3
In reply to this post by Mary Kindall
On Jan 6, 2012, at 11:43 AM, Mary Kindall wrote:

> I have two one dimensional list of elements and want to perform cbind and
> then write into a file. The number of entries are more than a million in
> both lists. R is taking a lot of time performing this operation.
>
> Is there any alternate way to perform cbind?
>
> x = table1[1:1000000,1]
> y = table2[1:1000000,5]
>
> z = cbind(x,y)   //hanging the machine
>
> write.table(z,'out.txt)



The issue is not the use of cbind(), but that write.table() can be slow with data frames, where each column may be a different class (data type) and requires separate formatting for output. This is referenced in the Note section of ?write.table:

write.table can be slow for data frames with large numbers (hundreds or more) of columns: this is inevitable as each column could be of a different class and so must be handled separately. If they are all of the same class, consider using a matrix instead.


I suspect in this case, while you don't have a large number of columns, you do have a large number of rows, so that there is a tradeoff.

If all of the columns in your source tables are of the same type (eg. all numeric), coerce 'z' to a matrix and then try using write.table().

z <- matrix(rnorm(1000000 * 6), ncol = 6)

> str(z)
 num [1:1000000, 1:6] -0.713 0.79 -0.538 0.945 1.621 ...

> system.time(write.table(z, file = "test.txt"))
   user  system elapsed
 12.664   0.292  13.029


The resultant file is about 118 Mb on my system.

HTH,

Marc Schwartz

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: cbind alternate

David Winsemius
In reply to this post by Mary Kindall

On Jan 6, 2012, at 12:43 PM, Mary Kindall wrote:

> I have two one dimensional list of elements and want to perform  
> cbind and
> then write into a file. The number of entries are more than a  
> million in
> both lists. R is taking a lot of time performing this operation.
>
> Is there any alternate way to perform cbind?
>
> x =
> y = table2[1:1000000,5]
>
> z = cbind(x,y)   //hanging the machine

You should have been able to bypass the intermediate steps with just:

z = cbind( table1[1:1000000,1] table2[1:1000000,5])

Whether you will have sufficient contiguous memory for that object at  
the moment or even after rm(x), rm(y) is in doubt, but had you not  
created the unneeded x and y, you _might_ have succeeded in your  
limited environment. (Real answer: Buy more RAM.)

I speculate that you are on Windows and so refer your to the R-Win FAQ  
for further reading about memory limits.


>
> write.table(z,'out.txt)

I do not know of a way to bypass the requirement of a named object to  
pass to write.table, but testing suggests that you could try:

write( t(cbind( table1[1:1000000,1] table2[1:1000000,5])).  
"test.txt", 2)

write()  does not require a named object but is less inquisitive than  
write table and will give you a transposed matrix with 5 columns by  
default which will really mess up things, so you need to transpose and  
specify the number of columns. (And that may not save any space over  
creating a "z" object.)

So there is another thread today to which master R programmer Bill  
Dunlap has offered this strategy (with minor modifications to your  
situation by me):

###
f1 <- function (n, fileName) {
      unlink(fileName)
      system.time({
          fileConn <- file(fileName, "wt")
          on.exit(close(fileConn))
          for (i in seq_len(n)) cat( table1[i, 1], " ",
                                     table2[i, 5],
                                   "\n", file = fileConn)
      })
  }

f1(1000000, 'out.txt')

#------------
--

David Winsemius, MD
West Hartford, CT

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: cbind alternate

Rui Barradas
In reply to this post by Mary Kindall
Hello,

I believe this function can handle a problem of that size, or bigger.

It does NOT create the full matrix, just writes it to a file, a certain number of lines at a time.


write.big.matrix <- function(x, y, outfile, nmax=1000){

        if(file.exists(outfile)) unlink(outfile)
        testf <- file(outfile, "at")   # or "wt" - "write text"
        on.exit(close(testf))

        step <- nmax                         # how many at a time
        inx  <- seq(1, length(x), by=step)   # index into 'x' and 'y'
        mat  <- matrix(0, nrow=step, ncol=2) # create a work matrix

        # do it 'nmax' rows per iteration
        for(i in inx){
                mat <- cbind(x[i:(i+step-1)], y[i:(i+step-1)])
                write.table(mat, file=testf, quote=FALSE, row.names=FALSE, col.names=FALSE)
        }

        # and now the remainder
        mat <- NULL
        mat <- cbind(x[(i+1):length(x)], y[(i+1):length(y)])
        write.table(mat, file=testf, quote=FALSE, row.names=FALSE, col.names=FALSE)

        # return the output filename
        outfile
}

x <- 1:1e6                              # a numeric vector
y <- sample(letters, 1e6, replace=TRUE) # and a character vector
length(x);length(y)                     # of the same length
fl <- "test.txt"                        # output file

system.time(write.big.matrix(x, y, outfile=fl))


On my system it takes (sample output)

   user  system elapsed
   1.59    0.04    1.65

and can handle different types of data. In the example, numeric and character.

If you also need the matrix, try to use 'cbind' first, without writing to a file.
If it's still slow, adapt the code above to keep inserting chunks in an output matrix.

Rui Barradas


Reply | Threaded
Open this post in threaded view
|

Re: cbind alternate

Marc Schwartz-3
In reply to this post by Marc Schwartz-3

On Jan 6, 2012, at 12:39 PM, Marc Schwartz wrote:

> On Jan 6, 2012, at 11:43 AM, Mary Kindall wrote:
>
>> I have two one dimensional list of elements and want to perform cbind and
>> then write into a file. The number of entries are more than a million in
>> both lists. R is taking a lot of time performing this operation.
>>
>> Is there any alternate way to perform cbind?
>>
>> x = table1[1:1000000,1]
>> y = table2[1:1000000,5]
>>
>> z = cbind(x,y)   //hanging the machine
>>
>> write.table(z,'out.txt)
>

Apologies, I mis-read where the hang up was. It is in the use of cbind() prior to calling write.table(), not in write.table() itself.

Not sure why that part is taking a long time, unless as already mentioned, you are short on memory available. This runs quickly for me:

x <- matrix(rnorm(1000000 * 3), ncol = 3)
y <- matrix(rnorm(1000000 * 3), ncol = 3)
 
> system.time(z <- cbind(x, y))
   user  system elapsed
  0.039   0.025   0.065

> str(z)
 num [1:1000000, 1:6] -0.5102 1.8776 2.4635 0.2982 0.0901 ...


To give an example with two data frames containing differing data types, let's use the built-in 'iris' data set, which has 5 columns and 150 rows by default. Let's create a new version with over a million rows:

iris.new <- iris[rep(seq(nrow(iris)), 7000), ]

> str(iris.new)
'data.frame': 1050000 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...


> system.time(iris.new2 <- cbind(iris.new, iris.new))
   user  system elapsed
  5.289   0.282   5.658


> str(iris.new2)
'data.frame': 1050000 obs. of  10 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...


You might verify the structures of your 'x' and 'y' to be sure that there is not something amiss with either one.

HTH,

Marc Schwartz

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: cbind alternate

Rui Barradas
In reply to this post by Mary Kindall
Sorry Mary,

My function would write the remainder twice, I had only tested it
with multiples of the chunk size.
(And without looking at the lenghty output correctly.)

Now checked:

write.big.matrix <- function(x, y, outfile, nmax=1000){

    if(file.exists(outfile)) unlink(outfile)
    testf <- file(outfile, "at")   # or "wt" - "write text"
    on.exit(close(testf))

    step <- nmax                              # how many at a time
    inx  <- seq(1, length(x)-step, by=step)   # index into 'x' and 'y'

    mat  <- matrix(0, nrow=step, ncol=2) # create a work matrix

    # do it 'nmax' rows per iteration
    for(i in inx){
        mat <- cbind(x[i:(i+step-1)], y[i:(i+step-1)])
        write.table(mat, file=testf, quote=FALSE, row.names=FALSE, col.names=FALSE)
    }

    # and now the remainder
    if(i+step < length(x)){
        mat <- NULL
        mat <- cbind(x[(i+step):length(x)], y[(i+step):length(y)])
        write.table(mat, file=testf, quote=FALSE, row.names=FALSE, col.names=FALSE)
    }
    # return the output filename
    outfile
}

x <- 1:(1e6 + 1234)                             # a numeric vector
y <- sample(letters, 1e6 + 1234, replace=TRUE)  # and a character vector
length(x);length(y)                             # of the same length
fl <- "test.txt"                                # output file

system.time(write.big.matrix(x, y, outfile=fl, nmax=100))

   user  system elapsed
   3.04    0.06    3.09

system.time(write.big.matrix(x, y, outfile=fl))

   user  system elapsed
   1.64    0.12    1.76

Rui Barradas
Reply | Threaded
Open this post in threaded view
|

Re: cbind alternate

jholtman
In reply to this post by Mary Kindall
What is it you want to do with the data after you save it?  Are you
just going to read it back into R?  If so, consider using save/load.

On Fri, Jan 6, 2012 at 12:43 PM, Mary Kindall <[hidden email]> wrote:

> I have two one dimensional list of elements and want to perform cbind and
> then write into a file. The number of entries are more than a million in
> both lists. R is taking a lot of time performing this operation.
>
> Is there any alternate way to perform cbind?
>
> x = table1[1:1000000,1]
> y = table2[1:1000000,5]
>
> z = cbind(x,y)   //hanging the machine
>
> write.table(z,'out.txt)
>
>
>
> --
> -------------
> Mary Kindall
> Yorktown Heights, NY
> USA
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



--
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.