

I have two one dimensional list of elements and want to perform cbind and
then write into a file. The number of entries are more than a million in
both lists. R is taking a lot of time performing this operation.
Is there any alternate way to perform cbind?
x = table1[1:1000000,1]
y = table2[1:1000000,5]
z = cbind(x,y) //hanging the machine
write.table(z,'out.txt)


Mary Kindall
Yorktown Heights, NY
USA
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


You could break the data into chunks, so you cbind and save 50,000
observations at a time. That should be less taxing on your machine and
memory.
Original Message
From: [hidden email] [mailto: [hidden email]]
On Behalf Of Mary Kindall
Sent: Friday, January 06, 2012 12:43 PM
To: [hidden email]
Subject: [R] cbind alternate
I have two one dimensional list of elements and want to perform cbind
and then write into a file. The number of entries are more than a
million in both lists. R is taking a lot of time performing this
operation.
Is there any alternate way to perform cbind?
x = table1[1:1000000,1]
y = table2[1:1000000,5]
z = cbind(x,y) //hanging the machine
write.table(z,'out.txt)


Mary Kindall
Yorktown Heights, NY
USA
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide
http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.
***************************************************************
This message is for the named person's use only. It may\...{{dropped:11}}
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


On Jan 6, 2012, at 11:43 AM, Mary Kindall wrote:
> I have two one dimensional list of elements and want to perform cbind and
> then write into a file. The number of entries are more than a million in
> both lists. R is taking a lot of time performing this operation.
>
> Is there any alternate way to perform cbind?
>
> x = table1[1:1000000,1]
> y = table2[1:1000000,5]
>
> z = cbind(x,y) //hanging the machine
>
> write.table(z,'out.txt)
The issue is not the use of cbind(), but that write.table() can be slow with data frames, where each column may be a different class (data type) and requires separate formatting for output. This is referenced in the Note section of ?write.table:
write.table can be slow for data frames with large numbers (hundreds or more) of columns: this is inevitable as each column could be of a different class and so must be handled separately. If they are all of the same class, consider using a matrix instead.
I suspect in this case, while you don't have a large number of columns, you do have a large number of rows, so that there is a tradeoff.
If all of the columns in your source tables are of the same type (eg. all numeric), coerce 'z' to a matrix and then try using write.table().
z < matrix(rnorm(1000000 * 6), ncol = 6)
> str(z)
num [1:1000000, 1:6] 0.713 0.79 0.538 0.945 1.621 ...
> system.time(write.table(z, file = "test.txt"))
user system elapsed
12.664 0.292 13.029
The resultant file is about 118 Mb on my system.
HTH,
Marc Schwartz
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


On Jan 6, 2012, at 12:43 PM, Mary Kindall wrote:
> I have two one dimensional list of elements and want to perform
> cbind and
> then write into a file. The number of entries are more than a
> million in
> both lists. R is taking a lot of time performing this operation.
>
> Is there any alternate way to perform cbind?
>
> x =
> y = table2[1:1000000,5]
>
> z = cbind(x,y) //hanging the machine
You should have been able to bypass the intermediate steps with just:
z = cbind( table1[1:1000000,1] table2[1:1000000,5])
Whether you will have sufficient contiguous memory for that object at
the moment or even after rm(x), rm(y) is in doubt, but had you not
created the unneeded x and y, you _might_ have succeeded in your
limited environment. (Real answer: Buy more RAM.)
I speculate that you are on Windows and so refer your to the RWin FAQ
for further reading about memory limits.
>
> write.table(z,'out.txt)
I do not know of a way to bypass the requirement of a named object to
pass to write.table, but testing suggests that you could try:
write( t(cbind( table1[1:1000000,1] table2[1:1000000,5])).
"test.txt", 2)
write() does not require a named object but is less inquisitive than
write table and will give you a transposed matrix with 5 columns by
default which will really mess up things, so you need to transpose and
specify the number of columns. (And that may not save any space over
creating a "z" object.)
So there is another thread today to which master R programmer Bill
Dunlap has offered this strategy (with minor modifications to your
situation by me):
###
f1 < function (n, fileName) {
unlink(fileName)
system.time({
fileConn < file(fileName, "wt")
on.exit(close(fileConn))
for (i in seq_len(n)) cat( table1[i, 1], " ",
table2[i, 5],
"\n", file = fileConn)
})
}
f1(1000000, 'out.txt')
#

David Winsemius, MD
West Hartford, CT
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Hello,
I believe this function can handle a problem of that size, or bigger.
It does NOT create the full matrix, just writes it to a file, a certain number of lines at a time.
write.big.matrix < function(x, y, outfile, nmax=1000){
if(file.exists(outfile)) unlink(outfile)
testf < file(outfile, "at") # or "wt"  "write text"
on.exit(close(testf))
step < nmax # how many at a time
inx < seq(1, length(x), by=step) # index into 'x' and 'y'
mat < matrix(0, nrow=step, ncol=2) # create a work matrix
# do it 'nmax' rows per iteration
for(i in inx){
mat < cbind(x[i:(i+step1)], y[i:(i+step1)])
write.table(mat, file=testf, quote=FALSE, row.names=FALSE, col.names=FALSE)
}
# and now the remainder
mat < NULL
mat < cbind(x[(i+1):length(x)], y[(i+1):length(y)])
write.table(mat, file=testf, quote=FALSE, row.names=FALSE, col.names=FALSE)
# return the output filename
outfile
}
x < 1:1e6 # a numeric vector
y < sample(letters, 1e6, replace=TRUE) # and a character vector
length(x);length(y) # of the same length
fl < "test.txt" # output file
system.time(write.big.matrix(x, y, outfile=fl))
On my system it takes (sample output)
user system elapsed
1.59 0.04 1.65
and can handle different types of data. In the example, numeric and character.
If you also need the matrix, try to use 'cbind' first, without writing to a file.
If it's still slow, adapt the code above to keep inserting chunks in an output matrix.
Rui Barradas


On Jan 6, 2012, at 12:39 PM, Marc Schwartz wrote:
> On Jan 6, 2012, at 11:43 AM, Mary Kindall wrote:
>
>> I have two one dimensional list of elements and want to perform cbind and
>> then write into a file. The number of entries are more than a million in
>> both lists. R is taking a lot of time performing this operation.
>>
>> Is there any alternate way to perform cbind?
>>
>> x = table1[1:1000000,1]
>> y = table2[1:1000000,5]
>>
>> z = cbind(x,y) //hanging the machine
>>
>> write.table(z,'out.txt)
>
Apologies, I misread where the hang up was. It is in the use of cbind() prior to calling write.table(), not in write.table() itself.
Not sure why that part is taking a long time, unless as already mentioned, you are short on memory available. This runs quickly for me:
x < matrix(rnorm(1000000 * 3), ncol = 3)
y < matrix(rnorm(1000000 * 3), ncol = 3)
> system.time(z < cbind(x, y))
user system elapsed
0.039 0.025 0.065
> str(z)
num [1:1000000, 1:6] 0.5102 1.8776 2.4635 0.2982 0.0901 ...
To give an example with two data frames containing differing data types, let's use the builtin 'iris' data set, which has 5 columns and 150 rows by default. Let's create a new version with over a million rows:
iris.new < iris[rep(seq(nrow(iris)), 7000), ]
> str(iris.new)
'data.frame': 1050000 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
> system.time(iris.new2 < cbind(iris.new, iris.new))
user system elapsed
5.289 0.282 5.658
> str(iris.new2)
'data.frame': 1050000 obs. of 10 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
You might verify the structures of your 'x' and 'y' to be sure that there is not something amiss with either one.
HTH,
Marc Schwartz
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Sorry Mary,
My function would write the remainder twice, I had only tested it
with multiples of the chunk size.
(And without looking at the lenghty output correctly.)
Now checked:
write.big.matrix < function(x, y, outfile, nmax=1000){
if(file.exists(outfile)) unlink(outfile)
testf < file(outfile, "at") # or "wt"  "write text"
on.exit(close(testf))
step < nmax # how many at a time
inx < seq(1, length(x)step, by=step) # index into 'x' and 'y'
mat < matrix(0, nrow=step, ncol=2) # create a work matrix
# do it 'nmax' rows per iteration
for(i in inx){
mat < cbind(x[i:(i+step1)], y[i:(i+step1)])
write.table(mat, file=testf, quote=FALSE, row.names=FALSE, col.names=FALSE)
}
# and now the remainder
if(i+step < length(x)){
mat < NULL
mat < cbind(x[(i+step):length(x)], y[(i+step):length(y)])
write.table(mat, file=testf, quote=FALSE, row.names=FALSE, col.names=FALSE)
}
# return the output filename
outfile
}
x < 1:(1e6 + 1234) # a numeric vector
y < sample(letters, 1e6 + 1234, replace=TRUE) # and a character vector
length(x);length(y) # of the same length
fl < "test.txt" # output file
system.time(write.big.matrix(x, y, outfile=fl, nmax=100))
user system elapsed
3.04 0.06 3.09
system.time(write.big.matrix(x, y, outfile=fl))
user system elapsed
1.64 0.12 1.76
Rui Barradas


What is it you want to do with the data after you save it? Are you
just going to read it back into R? If so, consider using save/load.
On Fri, Jan 6, 2012 at 12:43 PM, Mary Kindall < [hidden email]> wrote:
> I have two one dimensional list of elements and want to perform cbind and
> then write into a file. The number of entries are more than a million in
> both lists. R is taking a lot of time performing this operation.
>
> Is there any alternate way to perform cbind?
>
> x = table1[1:1000000,1]
> y = table2[1:1000000,5]
>
> z = cbind(x,y) //hanging the machine
>
> write.table(z,'out.txt)
>
>
>
> 
> 
> Mary Kindall
> Yorktown Heights, NY
> USA
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/rhelp> PLEASE do read the posting guide http://www.Rproject.org/postingguide.html> and provide commented, minimal, selfcontained, reproducible code.

Jim Holtman
Data Munger Guru
What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.

