data.table is asking for help

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

data.table is asking for help

Ron Hylton

The code below generates the warning:

 

In setkeyv(x, cols, verbose = verbose) :

  Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed.

 

This is my first attempt at using datatable so I probably did something dumb, but maybe that‘s useful for someone.  The first case is the one that gives the warnings.

 

I’m also surprised at the timings.  I wrote the original algorithm using dataframe & ddply and I expected datatable to be substantially faster; the opposite is true.

 

The algorithm does the following:  Certain columns in the table are keys and others are values in the sense that each row with the same set of keys should have the same set of values.  Find all the key sets for which this is not true and return the keys sets + conflicting value sets.

 

Insight into the performance would be appreciated.

 

Regards,

Ron

 

library(data.table)

library(plyr)

 

conflictsTable1 <- function(f) {

  u <- unique(setkey(f))

  if (nrow(u) == 1) return(NULL)

  u

}

 

conflictsTable2 <- function(f) {

  u <- unique(f)

  if (nrow(u) == 1) return(NULL)

  u

}

 

conflictsFrame <- function(f) {

  u <- unique(f)

  if (nrow(u) == 1) return(NULL)

  u

}

 

N <- 10000

test <- data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)), x1=rnorm(N), x2=rnorm(N), x3=rnorm(N))

 

setkey(test,id)

 

print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id]))

 

print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id]))

 

print(system.time(uf <- ddply(test, .(id), conflictsFrame)))


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: data.table is asking for help

Arunkumar Srinivasan

Nicely reproducible post. Reproducible in v1.9.3 (latest commit) as well.

This is a tricky one. It happens because you’re setting key on .SD which should normally not be allowed. What happens is, when you set key the first time, there’s no key set (here) and therefore key is set on all the columns x1, x2 and x3.

Now, the next group (in the by=.) is passed to your function, it’ll have the key already set to x1,x2,x3 (because setkey modifies the object by reference), but .SD has obtained new data corresponding to this group. And data.table sorts this data, knowing that it already has key set.. but if the key is set then the order must be 1:n. But it wouldn’t be, as this data isn’t sorted. data.table warns in those scenarios.. and that’s why you get the warning.

To verify this, you can try:

conflictsTable1 <- function(f, address) {
  u <- unique(setkey(f))
  setattr(f, 'sorted', NULL)
  if (nrow(u) == 1) return(NULL)
  u
}

Basically, we set the key of f (which is equal to .SD as it’s only modified by reference) to NULL everytime after.. so that .SD for the new group will not have the key set.

The ideal scenario here, IIUC, is that setkey(.SD) or things pointing to .SD should not be possible (locking binding doesn’t seem to affect things done by reference..). .SD however should retain the key of the data.table, if a key was set, wherever possible.


Arun

From: Ron Hylton [hidden email]
Reply: Ron Hylton [hidden email]
Date: June 14, 2014 at 1:55:53 AM
To: [hidden email] [hidden email]
Subject:  [datatable-help] data.table is asking for help

The code below generates the warning:

 

In setkeyv(x, cols, verbose = verbose) :

  Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed.

 

This is my first attempt at using datatable so I probably did something dumb, but maybe that‘s useful for someone.  The first case is the one that gives the warnings.

 

I’m also surprised at the timings.  I wrote the original algorithm using dataframe & ddply and I expected datatable to be substantially faster; the opposite is true.

 

The algorithm does the following:  Certain columns in the table are keys and others are values in the sense that each row with the same set of keys should have the same set of values.  Find all the key sets for which this is not true and return the keys sets + conflicting value sets.

 

Insight into the performance would be appreciated.

 

Regards,

Ron

 

library(data.table)

library(plyr)

 

conflictsTable1 <- function(f) {

  u <- unique(setkey(f))

  if (nrow(u) == 1) return(NULL)

  u

}

 

conflictsTable2 <- function(f) {

  u <- unique(f)

  if (nrow(u) == 1) return(NULL)

  u

}

 

conflictsFrame <- function(f) {

  u <- unique(f)

  if (nrow(u) == 1) return(NULL)

  u

}

 

N <- 10000

test <- data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)), x1=rnorm(N), x2=rnorm(N), x3=rnorm(N))

 

setkey(test,id)

 

print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id]))

 

print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id]))

 

print(system.time(uf <- ddply(test, .(id), conflictsFrame)))

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: data.table is asking for help

Ron Hylton

I suspected it was something like this.  As one clarification, there is a setkey(test,id) before any setkey(.SD).   If setkey(test,id) is changed to setkey(test) so all columns are in the original datatable key then the warning goes away.

 

However there’s another aspect.  While I’m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD.

 

From: Arunkumar Srinivasan [mailto:[hidden email]]
Sent: Friday, June 13, 2014 8:23 PM
To: Ron Hylton; [hidden email]
Subject: Re: [datatable-help] data.table is asking for help

 

Nicely reproducible post. Reproducible in v1.9.3 (latest commit) as well.

This is a tricky one. It happens because you’re setting key on .SD which should normally not be allowed. What happens is, when you set key the first time, there’s no key set (here) and therefore key is set on all the columns x1, x2 and x3.

Now, the next group (in the by=.) is passed to your function, it’ll have the key already set to x1,x2,x3 (because setkey modifies the object by reference), but .SD has obtained new data corresponding to this group. And data.table sorts this data, knowing that it already has key set.. but if the key is set then the order must be 1:n. But it wouldn’t be, as this data isn’t sorted. data.table warns in those scenarios.. and that’s why you get the warning.

To verify this, you can try:

conflictsTable1 <- function(f, address) {
  u <- unique(setkey(f))
  setattr(f, 'sorted', NULL)
  if (nrow(u) == 1) return(NULL)
  u
}

Basically, we set the key of f (which is equal to .SD as it’s only modified by reference) to NULL everytime after.. so that .SD for the new group will not have the key set.

The ideal scenario here, IIUC, is that setkey(.SD) or things pointing to .SD should not be possible (locking binding doesn’t seem to affect things done by reference..). .SD however should retain the key of the data.table, if a key was set, wherever possible.

 

Arun


From: Ron Hylton [hidden email]
Reply: Ron Hylton [hidden email]
Date: June 14, 2014 at 1:55:53 AM
To: [hidden email] [hidden email]
Subject:  [datatable-help] data.table is asking for help



The code below generates the warning:

 

In setkeyv(x, cols, verbose = verbose) :

  Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed.

 

This is my first attempt at using datatable so I probably did something dumb, but maybe that‘s useful for someone.  The first case is the one that gives the warnings.

 

I’m also surprised at the timings.  I wrote the original algorithm using dataframe & ddply and I expected datatable to be substantially faster; the opposite is true.

 

The algorithm does the following:  Certain columns in the table are keys and others are values in the sense that each row with the same set of keys should have the same set of values.  Find all the key sets for which this is not true and return the keys sets + conflicting value sets.

 

Insight into the performance would be appreciated.

 

Regards,

Ron

 

library(data.table)

library(plyr)

 

conflictsTable1 <- function(f) {

  u <- unique(setkey(f))

  if (nrow(u) == 1) return(NULL)

  u

}

 

conflictsTable2 <- function(f) {

  u <- unique(f)

  if (nrow(u) == 1) return(NULL)

  u

}

 

conflictsFrame <- function(f) {

  u <- unique(f)

  if (nrow(u) == 1) return(NULL)

  u

}

 

N <- 10000

test <- data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)), x1=rnorm(N), x2=rnorm(N), x3=rnorm(N))

 

setkey(test,id)

 

print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id]))

 

print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id]))

 

print(system.time(uf <- ddply(test, .(id), conflictsFrame)))

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: data.table is asking for help

Arunkumar Srinivasan

However there’s another aspect.  While I’m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD.

`data.table` is designed for working with *really large* data sets in mind (> 100 or 200 GB in memory even). And therefore, as a design feature, it trades in "referential transparency" for manipulating data objects *as efficient as possible* in terms of both *speed* and *memory usage* (most of the times they go hand-in-hand).

This is perhaps the biggest design choice one needs to be aware of when working/choosing data.tables. It is possible to modify objects by reference using data.table - All the functions that begin with "set*" modify objects by reference. The only other non "set*" function is `:=` operator.


HTH

Arun


From: Ron Hylton [hidden email]
Reply: Ron Hylton [hidden email]
Date: June 14, 2014 at 2:52:04 AM
To: [hidden email] [hidden email]
Subject:  Re: [datatable-help] data.table is asking for help

I suspected it was something like this.  As one clarification, there is a setkey(test,id) before any setkey(.SD).   If setkey(test,id) is changed to setkey(test) so all columns are in the original datatable key then the warning goes away.

 

However there’s another aspect.  While I’m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD.

 

From: Arunkumar Srinivasan [mailto:[hidden email]]
Sent: Friday, June 13, 2014 8:23 PM
To: Ron Hylton; [hidden email]
Subject: Re: [datatable-help] data.table is asking for help

 

Nicely reproducible post. Reproducible in v1.9.3 (latest commit) as well.

This is a tricky one. It happens because you’re setting key on .SD which should normally not be allowed. What happens is, when you set key the first time, there’s no key set (here) and therefore key is set on all the columns x1, x2 and x3.

Now, the next group (in the by=.) is passed to your function, it’ll have the key already set to x1,x2,x3 (because setkey modifies the object by reference), but .SD has obtained new data corresponding to this group. And data.table sorts this data, knowing that it already has key set.. but if the key is set then the order must be 1:n. But it wouldn’t be, as this data isn’t sorted. data.table warns in those scenarios.. and that’s why you get the warning.

To verify this, you can try:

conflictsTable1 <- function(f, address) {
  u <- unique(setkey(f))
  setattr(f, 'sorted', NULL)
  if (nrow(u) == 1) return(NULL)
  u
}

Basically, we set the key of f (which is equal to .SD as it’s only modified by reference) to NULL everytime after.. so that .SD for the new group will not have the key set.

The ideal scenario here, IIUC, is that setkey(.SD) or things pointing to .SD should not be possible (locking binding doesn’t seem to affect things done by reference..). .SD however should retain the key of the data.table, if a key was set, wherever possible.

 

Arun


From: Ron Hylton [hidden email]
Reply: Ron Hylton [hidden email]
Date: June 14, 2014 at 1:55:53 AM
To: [hidden email] [hidden email]
Subject:  [datatable-help] data.table is asking for help



The code below generates the warning:

 

In setkeyv(x, cols, verbose = verbose) :

  Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed.

 

This is my first attempt at using datatable so I probably did something dumb, but maybe that‘s useful for someone.  The first case is the one that gives the warnings.

 

I’m also surprised at the timings.  I wrote the original algorithm using dataframe & ddply and I expected datatable to be substantially faster; the opposite is true.

 

The algorithm does the following:  Certain columns in the table are keys and others are values in the sense that each row with the same set of keys should have the same set of values.  Find all the key sets for which this is not true and return the keys sets + conflicting value sets.

 

Insight into the performance would be appreciated.

 

Regards,

Ron

 

library(data.table)

library(plyr)

 

conflictsTable1 <- function(f) {

  u <- unique(setkey(f))

  if (nrow(u) == 1) return(NULL)

  u

}

 

conflictsTable2 <- function(f) {

  u <- unique(f)

  if (nrow(u) == 1) return(NULL)

  u

}

 

conflictsFrame <- function(f) {

  u <- unique(f)

  if (nrow(u) == 1) return(NULL)

  u

}

 

N <- 10000

test <- data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)), x1=rnorm(N), x2=rnorm(N), x3=rnorm(N))

 

setkey(test,id)

 

print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id]))

 

print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id]))

 

print(system.time(uf <- ddply(test, .(id), conflictsFrame)))

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: data.table is asking for help

Ron Hylton

The performance is what puzzles me; the results are correct so the warnings don’t matter, and not all the variations I’ve tried have warnings.  On the real dataset (~800,000 rows) datatable takes about 1.5 times longer than dataframe + ddply.  I expected it to be substantially faster.

 

From: Arunkumar Srinivasan [mailto:[hidden email]]
Sent: Friday, June 13, 2014 8:57 PM
To: Ron Hylton; [hidden email]
Subject: Re: [datatable-help] data.table is asking for help

 

However there’s another aspect.  While I’m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD.

`data.table` is designed for working with *really large* data sets in mind (> 100 or 200 GB in memory even). And therefore, as a design feature, it trades in "referential transparency" for manipulating data objects *as efficient as possible* in terms of both *speed* and *memory usage* (most of the times they go hand-in-hand).

This is perhaps the biggest design choice one needs to be aware of when working/choosing data.tables. It is possible to modify objects by reference using data.table - All the functions that begin with "set*" modify objects by reference. The only other non "set*" function is `:=` operator.

 

HTH

Arun


From: Ron Hylton [hidden email]
Reply: Ron Hylton [hidden email]
Date: June 14, 2014 at 2:52:04 AM
To: [hidden email] [hidden email]
Subject:  Re: [datatable-help] data.table is asking for help



I suspected it was something like this.  As one clarification, there is a setkey(test,id) before any setkey(.SD).   If setkey(test,id) is changed to setkey(test) so all columns are in the original datatable key then the warning goes away.

 

However there’s another aspect.  While I’m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD.

 

From: Arunkumar Srinivasan [[hidden email]]
Sent: Friday, June 13, 2014 8:23 PM
To: Ron Hylton; [hidden email]
Subject: Re: [datatable-help] data.table is asking for help

 

Nicely reproducible post. Reproducible in v1.9.3 (latest commit) as well.

This is a tricky one. It happens because you’re setting key on .SD which should normally not be allowed. What happens is, when you set key the first time, there’s no key set (here) and therefore key is set on all the columns x1, x2 and x3.

Now, the next group (in the by=.) is passed to your function, it’ll have the key already set to x1,x2,x3 (because setkey modifies the object by reference), but .SD has obtained new data corresponding to this group. And data.table sorts this data, knowing that it already has key set.. but if the key is set then the order must be 1:n. But it wouldn’t be, as this data isn’t sorted. data.table warns in those scenarios.. and that’s why you get the warning.

To verify this, you can try:

conflictsTable1 <- function(f, address) {
  u <- unique(setkey(f))
  setattr(f, 'sorted', NULL)
  if (nrow(u) == 1) return(NULL)
  u
}

Basically, we set the key of f (which is equal to .SD as it’s only modified by reference) to NULL everytime after.. so that .SD for the new group will not have the key set.

The ideal scenario here, IIUC, is that setkey(.SD) or things pointing to .SD should not be possible (locking binding doesn’t seem to affect things done by reference..). .SD however should retain the key of the data.table, if a key was set, wherever possible.

 

Arun


From: Ron Hylton [hidden email]
Reply: Ron Hylton [hidden email]
Date: June 14, 2014 at 1:55:53 AM
To: [hidden email] [hidden email]
Subject:  [datatable-help] data.table is asking for help

 

The code below generates the warning:

 

In setkeyv(x, cols, verbose = verbose) :

  Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed.

 

This is my first attempt at using datatable so I probably did something dumb, but maybe that‘s useful for someone.  The first case is the one that gives the warnings.

 

I’m also surprised at the timings.  I wrote the original algorithm using dataframe & ddply and I expected datatable to be substantially faster; the opposite is true.

 

The algorithm does the following:  Certain columns in the table are keys and others are values in the sense that each row with the same set of keys should have the same set of values.  Find all the key sets for which this is not true and return the keys sets + conflicting value sets.

 

Insight into the performance would be appreciated.

 

Regards,

Ron

 

library(data.table)

library(plyr)

 

conflictsTable1 <- function(f) {

  u <- unique(setkey(f))

  if (nrow(u) == 1) return(NULL)

  u

}

 

conflictsTable2 <- function(f) {

  u <- unique(f)

  if (nrow(u) == 1) return(NULL)

  u

}

 

conflictsFrame <- function(f) {

  u <- unique(f)

  if (nrow(u) == 1) return(NULL)

  u

}

 

N <- 10000

test <- data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)), x1=rnorm(N), x2=rnorm(N), x3=rnorm(N))

 

setkey(test,id)

 

print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id]))

 

print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id]))

 

print(system.time(uf <- ddply(test, .(id), conflictsFrame)))

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: data.table is asking for help

Arunkumar Srinivasan

The j-expression is evaluated from within C for each group (unless they’re optimised with GForce - a new initiative in data.table). And eval(.SD) or eval(anything(.SD)) is costly.

You can get around it by listing the columns by yourself and using .I instead, as follows:

test[test[, .I[length(unique(list(x1,x2,x3))[[1L]]) > 1L], by=id]$V1]
#  0.140   0.001   0.142  

Takes about 0.14 seconds.


An even faster way is:

system.time({
ans = test[test[, .I[.N > 1], by=id]$V1]        # (1)  
ans = ans[, .N, by=names(ans)]                  # (2)  
ans = ans[ans[, .I[.N > 1L], by=id]$V1]         # (3)
})

#  0.026   0.000   0.027  

The idea for the second case is:

(1) remove all entries where there’s just 1 row corresponding to that id.
(2) Aggregate this result by all the columns now and get the number of rows in the column N (we won’t have to use this column though).
(3) Now, if we aggregate by id and if any id has just 1 row, then it’d mean that that id has had more than 1 rows (step (1) filtering ensures this), but all of them are same and we don’t need them. So we just filter for those where .N > 1L.

HTH


Arun

From: Ron Hylton [hidden email]
Reply: Ron Hylton [hidden email]
Date: June 14, 2014 at 3:30:55 AM
To: [hidden email] [hidden email]
Subject:  Re: [datatable-help] data.table is asking for help

The performance is what puzzles me; the results are correct so the warnings don’t matter, and not all the variations I’ve tried have warnings.  On the real dataset (~800,000 rows) datatable takes about 1.5 times longer than dataframe + ddply.  I expected it to be substantially faster.

 

From: Arunkumar Srinivasan [mailto:[hidden email]]
Sent: Friday, June 13, 2014 8:57 PM
To: Ron Hylton; [hidden email]
Subject: Re: [datatable-help] data.table is asking for help

 

However there’s another aspect.  While I’m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD.

`data.table` is designed for working with *really large* data sets in mind (> 100 or 200 GB in memory even). And therefore, as a design feature, it trades in "referential transparency" for manipulating data objects *as efficient as possible* in terms of both *speed* and *memory usage* (most of the times they go hand-in-hand).

This is perhaps the biggest design choice one needs to be aware of when working/choosing data.tables. It is possible to modify objects by reference using data.table - All the functions that begin with "set*" modify objects by reference. The only other non "set*" function is `:=` operator.

 

HTH

Arun


From: Ron Hylton [hidden email]
Reply: Ron Hylton [hidden email]
Date: June 14, 2014 at 2:52:04 AM
To: [hidden email] [hidden email]
Subject:  Re: [datatable-help] data.table is asking for help



I suspected it was something like this.  As one clarification, there is a setkey(test,id) before any setkey(.SD).   If setkey(test,id) is changed to setkey(test) so all columns are in the original datatable key then the warning goes away.

 

However there’s another aspect.  While I’m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD.

 

From: Arunkumar Srinivasan [[hidden email]]
Sent: Friday, June 13, 2014 8:23 PM
To: Ron Hylton; [hidden email]
Subject: Re: [datatable-help] data.table is asking for help

 

Nicely reproducible post. Reproducible in v1.9.3 (latest commit) as well.

This is a tricky one. It happens because you’re setting key on .SD which should normally not be allowed. What happens is, when you set key the first time, there’s no key set (here) and therefore key is set on all the columns x1, x2 and x3.

Now, the next group (in the by=.) is passed to your function, it’ll have the key already set to x1,x2,x3 (because setkey modifies the object by reference), but .SD has obtained new data corresponding to this group. And data.table sorts this data, knowing that it already has key set.. but if the key is set then the order must be 1:n. But it wouldn’t be, as this data isn’t sorted. data.table warns in those scenarios.. and that’s why you get the warning.

To verify this, you can try:

conflictsTable1 <- function(f, address) {
  u <- unique(setkey(f))
  setattr(f, 'sorted', NULL)
  if (nrow(u) == 1) return(NULL)
  u
}

Basically, we set the key of f (which is equal to .SD as it’s only modified by reference) to NULL everytime after.. so that .SD for the new group will not have the key set.

The ideal scenario here, IIUC, is that setkey(.SD) or things pointing to .SD should not be possible (locking binding doesn’t seem to affect things done by reference..). .SD however should retain the key of the data.table, if a key was set, wherever possible.

 

Arun


From: Ron Hylton [hidden email]
Reply: Ron Hylton [hidden email]
Date: June 14, 2014 at 1:55:53 AM
To: [hidden email] [hidden email]
Subject:  [datatable-help] data.table is asking for help

 

The code below generates the warning:

 

In setkeyv(x, cols, verbose = verbose) :

  Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed.

 

This is my first attempt at using datatable so I probably did something dumb, but maybe that‘s useful for someone.  The first case is the one that gives the warnings.

 

I’m also surprised at the timings.  I wrote the original algorithm using dataframe & ddply and I expected datatable to be substantially faster; the opposite is true.

 

The algorithm does the following:  Certain columns in the table are keys and others are values in the sense that each row with the same set of keys should have the same set of values.  Find all the key sets for which this is not true and return the keys sets + conflicting value sets.

 

Insight into the performance would be appreciated.

 

Regards,

Ron

 

library(data.table)

library(plyr)

 

conflictsTable1 <- function(f) {

  u <- unique(setkey(f))

  if (nrow(u) == 1) return(NULL)

  u

}

 

conflictsTable2 <- function(f) {

  u <- unique(f)

  if (nrow(u) == 1) return(NULL)

  u

}

 

conflictsFrame <- function(f) {

  u <- unique(f)

  if (nrow(u) == 1) return(NULL)

  u

}

 

N <- 10000

test <- data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)), x1=rnorm(N), x2=rnorm(N), x3=rnorm(N))

 

setkey(test,id)

 

print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id]))

 

print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id]))

 

print(system.time(uf <- ddply(test, .(id), conflictsFrame)))

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: data.table is asking for help

Arunkumar Srinivasan

A slightly simpler version of the 2nd solution is:

system.time({
ans = test[, .N, by=names(test)]
ans = ans[ans[, .I[.N > 1L], by=id]$V1]
})
#  0.019   0.000   0.019  

The answers are identical, you can check this by doing:

ans[, N := NULL]
setkey(ans)
setkey(ut1)
identical(ans, ut1) # [1] TRUE


Arun

From: Arunkumar Srinivasan [hidden email]
Reply: Arunkumar Srinivasan [hidden email]
Date: June 14, 2014 at 4:34:15 AM
To: Ron Hylton [hidden email], [hidden email] [hidden email]
Subject:  Re: [datatable-help] data.table is asking for help

The j-expression is evaluated from within C for each group (unless they’re optimised with GForce - a new initiative in data.table). And eval(.SD) or eval(anything(.SD)) is costly.

You can get around it by listing the columns by yourself and using .I instead, as follows:

test[test[, .I[length(unique(list(x1,x2,x3))[[1L]]) > 1L], by=id]$V1]
#  0.140   0.001   0.142   

Takes about 0.14 seconds.


An even faster way is:

system.time({
ans = test[test[, .I[.N > 1], by=id]$V1]        # (1)   
ans = ans[, .N, by=names(ans)]                  # (2)   
ans = ans[ans[, .I[.N > 1L], by=id]$V1]         # (3)
})

#  0.026   0.000   0.027   

The idea for the second case is:

(1) remove all entries where there’s just 1 row corresponding to that id.
(2) Aggregate this result by all the columns now and get the number of rows in the column N (we won’t have to use this column though).
(3) Now, if we aggregate by id and if any id has just 1 row, then it’d mean that that id has had more than 1 rows (step (1) filtering ensures this), but all of them are same and we don’t need them. So we just filter for those where .N > 1L.

HTH


Arun

From: Ron Hylton [hidden email]
Reply: Ron Hylton [hidden email]
Date: June 14, 2014 at 3:30:55 AM
To: [hidden email] [hidden email]
Subject:  Re: [datatable-help] data.table is asking for help

The performance is what puzzles me; the results are correct so the warnings don’t matter, and not all the variations I’ve tried have warnings.  On the real dataset (~800,000 rows) datatable takes about 1.5 times longer than dataframe + ddply.  I expected it to be substantially faster.

 

From: Arunkumar Srinivasan [mailto:[hidden email]]
Sent: Friday, June 13, 2014 8:57 PM
To: Ron Hylton; [hidden email]
Subject: Re: [datatable-help] data.table is asking for help

 

However there’s another aspect.  While I’m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD.

`data.table` is designed for working with *really large* data sets in mind (> 100 or 200 GB in memory even). And therefore, as a design feature, it trades in "referential transparency" for manipulating data objects *as efficient as possible* in terms of both *speed* and *memory usage* (most of the times they go hand-in-hand).

This is perhaps the biggest design choice one needs to be aware of when working/choosing data.tables. It is possible to modify objects by reference using data.table - All the functions that begin with "set*" modify objects by reference. The only other non "set*" function is `:=` operator.

 

HTH

Arun


From: Ron Hylton [hidden email]
Reply: Ron Hylton [hidden email]
Date: June 14, 2014 at 2:52:04 AM
To: [hidden email] [hidden email]
Subject:  Re: [datatable-help] data.table is asking for help



I suspected it was something like this.  As one clarification, there is a setkey(test,id) before any setkey(.SD).   If setkey(test,id) is changed to setkey(test) so all columns are in the original datatable key then the warning goes away.

 

However there’s another aspect.  While I’m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD.

 

From: Arunkumar Srinivasan [[hidden email]]
Sent: Friday, June 13, 2014 8:23 PM
To: Ron Hylton; [hidden email]
Subject: Re: [datatable-help] data.table is asking for help

 

Nicely reproducible post. Reproducible in v1.9.3 (latest commit) as well.

This is a tricky one. It happens because you’re setting key on .SD which should normally not be allowed. What happens is, when you set key the first time, there’s no key set (here) and therefore key is set on all the columns x1, x2 and x3.

Now, the next group (in the by=.) is passed to your function, it’ll have the key already set to x1,x2,x3 (because setkey modifies the object by reference), but .SD has obtained new data corresponding to this group. And data.table sorts this data, knowing that it already has key set.. but if the key is set then the order must be 1:n. But it wouldn’t be, as this data isn’t sorted. data.table warns in those scenarios.. and that’s why you get the warning.

To verify this, you can try:

conflictsTable1 <- function(f, address) {
  u <- unique(setkey(f))
  setattr(f, 'sorted', NULL)
  if (nrow(u) == 1) return(NULL)
  u
}

Basically, we set the key of f (which is equal to .SD as it’s only modified by reference) to NULL everytime after.. so that .SD for the new group will not have the key set.

The ideal scenario here, IIUC, is that setkey(.SD) or things pointing to .SD should not be possible (locking binding doesn’t seem to affect things done by reference..). .SD however should retain the key of the data.table, if a key was set, wherever possible.

 

Arun


From: Ron Hylton [hidden email]
Reply: Ron Hylton [hidden email]
Date: June 14, 2014 at 1:55:53 AM
To: [hidden email] [hidden email]
Subject:  [datatable-help] data.table is asking for help

 

The code below generates the warning:

 

In setkeyv(x, cols, verbose = verbose) :

  Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed.

 

This is my first attempt at using datatable so I probably did something dumb, but maybe that‘s useful for someone.  The first case is the one that gives the warnings.

 

I’m also surprised at the timings.  I wrote the original algorithm using dataframe & ddply and I expected datatable to be substantially faster; the opposite is true.

 

The algorithm does the following:  Certain columns in the table are keys and others are values in the sense that each row with the same set of keys should have the same set of values.  Find all the key sets for which this is not true and return the keys sets + conflicting value sets.

 

Insight into the performance would be appreciated.

 

Regards,

Ron

 

library(data.table)

library(plyr)

 

conflictsTable1 <- function(f) {

  u <- unique(setkey(f))

  if (nrow(u) == 1) return(NULL)

  u

}

 

conflictsTable2 <- function(f) {

  u <- unique(f)

  if (nrow(u) == 1) return(NULL)

  u

}

 

conflictsFrame <- function(f) {

  u <- unique(f)

  if (nrow(u) == 1) return(NULL)

  u

}

 

N <- 10000

test <- data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)), x1=rnorm(N), x2=rnorm(N), x3=rnorm(N))

 

setkey(test,id)

 

print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id]))

 

print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id]))

 

print(system.time(uf <- ddply(test, .(id), conflictsFrame)))

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: data.table is asking for help

Arunkumar Srinivasan

Sorry. But we can simplify it even further:

The first step is just unique(test). So, we can do:

system.time({
ans = unique(test)
ans = ans[ans[, .I[.N > 1L], by=id]$V1]
})
#  0.016   0.000   0.016  

Identical?

setkey(ans)
setkey(ut1)
identical(ans, ut1) # [1] TRUE


Arun

From: Arunkumar Srinivasan [hidden email]
Reply: Arunkumar Srinivasan [hidden email]
Date: June 14, 2014 at 4:42:31 AM
To: Ron Hylton [hidden email], [hidden email] [hidden email]
Subject:  Re: [datatable-help] data.table is asking for help

A slightly simpler version of the 2nd solution is:

system.time({
ans = test[, .N, by=names(test)]
ans = ans[ans[, .I[.N > 1L], by=id]$V1]
})
#  0.019   0.000   0.019   

The answers are identical, you can check this by doing:

ans[, N := NULL]
setkey(ans)
setkey(ut1)
identical(ans, ut1) # [1] TRUE


Arun

From: Arunkumar Srinivasan [hidden email]
Reply: Arunkumar Srinivasan [hidden email]
Date: June 14, 2014 at 4:34:15 AM
To: Ron Hylton [hidden email], [hidden email] [hidden email]
Subject:  Re: [datatable-help] data.table is asking for help

The j-expression is evaluated from within C for each group (unless they’re optimised with GForce - a new initiative in data.table). And eval(.SD) or eval(anything(.SD)) is costly.

You can get around it by listing the columns by yourself and using .I instead, as follows:

test[test[, .I[length(unique(list(x1,x2,x3))[[1L]]) > 1L], by=id]$V1]
#  0.140   0.001   0.142    


Takes about 0.14 seconds.


An even faster way is:

system.time({
ans = test[test[, .I[.N > 1], by=id]$V1]        # (1)    
ans = ans[, .N, by=names(ans)]                  # (2)    
ans = ans[ans[, .I[.N > 1L], by=id]$V1]         # (3)
})

#  0.026   0.000   0.027    


The idea for the second case is:

(1) remove all entries where there’s just 1 row corresponding to that id.
(2) Aggregate this result by all the columns now and get the number of rows in the column N (we won’t have to use this column though).
(3) Now, if we aggregate by id and if any id has just 1 row, then it’d mean that that id has had more than 1 rows (step (1) filtering ensures this), but all of them are same and we don’t need them. So we just filter for those where .N > 1L.

HTH


Arun

From: Ron Hylton [hidden email]
Reply: Ron Hylton [hidden email]
Date: June 14, 2014 at 3:30:55 AM
To: [hidden email] [hidden email]
Subject:  Re: [datatable-help] data.table is asking for help

The performance is what puzzles me; the results are correct so the warnings don’t matter, and not all the variations I’ve tried have warnings.  On the real dataset (~800,000 rows) datatable takes about 1.5 times longer than dataframe + ddply.  I expected it to be substantially faster.

 

From: Arunkumar Srinivasan [mailto:[hidden email]]
Sent: Friday, June 13, 2014 8:57 PM
To: Ron Hylton; [hidden email]
Subject: Re: [datatable-help] data.table is asking for help

 

However there’s another aspect.  While I’m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD.

`data.table` is designed for working with *really large* data sets in mind (> 100 or 200 GB in memory even). And therefore, as a design feature, it trades in "referential transparency" for manipulating data objects *as efficient as possible* in terms of both *speed* and *memory usage* (most of the times they go hand-in-hand).

This is perhaps the biggest design choice one needs to be aware of when working/choosing data.tables. It is possible to modify objects by reference using data.table - All the functions that begin with "set*" modify objects by reference. The only other non "set*" function is `:=` operator.

 

HTH

Arun


From: Ron Hylton [hidden email]
Reply: Ron Hylton [hidden email]
Date: June 14, 2014 at 2:52:04 AM
To: [hidden email] [hidden email]
Subject:  Re: [datatable-help] data.table is asking for help



I suspected it was something like this.  As one clarification, there is a setkey(test,id) before any setkey(.SD).   If setkey(test,id) is changed to setkey(test) so all columns are in the original datatable key then the warning goes away.

 

However there’s another aspect.  While I’m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD.

 

From: Arunkumar Srinivasan [[hidden email]]
Sent: Friday, June 13, 2014 8:23 PM
To: Ron Hylton; [hidden email]
Subject: Re: [datatable-help] data.table is asking for help

 

Nicely reproducible post. Reproducible in v1.9.3 (latest commit) as well.

This is a tricky one. It happens because you’re setting key on .SD which should normally not be allowed. What happens is, when you set key the first time, there’s no key set (here) and therefore key is set on all the columns x1, x2 and x3.

Now, the next group (in the by=.) is passed to your function, it’ll have the key already set to x1,x2,x3 (because setkey modifies the object by reference), but .SD has obtained new data corresponding to this group. And data.table sorts this data, knowing that it already has key set.. but if the key is set then the order must be 1:n. But it wouldn’t be, as this data isn’t sorted. data.table warns in those scenarios.. and that’s why you get the warning.

To verify this, you can try:

conflictsTable1 <- function(f, address) {
  u <- unique(setkey(f))
  setattr(f, 'sorted', NULL)
  if (nrow(u) == 1) return(NULL)
  u
}

Basically, we set the key of f (which is equal to .SD as it’s only modified by reference) to NULL everytime after.. so that .SD for the new group will not have the key set.

The ideal scenario here, IIUC, is that setkey(.SD) or things pointing to .SD should not be possible (locking binding doesn’t seem to affect things done by reference..). .SD however should retain the key of the data.table, if a key was set, wherever possible.

 

Arun


From: Ron Hylton [hidden email]
Reply: Ron Hylton [hidden email]
Date: June 14, 2014 at 1:55:53 AM
To: [hidden email] [hidden email]
Subject:  [datatable-help] data.table is asking for help

 

The code below generates the warning:

 

In setkeyv(x, cols, verbose = verbose) :

  Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed.

 

This is my first attempt at using datatable so I probably did something dumb, but maybe that‘s useful for someone.  The first case is the one that gives the warnings.

 

I’m also surprised at the timings.  I wrote the original algorithm using dataframe & ddply and I expected datatable to be substantially faster; the opposite is true.

 

The algorithm does the following:  Certain columns in the table are keys and others are values in the sense that each row with the same set of keys should have the same set of values.  Find all the key sets for which this is not true and return the keys sets + conflicting value sets.

 

Insight into the performance would be appreciated.

 

Regards,

Ron

 

library(data.table)

library(plyr)

 

conflictsTable1 <- function(f) {

  u <- unique(setkey(f))

  if (nrow(u) == 1) return(NULL)

  u

}

 

conflictsTable2 <- function(f) {

  u <- unique(f)

  if (nrow(u) == 1) return(NULL)

  u

}

 

conflictsFrame <- function(f) {

  u <- unique(f)

  if (nrow(u) == 1) return(NULL)

  u

}

 

N <- 10000

test <- data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)), x1=rnorm(N), x2=rnorm(N), x3=rnorm(N))

 

setkey(test,id)

 

print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id]))

 

print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id]))

 

print(system.time(uf <- ddply(test, .(id), conflictsFrame)))

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: data.table is asking for help

Ron Hylton

Thanks, that very helpful.

 

From: Arunkumar Srinivasan [mailto:[hidden email]]
Sent: Friday, June 13, 2014 10:46 PM
To: Ron Hylton; [hidden email]
Subject: Re: [datatable-help] data.table is asking for help

 

Sorry. But we can simplify it even further:

The first step is just unique(test). So, we can do:

system.time({
ans = unique(test)
ans = ans[ans[, .I[.N > 1L], by=id]$V1]
})
#  0.016   0.000   0.016  

Identical?

setkey(ans)
setkey(ut1)
identical(ans, ut1) # [1] TRUE

 

Arun


From: Arunkumar Srinivasan [hidden email]
Reply: Arunkumar Srinivasan [hidden email]
Date: June 14, 2014 at 4:42:31 AM
To: Ron Hylton [hidden email], [hidden email] [hidden email]
Subject:  Re: [datatable-help] data.table is asking for help



A slightly simpler version of the 2nd solution is:

system.time({
ans = test[, .N, by=names(test)]
ans = ans[ans[, .I[.N > 1L], by=id]$V1]
})
#  0.019   0.000   0.019   
 

The answers are identical, you can check this by doing:

ans[, N := NULL]
setkey(ans)
setkey(ut1)
identical(ans, ut1) # [1] TRUE
 

 

Arun


From: Arunkumar Srinivasan [hidden email]
Reply: Arunkumar Srinivasan [hidden email]
Date: June 14, 2014 at 4:34:15 AM
To: Ron Hylton [hidden email], [hidden email] [hidden email]
Subject:  Re: [datatable-help] data.table is asking for help



The j-expression is evaluated from within C for each group (unless they’re optimised with GForce - a new initiative in data.table). And eval(.SD) or eval(anything(.SD)) is costly.

You can get around it by listing the columns by yourself and using .I instead, as follows:

test[test[, .I[length(unique(list(x1,x2,x3))[[1L]]) > 1L], by=id]$V1]
#  0.140   0.001   0.142    
 
 

Takes about 0.14 seconds.


An even faster way is:

system.time({
ans = test[test[, .I[.N > 1], by=id]$V1]        # (1)    
ans = ans[, .N, by=names(ans)]                  # (2)    
ans = ans[ans[, .I[.N > 1L], by=id]$V1]         # (3)
})
 
#  0.026   0.000   0.027    
 
 

The idea for the second case is:

(1) remove all entries where there’s just 1 row corresponding to that id.
(2) Aggregate this result by all the columns now and get the number of rows in the column N (we won’t have to use this column though).
(3) Now, if we aggregate by id and if any id has just 1 row, then it’d mean that that id has had more than 1 rows (step (1) filtering ensures this), but all of them are same and we don’t need them. So we just filter for those where .N > 1L.

HTH

 

Arun


From: Ron Hylton [hidden email]
Reply: Ron Hylton [hidden email]
Date: June 14, 2014 at 3:30:55 AM
To: [hidden email] [hidden email]
Subject:  Re: [datatable-help] data.table is asking for help



The performance is what puzzles me; the results are correct so the warnings don’t matter, and not all the variations I’ve tried have warnings.  On the real dataset (~800,000 rows) datatable takes about 1.5 times longer than dataframe + ddply.  I expected it to be substantially faster.

 

From: Arunkumar Srinivasan [[hidden email]]
Sent: Friday, June 13, 2014 8:57 PM
To: Ron Hylton; [hidden email]
Subject: Re: [datatable-help] data.table is asking for help

 

However there’s another aspect.  While I’m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD.

`data.table` is designed for working with *really large* data sets in mind (> 100 or 200 GB in memory even). And therefore, as a design feature, it trades in "referential transparency" for manipulating data objects *as efficient as possible* in terms of both *speed* and *memory usage* (most of the times they go hand-in-hand).

This is perhaps the biggest design choice one needs to be aware of when working/choosing data.tables. It is possible to modify objects by reference using data.table - All the functions that begin with "set*" modify objects by reference. The only other non "set*" function is `:=` operator.

 

HTH

Arun


From: Ron Hylton [hidden email]
Reply: Ron Hylton [hidden email]
Date: June 14, 2014 at 2:52:04 AM
To: [hidden email] [hidden email]
Subject:  Re: [datatable-help] data.table is asking for help

 

I suspected it was something like this.  As one clarification, there is a setkey(test,id) before any setkey(.SD).   If setkey(test,id) is changed to setkey(test) so all columns are in the original datatable key then the warning goes away.

 

However there’s another aspect.  While I’m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD.

 

From: Arunkumar Srinivasan [[hidden email]]
Sent: Friday, June 13, 2014 8:23 PM
To: Ron Hylton; [hidden email]
Subject: Re: [datatable-help] data.table is asking for help

 

Nicely reproducible post. Reproducible in v1.9.3 (latest commit) as well.

This is a tricky one. It happens because you’re setting key on .SD which should normally not be allowed. What happens is, when you set key the first time, there’s no key set (here) and therefore key is set on all the columns x1, x2 and x3.

Now, the next group (in the by=.) is passed to your function, it’ll have the key already set to x1,x2,x3 (because setkey modifies the object by reference), but .SD has obtained new data corresponding to this group. And data.table sorts this data, knowing that it already has key set.. but if the key is set then the order must be 1:n. But it wouldn’t be, as this data isn’t sorted. data.table warns in those scenarios.. and that’s why you get the warning.

To verify this, you can try:

conflictsTable1 <- function(f, address) {
  u <- unique(setkey(f))
  setattr(f, 'sorted', NULL)
  if (nrow(u) == 1) return(NULL)
  u
}

Basically, we set the key of f (which is equal to .SD as it’s only modified by reference) to NULL everytime after.. so that .SD for the new group will not have the key set.

The ideal scenario here, IIUC, is that setkey(.SD) or things pointing to .SD should not be possible (locking binding doesn’t seem to affect things done by reference..). .SD however should retain the key of the data.table, if a key was set, wherever possible.

 

Arun


From: Ron Hylton [hidden email]
Reply: Ron Hylton [hidden email]
Date: June 14, 2014 at 1:55:53 AM
To: [hidden email] [hidden email]
Subject:  [datatable-help] data.table is asking for help

 

The code below generates the warning:

 

In setkeyv(x, cols, verbose = verbose) :

  Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed.

 

This is my first attempt at using datatable so I probably did something dumb, but maybe that‘s useful for someone.  The first case is the one that gives the warnings.

 

I’m also surprised at the timings.  I wrote the original algorithm using dataframe & ddply and I expected datatable to be substantially faster; the opposite is true.

 

The algorithm does the following:  Certain columns in the table are keys and others are values in the sense that each row with the same set of keys should have the same set of values.  Find all the key sets for which this is not true and return the keys sets + conflicting value sets.

 

Insight into the performance would be appreciated.

 

Regards,

Ron

 

library(data.table)

library(plyr)

 

conflictsTable1 <- function(f) {

  u <- unique(setkey(f))

  if (nrow(u) == 1) return(NULL)

  u

}

 

conflictsTable2 <- function(f) {

  u <- unique(f)

  if (nrow(u) == 1) return(NULL)

  u

}

 

conflictsFrame <- function(f) {

  u <- unique(f)

  if (nrow(u) == 1) return(NULL)

  u

}

 

N <- 10000

test <- data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)), x1=rnorm(N), x2=rnorm(N), x3=rnorm(N))

 

setkey(test,id)

 

print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id]))

 

print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id]))

 

print(system.time(uf <- ddply(test, .(id), conflictsFrame)))

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: data.table is asking for help

Matthew Dowle

Hi Ron,

Thanks for highlighting this.  Two changes now in v1.9.3 on GitHub:

  • setkey on .SD is now an error, rather than warnings for each group about rebuilding the key. The new error is similar to when attempting to use := in a .SD subquery: ".SD is locked. Using set*() functions on .SD is reserved for possible future use; a tortuously flexible way to modify the original data by group." Thanks to Ron Hylton for highlighting the issue on datatable-help here.

  • Looping calls to unique(DT) such as in DT[,unique(.SD),by=group] is now faster by avoiding internal overhead of calling [.data.table. Thanks again to Ron Hylton for highlighting in the same thread. His example is reduced from 28 sec to 9 sec, with identical results.


I now get the following (on my slow netbook) with no changes to your code.

print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id]))   #  were warnings,    now error
print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id]))   #  was 28s, now 9s
print(system.time(uf <- ddply(test, .(id), conflictsFrame)))   # 13s

This just fixes the surprises, basically.   Clearly Arun uses data.table in a better way which is orders of magnitude faster.

Matt


On 14/06/14 03:58, Ron Hylton wrote:

Thanks, that very helpful.

 

From: Arunkumar Srinivasan [[hidden email]]
Sent: Friday, June 13, 2014 10:46 PM
To: Ron Hylton; [hidden email]
Subject: Re: [datatable-help] data.table is asking for help

 

Sorry. But we can simplify it even further:

The first step is just unique(test). So, we can do:

system.time({
ans = unique(test)
ans = ans[ans[, .I[.N > 1L], by=id]$V1]
})
#  0.016   0.000   0.016  

Identical?

setkey(ans)
setkey(ut1)
identical(ans, ut1) # [1] TRUE

 

Arun


From: Arunkumar Srinivasan [hidden email]
Reply: Arunkumar Srinivasan [hidden email]
Date: June 14, 2014 at 4:42:31 AM
To: Ron Hylton [hidden email], [hidden email] [hidden email]
Subject:  Re: [datatable-help] data.table is asking for help



A slightly simpler version of the 2nd solution is:

system.time({
ans = test[, .N, by=names(test)]
ans = ans[ans[, .I[.N > 1L], by=id]$V1]
})
#  0.019   0.000   0.019   
 

The answers are identical, you can check this by doing:

ans[, N := NULL]
setkey(ans)
setkey(ut1)
identical(ans, ut1) # [1] TRUE
 

 

Arun


From: Arunkumar Srinivasan [hidden email]
Reply: Arunkumar Srinivasan [hidden email]
Date: June 14, 2014 at 4:34:15 AM
To: Ron Hylton [hidden email], [hidden email] [hidden email]
Subject:  Re: [datatable-help] data.table is asking for help



The j-expression is evaluated from within C for each group (unless they’re optimised with GForce - a new initiative in data.table). And eval(.SD) or eval(anything(.SD)) is costly.

You can get around it by listing the columns by yourself and using .I instead, as follows:

test[test[, .I[length(unique(list(x1,x2,x3))[[1L]]) > 1L], by=id]$V1]
#  0.140   0.001   0.142    
 
 

Takes about 0.14 seconds.


An even faster way is:

system.time({
ans = test[test[, .I[.N > 1], by=id]$V1]        # (1)    
ans = ans[, .N, by=names(ans)]                  # (2)    
ans = ans[ans[, .I[.N > 1L], by=id]$V1]         # (3)
})
 
#  0.026   0.000   0.027    
 
 

The idea for the second case is:

(1) remove all entries where there’s just 1 row corresponding to that id.
(2) Aggregate this result by all the columns now and get the number of rows in the column N (we won’t have to use this column though).
(3) Now, if we aggregate by id and if any id has just 1 row, then it’d mean that that id has had more than 1 rows (step (1) filtering ensures this), but all of them are same and we don’t need them. So we just filter for those where .N > 1L.

HTH

 

Arun


From: Ron Hylton [hidden email]
Reply: Ron Hylton [hidden email]
Date: June 14, 2014 at 3:30:55 AM
To: [hidden email] [hidden email]
Subject:  Re: [datatable-help] data.table is asking for help



The performance is what puzzles me; the results are correct so the warnings don’t matter, and not all the variations I’ve tried have warnings.  On the real dataset (~800,000 rows) datatable takes about 1.5 times longer than dataframe + ddply.  I expected it to be substantially faster.

 

From: Arunkumar Srinivasan [[hidden email]]
Sent: Friday, June 13, 2014 8:57 PM
To: Ron Hylton; [hidden email]
Subject: Re: [datatable-help] data.table is asking for help

 

However there’s another aspect.  While I’m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD.

`data.table` is designed for working with *really large* data sets in mind (> 100 or 200 GB in memory even). And therefore, as a design feature, it trades in "referential transparency" for manipulating data objects *as efficient as possible* in terms of both *speed* and *memory usage* (most of the times they go hand-in-hand).

This is perhaps the biggest design choice one needs to be aware of when working/choosing data.tables. It is possible to modify objects by reference using data.table - All the functions that begin with "set*" modify objects by reference. The only other non "set*" function is `:=` operator.

 

HTH

Arun


From: Ron Hylton [hidden email]
Reply: Ron Hylton [hidden email]
Date: June 14, 2014 at 2:52:04 AM
To: [hidden email] [hidden email]
Subject:  Re: [datatable-help] data.table is asking for help

 

I suspected it was something like this.  As one clarification, there is a setkey(test,id) before any setkey(.SD).   If setkey(test,id) is changed to setkey(test) so all columns are in the original datatable key then the warning goes away.

 

However there’s another aspect.  While I’m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD.

 

From: Arunkumar Srinivasan [[hidden email]]
Sent: Friday, June 13, 2014 8:23 PM
To: Ron Hylton; [hidden email]
Subject: Re: [datatable-help] data.table is asking for help

 

Nicely reproducible post. Reproducible in v1.9.3 (latest commit) as well.

This is a tricky one. It happens because you’re setting key on .SD which should normally not be allowed. What happens is, when you set key the first time, there’s no key set (here) and therefore key is set on all the columns x1, x2 and x3.

Now, the next group (in the by=.) is passed to your function, it’ll have the key already set to x1,x2,x3 (because setkey modifies the object by reference), but .SD has obtained new data corresponding to this group. And data.table sorts this data, knowing that it already has key set.. but if the key is set then the order must be 1:n. But it wouldn’t be, as this data isn’t sorted. data.table warns in those scenarios.. and that’s why you get the warning.

To verify this, you can try:

conflictsTable1 <- function(f, address) {
  u <- unique(setkey(f))
  setattr(f, 'sorted', NULL)
  if (nrow(u) == 1) return(NULL)
  u
}

Basically, we set the key of f (which is equal to .SD as it’s only modified by reference) to NULL everytime after.. so that .SD for the new group will not have the key set.

The ideal scenario here, IIUC, is that setkey(.SD) or things pointing to .SD should not be possible (locking binding doesn’t seem to affect things done by reference..). .SD however should retain the key of the data.table, if a key was set, wherever possible.

 

Arun


From: Ron Hylton [hidden email]
Reply: Ron Hylton [hidden email]
Date: June 14, 2014 at 1:55:53 AM
To: [hidden email] [hidden email]
Subject:  [datatable-help] data.table is asking for help

 

The code below generates the warning:

 

In setkeyv(x, cols, verbose = verbose) :

  Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed.

 

This is my first attempt at using datatable so I probably did something dumb, but maybe that‘s useful for someone.  The first case is the one that gives the warnings.

 

I’m also surprised at the timings.  I wrote the original algorithm using dataframe & ddply and I expected datatable to be substantially faster; the opposite is true.

 

The algorithm does the following:  Certain columns in the table are keys and others are values in the sense that each row with the same set of keys should have the same set of values.  Find all the key sets for which this is not true and return the keys sets + conflicting value sets.

 

Insight into the performance would be appreciated.

 

Regards,

Ron

 

library(data.table)

library(plyr)

 

conflictsTable1 <- function(f) {

  u <- unique(setkey(f))

  if (nrow(u) == 1) return(NULL)

  u

}

 

conflictsTable2 <- function(f) {

  u <- unique(f)

  if (nrow(u) == 1) return(NULL)

  u

}

 

conflictsFrame <- function(f) {

  u <- unique(f)

  if (nrow(u) == 1) return(NULL)

  u

}

 

N <- 10000

test <- data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)), x1=rnorm(N), x2=rnorm(N), x3=rnorm(N))

 

setkey(test,id)

 

print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id]))

 

print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id]))

 

print(system.time(uf <- ddply(test, .(id), conflictsFrame)))

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help



_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: data.table is asking for help

Michael Smith
Hi Matt,

There was recently another discussion on using setkey on .SD here:

  http://r.789695.n4.nabble.com/setkey-on-SD-td4690283.html

So the following code won't work any more in the current 1.9.3 dev
version. I think the idea of using setkey in a "chain" of data.tables
was nice, since it allows to set the key temporarily.

The basic idea is taken from the comment here:


http://stackoverflow.com/questions/22863414/using-roll-true-with-allow-cartesian-true#comment34980343_22866917


A <-
  data.table(
    x = c(1, 2, 3, 4, 5),
    y = letters[1:5])
B <-
  data.table(
    x = c(1, 2, 3, 1, 4),
    f = c("Alice", "Alice", "Alice", "Bob", "Bob"),
    z = 101:105)
B[, setkey(.SD, x)][
  , .SD[A, roll = TRUE, rollends = FALSE], by = f][
    , setkey(.SD, x)]


Thanks,

M



On 06/18/2014 01:03 AM, Matt Dowle wrote:

>
> Hi Ron,
>
> Thanks for highlighting this.  Two changes now in v1.9.3 on GitHub:
>
>   *
>
>     |setkey| on |.SD| is now an error, rather than warnings for each
>     group about rebuilding the key. The new error is similar to when
>     attempting to use |:=| in a |.SD| subquery: |".SD is locked. Using
>     set*() functions on .SD is reserved for possible future use; a
>     tortuously flexible way to modify the original data by
>     group."| Thanks to Ron Hylton for highlighting the issue on
>     datatable-help here
>     <http://r.789695.n4.nabble.com/data-table-is-asking-for-help-tp4692080.html>.
>
>   *
>
>     Looping calls to |unique(DT)| such as
>     in |DT[,unique(.SD),by=group]| is now faster by avoiding internal
>     overhead of calling |[.data.table|. Thanks again to Ron Hylton for
>     highlighting in the same thread
>     <http://r.789695.n4.nabble.com/data-table-is-asking-for-help-tp4692080.html>.
>     His example is reduced from 28 sec to 9 sec, with identical results.
>
>
> I now get the following (on my slow netbook) with no changes to your code.
>
> print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id]))   #  were
> warnings,    now error
> print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id]))   #  was
> 28s, now 9s
> print(system.time(uf <- ddply(test, .(id), conflictsFrame)))   # 13s
>
> This just fixes the surprises, basically.   Clearly Arun uses data.table
> in a better way which is orders of magnitude faster.
>
> Matt
>
>
> On 14/06/14 03:58, Ron Hylton wrote:
>>
>> Thanks, that very helpful.
>>
>>  
>>
>> *From:*Arunkumar Srinivasan [mailto:[hidden email]]
>> *Sent:* Friday, June 13, 2014 10:46 PM
>> *To:* Ron Hylton; [hidden email]
>> *Subject:* Re: [datatable-help] data.table is asking for help
>>
>>  
>>
>> Sorry. But we can simplify it even further:
>>
>> The first step is just |unique(test)|. So, we can do:
>>
>> |system.time({|
>> |ans = unique(test)|
>> |ans = ans[ans[, .I[.N > 1L], by=id]$V1]|
>> |})|
>> |#  0.016   0.000   0.016  |
>>
>> Identical?
>>
>> |setkey(ans)|
>> |setkey(ut1)|
>> |identical(ans, ut1) # [1] TRUE|
>>
>>  
>>
>> Arun
>>
>>
>> From: Arunkumar Srinivasan [hidden email]
>> <mailto:[hidden email]>
>> Reply: Arunkumar Srinivasan [hidden email]
>> <mailto:[hidden email]>
>> Date: June 14, 2014 at 4:42:31 AM
>> To: Ron Hylton [hidden email] <mailto:[hidden email]>,
>> [hidden email]
>> <mailto:[hidden email]>
>> [hidden email]
>> <mailto:[hidden email]>
>> Subject:  Re: [datatable-help] data.table is asking for help
>>
>>
>>
>>     A slightly simpler version of the 2nd solution is:
>>
>>     |system.time({|
>>
>>     |ans = test[, .N, by=names(test)]|
>>
>>     |ans = ans[ans[, .I[.N > 1L], by=id]$V1]|
>>
>>     |})|
>>
>>     |#  0.019   0.000   0.019   |
>>
>>      
>>
>>     The answers are identical, you can check this by doing:
>>
>>     |ans[, N := NULL]|
>>
>>     |setkey(ans)|
>>
>>     |setkey(ut1)|
>>
>>     |identical(ans, ut1) # [1] TRUE|
>>
>>      
>>
>>      
>>
>>     Arun
>>
>>
>>     From: Arunkumar Srinivasan [hidden email]
>>     <mailto:[hidden email]>
>>     Reply: Arunkumar Srinivasan [hidden email]
>>     <mailto:[hidden email]>
>>     Date: June 14, 2014 at 4:34:15 AM
>>     To: Ron Hylton [hidden email] <mailto:[hidden email]>,
>>     [hidden email]
>>     <mailto:[hidden email]>
>>     [hidden email]
>>     <mailto:[hidden email]>
>>     Subject:  Re: [datatable-help] data.table is asking for help
>>
>>
>>
>>         The j-expression is evaluated from within C for each group
>>         (unless they’re optimised with GForce - a new initiative in
>>         data.table). And |eval(.SD)| or |eval(anything(.SD))| is costly.
>>
>>         You can get around it by listing the columns by yourself and
>>         using |.I| instead, as follows:
>>
>>         |test[test[, .I[length(unique(list(x1,x2,x3))[[1L]]) > 1L], by=id]$V1]|
>>
>>         |#  0.140   0.001   0.142    |
>>
>>          
>>
>>          
>>
>>         Takes about 0.14 seconds.
>>
>>         ------------------------------------------------------------------------
>>
>>         An even faster way is:
>>
>>         |system.time({|
>>
>>         |ans = test[test[, .I[.N > 1], by=id]$V1]        # (1)    |
>>
>>         |ans = ans[, .N, by=names(ans)]                  # (2)    |
>>
>>         |ans = ans[ans[, .I[.N > 1L], by=id]$V1]         # (3)|
>>
>>         |})|
>>
>>         | |
>>
>>         |#  0.026   0.000   0.027    |
>>
>>          
>>
>>          
>>
>>         The idea for the second case is:
>>
>>         (1) remove all entries where there’s just 1 row corresponding
>>         to that |id|.
>>         (2) Aggregate this result by all the columns now and get the
>>         number of rows in the column |N| (we won’t have to use this
>>         column though).
>>         (3) Now, if we aggregate by |id| and if any id has just 1 row,
>>         then it’d mean that that |id| has had more than 1 rows (step
>>         (1) filtering ensures this), but all of them are same and we
>>         don’t need them. So we just filter for those where .N > 1L.
>>
>>         HTH
>>
>>          
>>
>>         Arun
>>
>>
>>         From: Ron Hylton [hidden email] <mailto:[hidden email]>
>>         Reply: Ron Hylton [hidden email] <mailto:[hidden email]>
>>         Date: June 14, 2014 at 3:30:55 AM
>>         To: [hidden email]
>>         <mailto:[hidden email]>
>>         [hidden email]
>>         <mailto:[hidden email]>
>>         Subject:  Re: [datatable-help] data.table is asking for help
>>
>>
>>
>>             The performance is what puzzles me; the results are
>>             correct so the warnings don’t matter, and not all the
>>             variations I’ve tried have warnings.  On the real dataset
>>             (~800,000 rows) datatable takes about 1.5 times longer
>>             than dataframe + ddply.  I expected it to be substantially
>>             faster.
>>
>>              
>>
>>             *From:* Arunkumar Srinivasan [mailto:[hidden email]]
>>             *Sent:* Friday, June 13, 2014 8:57 PM
>>             *To:* Ron Hylton;
>>             [hidden email]
>>             <mailto:[hidden email]>
>>             *Subject:* Re: [datatable-help] data.table is asking for help
>>
>>              
>>
>>                 However there’s another aspect.  While I’m relatively
>>                 new to R my understanding is that a function argument
>>                 should be modifiable within the function body without
>>                 affecting the caller, which perhaps conflicts with the
>>                 behavior of .SD.
>>
>>             `data.table` is designed for working with *really large*
>>             data sets in mind (> 100 or 200 GB in memory even). And
>>             therefore, as a design feature, it trades in "referential
>>             transparency" for manipulating data objects *as efficient
>>             as possible* in terms of both *speed* and *memory usage*
>>             (most of the times they go hand-in-hand).
>>
>>             This is perhaps the biggest design choice one needs to be
>>             aware of when working/choosing data.tables. It is possible
>>             to modify objects by reference using data.table - All the
>>             functions that begin with "set*" modify objects by
>>             reference. The only other non "set*" function is `:=`
>>             operator.
>>
>>              
>>
>>             HTH
>>
>>             Arun
>>
>>
>>             From: Ron Hylton [hidden email]
>>             <mailto:[hidden email]>
>>             Reply: Ron Hylton [hidden email]
>>             <mailto:[hidden email]>
>>             Date: June 14, 2014 at 2:52:04 AM
>>             To: [hidden email]
>>             <mailto:[hidden email]>
>>             [hidden email]
>>             <mailto:[hidden email]>
>>             Subject:  Re: [datatable-help] data.table is asking for help
>>
>>              
>>
>>                 I suspected it was something like this.  As one
>>                 clarification, there is a setkey(test,id) before any
>>                 setkey(.SD).   If setkey(test,id) is changed to
>>                 setkey(test) so all columns are in the original
>>                 datatable key then the warning goes away.
>>
>>                  
>>
>>                 However there’s another aspect.  While I’m relatively
>>                 new to R my understanding is that a function argument
>>                 should be modifiable within the function body without
>>                 affecting the caller, which perhaps conflicts with the
>>                 behavior of .SD.
>>
>>                  
>>
>>                 *From:* Arunkumar Srinivasan
>>                 [mailto:[hidden email]]
>>                 *Sent:* Friday, June 13, 2014 8:23 PM
>>                 *To:* Ron Hylton;
>>                 [hidden email]
>>                 <mailto:[hidden email]>
>>                 *Subject:* Re: [datatable-help] data.table is asking
>>                 for help
>>
>>                  
>>
>>                 Nicely reproducible post. Reproducible in v1.9.3
>>                 (latest commit) as well.
>>
>>                 This is a tricky one. It happens because you’re
>>                 setting key on |.SD| which should normally not be
>>                 allowed. What happens is, when you set key the first
>>                 time, there’s no key set (here) and therefore key is
>>                 set on all the columns |x1|, |x2| and |x3|.
>>
>>                 Now, the next group (in the |by=.|) is passed to your
>>                 function, it’ll have the |key| already set to
>>                 |x1,x2,x3| (because |setkey| modifies the object by
>>                 reference), but |.SD| has obtained *new* data
>>                 corresponding to /this/ group. And |data.table| sorts
>>                 this data, knowing that it already has key set.. but
>>                 if the key is set then the order must be 1:n. But it
>>                 wouldn’t be, as this data isn’t sorted. |data.table|
>>                 warns in those scenarios.. and that’s why you get the
>>                 warning.
>>
>>                 To verify this, you can try:
>>
>>                 |conflictsTable1 <- function(f, address) {|
>>
>>                 |  u <- unique(setkey(f))|
>>
>>                 |  setattr(f, 'sorted', NULL)|
>>
>>                 |  if (nrow(u) == 1) return(NULL)|
>>
>>                 |  u|
>>
>>                 |}|
>>
>>                 Basically, we set the key of |f| (which is equal to
>>                 |.SD| as it’s only modified by reference) to |NULL|
>>                 everytime after.. so that |.SD| for the new group will
>>                 not have the key set.
>>
>>                 The ideal scenario here, IIUC, is that |setkey(.SD)|
>>                 or things pointing to |.SD| should not be possible
>>                 (locking binding doesn’t seem to affect things done by
>>                 reference..). |.SD| however should retain the key of
>>                 the data.table, if a key was set, wherever possible.
>>
>>                  
>>
>>                 Arun
>>
>>
>>                 From: Ron Hylton [hidden email]
>>                 <mailto:[hidden email]>
>>                 Reply: Ron Hylton [hidden email]
>>                 <mailto:[hidden email]>
>>                 Date: June 14, 2014 at 1:55:53 AM
>>                 To: [hidden email]
>>                 <mailto:[hidden email]>
>>                 [hidden email]
>>                 <mailto:[hidden email]>
>>                 Subject:  [datatable-help] data.table is asking for help
>>
>>                  
>>
>>                     The code below generates the warning:
>>
>>                      
>>
>>                     In setkeyv(x, cols, verbose = verbose) :
>>
>>                       Already keyed by this key but had invalid row
>>                     order, key rebuilt. If you didn't go under the
>>                     hood please let datatable-help know so the root
>>                     cause can be fixed.
>>
>>                      
>>
>>                     This is my first attempt at using datatable so I
>>                     probably did something dumb, but maybe that‘s
>>                     useful for someone.  The first case is the one
>>                     that gives the warnings.
>>
>>                      
>>
>>                     I’m also surprised at the timings.  I wrote the
>>                     original algorithm using dataframe & ddply and I
>>                     expected datatable to be substantially faster; the
>>                     opposite is true.
>>
>>                      
>>
>>                     The algorithm does the following:  Certain columns
>>                     in the table are keys and others are values in the
>>                     sense that each row with the same set of keys
>>                     should have the same set of values.  Find all the
>>                     key sets for which this is not true and return the
>>                     keys sets + conflicting value sets.
>>
>>                      
>>
>>                     Insight into the performance would be appreciated.
>>
>>                      
>>
>>                     Regards,
>>
>>                     Ron
>>
>>                      
>>
>>                     library(data.table)
>>
>>                     library(plyr)
>>
>>                      
>>
>>                     conflictsTable1 <- function(f) {
>>
>>                       u <- unique(setkey(f))
>>
>>                       if (nrow(u) == 1) return(NULL)
>>
>>                       u
>>
>>                     }
>>
>>                      
>>
>>                     conflictsTable2 <- function(f) {
>>
>>                       u <- unique(f)
>>
>>                       if (nrow(u) == 1) return(NULL)
>>
>>                       u
>>
>>                     }
>>
>>                      
>>
>>                     conflictsFrame <- function(f) {
>>
>>                       u <- unique(f)
>>
>>                       if (nrow(u) == 1) return(NULL)
>>
>>                       u
>>
>>                     }
>>
>>                      
>>
>>                     N <- 10000
>>
>>                     test <-
>>                     data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)),
>>                     x1=rnorm(N), x2=rnorm(N), x3=rnorm(N))
>>
>>                      
>>
>>                     setkey(test,id)
>>
>>                      
>>
>>                     print(system.time(ut1 <- test[,
>>                     conflictsTable1(.SD), by=id]))
>>
>>                      
>>
>>                     print(system.time(ut2 <- test[,
>>                     conflictsTable2(.SD), by=id]))
>>
>>                      
>>
>>                     print(system.time(uf <- ddply(test, .(id),
>>                     conflictsFrame)))
>>
>>                     _______________________________________________
>>                     datatable-help mailing list
>>                     [hidden email]
>>                     <mailto:[hidden email]>
>>                     https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>                 _______________________________________________
>>                 datatable-help mailing list
>>                 [hidden email]
>>                 <mailto:[hidden email]>
>>                 https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>             _______________________________________________
>>             datatable-help mailing list
>>             [hidden email]
>>             <mailto:[hidden email]>
>>             https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>>
>> _______________________________________________
>> datatable-help mailing list
>> [hidden email]
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
> _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Loading...