|
I think this is a bug. DT.2 <- DT.1 doesn't seem to make a copy in all cases.
> DT.1 <- data.table(x=1, y=1)
> DT.2 <- DT.1 > > # Both DT.1 and DT.2 are changed.
> DT.2[, y := NULL] x [1,] 1
> DT.1 x [1,] 1
> DT.2 x [1,] 1
> > # Only DT.2 is changed > DT.2[, y := x]
x y [1,] 1 1 > DT.1
x [1,] 1 > DT.2
x y [1,] 1 1 _______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help |
|
Interesting one. Adding columns is a bit different to deleting and
modifying columns. Here's how it works. Could make changes, could document it, or both, what do people think? Just like data.frame there is a list vector holding pointers to the column vectors. A delete column op is done with a memmove to budge up the column pointers above the column by one place. That leaves a gap at the end. The length attribute of that vector (ncol(DT)) is then decremented and the spare 4 bytes (or 8 on 64bit) are left unused at the end. An add column can't be fully by reference because the list vector is full. A new list vector has to be allocated, one slot larger, the old pointers memcpy'd over, and the last spot assigned the pointer to the new column vector. This copying is negligible because it's a small list of pointers fitting well within one page. [Unless, there are many 1000's of columns, which is why it's done as efficiently as possible using memcpy]. Aside : There is little known (I guess) distinction between length and truelength in R internals. Base R doesn't use it, but we could in data.table. A delete column sets length but leaves truelength one larger. When the next add column comes along, it could just do the budge up and insert the column. That may not be so advantageous for (a small number) of columns, but the same logic could work for insert() and delete()ing rows. Of course, this would mean whether a visible copy or not is taken depends on what happened previously, rather than the syntax. That's something we've disliked before, in the same way we dislike drop=TRUE behaviour and so dropped drop. One way to approach this might be to advise ":= add *may* not copy. Best to assume it doesn't; use copy()". If you get in the habbit of "DT2=copy(DT)" then that'll take a deep copy at the time and you're safe. To illustrate the partial (maybe shallow copy is better word), consider the following : > DT = data.table(1:2,3:4) > DT2=DT > DT2[,y:=10L] V1 V2 y [1,] 1 3 10 [2,] 2 4 10 > DT V1 V2 [1,] 1 3 [2,] 2 4 > DT2 V1 V2 y [1,] 1 3 10 [2,] 2 4 10 > DT2[1,V1:=99L] V1 V2 y [1,] 99 3 10 [2,] 2 4 10 > DT V1 V2 [1,] 99 3 [2,] 2 4 > Matthew On Thu, 2011-10-27 at 11:46 -0700, Muhammad Waliji wrote: > I think this is a bug. DT.2 <- DT.1 doesn't seem to make a copy in > all cases. > > > > DT.1 <- data.table(x=1, y=1) > > DT.2 <- DT.1 > > > > # Both DT.1 and DT.2 are changed. > > DT.2[, y := NULL] > x > [1,] 1 > > DT.1 > x > [1,] 1 > > DT.2 > x > [1,] 1 > > > > # Only DT.2 is changed > > DT.2[, y := x] > x y > [1,] 1 1 > > DT.1 > x > [1,] 1 > > DT.2 > x y > [1,] 1 1 > > > _______________________________________________ > datatable-help mailing list > [hidden email] > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help |
|
From the user's perspective, DT2 <- DT should either be a new copy or a new reference. Anything in between is confusing.
How about this - add a new argument to data.table(), say max.cols. max.cols defaults to a couple orders of magnitude above the initial number of columns. data.table allocates enough memory for max.cols column pointers. If you try to add more than max.cols columns, it is either an error, or it creates a copy and produces a warning.
On Fri, Oct 28, 2011 at 1:10 AM, Matthew Dowle <[hidden email]> wrote: Interesting one. Adding columns is a bit different to deleting and _______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help |
|
On Fri, 2011-10-28 at 09:52 -0700, Muhammad Waliji wrote: > >From the user's perspective, DT2 <- DT should either be a new copy or > a new reference. Anything in between is confusing. Agreed. With picky caveat: even in base it's not at this point the copy is taken. It's later: copy-on-write. It's setkey and := that don't copy on write, not the (earlier) <-. > How about this - add a new argument to data.table(), say max.cols. > max.cols defaults to a couple orders of magnitude above the initial > number of columns. data.table allocates enough memory for max.cols > column pointers. If you try to add more than max.cols columns, it is > either an error, or it creates a copy and produces a warning. Very nice idea. To over allocate by default so that := can add columns fully by reference most of the time seems good to me since there's a very low cost to over allocating the vector of column pointers. Create the (shallow copy) and issue a warning, I'm thinking, not error. The "max.cols" names seems a bit absolute, could it be "alloc.cols"? We could have alloc(DT,2,ncol) or rowalloc(DT,n) and colalloc(DT,n), or realloc(...) so users can over alloc themselves before a loop that adds columns or inserts rows. tables() could also report truenrow, and truencol as well as nrow and ncol. What should alloc.cols be, by default? How about: max(100,2*ncol) What about as.data.table.data.frame()? Should that over-allocate, too, or for speed just change the class attribute as it does now. Maybe checking NAMED would work, in addition. If NAMED was 0, no need to warn. Only when NAMED was 1 (or 2) - (not too hot on NAMED) - would the warning be necessary. > > On Fri, Oct 28, 2011 at 1:10 AM, Matthew Dowle > <[hidden email]> wrote: > Interesting one. Adding columns is a bit different to deleting > and > modifying columns. Here's how it works. Could make changes, > could > document it, or both, what do people think? > > Just like data.frame there is a list vector holding pointers > to the > column vectors. A delete column op is done with a memmove to > budge up > the column pointers above the column by one place. That leaves > a gap at > the end. The length attribute of that vector (ncol(DT)) is > then > decremented and the spare 4 bytes (or 8 on 64bit) are left > unused at the > end. > > An add column can't be fully by reference because the list > vector is > full. A new list vector has to be allocated, one slot larger, > the old > pointers memcpy'd over, and the last spot assigned the pointer > to the > new column vector. This copying is negligible because it's a > small list > of pointers fitting well within one page. [Unless, there are > many 1000's > of columns, which is why it's done as efficiently as possible > using > memcpy]. > > Aside : There is little known (I guess) distinction between > length and > truelength in R internals. Base R doesn't use it, but we could > in > data.table. A delete column sets length but leaves truelength > one > larger. When the next add column comes along, it could just do > the budge > up and insert the column. That may not be so advantageous for > (a small > number) of columns, but the same logic could work for > insert() and > delete()ing rows. Of course, this would mean whether a > visible copy or > not is taken depends on what happened previously, rather than > the > syntax. That's something we've disliked before, in the same > way we > dislike drop=TRUE behaviour and so dropped drop. One way to > approach > this might be to advise ":= add *may* not copy. Best to assume > it > doesn't; use copy()". If you get in the habbit of > "DT2=copy(DT)" then > that'll take a deep copy at the time and you're safe. > > To illustrate the partial (maybe shallow copy is better word), > consider > the following : > > > DT = data.table(1:2,3:4) > > DT2=DT > > DT2[,y:=10L] > V1 V2 y > [1,] 1 3 10 > [2,] 2 4 10 > > DT > V1 V2 > [1,] 1 3 > [2,] 2 4 > > DT2 > V1 V2 y > [1,] 1 3 10 > [2,] 2 4 10 > > DT2[1,V1:=99L] > V1 V2 y > [1,] 99 3 10 > [2,] 2 4 10 > > DT > V1 V2 > [1,] 99 3 > [2,] 2 4 > > > > Matthew > > > On Thu, 2011-10-27 at 11:46 -0700, Muhammad Waliji wrote: > > I think this is a bug. DT.2 <- DT.1 doesn't seem to make a > copy in > > all cases. > > > > > > > DT.1 <- data.table(x=1, y=1) > > > DT.2 <- DT.1 > > > > > > # Both DT.1 and DT.2 are changed. > > > DT.2[, y := NULL] > > x > > [1,] 1 > > > DT.1 > > x > > [1,] 1 > > > DT.2 > > x > > [1,] 1 > > > > > > # Only DT.2 is changed > > > DT.2[, y := x] > > x y > > [1,] 1 1 > > > DT.1 > > x > > [1,] 1 > > > DT.2 > > x y > > [1,] 1 1 > > > > > > > _______________________________________________ > > datatable-help mailing list > > [hidden email] > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > _______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help |
|
On Fri, Oct 28, 2011 at 5:32 PM, Matthew Dowle <[hidden email]> wrote:
Hmm, I would prefer for these to have the same behavior.
Fine with me.
Yeah, I think any method of creating a data table should over-allocate. If people want the speed gains, they can set explicitly set alloc.cols.
_______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help |
|
On Fri, 2011-10-28 at 17:42 -0700, Muhammad Waliji wrote:
> On Fri, Oct 28, 2011 at 5:32 PM, Matthew Dowle > <[hidden email]> wrote: > > On Fri, 2011-10-28 at 09:52 -0700, Muhammad Waliji wrote: > > >From the user's perspective, DT2 <- DT should either be a > new copy or > > a new reference. Anything in between is confusing. > > > Agreed. With picky caveat: even in base it's not at this point > the copy > is taken. It's later: copy-on-write. It's setkey and := that > don't copy > on write, not the (earlier) <-. > > > Hmm, I would prefer for these to have the same behavior. Not sure I follow, please expand. > > > > How about this - add a new argument to data.table(), say > max.cols. > > max.cols defaults to a couple orders of magnitude above the > initial > > number of columns. data.table allocates enough memory for > max.cols > > column pointers. If you try to add more than max.cols > columns, it is > > either an error, or it creates a copy and produces a > warning. > > > Very nice idea. To over allocate by default so that := can add > columns > fully by reference most of the time seems good to me since > there's a > very low cost to over allocating the vector of column > pointers. Create > the (shallow copy) and issue a warning, I'm thinking, not > error. The > "max.cols" names seems a bit absolute, could it be > "alloc.cols"? We > could have alloc(DT,2,ncol) or rowalloc(DT,n) and > colalloc(DT,n), or > realloc(...) so users can over alloc themselves before a loop > that adds > columns or inserts rows. tables() could also report truenrow, > and > truencol as well as nrow and ncol. What should alloc.cols be, > by > default? How about: max(100,2*ncol) > > > Fine with me. > > What about as.data.table.data.frame()? Should that > over-allocate, too, > or for speed just change the class attribute as it does now. > > > Yeah, I think any method of creating a data table should > over-allocate. If people want the speed gains, they can set > explicitly set alloc.cols. > > > > Maybe checking NAMED would work, in addition. If NAMED was 0, > no need to > warn. Only when NAMED was 1 (or 2) - (not too hot on NAMED) - > would the > warning be necessary. > > > > > > On Fri, Oct 28, 2011 at 1:10 AM, Matthew Dowle > > <[hidden email]> wrote: > > Interesting one. Adding columns is a bit different > to deleting > > and > > modifying columns. Here's how it works. Could make > changes, > > could > > document it, or both, what do people think? > > > > Just like data.frame there is a list vector holding > pointers > > to the > > column vectors. A delete column op is done with a > memmove to > > budge up > > the column pointers above the column by one place. > That leaves > > a gap at > > the end. The length attribute of that vector > (ncol(DT)) is > > then > > decremented and the spare 4 bytes (or 8 on 64bit) > are left > > unused at the > > end. > > > > An add column can't be fully by reference because > the list > > vector is > > full. A new list vector has to be allocated, one > slot larger, > > the old > > pointers memcpy'd over, and the last spot assigned > the pointer > > to the > > new column vector. This copying is negligible > because it's a > > small list > > of pointers fitting well within one page. [Unless, > there are > > many 1000's > > of columns, which is why it's done as efficiently as > possible > > using > > memcpy]. > > > > Aside : There is little known (I guess) distinction > between > > length and > > truelength in R internals. Base R doesn't use it, > but we could > > in > > data.table. A delete column sets length but leaves > truelength > > one > > larger. When the next add column comes along, it > could just do > > the budge > > up and insert the column. That may not be so > advantageous for > > (a small > > number) of columns, but the same logic could work > for > > insert() and > > delete()ing rows. Of course, this would mean > whether a > > visible copy or > > not is taken depends on what happened previously, > rather than > > the > > syntax. That's something we've disliked before, in > the same > > way we > > dislike drop=TRUE behaviour and so dropped drop. One > way to > > approach > > this might be to advise ":= add *may* not copy. Best > to assume > > it > > doesn't; use copy()". If you get in the habbit of > > "DT2=copy(DT)" then > > that'll take a deep copy at the time and you're > safe. > > > > To illustrate the partial (maybe shallow copy is > better word), > > consider > > the following : > > > > > DT = data.table(1:2,3:4) > > > DT2=DT > > > DT2[,y:=10L] > > V1 V2 y > > [1,] 1 3 10 > > [2,] 2 4 10 > > > DT > > V1 V2 > > [1,] 1 3 > > [2,] 2 4 > > > DT2 > > V1 V2 y > > [1,] 1 3 10 > > [2,] 2 4 10 > > > DT2[1,V1:=99L] > > V1 V2 y > > [1,] 99 3 10 > > [2,] 2 4 10 > > > DT > > V1 V2 > > [1,] 99 3 > > [2,] 2 4 > > > > > > > Matthew > > > > > > On Thu, 2011-10-27 at 11:46 -0700, Muhammad Waliji > wrote: > > > I think this is a bug. DT.2 <- DT.1 doesn't seem > to make a > > copy in > > > all cases. > > > > > > > > > > DT.1 <- data.table(x=1, y=1) > > > > DT.2 <- DT.1 > > > > > > > > # Both DT.1 and DT.2 are changed. > > > > DT.2[, y := NULL] > > > x > > > [1,] 1 > > > > DT.1 > > > x > > > [1,] 1 > > > > DT.2 > > > x > > > [1,] 1 > > > > > > > > # Only DT.2 is changed > > > > DT.2[, y := x] > > > x y > > > [1,] 1 1 > > > > DT.1 > > > x > > > [1,] 1 > > > > DT.2 > > > x y > > > [1,] 1 1 > > > > > > > > > > > _______________________________________________ > > > datatable-help mailing list > > > [hidden email] > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > > _______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help |
|
On Fri, Oct 28, 2011 at 5:57 PM, Matthew Dowle <[hidden email]> wrote:
I would like for DT[, x := foo] and DT$x <- foo to have the same behavior. i.e. if one preserves the reference, so should the other.
_______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help |
|
On Fri, 2011-10-28 at 17:59 -0700, Muhammad Waliji wrote:
> On Fri, Oct 28, 2011 at 5:57 PM, Matthew Dowle > <[hidden email]> wrote: > On Fri, 2011-10-28 at 17:42 -0700, Muhammad Waliji wrote: > > On Fri, Oct 28, 2011 at 5:32 PM, Matthew Dowle > > <[hidden email]> wrote: > > > > On Fri, 2011-10-28 at 09:52 -0700, Muhammad Waliji > wrote: > > > >From the user's perspective, DT2 <- DT should > either be a > > new copy or > > > a new reference. Anything in between is > confusing. > > > > > > Agreed. With picky caveat: even in base it's not at > this point > > the copy > > is taken. It's later: copy-on-write. It's setkey > and := that > > don't copy > > on write, not the (earlier) <-. > > > > > > Hmm, I would prefer for these to have the same behavior. > > > Not sure I follow, please expand. > > > I would like for DT[, x := foo] and DT$x <- foo to have the same > behavior. i.e. if one preserves the reference, so should the other. > Agreed, I would too (at least I did to start with). However, it seems a base R thing that <- copies via `*tmp*` and isn't possible to make it work by reference. See the (long) thread on r-devel here : http://r.789695.n4.nabble.com/speeding-up-perception-tp3640920p3640920.html I'm pretty happy with :=, honestly. It's nice that <-, <<- and := all do different and useful things within j. Changes to the meaning of [<- might not have been backwards compatible, so := provides without forcing. Matthew _______________________________________________ datatable-help mailing list [hidden email] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help |
|
In reply to this post by mhwaliji
Done, committed to v1.7.3 on R-Forge. Matthew |
| Powered by Nabble | Edit this page |
