Quantcast

Copy on assign broken in some cases

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Copy on assign broken in some cases

mhwaliji
I think this is a bug.  DT.2 <- DT.1 doesn't seem to make a copy in all cases.

> DT.1 <- data.table(x=1, y=1)
> DT.2 <- DT.1
> # Both DT.1 and DT.2 are changed.
> DT.2[, y := NULL]
     x
[1,] 1
> DT.1
     x
[1,] 1
> DT.2
     x
[1,] 1
> # Only DT.2 is changed
> DT.2[, y := x]
     x y
[1,] 1 1
> DT.1
     x
[1,] 1
> DT.2
     x y
[1,] 1 1


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Copy on assign broken in some cases

Matthew Dowle
Interesting one. Adding columns is a bit different to deleting and
modifying columns. Here's how it works. Could make changes, could
document it, or both, what do people think?

Just like data.frame there is a list vector holding pointers to the
column vectors. A delete column op is done with a memmove to budge up
the column pointers above the column by one place. That leaves a gap at
the end. The length attribute of that vector (ncol(DT)) is then
decremented and the spare 4 bytes (or 8 on 64bit) are left unused at the
end.

An add column can't be fully by reference because the list vector is
full. A new list vector has to be allocated, one slot larger, the old
pointers memcpy'd over, and the last spot assigned the pointer to the
new column vector.  This copying is negligible because it's a small list
of pointers fitting well within one page. [Unless, there are many 1000's
of columns, which is why it's done as efficiently as possible using
memcpy].

Aside : There is little known (I guess) distinction between length and
truelength in R internals. Base R doesn't use it, but we could in
data.table. A delete column sets length but leaves truelength one
larger. When the next add column comes along, it could just do the budge
up and insert the column. That may not be so advantageous for (a small
number) of columns,  but the same logic could work for insert() and
delete()ing rows.  Of course, this would mean whether a visible copy or
not is taken depends on what happened previously, rather than the
syntax. That's something we've disliked before, in the same way we
dislike drop=TRUE behaviour and so dropped drop. One way to approach
this might be to advise ":= add *may* not copy. Best to assume it
doesn't; use copy()". If you get in the habbit of "DT2=copy(DT)" then
that'll take a deep copy at the time and you're safe.

To illustrate the partial (maybe shallow copy is better word), consider
the following :

> DT = data.table(1:2,3:4)
> DT2=DT
> DT2[,y:=10L]
     V1 V2  y
[1,]  1  3 10
[2,]  2  4 10
> DT
     V1 V2
[1,]  1  3
[2,]  2  4
> DT2
     V1 V2  y
[1,]  1  3 10
[2,]  2  4 10
> DT2[1,V1:=99L]
     V1 V2  y
[1,] 99  3 10
[2,]  2  4 10
> DT
     V1 V2
[1,] 99  3
[2,]  2  4
>

Matthew


On Thu, 2011-10-27 at 11:46 -0700, Muhammad Waliji wrote:

> I think this is a bug.  DT.2 <- DT.1 doesn't seem to make a copy in
> all cases.
>
>
> > DT.1 <- data.table(x=1, y=1)
> > DT.2 <- DT.1
> >
> > # Both DT.1 and DT.2 are changed.
> > DT.2[, y := NULL]
>      x
> [1,] 1
> > DT.1
>      x
> [1,] 1
> > DT.2
>      x
> [1,] 1
> >
> > # Only DT.2 is changed
> > DT.2[, y := x]
>      x y
> [1,] 1 1
> > DT.1
>      x
> [1,] 1
> > DT.2
>      x y
> [1,] 1 1
>
>
> _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Copy on assign broken in some cases

mhwaliji
From the user's perspective, DT2 <- DT should either be a new copy or a new reference.  Anything in between is confusing.

How about this - add a new argument to data.table(), say max.cols.  max.cols defaults to a couple orders of magnitude above the initial number of columns.  data.table allocates enough memory for max.cols column pointers.  If you try to add more than max.cols columns, it is either an error, or it creates a copy and produces a warning.

On Fri, Oct 28, 2011 at 1:10 AM, Matthew Dowle <[hidden email]> wrote:
Interesting one. Adding columns is a bit different to deleting and
modifying columns. Here's how it works. Could make changes, could
document it, or both, what do people think?

Just like data.frame there is a list vector holding pointers to the
column vectors. A delete column op is done with a memmove to budge up
the column pointers above the column by one place. That leaves a gap at
the end. The length attribute of that vector (ncol(DT)) is then
decremented and the spare 4 bytes (or 8 on 64bit) are left unused at the
end.

An add column can't be fully by reference because the list vector is
full. A new list vector has to be allocated, one slot larger, the old
pointers memcpy'd over, and the last spot assigned the pointer to the
new column vector.  This copying is negligible because it's a small list
of pointers fitting well within one page. [Unless, there are many 1000's
of columns, which is why it's done as efficiently as possible using
memcpy].

Aside : There is little known (I guess) distinction between length and
truelength in R internals. Base R doesn't use it, but we could in
data.table. A delete column sets length but leaves truelength one
larger. When the next add column comes along, it could just do the budge
up and insert the column. That may not be so advantageous for (a small
number) of columns,  but the same logic could work for insert() and
delete()ing rows.  Of course, this would mean whether a visible copy or
not is taken depends on what happened previously, rather than the
syntax. That's something we've disliked before, in the same way we
dislike drop=TRUE behaviour and so dropped drop. One way to approach
this might be to advise ":= add *may* not copy. Best to assume it
doesn't; use copy()". If you get in the habbit of "DT2=copy(DT)" then
that'll take a deep copy at the time and you're safe.

To illustrate the partial (maybe shallow copy is better word), consider
the following :

> DT = data.table(1:2,3:4)
> DT2=DT
> DT2[,y:=10L]
    V1 V2  y
[1,]  1  3 10
[2,]  2  4 10
> DT
    V1 V2
[1,]  1  3
[2,]  2  4
> DT2
    V1 V2  y
[1,]  1  3 10
[2,]  2  4 10
> DT2[1,V1:=99L]
    V1 V2  y
[1,] 99  3 10
[2,]  2  4 10
> DT
    V1 V2
[1,] 99  3
[2,]  2  4
>

Matthew


On Thu, 2011-10-27 at 11:46 -0700, Muhammad Waliji wrote:
> I think this is a bug.  DT.2 <- DT.1 doesn't seem to make a copy in
> all cases.
>
>
> > DT.1 <- data.table(x=1, y=1)
> > DT.2 <- DT.1
> >
> > # Both DT.1 and DT.2 are changed.
> > DT.2[, y := NULL]
>      x
> [1,] 1
> > DT.1
>      x
> [1,] 1
> > DT.2
>      x
> [1,] 1
> >
> > # Only DT.2 is changed
> > DT.2[, y := x]
>      x y
> [1,] 1 1
> > DT.1
>      x
> [1,] 1
> > DT.2
>      x y
> [1,] 1 1
>
>
> _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help




_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Copy on assign broken in some cases

Matthew Dowle

On Fri, 2011-10-28 at 09:52 -0700, Muhammad Waliji wrote:
> >From the user's perspective, DT2 <- DT should either be a new copy or
> a new reference.  Anything in between is confusing.

Agreed. With picky caveat: even in base it's not at this point the copy
is taken. It's later: copy-on-write. It's setkey and := that don't copy
on write, not the (earlier) <-.

> How about this - add a new argument to data.table(), say max.cols.
> max.cols defaults to a couple orders of magnitude above the initial
> number of columns.  data.table allocates enough memory for max.cols
> column pointers.  If you try to add more than max.cols columns, it is
> either an error, or it creates a copy and produces a warning.

Very nice idea. To over allocate by default so that := can add columns
fully by reference most of the time seems good to me since there's a
very low cost to over allocating the vector of column pointers. Create
the (shallow copy) and issue a warning, I'm thinking, not error. The
"max.cols" names seems a bit absolute, could it be "alloc.cols"?  We
could have alloc(DT,2,ncol) or rowalloc(DT,n) and colalloc(DT,n), or
realloc(...) so users can over alloc themselves before a loop that adds
columns or inserts rows.  tables() could also report truenrow, and
truencol as well as nrow and ncol.  What should alloc.cols be, by
default? How about:  max(100,2*ncol)

What about as.data.table.data.frame()?  Should that over-allocate, too,
or for speed just change the class attribute as it does now.

Maybe checking NAMED would work, in addition. If NAMED was 0, no need to
warn. Only when NAMED was 1 (or 2) - (not too hot on NAMED) - would the
warning be necessary.


>
> On Fri, Oct 28, 2011 at 1:10 AM, Matthew Dowle
> <[hidden email]> wrote:
>         Interesting one. Adding columns is a bit different to deleting
>         and
>         modifying columns. Here's how it works. Could make changes,
>         could
>         document it, or both, what do people think?
>        
>         Just like data.frame there is a list vector holding pointers
>         to the
>         column vectors. A delete column op is done with a memmove to
>         budge up
>         the column pointers above the column by one place. That leaves
>         a gap at
>         the end. The length attribute of that vector (ncol(DT)) is
>         then
>         decremented and the spare 4 bytes (or 8 on 64bit) are left
>         unused at the
>         end.
>        
>         An add column can't be fully by reference because the list
>         vector is
>         full. A new list vector has to be allocated, one slot larger,
>         the old
>         pointers memcpy'd over, and the last spot assigned the pointer
>         to the
>         new column vector.  This copying is negligible because it's a
>         small list
>         of pointers fitting well within one page. [Unless, there are
>         many 1000's
>         of columns, which is why it's done as efficiently as possible
>         using
>         memcpy].
>        
>         Aside : There is little known (I guess) distinction between
>         length and
>         truelength in R internals. Base R doesn't use it, but we could
>         in
>         data.table. A delete column sets length but leaves truelength
>         one
>         larger. When the next add column comes along, it could just do
>         the budge
>         up and insert the column. That may not be so advantageous for
>         (a small
>         number) of columns,  but the same logic could work for
>         insert() and
>         delete()ing rows.  Of course, this would mean whether a
>         visible copy or
>         not is taken depends on what happened previously, rather than
>         the
>         syntax. That's something we've disliked before, in the same
>         way we
>         dislike drop=TRUE behaviour and so dropped drop. One way to
>         approach
>         this might be to advise ":= add *may* not copy. Best to assume
>         it
>         doesn't; use copy()". If you get in the habbit of
>         "DT2=copy(DT)" then
>         that'll take a deep copy at the time and you're safe.
>        
>         To illustrate the partial (maybe shallow copy is better word),
>         consider
>         the following :
>        
>         > DT = data.table(1:2,3:4)
>         > DT2=DT
>         > DT2[,y:=10L]
>             V1 V2  y
>         [1,]  1  3 10
>         [2,]  2  4 10
>         > DT
>             V1 V2
>         [1,]  1  3
>         [2,]  2  4
>         > DT2
>             V1 V2  y
>         [1,]  1  3 10
>         [2,]  2  4 10
>         > DT2[1,V1:=99L]
>             V1 V2  y
>         [1,] 99  3 10
>         [2,]  2  4 10
>         > DT
>             V1 V2
>         [1,] 99  3
>         [2,]  2  4
>         >
>        
>         Matthew
>        
>        
>         On Thu, 2011-10-27 at 11:46 -0700, Muhammad Waliji wrote:
>         > I think this is a bug.  DT.2 <- DT.1 doesn't seem to make a
>         copy in
>         > all cases.
>         >
>         >
>         > > DT.1 <- data.table(x=1, y=1)
>         > > DT.2 <- DT.1
>         > >
>         > > # Both DT.1 and DT.2 are changed.
>         > > DT.2[, y := NULL]
>         >      x
>         > [1,] 1
>         > > DT.1
>         >      x
>         > [1,] 1
>         > > DT.2
>         >      x
>         > [1,] 1
>         > >
>         > > # Only DT.2 is changed
>         > > DT.2[, y := x]
>         >      x y
>         > [1,] 1 1
>         > > DT.1
>         >      x
>         > [1,] 1
>         > > DT.2
>         >      x y
>         > [1,] 1 1
>         >
>         >
>        
>         > _______________________________________________
>         > datatable-help mailing list
>         > [hidden email]
>         >
>         https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>        
>        
>
>


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Copy on assign broken in some cases

mhwaliji
On Fri, Oct 28, 2011 at 5:32 PM, Matthew Dowle <[hidden email]> wrote:

On Fri, 2011-10-28 at 09:52 -0700, Muhammad Waliji wrote:
> >From the user's perspective, DT2 <- DT should either be a new copy or
> a new reference.  Anything in between is confusing.

Agreed. With picky caveat: even in base it's not at this point the copy
is taken. It's later: copy-on-write. It's setkey and := that don't copy
on write, not the (earlier) <-.

Hmm, I would prefer for these to have the same behavior.
 

> How about this - add a new argument to data.table(), say max.cols.
> max.cols defaults to a couple orders of magnitude above the initial
> number of columns.  data.table allocates enough memory for max.cols
> column pointers.  If you try to add more than max.cols columns, it is
> either an error, or it creates a copy and produces a warning.

Very nice idea. To over allocate by default so that := can add columns
fully by reference most of the time seems good to me since there's a
very low cost to over allocating the vector of column pointers. Create
the (shallow copy) and issue a warning, I'm thinking, not error. The
"max.cols" names seems a bit absolute, could it be "alloc.cols"?  We
could have alloc(DT,2,ncol) or rowalloc(DT,n) and colalloc(DT,n), or
realloc(...) so users can over alloc themselves before a loop that adds
columns or inserts rows.  tables() could also report truenrow, and
truencol as well as nrow and ncol.  What should alloc.cols be, by
default? How about:  max(100,2*ncol)

Fine with me. 

What about as.data.table.data.frame()?  Should that over-allocate, too,
or for speed just change the class attribute as it does now.

Yeah, I think any method of creating a data table should over-allocate.  If people want the speed gains, they can set explicitly set alloc.cols.


Maybe checking NAMED would work, in addition. If NAMED was 0, no need to
warn. Only when NAMED was 1 (or 2) - (not too hot on NAMED) - would the
warning be necessary.


>
> On Fri, Oct 28, 2011 at 1:10 AM, Matthew Dowle
> <[hidden email]> wrote:
>         Interesting one. Adding columns is a bit different to deleting
>         and
>         modifying columns. Here's how it works. Could make changes,
>         could
>         document it, or both, what do people think?
>
>         Just like data.frame there is a list vector holding pointers
>         to the
>         column vectors. A delete column op is done with a memmove to
>         budge up
>         the column pointers above the column by one place. That leaves
>         a gap at
>         the end. The length attribute of that vector (ncol(DT)) is
>         then
>         decremented and the spare 4 bytes (or 8 on 64bit) are left
>         unused at the
>         end.
>
>         An add column can't be fully by reference because the list
>         vector is
>         full. A new list vector has to be allocated, one slot larger,
>         the old
>         pointers memcpy'd over, and the last spot assigned the pointer
>         to the
>         new column vector.  This copying is negligible because it's a
>         small list
>         of pointers fitting well within one page. [Unless, there are
>         many 1000's
>         of columns, which is why it's done as efficiently as possible
>         using
>         memcpy].
>
>         Aside : There is little known (I guess) distinction between
>         length and
>         truelength in R internals. Base R doesn't use it, but we could
>         in
>         data.table. A delete column sets length but leaves truelength
>         one
>         larger. When the next add column comes along, it could just do
>         the budge
>         up and insert the column. That may not be so advantageous for
>         (a small
>         number) of columns,  but the same logic could work for
>         insert() and
>         delete()ing rows.  Of course, this would mean whether a
>         visible copy or
>         not is taken depends on what happened previously, rather than
>         the
>         syntax. That's something we've disliked before, in the same
>         way we
>         dislike drop=TRUE behaviour and so dropped drop. One way to
>         approach
>         this might be to advise ":= add *may* not copy. Best to assume
>         it
>         doesn't; use copy()". If you get in the habbit of
>         "DT2=copy(DT)" then
>         that'll take a deep copy at the time and you're safe.
>
>         To illustrate the partial (maybe shallow copy is better word),
>         consider
>         the following :
>
>         > DT = data.table(1:2,3:4)
>         > DT2=DT
>         > DT2[,y:=10L]
>             V1 V2  y
>         [1,]  1  3 10
>         [2,]  2  4 10
>         > DT
>             V1 V2
>         [1,]  1  3
>         [2,]  2  4
>         > DT2
>             V1 V2  y
>         [1,]  1  3 10
>         [2,]  2  4 10
>         > DT2[1,V1:=99L]
>             V1 V2  y
>         [1,] 99  3 10
>         [2,]  2  4 10
>         > DT
>             V1 V2
>         [1,] 99  3
>         [2,]  2  4
>         >
>
>         Matthew
>
>
>         On Thu, 2011-10-27 at 11:46 -0700, Muhammad Waliji wrote:
>         > I think this is a bug.  DT.2 <- DT.1 doesn't seem to make a
>         copy in
>         > all cases.
>         >
>         >
>         > > DT.1 <- data.table(x=1, y=1)
>         > > DT.2 <- DT.1
>         > >
>         > > # Both DT.1 and DT.2 are changed.
>         > > DT.2[, y := NULL]
>         >      x
>         > [1,] 1
>         > > DT.1
>         >      x
>         > [1,] 1
>         > > DT.2
>         >      x
>         > [1,] 1
>         > >
>         > > # Only DT.2 is changed
>         > > DT.2[, y := x]
>         >      x y
>         > [1,] 1 1
>         > > DT.1
>         >      x
>         > [1,] 1
>         > > DT.2
>         >      x y
>         > [1,] 1 1
>         >
>         >
>
>         > _______________________________________________
>         > datatable-help mailing list
>         > [hidden email]
>         >
>         https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
>




_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Copy on assign broken in some cases

Matthew Dowle
On Fri, 2011-10-28 at 17:42 -0700, Muhammad Waliji wrote:

> On Fri, Oct 28, 2011 at 5:32 PM, Matthew Dowle
> <[hidden email]> wrote:
>        
>         On Fri, 2011-10-28 at 09:52 -0700, Muhammad Waliji wrote:
>         > >From the user's perspective, DT2 <- DT should either be a
>         new copy or
>         > a new reference.  Anything in between is confusing.
>        
>        
>         Agreed. With picky caveat: even in base it's not at this point
>         the copy
>         is taken. It's later: copy-on-write. It's setkey and := that
>         don't copy
>         on write, not the (earlier) <-.
>
>
> Hmm, I would prefer for these to have the same behavior.

Not sure I follow, please expand.

>  
>        
>         > How about this - add a new argument to data.table(), say
>         max.cols.
>         > max.cols defaults to a couple orders of magnitude above the
>         initial
>         > number of columns.  data.table allocates enough memory for
>         max.cols
>         > column pointers.  If you try to add more than max.cols
>         columns, it is
>         > either an error, or it creates a copy and produces a
>         warning.
>        
>        
>         Very nice idea. To over allocate by default so that := can add
>         columns
>         fully by reference most of the time seems good to me since
>         there's a
>         very low cost to over allocating the vector of column
>         pointers. Create
>         the (shallow copy) and issue a warning, I'm thinking, not
>         error. The
>         "max.cols" names seems a bit absolute, could it be
>         "alloc.cols"?  We
>         could have alloc(DT,2,ncol) or rowalloc(DT,n) and
>         colalloc(DT,n), or
>         realloc(...) so users can over alloc themselves before a loop
>         that adds
>         columns or inserts rows.  tables() could also report truenrow,
>         and
>         truencol as well as nrow and ncol.  What should alloc.cols be,
>         by
>         default? How about:  max(100,2*ncol)
>
>
> Fine with me.
>        
>         What about as.data.table.data.frame()?  Should that
>         over-allocate, too,
>         or for speed just change the class attribute as it does now.
>
>
> Yeah, I think any method of creating a data table should
> over-allocate.  If people want the speed gains, they can set
> explicitly set alloc.cols.
>
>
>        
>         Maybe checking NAMED would work, in addition. If NAMED was 0,
>         no need to
>         warn. Only when NAMED was 1 (or 2) - (not too hot on NAMED) -
>         would the
>         warning be necessary.
>        
>        
>         >
>         > On Fri, Oct 28, 2011 at 1:10 AM, Matthew Dowle
>         > <[hidden email]> wrote:
>         >         Interesting one. Adding columns is a bit different
>         to deleting
>         >         and
>         >         modifying columns. Here's how it works. Could make
>         changes,
>         >         could
>         >         document it, or both, what do people think?
>         >
>         >         Just like data.frame there is a list vector holding
>         pointers
>         >         to the
>         >         column vectors. A delete column op is done with a
>         memmove to
>         >         budge up
>         >         the column pointers above the column by one place.
>         That leaves
>         >         a gap at
>         >         the end. The length attribute of that vector
>         (ncol(DT)) is
>         >         then
>         >         decremented and the spare 4 bytes (or 8 on 64bit)
>         are left
>         >         unused at the
>         >         end.
>         >
>         >         An add column can't be fully by reference because
>         the list
>         >         vector is
>         >         full. A new list vector has to be allocated, one
>         slot larger,
>         >         the old
>         >         pointers memcpy'd over, and the last spot assigned
>         the pointer
>         >         to the
>         >         new column vector.  This copying is negligible
>         because it's a
>         >         small list
>         >         of pointers fitting well within one page. [Unless,
>         there are
>         >         many 1000's
>         >         of columns, which is why it's done as efficiently as
>         possible
>         >         using
>         >         memcpy].
>         >
>         >         Aside : There is little known (I guess) distinction
>         between
>         >         length and
>         >         truelength in R internals. Base R doesn't use it,
>         but we could
>         >         in
>         >         data.table. A delete column sets length but leaves
>         truelength
>         >         one
>         >         larger. When the next add column comes along, it
>         could just do
>         >         the budge
>         >         up and insert the column. That may not be so
>         advantageous for
>         >         (a small
>         >         number) of columns,  but the same logic could work
>         for
>         >         insert() and
>         >         delete()ing rows.  Of course, this would mean
>         whether a
>         >         visible copy or
>         >         not is taken depends on what happened previously,
>         rather than
>         >         the
>         >         syntax. That's something we've disliked before, in
>         the same
>         >         way we
>         >         dislike drop=TRUE behaviour and so dropped drop. One
>         way to
>         >         approach
>         >         this might be to advise ":= add *may* not copy. Best
>         to assume
>         >         it
>         >         doesn't; use copy()". If you get in the habbit of
>         >         "DT2=copy(DT)" then
>         >         that'll take a deep copy at the time and you're
>         safe.
>         >
>         >         To illustrate the partial (maybe shallow copy is
>         better word),
>         >         consider
>         >         the following :
>         >
>         >         > DT = data.table(1:2,3:4)
>         >         > DT2=DT
>         >         > DT2[,y:=10L]
>         >             V1 V2  y
>         >         [1,]  1  3 10
>         >         [2,]  2  4 10
>         >         > DT
>         >             V1 V2
>         >         [1,]  1  3
>         >         [2,]  2  4
>         >         > DT2
>         >             V1 V2  y
>         >         [1,]  1  3 10
>         >         [2,]  2  4 10
>         >         > DT2[1,V1:=99L]
>         >             V1 V2  y
>         >         [1,] 99  3 10
>         >         [2,]  2  4 10
>         >         > DT
>         >             V1 V2
>         >         [1,] 99  3
>         >         [2,]  2  4
>         >         >
>         >
>         >         Matthew
>         >
>         >
>         >         On Thu, 2011-10-27 at 11:46 -0700, Muhammad Waliji
>         wrote:
>         >         > I think this is a bug.  DT.2 <- DT.1 doesn't seem
>         to make a
>         >         copy in
>         >         > all cases.
>         >         >
>         >         >
>         >         > > DT.1 <- data.table(x=1, y=1)
>         >         > > DT.2 <- DT.1
>         >         > >
>         >         > > # Both DT.1 and DT.2 are changed.
>         >         > > DT.2[, y := NULL]
>         >         >      x
>         >         > [1,] 1
>         >         > > DT.1
>         >         >      x
>         >         > [1,] 1
>         >         > > DT.2
>         >         >      x
>         >         > [1,] 1
>         >         > >
>         >         > > # Only DT.2 is changed
>         >         > > DT.2[, y := x]
>         >         >      x y
>         >         > [1,] 1 1
>         >         > > DT.1
>         >         >      x
>         >         > [1,] 1
>         >         > > DT.2
>         >         >      x y
>         >         > [1,] 1 1
>         >         >
>         >         >
>         >
>         >         > _______________________________________________
>         >         > datatable-help mailing list
>         >         > [hidden email]
>         >         >
>         >
>         https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>         >
>         >
>         >
>         >
>        
>        
>        
>


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Copy on assign broken in some cases

mhwaliji


On Fri, Oct 28, 2011 at 5:57 PM, Matthew Dowle <[hidden email]> wrote:
On Fri, 2011-10-28 at 17:42 -0700, Muhammad Waliji wrote:
> On Fri, Oct 28, 2011 at 5:32 PM, Matthew Dowle
> <[hidden email]> wrote:
>
>         On Fri, 2011-10-28 at 09:52 -0700, Muhammad Waliji wrote:
>         > >From the user's perspective, DT2 <- DT should either be a
>         new copy or
>         > a new reference.  Anything in between is confusing.
>
>
>         Agreed. With picky caveat: even in base it's not at this point
>         the copy
>         is taken. It's later: copy-on-write. It's setkey and := that
>         don't copy
>         on write, not the (earlier) <-.
>
>
> Hmm, I would prefer for these to have the same behavior.

Not sure I follow, please expand.

I would like for DT[, x := foo] and DT$x <- foo to have the same behavior.  i.e. if one preserves the reference, so should the other.
 

>
>
>         > How about this - add a new argument to data.table(), say
>         max.cols.
>         > max.cols defaults to a couple orders of magnitude above the
>         initial
>         > number of columns.  data.table allocates enough memory for
>         max.cols
>         > column pointers.  If you try to add more than max.cols
>         columns, it is
>         > either an error, or it creates a copy and produces a
>         warning.
>
>
>         Very nice idea. To over allocate by default so that := can add
>         columns
>         fully by reference most of the time seems good to me since
>         there's a
>         very low cost to over allocating the vector of column
>         pointers. Create
>         the (shallow copy) and issue a warning, I'm thinking, not
>         error. The
>         "max.cols" names seems a bit absolute, could it be
>         "alloc.cols"?  We
>         could have alloc(DT,2,ncol) or rowalloc(DT,n) and
>         colalloc(DT,n), or
>         realloc(...) so users can over alloc themselves before a loop
>         that adds
>         columns or inserts rows.  tables() could also report truenrow,
>         and
>         truencol as well as nrow and ncol.  What should alloc.cols be,
>         by
>         default? How about:  max(100,2*ncol)
>
>
> Fine with me.
>
>         What about as.data.table.data.frame()?  Should that
>         over-allocate, too,
>         or for speed just change the class attribute as it does now.
>
>
> Yeah, I think any method of creating a data table should
> over-allocate.  If people want the speed gains, they can set
> explicitly set alloc.cols.
>
>
>
>         Maybe checking NAMED would work, in addition. If NAMED was 0,
>         no need to
>         warn. Only when NAMED was 1 (or 2) - (not too hot on NAMED) -
>         would the
>         warning be necessary.
>
>
>         >
>         > On Fri, Oct 28, 2011 at 1:10 AM, Matthew Dowle
>         > <[hidden email]> wrote:
>         >         Interesting one. Adding columns is a bit different
>         to deleting
>         >         and
>         >         modifying columns. Here's how it works. Could make
>         changes,
>         >         could
>         >         document it, or both, what do people think?
>         >
>         >         Just like data.frame there is a list vector holding
>         pointers
>         >         to the
>         >         column vectors. A delete column op is done with a
>         memmove to
>         >         budge up
>         >         the column pointers above the column by one place.
>         That leaves
>         >         a gap at
>         >         the end. The length attribute of that vector
>         (ncol(DT)) is
>         >         then
>         >         decremented and the spare 4 bytes (or 8 on 64bit)
>         are left
>         >         unused at the
>         >         end.
>         >
>         >         An add column can't be fully by reference because
>         the list
>         >         vector is
>         >         full. A new list vector has to be allocated, one
>         slot larger,
>         >         the old
>         >         pointers memcpy'd over, and the last spot assigned
>         the pointer
>         >         to the
>         >         new column vector.  This copying is negligible
>         because it's a
>         >         small list
>         >         of pointers fitting well within one page. [Unless,
>         there are
>         >         many 1000's
>         >         of columns, which is why it's done as efficiently as
>         possible
>         >         using
>         >         memcpy].
>         >
>         >         Aside : There is little known (I guess) distinction
>         between
>         >         length and
>         >         truelength in R internals. Base R doesn't use it,
>         but we could
>         >         in
>         >         data.table. A delete column sets length but leaves
>         truelength
>         >         one
>         >         larger. When the next add column comes along, it
>         could just do
>         >         the budge
>         >         up and insert the column. That may not be so
>         advantageous for
>         >         (a small
>         >         number) of columns,  but the same logic could work
>         for
>         >         insert() and
>         >         delete()ing rows.  Of course, this would mean
>         whether a
>         >         visible copy or
>         >         not is taken depends on what happened previously,
>         rather than
>         >         the
>         >         syntax. That's something we've disliked before, in
>         the same
>         >         way we
>         >         dislike drop=TRUE behaviour and so dropped drop. One
>         way to
>         >         approach
>         >         this might be to advise ":= add *may* not copy. Best
>         to assume
>         >         it
>         >         doesn't; use copy()". If you get in the habbit of
>         >         "DT2=copy(DT)" then
>         >         that'll take a deep copy at the time and you're
>         safe.
>         >
>         >         To illustrate the partial (maybe shallow copy is
>         better word),
>         >         consider
>         >         the following :
>         >
>         >         > DT = data.table(1:2,3:4)
>         >         > DT2=DT
>         >         > DT2[,y:=10L]
>         >             V1 V2  y
>         >         [1,]  1  3 10
>         >         [2,]  2  4 10
>         >         > DT
>         >             V1 V2
>         >         [1,]  1  3
>         >         [2,]  2  4
>         >         > DT2
>         >             V1 V2  y
>         >         [1,]  1  3 10
>         >         [2,]  2  4 10
>         >         > DT2[1,V1:=99L]
>         >             V1 V2  y
>         >         [1,] 99  3 10
>         >         [2,]  2  4 10
>         >         > DT
>         >             V1 V2
>         >         [1,] 99  3
>         >         [2,]  2  4
>         >         >
>         >
>         >         Matthew
>         >
>         >
>         >         On Thu, 2011-10-27 at 11:46 -0700, Muhammad Waliji
>         wrote:
>         >         > I think this is a bug.  DT.2 <- DT.1 doesn't seem
>         to make a
>         >         copy in
>         >         > all cases.
>         >         >
>         >         >
>         >         > > DT.1 <- data.table(x=1, y=1)
>         >         > > DT.2 <- DT.1
>         >         > >
>         >         > > # Both DT.1 and DT.2 are changed.
>         >         > > DT.2[, y := NULL]
>         >         >      x
>         >         > [1,] 1
>         >         > > DT.1
>         >         >      x
>         >         > [1,] 1
>         >         > > DT.2
>         >         >      x
>         >         > [1,] 1
>         >         > >
>         >         > > # Only DT.2 is changed
>         >         > > DT.2[, y := x]
>         >         >      x y
>         >         > [1,] 1 1
>         >         > > DT.1
>         >         >      x
>         >         > [1,] 1
>         >         > > DT.2
>         >         >      x y
>         >         > [1,] 1 1
>         >         >
>         >         >
>         >
>         >         > _______________________________________________
>         >         > datatable-help mailing list
>         >         > [hidden email]
>         >         >
>         >
>         https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>         >
>         >
>         >
>         >
>
>
>
>




_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Copy on assign broken in some cases

Matthew Dowle
On Fri, 2011-10-28 at 17:59 -0700, Muhammad Waliji wrote:

> On Fri, Oct 28, 2011 at 5:57 PM, Matthew Dowle
> <[hidden email]> wrote:
>         On Fri, 2011-10-28 at 17:42 -0700, Muhammad Waliji wrote:
>         > On Fri, Oct 28, 2011 at 5:32 PM, Matthew Dowle
>         > <[hidden email]> wrote:
>         >
>         >         On Fri, 2011-10-28 at 09:52 -0700, Muhammad Waliji
>         wrote:
>         >         > >From the user's perspective, DT2 <- DT should
>         either be a
>         >         new copy or
>         >         > a new reference.  Anything in between is
>         confusing.
>         >
>         >
>         >         Agreed. With picky caveat: even in base it's not at
>         this point
>         >         the copy
>         >         is taken. It's later: copy-on-write. It's setkey
>         and := that
>         >         don't copy
>         >         on write, not the (earlier) <-.
>         >
>         >
>         > Hmm, I would prefer for these to have the same behavior.
>        
>        
>         Not sure I follow, please expand.
>
>
> I would like for DT[, x := foo] and DT$x <- foo to have the same
> behavior.  i.e. if one preserves the reference, so should the other.
>  

Agreed, I would too (at least I did to start with). However, it seems a
base R thing that <- copies via `*tmp*` and isn't possible to make it
work by reference.  See the (long) thread on r-devel here :

http://r.789695.n4.nabble.com/speeding-up-perception-tp3640920p3640920.html 

I'm pretty happy with :=, honestly. It's nice that <-, <<- and := all do
different and useful things within j. Changes to the meaning of [<-
might not have been backwards compatible, so := provides without
forcing.

Matthew



_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Copy on assign broken in some cases

Matthew Dowle
In reply to this post by mhwaliji
mhwaliji wrote
How about this - add a new argument to data.table(), say
max.cols. max.cols defaults to a couple orders of magnitude
above the initial number of columns.  data.table allocates enough
memory for max.cols column pointers.  If you try to add more
than max.cols columns, it is either an error, or it creates a copy
and produces a warning.
mdowle wrote
Very nice idea. To over allocate by default so that := can add
columns fully by reference most of the time seems good to me since
there's a very low cost to over allocating the vector of column
pointers. Create the (shallow copy) and issue a warning, I'm thinking, not
error. The "max.cols" names seems a bit absolute, could it be
"alloc.cols"?  We could have alloc(DT,2,ncol) or rowalloc(DT,n) and
colalloc(DT,n), or realloc(...) so users can over alloc themselves before a loop
that adds columns or inserts rows.  tables() could also report truenrow,
and truencol as well as nrow and ncol.  What should alloc.cols be,
by default? How about:  max(100,2*ncol)
mhwaliji wrote
Fine with me.
Done, committed to v1.7.3 on R-Forge.

Matthew
Loading...