Quantcast

speeding up perception

classic Classic list List threaded Threaded
23 messages Options
12
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

speeding up perception

ivo welch
Dear R developers: R is supposed to be slow for iterative
calculations.  actually, it isn't.  matrix operations are fast.  it is
data frame operations that are slow.

R <- 1000
C <- 1000

example <- function(m) {
  cat("rows: "); cat(system.time( for (r in 1:R) m[r,20] <-
sqrt(abs(m[r,20])) + rnorm(1) ), "\n")
  cat("columns: "); cat(system.time(for (c in 1:C) m[20,c] <-
sqrt(abs(m[20,c])) + rnorm(1)), "\n")
  if (is.data.frame(m)) { cat("df: columns as names: ");
cat(system.time(for (c in 1:C) m[[c]][20] <- sqrt(abs(m[[c]][20])) +
rnorm(1)), "\n") }
}

cat("\n**** Now as matrix\n")
example( matrix( rnorm(C*R), nrow=R ) )

cat("\n**** Now as data frame\n")
example( as.data.frame( matrix( rnorm(C*R), nrow=R ) ) )


When m is a data frame, the operation is about 300 times slower than
when m is a matrix.    The program is basically accessing 1000
numbers.  When m is a data frame, the speed of R is about 20 accesses
per seconds on a Mac Pro.  This is pretty pathetic.

I do not know the R internals, so the following is pure speculation.
I understand that an index calculation is faster than a vector lookup
for arbitrary size objects, but it seems as if R relies on search to
find its element.  maybe there isn't even a basic vector lookup table.
 a vector lookup table should be possible at least along the dimension
of consecutive storage.  another possible improvement would be to add
an operation that adds an attribute to the data frame that contains a
full index table to the object for quick lookup.  (if the index table
is there, it could be used.  otherwise, R could simply use the
existing internal mechanism.)

I think faster data frame access would significantly improve the
impression that R makes on novices.  just my 5 cents.

/iaw
----
Ivo Welch ([hidden email])
http://www.ivo-welch.info/
J. Fred Weston Professor of Finance
Anderson School at UCLA, C519

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: speeding up perception

Simon Urbanek
This is just a quick, incomplete response, but the main misconception is really the use of data.frames. If you don't use the elaborate mechanics of data frames that involve the management of row names, then they are definitely the wrong tool to use, because most of the overhead is exactly to manage to row names and you pay a substantial penalty for that. Just drop that one feature and you get timings similar to a matrix:

> m=matrix(rnorm(C*R), nrow=R)
> example(m)
rows: 0.015 0 0.015 0 0
columns: 0.01 0 0.01 0 0

> l=as.list(as.data.frame( matrix( rnorm(C*R), nrow=R ) ))
> example(l)
rows: 0.015 0 0.016 0 0
columns: 0.012 0 0.011 0 0

(with example modified to use m[[y]][x] instad of m[x,y])

I would not be surprised that many people use data.frames for the convenience of the matrix subsetting/subassignement operators and don't really care about the row names and for all those uses data.frames are the wrong tool. (Just look at `[.data.frame` and `[<-.data.frame`).

As a side note, it's a bit pointless to compare the performance to matrices as they imposes much more rigorous structure (all columns have the same type) - if you use data frames in such special (rare) cases, it's really your fault ;). So the bottom line is to educate users to not use data frames where not needed and/or provide alternatives (and there may be some things coming up, too).

And as I said, this is just a quick note, so carry on and comment on the original question ;).

Cheers,
Simon


On Jul 2, 2011, at 2:23 PM, ivo welch wrote:

> Dear R developers: R is supposed to be slow for iterative
> calculations.  actually, it isn't.  matrix operations are fast.  it is
> data frame operations that are slow.
>
> R <- 1000
> C <- 1000
>
> example <- function(m) {
>  cat("rows: "); cat(system.time( for (r in 1:R) m[r,20] <-
> sqrt(abs(m[r,20])) + rnorm(1) ), "\n")
>  cat("columns: "); cat(system.time(for (c in 1:C) m[20,c] <-
> sqrt(abs(m[20,c])) + rnorm(1)), "\n")
>  if (is.data.frame(m)) { cat("df: columns as names: ");
> cat(system.time(for (c in 1:C) m[[c]][20] <- sqrt(abs(m[[c]][20])) +
> rnorm(1)), "\n") }
> }
>
> cat("\n**** Now as matrix\n")
> example( matrix( rnorm(C*R), nrow=R ) )
>
> cat("\n**** Now as data frame\n")
> example( as.data.frame( matrix( rnorm(C*R), nrow=R ) ) )
>
>
> When m is a data frame, the operation is about 300 times slower than
> when m is a matrix.    The program is basically accessing 1000
> numbers.  When m is a data frame, the speed of R is about 20 accesses
> per seconds on a Mac Pro.  This is pretty pathetic.
>
> I do not know the R internals, so the following is pure speculation.
> I understand that an index calculation is faster than a vector lookup
> for arbitrary size objects, but it seems as if R relies on search to
> find its element.  maybe there isn't even a basic vector lookup table.
> a vector lookup table should be possible at least along the dimension
> of consecutive storage.  another possible improvement would be to add
> an operation that adds an attribute to the data frame that contains a
> full index table to the object for quick lookup.  (if the index table
> is there, it could be used.  otherwise, R could simply use the
> existing internal mechanism.)
>
> I think faster data frame access would significantly improve the
> impression that R makes on novices.  just my 5 cents.
>
> /iaw
> ----
> Ivo Welch ([hidden email])
> http://www.ivo-welch.info/
> J. Fred Weston Professor of Finance
> Anderson School at UCLA, C519
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: speeding up perception

Robert Stojnic

Hi Simon,

On 03/07/11 05:30, Simon Urbanek wrote:
> This is just a quick, incomplete response, but the main misconception is really the use of data.frames. If you don't use the elaborate mechanics of data frames that involve the management of row names, then they are definitely the wrong tool to use, because most of the overhead is exactly to manage to row names and you pay a substantial penalty for that. Just drop that one feature and you get timings similar to a matrix:

I tried to find some documentation on why there needs to be extra row
names handling when one is just assigning values into the column of a
data frame, but couldn't find any. For a while I stared at the code of
`[<-.data.frame` but couldn't figure out it myself. Can you please
summarise what exactly is going one when one does m[1, 1] <- 1 where m
is a data frame?

I found that the performance is significantly different with different
number of columns. For instance

# reassign first column to 1
example <- function(m){
     for(i in 1:1000)
         m[i,1] <- 1
}

m <- as.data.frame(matrix(0, ncol=2, nrow=1000))
system.time( example(m) )

    user  system elapsed
   0.164   0.000   0.163

m <- as.data.frame(matrix(0, ncol=1000, nrow=1000))
system.time( example(m) )

    user  system elapsed
  34.634   0.004  34.765

When m is a matrix, both run well under 0.1s.

Increasing the number of rows (but not the number of iterations) leads
to some increase in time, but not as drastic when increasing column
number. Using m[[y]][x] in this case doesn't help either.

example2 <- function(m){
     for(i in 1:1000)
         m[[1]][i] <- 1
}

m <- as.data.frame(matrix(0, ncol=1000, nrow=1000))
system.time( example2(m) )

    user  system elapsed
  36.007   0.148  36.233


r.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: speeding up perception

Simon Urbanek
Robert,

it's not the handling of row names per se that causes the slowdown, but my point was that if what you need is just matrix-like structure with different column types, you may want to use lists instead and for equal column types you're better of with a matrix.

But to address your point, one of the reasons for subassignments on data frames being slow is that they need extra copies of the data frame for method dispatch. Data frames are lists of column vectors, so the penalty is worse with increasing number of columns. Rows play no significant (additional) role, because those are simply operations on the column vectors (they need to be copied on modification in any case).

In practice it would not matter as much unless the users do stupid things like the example loop. In that case the list holding the columns is copied twice for every single value of i which is deadly. Obviously the sensible thing to do m[1:1000,1] <- 1 does not have that issue.

So to illustrate part of the data.frame penalty effect consider simply falling back to lists in the assignment:

> example2 <- function(m){
+    for(i in 1:1000)
+        m[[1]][i] <- 1
+ }
> m <- as.data.frame(matrix(0, ncol=1000, nrow=1000))
> system.time( example2(m) )
   user  system elapsed
 44.359  13.608  58.011

> ### using a list is very fast as illustrated before:
> m <- as.list(as.data.frame(matrix(0, ncol=1000, nrow=1000)))
> system.time( example2(m) )
   user  system elapsed
   0.01    0.00    0.01

> ### now try to fall back to a list for each iteration (part of what the data frames have to do):
> example3 <- function(m){
+    for(i in 1:1000) {
+        oc <- class(m)
+        class(m) <- NULL
+        m[[1]][i] <- 1
+        class(m) <- oc
+    }
+ }
> system.time( example3(m) )
   user  system elapsed
 19.080   2.251  21.335

So just the simple fact that you unclass and re-class the object gives you half of the penalty that data.frames incur even if you're dealing with a list. Add the additional logic that data frames have to go through and you have the full picture.

So, as I was saying earlier, if you want to loop subassignments over many elements: don't do that in the first place, but if you do, use lists or matrices, NOT data frames.

Cheers,
Simon


On Jul 3, 2011, at 8:13 AM, Robert Stojnic wrote:

>
> Hi Simon,
>
> On 03/07/11 05:30, Simon Urbanek wrote:
>> This is just a quick, incomplete response, but the main misconception is really the use of data.frames. If you don't use the elaborate mechanics of data frames that involve the management of row names, then they are definitely the wrong tool to use, because most of the overhead is exactly to manage to row names and you pay a substantial penalty for that. Just drop that one feature and you get timings similar to a matrix:
>
> I tried to find some documentation on why there needs to be extra row names handling when one is just assigning values into the column of a data frame, but couldn't find any. For a while I stared at the code of `[<-.data.frame` but couldn't figure out it myself. Can you please summarise what exactly is going one when one does m[1, 1] <- 1 where m is a data frame?
>
> I found that the performance is significantly different with different number of columns. For instance
>
> # reassign first column to 1
> example <- function(m){
>    for(i in 1:1000)
>        m[i,1] <- 1
> }
>
> m <- as.data.frame(matrix(0, ncol=2, nrow=1000))
> system.time( example(m) )
>
>   user  system elapsed
>  0.164   0.000   0.163
>
> m <- as.data.frame(matrix(0, ncol=1000, nrow=1000))
> system.time( example(m) )
>
>   user  system elapsed
> 34.634   0.004  34.765
>
> When m is a matrix, both run well under 0.1s.
>
> Increasing the number of rows (but not the number of iterations) leads to some increase in time, but not as drastic when increasing column number. Using m[[y]][x] in this case doesn't help either.
>
> example2 <- function(m){
>    for(i in 1:1000)
>        m[[1]][i] <- 1
> }
>
> m <- as.data.frame(matrix(0, ncol=1000, nrow=1000))
> system.time( example2(m) )
>
>   user  system elapsed
> 36.007   0.148  36.233
>
>
> r.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: speeding up perception

ivo welch
In reply to this post by ivo welch
thank you, simon.  this was very interesting indeed.  I also now
understand how far out of my depth I am here.

fortunately, as an end user, obviously, *I* now know how to avoid the
problem.  I particularly like the as.list() transformation and back to
as.data.frame() to speed things up without loss of (much)
functionality.


more broadly, I view the avoidance of individual access through the
use of apply and vector operations as a mixed "IQ test" and "knowledge
test" (which I often fail).  However, even for the most clever, there
are also situations where the KISS programming principle makes
explicit loops still preferable.  Personally, I would have preferred
it if R had, in its standard "statistical data set" data structure,
foregone the row names feature in exchange for retaining fast direct
access.  R could have reserved its current implementation "with row
names but slow access" for a less common (possibly pseudo-inheriting)
data structure.


If end users commonly do iterations over a data frame, which I would
guess to be the case, then the impression of R by (novice) end users
could be greatly enhanced if the extreme penalties could be eliminated
or at least flagged.  For example, I wonder if modest special internal
code could store data frames internally and transparently as lists of
vectors UNTIL a row name is assigned to.  Easier and uglier, a simple
but specific warning message could be issued with a suggestion if
there is an individual read/write into a data frame ("Warning: data
frames are much slower than lists of vectors for individual element
access").


I would also suggest changing the "Introduction to R" 6.3  from "A
data frame may for many purposes be regarded as a matrix with columns
possibly of differing modes and attributes. It may be displayed in
matrix form, and its rows and columns extracted using matrix indexing
conventions." to "A data frame may for many purposes be regarded as a
matrix with columns possibly of differing modes and attributes. It may
be displayed in matrix form, and its rows and columns extracted using
matrix indexing conventions.  However, data frames can be much slower
than matrices or even lists of vectors (which, like data frames, can
contain different types of columns) when individual elements need to
be accessed."  Reading about it immediately upon introduction could
flag the problem in a more visible manner.


regards,

/iaw

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: speeding up perception

Timothée Carayol
Hi --

It's my first post on this list; as a relatively new user with little
knowledge of R internals, I am a bit intimidated by the depth of some
of the discussions here, so please spare me if I say something
incredibly silly.

I feel that someone at this point should mention Matthew Dowle's
excellent data.table package
(http://cran.r-project.org/web/packages/data.table/index.html) which
seems to me to address many of the inefficiencies of data.frame.
data.tables have no row names; and operations that only need data from
one or two columns are (I believe) just as quick whether the total
number of columns is 5 or 1000. This results in very quick operations
(and, often, elegant code as well).

Regards
Timothee

On Mon, Jul 4, 2011 at 6:19 AM, ivo welch <[hidden email]> wrote:

> thank you, simon.  this was very interesting indeed.  I also now
> understand how far out of my depth I am here.
>
> fortunately, as an end user, obviously, *I* now know how to avoid the
> problem.  I particularly like the as.list() transformation and back to
> as.data.frame() to speed things up without loss of (much)
> functionality.
>
>
> more broadly, I view the avoidance of individual access through the
> use of apply and vector operations as a mixed "IQ test" and "knowledge
> test" (which I often fail).  However, even for the most clever, there
> are also situations where the KISS programming principle makes
> explicit loops still preferable.  Personally, I would have preferred
> it if R had, in its standard "statistical data set" data structure,
> foregone the row names feature in exchange for retaining fast direct
> access.  R could have reserved its current implementation "with row
> names but slow access" for a less common (possibly pseudo-inheriting)
> data structure.
>
>
> If end users commonly do iterations over a data frame, which I would
> guess to be the case, then the impression of R by (novice) end users
> could be greatly enhanced if the extreme penalties could be eliminated
> or at least flagged.  For example, I wonder if modest special internal
> code could store data frames internally and transparently as lists of
> vectors UNTIL a row name is assigned to.  Easier and uglier, a simple
> but specific warning message could be issued with a suggestion if
> there is an individual read/write into a data frame ("Warning: data
> frames are much slower than lists of vectors for individual element
> access").
>
>
> I would also suggest changing the "Introduction to R" 6.3  from "A
> data frame may for many purposes be regarded as a matrix with columns
> possibly of differing modes and attributes. It may be displayed in
> matrix form, and its rows and columns extracted using matrix indexing
> conventions." to "A data frame may for many purposes be regarded as a
> matrix with columns possibly of differing modes and attributes. It may
> be displayed in matrix form, and its rows and columns extracted using
> matrix indexing conventions.  However, data frames can be much slower
> than matrices or even lists of vectors (which, like data frames, can
> contain different types of columns) when individual elements need to
> be accessed."  Reading about it immediately upon introduction could
> flag the problem in a more visible manner.
>
>
> regards,
>
> /iaw
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: speeding up perception

Simon Urbanek
Timothée,

On Jul 4, 2011, at 2:47 AM, Timothée Carayol wrote:

> Hi --
>
> It's my first post on this list; as a relatively new user with little
> knowledge of R internals, I am a bit intimidated by the depth of some
> of the discussions here, so please spare me if I say something
> incredibly silly.
>
> I feel that someone at this point should mention Matthew Dowle's
> excellent data.table package
> (http://cran.r-project.org/web/packages/data.table/index.html) which
> seems to me to address many of the inefficiencies of data.frame.
> data.tables have no row names; and operations that only need data from
> one or two columns are (I believe) just as quick whether the total
> number of columns is 5 or 1000. This results in very quick operations
> (and, often, elegant code as well).
>

I agree that data.table is a very good alternative (for other reasons) that should be promoted more. The only slight snag is that it doesn't help with the issue at hand since it simply does a pass-though for subassignments to data frame's methods and thus suffers from the same problems (in fact there is a rather stark asymmetry in how it handles subsetting vs subassignment - which is a bit surprising [if I read the code correctly you can't use the same indexing in both]). In fact I would propose that it should not do that but handle the simple cases itself more efficiently without unneeded copies. That would make it indeed a very interesting alternative.

Cheers,
Simon


>
> On Mon, Jul 4, 2011 at 6:19 AM, ivo welch <[hidden email]> wrote:
>> thank you, simon.  this was very interesting indeed.  I also now
>> understand how far out of my depth I am here.
>>
>> fortunately, as an end user, obviously, *I* now know how to avoid the
>> problem.  I particularly like the as.list() transformation and back to
>> as.data.frame() to speed things up without loss of (much)
>> functionality.
>>
>>
>> more broadly, I view the avoidance of individual access through the
>> use of apply and vector operations as a mixed "IQ test" and "knowledge
>> test" (which I often fail).  However, even for the most clever, there
>> are also situations where the KISS programming principle makes
>> explicit loops still preferable.  Personally, I would have preferred
>> it if R had, in its standard "statistical data set" data structure,
>> foregone the row names feature in exchange for retaining fast direct
>> access.  R could have reserved its current implementation "with row
>> names but slow access" for a less common (possibly pseudo-inheriting)
>> data structure.
>>
>>
>> If end users commonly do iterations over a data frame, which I would
>> guess to be the case, then the impression of R by (novice) end users
>> could be greatly enhanced if the extreme penalties could be eliminated
>> or at least flagged.  For example, I wonder if modest special internal
>> code could store data frames internally and transparently as lists of
>> vectors UNTIL a row name is assigned to.  Easier and uglier, a simple
>> but specific warning message could be issued with a suggestion if
>> there is an individual read/write into a data frame ("Warning: data
>> frames are much slower than lists of vectors for individual element
>> access").
>>
>>
>> I would also suggest changing the "Introduction to R" 6.3  from "A
>> data frame may for many purposes be regarded as a matrix with columns
>> possibly of differing modes and attributes. It may be displayed in
>> matrix form, and its rows and columns extracted using matrix indexing
>> conventions." to "A data frame may for many purposes be regarded as a
>> matrix with columns possibly of differing modes and attributes. It may
>> be displayed in matrix form, and its rows and columns extracted using
>> matrix indexing conventions.  However, data frames can be much slower
>> than matrices or even lists of vectors (which, like data frames, can
>> contain different types of columns) when individual elements need to
>> be accessed."  Reading about it immediately upon introduction could
>> flag the problem in a more visible manner.
>>
>>
>> regards,
>>
>> /iaw
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: speeding up perception

Tim Hesterberg-2
In reply to this post by ivo welch
I've written a "dataframe" package that replaces existing methods for
data frame creation and subscripting with versions that use less
memory.  For example, as.data.frame(a vector) makes 4 copies of the
data in R 2.9.2, and 1 copy with the package.  There is a small speed
gain.

I and others have been using it at Google for some years, and it is time
to either put it on CRAN, or move it into R.

R core folks - would you prefer that this be released to CRAN, or
would you like to consider merging it directly into R?

I took existing functions, and did some hacks to reduce the number of
times R copies objects.  Some of it is ugly.  This could be done more
efficiently, and with cleaner code, with some changes or hooks in R
internal code, but I'm not prepared to do that.

I often use lists instead of data frames.  In another package I have a
'subscriptRows' function that subscripts a list as if it were
a data frame.  I could merge that into the dataframe package.

Memory use - number of copies made
#                               R 2.9.2                 library(dataframe)
#       as.data.frame(y)        4                       1
#       data.frame(y)           8                       3
#       data.frame(y, z)        8                       3
#       as.data.frame(l)        10                      3
#       data.frame(l)           15                      5
#       d$z <- z                3,2                     1,1
#       d[["z"]] <- z           4,3                     2,1
#       d[, "z"] <- z           6,4,2                   2,2,1
#       d["z"] <- z             6,5,2                   2,2,1
#       d["z"] <- list(z=z)     6,3,2                   2,2,1
#       d["z"] <- Z #list(z=z)  6,2,2                   2,1,1
#       a <- d["y"]             2                       1
#       a <- d[, "y", drop=F]   2                       1
# y and z are vectors, Z and l are lists, and d a data frame.
# Where two numbers are given, they refer to:
#   (copies of the old data frame),
#   (copies of the new column)
# A third number refers to numbers of
#   (copies made of an integer vector of row names)

#                      -------  seconds (multiple repetitions) -------
#                      creation/column subscripting     row subscripting
# R 2.9.2            : 34.2 43.9 43.3                   10.6 13.0
# library(dataframe) : 22.5 21.8 21.8                    9.7  9.5  9.5

I reported one of the simpler hacks to this list earlier, and it
was included in some version of R after 2.9.2, so the current version
of R isn't as bad as 2.9.2.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [datatable-help] speeding up perception

Matthew Dowle
In reply to this post by Simon Urbanek

Simon,

Thanks for the great suggestion. I've written a skeleton assignment
function for data.table which incurs no copies, which works for this
case. For completeness, if I understand correctly, this is for :
  i) convenience of new users who don't know how to vectorize yet
  ii) more complex examples which can't be vectorized.

Before:

> system.time(for (r in 1:R) DT[r,20] <- 1.0)
   user  system elapsed
 12.792   0.488  13.340

After :

> system.time(for (r in 1:R) DT[r,20] <- 1.0)
   user  system elapsed
  2.908   0.020   2.935

Where this can be reduced further as follows :

> system.time(for (r in 1:R) `[<-.data.table`(DT,r,2,1.0))
   user  system elapsed
  0.132   0.000   0.131
>

Still working on it. When it doesn't break other data.table tests, I'll
commit to R-Forge ...

Matthew


On Mon, 2011-07-04 at 12:41 -0400, Simon Urbanek wrote:

> Timothée,
>
> On Jul 4, 2011, at 2:47 AM, Timothée Carayol wrote:
>
> > Hi --
> >
> > It's my first post on this list; as a relatively new user with little
> > knowledge of R internals, I am a bit intimidated by the depth of some
> > of the discussions here, so please spare me if I say something
> > incredibly silly.
> >
> > I feel that someone at this point should mention Matthew Dowle's
> > excellent data.table package
> > (http://cran.r-project.org/web/packages/data.table/index.html) which
> > seems to me to address many of the inefficiencies of data.frame.
> > data.tables have no row names; and operations that only need data from
> > one or two columns are (I believe) just as quick whether the total
> > number of columns is 5 or 1000. This results in very quick operations
> > (and, often, elegant code as well).
> >
>
> I agree that data.table is a very good alternative (for other reasons) that should be promoted more. The only slight snag is that it doesn't help with the issue at hand since it simply does a pass-though for subassignments to data frame's methods and thus suffers from the same problems (in fact there is a rather stark asymmetry in how it handles subsetting vs subassignment - which is a bit surprising [if I read the code correctly you can't use the same indexing in both]). In fact I would propose that it should not do that but handle the simple cases itself more efficiently without unneeded copies. That would make it indeed a very interesting alternative.
>
> Cheers,
> Simon
>
>
> >
> > On Mon, Jul 4, 2011 at 6:19 AM, ivo welch <[hidden email]> wrote:
> >> thank you, simon.  this was very interesting indeed.  I also now
> >> understand how far out of my depth I am here.
> >>
> >> fortunately, as an end user, obviously, *I* now know how to avoid the
> >> problem.  I particularly like the as.list() transformation and back to
> >> as.data.frame() to speed things up without loss of (much)
> >> functionality.
> >>
> >>
> >> more broadly, I view the avoidance of individual access through the
> >> use of apply and vector operations as a mixed "IQ test" and "knowledge
> >> test" (which I often fail).  However, even for the most clever, there
> >> are also situations where the KISS programming principle makes
> >> explicit loops still preferable.  Personally, I would have preferred
> >> it if R had, in its standard "statistical data set" data structure,
> >> foregone the row names feature in exchange for retaining fast direct
> >> access.  R could have reserved its current implementation "with row
> >> names but slow access" for a less common (possibly pseudo-inheriting)
> >> data structure.
> >>
> >>
> >> If end users commonly do iterations over a data frame, which I would
> >> guess to be the case, then the impression of R by (novice) end users
> >> could be greatly enhanced if the extreme penalties could be eliminated
> >> or at least flagged.  For example, I wonder if modest special internal
> >> code could store data frames internally and transparently as lists of
> >> vectors UNTIL a row name is assigned to.  Easier and uglier, a simple
> >> but specific warning message could be issued with a suggestion if
> >> there is an individual read/write into a data frame ("Warning: data
> >> frames are much slower than lists of vectors for individual element
> >> access").
> >>
> >>
> >> I would also suggest changing the "Introduction to R" 6.3  from "A
> >> data frame may for many purposes be regarded as a matrix with columns
> >> possibly of differing modes and attributes. It may be displayed in
> >> matrix form, and its rows and columns extracted using matrix indexing
> >> conventions." to "A data frame may for many purposes be regarded as a
> >> matrix with columns possibly of differing modes and attributes. It may
> >> be displayed in matrix form, and its rows and columns extracted using
> >> matrix indexing conventions.  However, data frames can be much slower
> >> than matrices or even lists of vectors (which, like data frames, can
> >> contain different types of columns) when individual elements need to
> >> be accessed."  Reading about it immediately upon introduction could
> >> flag the problem in a more visible manner.
> >>
> >>
> >> regards,
> >>
> >> /iaw
> >>
> >> ______________________________________________
> >> [hidden email] mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> >
>
> _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [datatable-help] speeding up perception

Matthew Dowle
Simon (and all),

I've tried to make assignment as fast as calling `[<-.data.table`
directly, for user convenience. Profiling shows (IIUC) that it isn't
dispatch, but x being copied. Is there a way to prevent '[<-' from
copying x?  Small reproducible example in vanilla R 2.13.0 :

> x = list(a=1:10000,b=1:10000)
> class(x) = "newclass"
> "[<-.newclass" = function(x,i,j,value) x      # i.e. do nothing
> tracemem(x)
[1] "<0xa1ec758>"
> x[1,2] = 42L
tracemem[0xa1ec758 -> 0xa1ec558]:    # but, x is still copied, why?
>

I've tried returning NULL from [<-.newclass but then x gets assigned
NULL :

> "[<-.newclass" = function(x,i,j,value) NULL
> x[1,2] = 42L
tracemem[0xa1ec558 -> 0x9c5f318]:
> x
NULL
>

Any pointers much appreciated. If that copy is preventable it should
save the user needing to use `[<-.data.table`(...) syntax to get the
best speed (20 times faster on the small example used so far).

Matthew


On Tue, 2011-07-05 at 08:32 +0100, Matthew Dowle wrote:

> Simon,
>
> Thanks for the great suggestion. I've written a skeleton assignment
> function for data.table which incurs no copies, which works for this
> case. For completeness, if I understand correctly, this is for :
>   i) convenience of new users who don't know how to vectorize yet
>   ii) more complex examples which can't be vectorized.
>
> Before:
>
> > system.time(for (r in 1:R) DT[r,20] <- 1.0)
>    user  system elapsed
>  12.792   0.488  13.340
>
> After :
>
> > system.time(for (r in 1:R) DT[r,20] <- 1.0)
>    user  system elapsed
>   2.908   0.020   2.935
>
> Where this can be reduced further as follows :
>
> > system.time(for (r in 1:R) `[<-.data.table`(DT,r,2,1.0))
>    user  system elapsed
>   0.132   0.000   0.131
> >
>
> Still working on it. When it doesn't break other data.table tests, I'll
> commit to R-Forge ...
>
> Matthew
>
>
> On Mon, 2011-07-04 at 12:41 -0400, Simon Urbanek wrote:
> > Timothée,
> >
> > On Jul 4, 2011, at 2:47 AM, Timothée Carayol wrote:
> >
> > > Hi --
> > >
> > > It's my first post on this list; as a relatively new user with little
> > > knowledge of R internals, I am a bit intimidated by the depth of some
> > > of the discussions here, so please spare me if I say something
> > > incredibly silly.
> > >
> > > I feel that someone at this point should mention Matthew Dowle's
> > > excellent data.table package
> > > (http://cran.r-project.org/web/packages/data.table/index.html) which
> > > seems to me to address many of the inefficiencies of data.frame.
> > > data.tables have no row names; and operations that only need data from
> > > one or two columns are (I believe) just as quick whether the total
> > > number of columns is 5 or 1000. This results in very quick operations
> > > (and, often, elegant code as well).
> > >
> >
> > I agree that data.table is a very good alternative (for other reasons) that should be promoted more. The only slight snag is that it doesn't help with the issue at hand since it simply does a pass-though for subassignments to data frame's methods and thus suffers from the same problems (in fact there is a rather stark asymmetry in how it handles subsetting vs subassignment - which is a bit surprising [if I read the code correctly you can't use the same indexing in both]). In fact I would propose that it should not do that but handle the simple cases itself more efficiently without unneeded copies. That would make it indeed a very interesting alternative.
> >
> > Cheers,
> > Simon
> >
> >
> > >
> > > On Mon, Jul 4, 2011 at 6:19 AM, ivo welch <[hidden email]> wrote:
> > >> thank you, simon.  this was very interesting indeed.  I also now
> > >> understand how far out of my depth I am here.
> > >>
> > >> fortunately, as an end user, obviously, *I* now know how to avoid the
> > >> problem.  I particularly like the as.list() transformation and back to
> > >> as.data.frame() to speed things up without loss of (much)
> > >> functionality.
> > >>
> > >>
> > >> more broadly, I view the avoidance of individual access through the
> > >> use of apply and vector operations as a mixed "IQ test" and "knowledge
> > >> test" (which I often fail).  However, even for the most clever, there
> > >> are also situations where the KISS programming principle makes
> > >> explicit loops still preferable.  Personally, I would have preferred
> > >> it if R had, in its standard "statistical data set" data structure,
> > >> foregone the row names feature in exchange for retaining fast direct
> > >> access.  R could have reserved its current implementation "with row
> > >> names but slow access" for a less common (possibly pseudo-inheriting)
> > >> data structure.
> > >>
> > >>
> > >> If end users commonly do iterations over a data frame, which I would
> > >> guess to be the case, then the impression of R by (novice) end users
> > >> could be greatly enhanced if the extreme penalties could be eliminated
> > >> or at least flagged.  For example, I wonder if modest special internal
> > >> code could store data frames internally and transparently as lists of
> > >> vectors UNTIL a row name is assigned to.  Easier and uglier, a simple
> > >> but specific warning message could be issued with a suggestion if
> > >> there is an individual read/write into a data frame ("Warning: data
> > >> frames are much slower than lists of vectors for individual element
> > >> access").
> > >>
> > >>
> > >> I would also suggest changing the "Introduction to R" 6.3  from "A
> > >> data frame may for many purposes be regarded as a matrix with columns
> > >> possibly of differing modes and attributes. It may be displayed in
> > >> matrix form, and its rows and columns extracted using matrix indexing
> > >> conventions." to "A data frame may for many purposes be regarded as a
> > >> matrix with columns possibly of differing modes and attributes. It may
> > >> be displayed in matrix form, and its rows and columns extracted using
> > >> matrix indexing conventions.  However, data frames can be much slower
> > >> than matrices or even lists of vectors (which, like data frames, can
> > >> contain different types of columns) when individual elements need to
> > >> be accessed."  Reading about it immediately upon introduction could
> > >> flag the problem in a more visible manner.
> > >>
> > >>
> > >> regards,
> > >>
> > >> /iaw
> > >>
> > >> ______________________________________________
> > >> [hidden email] mailing list
> > >> https://stat.ethz.ch/mailman/listinfo/r-devel
> > >>
> > >
> > > ______________________________________________
> > > [hidden email] mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-devel
> > >
> > >
> >
> > _______________________________________________
> > datatable-help mailing list
> > [hidden email]
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
> _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [datatable-help] speeding up perception

Simon Urbanek

On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote:

> Simon (and all),
>
> I've tried to make assignment as fast as calling `[<-.data.table`
> directly, for user convenience. Profiling shows (IIUC) that it isn't
> dispatch, but x being copied. Is there a way to prevent '[<-' from
> copying x?

Good point, and conceptually, no. It's a subassignment after all - see R-lang 3.4.4 - it is equivalent to

`*tmp*` <- x
x <- `[<-`(`*tmp*`, i, j, value)
rm(`*tmp*`)

so there is always a copy involved.

Now, a conceptual copy doesn't mean real copy in R since R tries to keep the pass-by-value illusion while passing references in cases where it knows that modifications cannot occur and/or they are safe. The default subassign method uses that feature which means it can afford to not duplicate if there is only one reference -- then it's safe to not duplicate as we are replacing that only existing reference. And in the case of a matrix, that will be true at the latest from the second subassignment on.

Unfortunately the method dispatch (AFAICS) introduces one more reference in the dispatch chain so there will always be two references so duplication is necessary. Since we have only 0 / 1 / 2+ information on the references, we can't distinguish whether the second reference is due to the dispatch or due to the passed object having more than one reference, so we have to duplicate in any case. That is unfortunate, and I don't see a way around (unless we handle subassignment methods is some special way).

Cheers,
Simon



>  Small reproducible example in vanilla R 2.13.0 :
>
>> x = list(a=1:10000,b=1:10000)
>> class(x) = "newclass"
>> "[<-.newclass" = function(x,i,j,value) x      # i.e. do nothing
>> tracemem(x)
> [1] "<0xa1ec758>"
>> x[1,2] = 42L
> tracemem[0xa1ec758 -> 0xa1ec558]:    # but, x is still copied, why?
>>
>
> I've tried returning NULL from [<-.newclass but then x gets assigned
> NULL :
>
>> "[<-.newclass" = function(x,i,j,value) NULL
>> x[1,2] = 42L
> tracemem[0xa1ec558 -> 0x9c5f318]:
>> x
> NULL
>>
>
> Any pointers much appreciated. If that copy is preventable it should
> save the user needing to use `[<-.data.table`(...) syntax to get the
> best speed (20 times faster on the small example used so far).
>
> Matthew
>
>
> On Tue, 2011-07-05 at 08:32 +0100, Matthew Dowle wrote:
>> Simon,
>>
>> Thanks for the great suggestion. I've written a skeleton assignment
>> function for data.table which incurs no copies, which works for this
>> case. For completeness, if I understand correctly, this is for :
>>  i) convenience of new users who don't know how to vectorize yet
>>  ii) more complex examples which can't be vectorized.
>>
>> Before:
>>
>>> system.time(for (r in 1:R) DT[r,20] <- 1.0)
>>   user  system elapsed
>> 12.792   0.488  13.340
>>
>> After :
>>
>>> system.time(for (r in 1:R) DT[r,20] <- 1.0)
>>   user  system elapsed
>>  2.908   0.020   2.935
>>
>> Where this can be reduced further as follows :
>>
>>> system.time(for (r in 1:R) `[<-.data.table`(DT,r,2,1.0))
>>   user  system elapsed
>>  0.132   0.000   0.131
>>>
>>
>> Still working on it. When it doesn't break other data.table tests, I'll
>> commit to R-Forge ...
>>
>> Matthew
>>
>>
>> On Mon, 2011-07-04 at 12:41 -0400, Simon Urbanek wrote:
>>> Timothée,
>>>
>>> On Jul 4, 2011, at 2:47 AM, Timothée Carayol wrote:
>>>
>>>> Hi --
>>>>
>>>> It's my first post on this list; as a relatively new user with little
>>>> knowledge of R internals, I am a bit intimidated by the depth of some
>>>> of the discussions here, so please spare me if I say something
>>>> incredibly silly.
>>>>
>>>> I feel that someone at this point should mention Matthew Dowle's
>>>> excellent data.table package
>>>> (http://cran.r-project.org/web/packages/data.table/index.html) which
>>>> seems to me to address many of the inefficiencies of data.frame.
>>>> data.tables have no row names; and operations that only need data from
>>>> one or two columns are (I believe) just as quick whether the total
>>>> number of columns is 5 or 1000. This results in very quick operations
>>>> (and, often, elegant code as well).
>>>>
>>>
>>> I agree that data.table is a very good alternative (for other reasons) that should be promoted more. The only slight snag is that it doesn't help with the issue at hand since it simply does a pass-though for subassignments to data frame's methods and thus suffers from the same problems (in fact there is a rather stark asymmetry in how it handles subsetting vs subassignment - which is a bit surprising [if I read the code correctly you can't use the same indexing in both]). In fact I would propose that it should not do that but handle the simple cases itself more efficiently without unneeded copies. That would make it indeed a very interesting alternative.
>>>
>>> Cheers,
>>> Simon
>>>
>>>
>>>>
>>>> On Mon, Jul 4, 2011 at 6:19 AM, ivo welch <[hidden email]> wrote:
>>>>> thank you, simon.  this was very interesting indeed.  I also now
>>>>> understand how far out of my depth I am here.
>>>>>
>>>>> fortunately, as an end user, obviously, *I* now know how to avoid the
>>>>> problem.  I particularly like the as.list() transformation and back to
>>>>> as.data.frame() to speed things up without loss of (much)
>>>>> functionality.
>>>>>
>>>>>
>>>>> more broadly, I view the avoidance of individual access through the
>>>>> use of apply and vector operations as a mixed "IQ test" and "knowledge
>>>>> test" (which I often fail).  However, even for the most clever, there
>>>>> are also situations where the KISS programming principle makes
>>>>> explicit loops still preferable.  Personally, I would have preferred
>>>>> it if R had, in its standard "statistical data set" data structure,
>>>>> foregone the row names feature in exchange for retaining fast direct
>>>>> access.  R could have reserved its current implementation "with row
>>>>> names but slow access" for a less common (possibly pseudo-inheriting)
>>>>> data structure.
>>>>>
>>>>>
>>>>> If end users commonly do iterations over a data frame, which I would
>>>>> guess to be the case, then the impression of R by (novice) end users
>>>>> could be greatly enhanced if the extreme penalties could be eliminated
>>>>> or at least flagged.  For example, I wonder if modest special internal
>>>>> code could store data frames internally and transparently as lists of
>>>>> vectors UNTIL a row name is assigned to.  Easier and uglier, a simple
>>>>> but specific warning message could be issued with a suggestion if
>>>>> there is an individual read/write into a data frame ("Warning: data
>>>>> frames are much slower than lists of vectors for individual element
>>>>> access").
>>>>>
>>>>>
>>>>> I would also suggest changing the "Introduction to R" 6.3  from "A
>>>>> data frame may for many purposes be regarded as a matrix with columns
>>>>> possibly of differing modes and attributes. It may be displayed in
>>>>> matrix form, and its rows and columns extracted using matrix indexing
>>>>> conventions." to "A data frame may for many purposes be regarded as a
>>>>> matrix with columns possibly of differing modes and attributes. It may
>>>>> be displayed in matrix form, and its rows and columns extracted using
>>>>> matrix indexing conventions.  However, data frames can be much slower
>>>>> than matrices or even lists of vectors (which, like data frames, can
>>>>> contain different types of columns) when individual elements need to
>>>>> be accessed."  Reading about it immediately upon introduction could
>>>>> flag the problem in a more visible manner.
>>>>>
>>>>>
>>>>> regards,
>>>>>
>>>>> /iaw
>>>>>
>>>>> ______________________________________________
>>>>> [hidden email] mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>>
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>
>>>>
>>>
>>> _______________________________________________
>>> datatable-help mailing list
>>> [hidden email]
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>> _______________________________________________
>> datatable-help mailing list
>> [hidden email]
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [datatable-help] speeding up perception

luke-tierney
In reply to this post by Matthew Dowle
On Tue, 5 Jul 2011, Matthew Dowle wrote:

> Simon (and all),
>
> I've tried to make assignment as fast as calling `[<-.data.table`
> directly, for user convenience. Profiling shows (IIUC) that it isn't
> dispatch, but x being copied. Is there a way to prevent '[<-' from
> copying x?  Small reproducible example in vanilla R 2.13.0 :
>
>> x = list(a=1:10000,b=1:10000)
>> class(x) = "newclass"
>> "[<-.newclass" = function(x,i,j,value) x      # i.e. do nothing
>> tracemem(x)
> [1] "<0xa1ec758>"
>> x[1,2] = 42L
> tracemem[0xa1ec758 -> 0xa1ec558]:    # but, x is still copied, why?
>>
This one is a red herring -- the class(x) <- "newclass" assignment is
bumping up the NAMED value and as a result the following assignment
needs to duplicate. (the primitive class<- could be modified to avoid
the NAMED bump but it's fairly intricate code so I'm not going to look
into it now).

[A bit more later in reply to Simon's message]

luke

>
> I've tried returning NULL from [<-.newclass but then x gets assigned
> NULL :
>
>> "[<-.newclass" = function(x,i,j,value) NULL
>> x[1,2] = 42L
> tracemem[0xa1ec558 -> 0x9c5f318]:
>> x
> NULL
>>
>
> Any pointers much appreciated. If that copy is preventable it should
> save the user needing to use `[<-.data.table`(...) syntax to get the
> best speed (20 times faster on the small example used so far).
>
> Matthew
>
>
> On Tue, 2011-07-05 at 08:32 +0100, Matthew Dowle wrote:
>> Simon,
>>
>> Thanks for the great suggestion. I've written a skeleton assignment
>> function for data.table which incurs no copies, which works for this
>> case. For completeness, if I understand correctly, this is for :
>>   i) convenience of new users who don't know how to vectorize yet
>>   ii) more complex examples which can't be vectorized.
>>
>> Before:
>>
>> > system.time(for (r in 1:R) DT[r,20] <- 1.0)
>>    user  system elapsed
>>  12.792   0.488  13.340
>>
>> After :
>>
>> > system.time(for (r in 1:R) DT[r,20] <- 1.0)
>>    user  system elapsed
>>   2.908   0.020   2.935
>>
>> Where this can be reduced further as follows :
>>
>> > system.time(for (r in 1:R) `[<-.data.table`(DT,r,2,1.0))
>>    user  system elapsed
>>   0.132   0.000   0.131
>> >
>>
>> Still working on it. When it doesn't break other data.table tests, I'll
>> commit to R-Forge ...
>>
>> Matthew
>>
>>
>> On Mon, 2011-07-04 at 12:41 -0400, Simon Urbanek wrote:
>> > Timothée,
>> >
>> > On Jul 4, 2011, at 2:47 AM, Timothée Carayol wrote:
>> >
>> > > Hi --
>> > >
>> > > It's my first post on this list; as a relatively new user with little
>> > > knowledge of R internals, I am a bit intimidated by the depth of some
>> > > of the discussions here, so please spare me if I say something
>> > > incredibly silly.
>> > >
>> > > I feel that someone at this point should mention Matthew Dowle's
>> > > excellent data.table package
>> > > (http://cran.r-project.org/web/packages/data.table/index.html) which
>> > > seems to me to address many of the inefficiencies of data.frame.
>> > > data.tables have no row names; and operations that only need data from
>> > > one or two columns are (I believe) just as quick whether the total
>> > > number of columns is 5 or 1000. This results in very quick operations
>> > > (and, often, elegant code as well).
>> > >
>> >
>> > I agree that data.table is a very good alternative (for other reasons) that should be promoted more. The only slight snag is that it doesn't help with the issue at hand since it simply does a pass-though for subassignments to data frame's methods and thus suffers from the same problems (in fact there is a rather stark asymmetry in how it handles subsetting vs subassignment - which is a bit surprising [if I read the code correctly you can't use the same indexing in both]). In fact I would propose that it should not do that but handle the simple cases itself more efficiently without unneeded copies. That would make it indeed a very interesting alternative.
>> >
>> > Cheers,
>> > Simon
>> >
>> >
>> > >
>> > > On Mon, Jul 4, 2011 at 6:19 AM, ivo welch <[hidden email]> wrote:
>> > >> thank you, simon.  this was very interesting indeed.  I also now
>> > >> understand how far out of my depth I am here.
>> > >>
>> > >> fortunately, as an end user, obviously, *I* now know how to avoid the
>> > >> problem.  I particularly like the as.list() transformation and back to
>> > >> as.data.frame() to speed things up without loss of (much)
>> > >> functionality.
>> > >>
>> > >>
>> > >> more broadly, I view the avoidance of individual access through the
>> > >> use of apply and vector operations as a mixed "IQ test" and "knowledge
>> > >> test" (which I often fail).  However, even for the most clever, there
>> > >> are also situations where the KISS programming principle makes
>> > >> explicit loops still preferable.  Personally, I would have preferred
>> > >> it if R had, in its standard "statistical data set" data structure,
>> > >> foregone the row names feature in exchange for retaining fast direct
>> > >> access.  R could have reserved its current implementation "with row
>> > >> names but slow access" for a less common (possibly pseudo-inheriting)
>> > >> data structure.
>> > >>
>> > >>
>> > >> If end users commonly do iterations over a data frame, which I would
>> > >> guess to be the case, then the impression of R by (novice) end users
>> > >> could be greatly enhanced if the extreme penalties could be eliminated
>> > >> or at least flagged.  For example, I wonder if modest special internal
>> > >> code could store data frames internally and transparently as lists of
>> > >> vectors UNTIL a row name is assigned to.  Easier and uglier, a simple
>> > >> but specific warning message could be issued with a suggestion if
>> > >> there is an individual read/write into a data frame ("Warning: data
>> > >> frames are much slower than lists of vectors for individual element
>> > >> access").
>> > >>
>> > >>
>> > >> I would also suggest changing the "Introduction to R" 6.3  from "A
>> > >> data frame may for many purposes be regarded as a matrix with columns
>> > >> possibly of differing modes and attributes. It may be displayed in
>> > >> matrix form, and its rows and columns extracted using matrix indexing
>> > >> conventions." to "A data frame may for many purposes be regarded as a
>> > >> matrix with columns possibly of differing modes and attributes. It may
>> > >> be displayed in matrix form, and its rows and columns extracted using
>> > >> matrix indexing conventions.  However, data frames can be much slower
>> > >> than matrices or even lists of vectors (which, like data frames, can
>> > >> contain different types of columns) when individual elements need to
>> > >> be accessed."  Reading about it immediately upon introduction could
>> > >> flag the problem in a more visible manner.
>> > >>
>> > >>
>> > >> regards,
>> > >>
>> > >> /iaw
>> > >>
>> > >> ______________________________________________
>> > >> [hidden email] mailing list
>> > >> https://stat.ethz.ch/mailman/listinfo/r-devel
>> > >>
>> > >
>> > > ______________________________________________
>> > > [hidden email] mailing list
>> > > https://stat.ethz.ch/mailman/listinfo/r-devel
>> > >
>> > >
>> >
>> > _______________________________________________
>> > datatable-help mailing list
>> > [hidden email]
>> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>> _______________________________________________
>> datatable-help mailing list
>> [hidden email]
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
--
Luke Tierney
Statistics and Actuarial Science
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:      [hidden email]
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [datatable-help] speeding up perception

luke-tierney
In reply to this post by Simon Urbanek
On Tue, 5 Jul 2011, Simon Urbanek wrote:

>
> On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote:
>
>> Simon (and all),
>>
>> I've tried to make assignment as fast as calling `[<-.data.table`
>> directly, for user convenience. Profiling shows (IIUC) that it isn't
>> dispatch, but x being copied. Is there a way to prevent '[<-' from
>> copying x?
>
> Good point, and conceptually, no. It's a subassignment after all - see R-lang 3.4.4 - it is equivalent to
>
> `*tmp*` <- x
> x <- `[<-`(`*tmp*`, i, j, value)
> rm(`*tmp*`)
>
> so there is always a copy involved.
>
> Now, a conceptual copy doesn't mean real copy in R since R tries to keep the pass-by-value illusion while passing references in cases where it knows that modifications cannot occur and/or they are safe. The default subassign method uses that feature which means it can afford to not duplicate if there is only one reference -- then it's safe to not duplicate as we are replacing that only existing reference. And in the case of a matrix, that will be true at the latest from the second subassignment on.
>
> Unfortunately the method dispatch (AFAICS) introduces one more reference in the dispatch chain so there will always be two references so duplication is necessary. Since we have only 0 / 1 / 2+ information on the references, we can't distinguish whether the second reference is due to the dispatch or due to the passed object having more than one reference, so we have to duplicate in any case. That is unfortunate, and I don't see a way around (unless we handle subassignment methods is some special way).
I don't believe dispatch is bumping NAMED (and a quick experiment
seems to confirm this though I don't guarantee I did that right). The
issue is that a replacement function implemented as a closure, which
is the only option for a package, will always see NAMED on the object
to be modified as 2 (because the value is obtained by forcing the
argument promise) and so any R level assignments will duplicate.  This
also isn't really an issue of imprecise reference counting -- there
really are (at least) two legitimate references -- one though the
argument and one through the caller's environment.

It would be good it we could come up with a way for packages to be
able to define replacement functions that do not duplicate in cases
where we really don't want them to, but this would require coming up
with some sort of protocol, minimally involving an efficient way to
detect whether a replacement funciton is bing called in a replacement
context or directly.

There are some replacement functions that use C code to cheat, but
these may create problems if called directly, so I won't advertise
them.

Best,

luke

>
> Cheers,
> Simon
>
>
>
>>  Small reproducible example in vanilla R 2.13.0 :
>>
>>> x = list(a=1:10000,b=1:10000)
>>> class(x) = "newclass"
>>> "[<-.newclass" = function(x,i,j,value) x      # i.e. do nothing
>>> tracemem(x)
>> [1] "<0xa1ec758>"
>>> x[1,2] = 42L
>> tracemem[0xa1ec758 -> 0xa1ec558]:    # but, x is still copied, why?
>>>
>>
>> I've tried returning NULL from [<-.newclass but then x gets assigned
>> NULL :
>>
>>> "[<-.newclass" = function(x,i,j,value) NULL
>>> x[1,2] = 42L
>> tracemem[0xa1ec558 -> 0x9c5f318]:
>>> x
>> NULL
>>>
>>
>> Any pointers much appreciated. If that copy is preventable it should
>> save the user needing to use `[<-.data.table`(...) syntax to get the
>> best speed (20 times faster on the small example used so far).
>>
>> Matthew
>>
>>
>> On Tue, 2011-07-05 at 08:32 +0100, Matthew Dowle wrote:
>>> Simon,
>>>
>>> Thanks for the great suggestion. I've written a skeleton assignment
>>> function for data.table which incurs no copies, which works for this
>>> case. For completeness, if I understand correctly, this is for :
>>>  i) convenience of new users who don't know how to vectorize yet
>>>  ii) more complex examples which can't be vectorized.
>>>
>>> Before:
>>>
>>>> system.time(for (r in 1:R) DT[r,20] <- 1.0)
>>>   user  system elapsed
>>> 12.792   0.488  13.340
>>>
>>> After :
>>>
>>>> system.time(for (r in 1:R) DT[r,20] <- 1.0)
>>>   user  system elapsed
>>>  2.908   0.020   2.935
>>>
>>> Where this can be reduced further as follows :
>>>
>>>> system.time(for (r in 1:R) `[<-.data.table`(DT,r,2,1.0))
>>>   user  system elapsed
>>>  0.132   0.000   0.131
>>>>
>>>
>>> Still working on it. When it doesn't break other data.table tests, I'll
>>> commit to R-Forge ...
>>>
>>> Matthew
>>>
>>>
>>> On Mon, 2011-07-04 at 12:41 -0400, Simon Urbanek wrote:
>>>> Timothée,
>>>>
>>>> On Jul 4, 2011, at 2:47 AM, Timothée Carayol wrote:
>>>>
>>>>> Hi --
>>>>>
>>>>> It's my first post on this list; as a relatively new user with little
>>>>> knowledge of R internals, I am a bit intimidated by the depth of some
>>>>> of the discussions here, so please spare me if I say something
>>>>> incredibly silly.
>>>>>
>>>>> I feel that someone at this point should mention Matthew Dowle's
>>>>> excellent data.table package
>>>>> (http://cran.r-project.org/web/packages/data.table/index.html) which
>>>>> seems to me to address many of the inefficiencies of data.frame.
>>>>> data.tables have no row names; and operations that only need data from
>>>>> one or two columns are (I believe) just as quick whether the total
>>>>> number of columns is 5 or 1000. This results in very quick operations
>>>>> (and, often, elegant code as well).
>>>>>
>>>>
>>>> I agree that data.table is a very good alternative (for other reasons) that should be promoted more. The only slight snag is that it doesn't help with the issue at hand since it simply does a pass-though for subassignments to data frame's methods and thus suffers from the same problems (in fact there is a rather stark asymmetry in how it handles subsetting vs subassignment - which is a bit surprising [if I read the code correctly you can't use the same indexing in both]). In fact I would propose that it should not do that but handle the simple cases itself more efficiently without unneeded copies. That would make it indeed a very interesting alternative.
>>>>
>>>> Cheers,
>>>> Simon
>>>>
>>>>
>>>>>
>>>>> On Mon, Jul 4, 2011 at 6:19 AM, ivo welch <[hidden email]> wrote:
>>>>>> thank you, simon.  this was very interesting indeed.  I also now
>>>>>> understand how far out of my depth I am here.
>>>>>>
>>>>>> fortunately, as an end user, obviously, *I* now know how to avoid the
>>>>>> problem.  I particularly like the as.list() transformation and back to
>>>>>> as.data.frame() to speed things up without loss of (much)
>>>>>> functionality.
>>>>>>
>>>>>>
>>>>>> more broadly, I view the avoidance of individual access through the
>>>>>> use of apply and vector operations as a mixed "IQ test" and "knowledge
>>>>>> test" (which I often fail).  However, even for the most clever, there
>>>>>> are also situations where the KISS programming principle makes
>>>>>> explicit loops still preferable.  Personally, I would have preferred
>>>>>> it if R had, in its standard "statistical data set" data structure,
>>>>>> foregone the row names feature in exchange for retaining fast direct
>>>>>> access.  R could have reserved its current implementation "with row
>>>>>> names but slow access" for a less common (possibly pseudo-inheriting)
>>>>>> data structure.
>>>>>>
>>>>>>
>>>>>> If end users commonly do iterations over a data frame, which I would
>>>>>> guess to be the case, then the impression of R by (novice) end users
>>>>>> could be greatly enhanced if the extreme penalties could be eliminated
>>>>>> or at least flagged.  For example, I wonder if modest special internal
>>>>>> code could store data frames internally and transparently as lists of
>>>>>> vectors UNTIL a row name is assigned to.  Easier and uglier, a simple
>>>>>> but specific warning message could be issued with a suggestion if
>>>>>> there is an individual read/write into a data frame ("Warning: data
>>>>>> frames are much slower than lists of vectors for individual element
>>>>>> access").
>>>>>>
>>>>>>
>>>>>> I would also suggest changing the "Introduction to R" 6.3  from "A
>>>>>> data frame may for many purposes be regarded as a matrix with columns
>>>>>> possibly of differing modes and attributes. It may be displayed in
>>>>>> matrix form, and its rows and columns extracted using matrix indexing
>>>>>> conventions." to "A data frame may for many purposes be regarded as a
>>>>>> matrix with columns possibly of differing modes and attributes. It may
>>>>>> be displayed in matrix form, and its rows and columns extracted using
>>>>>> matrix indexing conventions.  However, data frames can be much slower
>>>>>> than matrices or even lists of vectors (which, like data frames, can
>>>>>> contain different types of columns) when individual elements need to
>>>>>> be accessed."  Reading about it immediately upon introduction could
>>>>>> flag the problem in a more visible manner.
>>>>>>
>>>>>>
>>>>>> regards,
>>>>>>
>>>>>> /iaw
>>>>>>
>>>>>> ______________________________________________
>>>>>> [hidden email] mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>>>
>>>>>
>>>>> ______________________________________________
>>>>> [hidden email] mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> datatable-help mailing list
>>>> [hidden email]
>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>
>>>
>>> _______________________________________________
>>> datatable-help mailing list
>>> [hidden email]
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
--
Luke Tierney
Statistics and Actuarial Science
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:      [hidden email]
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [datatable-help] speeding up perception

David Winsemius

On Jul 5, 2011, at 7:18 PM, <[hidden email]> <[hidden email]
 > wrote:

> On Tue, 5 Jul 2011, Simon Urbanek wrote:
>
>>
>> On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote:
>>
>>> Simon (and all),
>>>
>>> I've tried to make assignment as fast as calling `[<-.data.table`
>>> directly, for user convenience. Profiling shows (IIUC) that it isn't
>>> dispatch, but x being copied. Is there a way to prevent '[<-' from
>>> copying x?
>>
>> Good point, and conceptually, no. It's a subassignment after all -  
>> see R-lang 3.4.4 - it is equivalent to
>>
>> `*tmp*` <- x
>> x <- `[<-`(`*tmp*`, i, j, value)
>> rm(`*tmp*`)
>>
>> so there is always a copy involved.
>>
>> Now, a conceptual copy doesn't mean real copy in R since R tries to  
>> keep the pass-by-value illusion while passing references in cases  
>> where it knows that modifications cannot occur and/or they are  
>> safe. The default subassign method uses that feature which means it  
>> can afford to not duplicate if there is only one reference -- then  
>> it's safe to not duplicate as we are replacing that only existing  
>> reference. And in the case of a matrix, that will be true at the  
>> latest from the second subassignment on.
>>
>> Unfortunately the method dispatch (AFAICS) introduces one more  
>> reference in the dispatch chain so there will always be two  
>> references so duplication is necessary. Since we have only 0 / 1 /  
>> 2+ information on the references, we can't distinguish whether the  
>> second reference is due to the dispatch or due to the passed object  
>> having more than one reference, so we have to duplicate in any  
>> case. That is unfortunate, and I don't see a way around (unless we  
>> handle subassignment methods is some special way).
>
> I don't believe dispatch is bumping NAMED (and a quick experiment
> seems to confirm this though I don't guarantee I did that right). The
> issue is that a replacement function implemented as a closure, which
> is the only option for a package, will always see NAMED on the object
> to be modified as 2 (because the value is obtained by forcing the
> argument promise) and so any R level assignments will duplicate.  This
> also isn't really an issue of imprecise reference counting -- there
> really are (at least) two legitimate references -- one though the
> argument and one through the caller's environment.
>
> It would be good it we could come up with a way for packages to be
> able to define replacement functions that do not duplicate in cases
> where we really don't want them to, but this would require coming up
> with some sort of protocol, minimally involving an efficient way to
> detect whether a replacement funciton is being called in a replacement
> context or directly.

Would "$<-" always satisfy that condition. It would be big help to me  
if it could be designed to avoid duplication the rest of the data.frame.

--

>
> There are some replacement functions that use C code to cheat, but
> these may create problems if called directly, so I won't advertise
> them.
>
> Best,
>
> luke
>
>>
>> Cheers,
>> Simon
>>
>>
>>
>
> --
> Luke Tierney
> Statistics and Actuarial Science
> Ralph E. Wareham Professor of Mathematical Sciences
> University of Iowa                  Phone:             319-335-3386
> Department of Statistics and        Fax:               319-335-3017
>   Actuarial Science
> 241 Schaeffer Hall                  email:      [hidden email]
> Iowa City, IA 52242                 WWW:  http://
> www.stat.uiowa.edu______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

David Winsemius, MD
West Hartford, CT

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [datatable-help] speeding up perception

Simon Urbanek
No subassignment function satisfies that condition, because you can always call them directly. However, that doesn't stop the default method from making that assumption, so I'm not sure it's an issue.

David, Just to clarify - the data frame content is not copied, we are talking about the vector holding columns.

Cheers,
Simon

Sent from my iPhone

On Jul 5, 2011, at 9:01 PM, David Winsemius <[hidden email]> wrote:

>
> On Jul 5, 2011, at 7:18 PM, <[hidden email]> <[hidden email]> wrote:
>
>> On Tue, 5 Jul 2011, Simon Urbanek wrote:
>>
>>>
>>> On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote:
>>>
>>>> Simon (and all),
>>>>
>>>> I've tried to make assignment as fast as calling `[<-.data.table`
>>>> directly, for user convenience. Profiling shows (IIUC) that it isn't
>>>> dispatch, but x being copied. Is there a way to prevent '[<-' from
>>>> copying x?
>>>
>>> Good point, and conceptually, no. It's a subassignment after all - see R-lang 3.4.4 - it is equivalent to
>>>
>>> `*tmp*` <- x
>>> x <- `[<-`(`*tmp*`, i, j, value)
>>> rm(`*tmp*`)
>>>
>>> so there is always a copy involved.
>>>
>>> Now, a conceptual copy doesn't mean real copy in R since R tries to keep the pass-by-value illusion while passing references in cases where it knows that modifications cannot occur and/or they are safe. The default subassign method uses that feature which means it can afford to not duplicate if there is only one reference -- then it's safe to not duplicate as we are replacing that only existing reference. And in the case of a matrix, that will be true at the latest from the second subassignment on.
>>>
>>> Unfortunately the method dispatch (AFAICS) introduces one more reference in the dispatch chain so there will always be two references so duplication is necessary. Since we have only 0 / 1 / 2+ information on the references, we can't distinguish whether the second reference is due to the dispatch or due to the passed object having more than one reference, so we have to duplicate in any case. That is unfortunate, and I don't see a way around (unless we handle subassignment methods is some special way).
>>
>> I don't believe dispatch is bumping NAMED (and a quick experiment
>> seems to confirm this though I don't guarantee I did that right). The
>> issue is that a replacement function implemented as a closure, which
>> is the only option for a package, will always see NAMED on the object
>> to be modified as 2 (because the value is obtained by forcing the
>> argument promise) and so any R level assignments will duplicate.  This
>> also isn't really an issue of imprecise reference counting -- there
>> really are (at least) two legitimate references -- one though the
>> argument and one through the caller's environment.
>>
>> It would be good it we could come up with a way for packages to be
>> able to define replacement functions that do not duplicate in cases
>> where we really don't want them to, but this would require coming up
>> with some sort of protocol, minimally involving an efficient way to
>> detect whether a replacement funciton is being called in a replacement
>> context or directly.
>
> Would "$<-" always satisfy that condition. It would be big help to me if it could be designed to avoid duplication the rest of the data.frame.
>
> --
>
>>
>> There are some replacement functions that use C code to cheat, but
>> these may create problems if called directly, so I won't advertise
>> them.
>>
>> Best,
>>
>> luke
>>
>>>
>>> Cheers,
>>> Simon
>>>
>>>
>>>
>>
>> --
>> Luke Tierney
>> Statistics and Actuarial Science
>> Ralph E. Wareham Professor of Mathematical Sciences
>> University of Iowa                  Phone:             319-335-3386
>> Department of Statistics and        Fax:               319-335-3017
>>  Actuarial Science
>> 241 Schaeffer Hall                  email:      [hidden email]
>> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> David Winsemius, MD
> West Hartford, CT
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [datatable-help] speeding up perception

Matthew Dowle

On Tue, 2011-07-05 at 21:11 -0400, Simon Urbanek wrote:
> No subassignment function satisfies that condition, because you can always call them directly. However, that doesn't stop the default method from making that assumption, so I'm not sure it's an issue.
>
> David, Just to clarify - the data frame content is not copied, we are talking about the vector holding columns.

If it is just the vector holding the columns that is copied (and not the
columns themselves), why does n make a difference in this test (on R
2.13.0)?

> n = 1000
> x = data.frame(a=1:n,b=1:n)
> system.time(for (i in 1:1000) x[1,1] <- 42L)
   user  system elapsed
  0.628   0.000   0.628
> n = 100000
> x = data.frame(a=1:n,b=1:n)      # still 2 columns, but longer columns
> system.time(for (i in 1:1000) x[1,1] <- 42L)
   user  system elapsed
 20.145   1.232  21.455
>

With $<- :

> n = 1000
> x = data.frame(a=1:n,b=1:n)
> system.time(for (i in 1:1000) x$a[1] <- 42L)
   user  system elapsed
  0.304   0.000   0.307
> n = 100000
> x = data.frame(a=1:n,b=1:n)
> system.time(for (i in 1:1000) x$a[1] <- 42L)
   user  system elapsed
 37.586   0.388  38.161
>

If it's because the 1st column needs to be copied (only) because that's
the one being assigned to (in this test), that magnitude of slow down
doesn't seem consistent with the time of a vector copy of the 1st
column :

> n=100000
> v = 1:n
> system.time(for (i in 1:1000) v[1] <- 42L)
   user  system elapsed
  0.016   0.000   0.017
> system.time(for (i in 1:1000) {v2=v;v2[1] <- 42L})
   user  system elapsed
  1.816   1.076   2.900

Finally, increasing the number of columns, again only the 1st is
assigned to :

> n=100000
> x = data.frame(rep(list(1:n),100))
> dim(x)
[1] 100000    100
> system.time(for (i in 1:1000) x[1,1] <- 42L)
   user  system elapsed
167.974  50.903 219.711
>



>
> Cheers,
> Simon
>
> Sent from my iPhone
>
> On Jul 5, 2011, at 9:01 PM, David Winsemius <[hidden email]> wrote:
>
> >
> > On Jul 5, 2011, at 7:18 PM, <[hidden email]> <[hidden email]> wrote:
> >
> >> On Tue, 5 Jul 2011, Simon Urbanek wrote:
> >>
> >>>
> >>> On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote:
> >>>
> >>>> Simon (and all),
> >>>>
> >>>> I've tried to make assignment as fast as calling `[<-.data.table`
> >>>> directly, for user convenience. Profiling shows (IIUC) that it isn't
> >>>> dispatch, but x being copied. Is there a way to prevent '[<-' from
> >>>> copying x?
> >>>
> >>> Good point, and conceptually, no. It's a subassignment after all - see R-lang 3.4.4 - it is equivalent to
> >>>
> >>> `*tmp*` <- x
> >>> x <- `[<-`(`*tmp*`, i, j, value)
> >>> rm(`*tmp*`)
> >>>
> >>> so there is always a copy involved.
> >>>
> >>> Now, a conceptual copy doesn't mean real copy in R since R tries to keep the pass-by-value illusion while passing references in cases where it knows that modifications cannot occur and/or they are safe. The default subassign method uses that feature which means it can afford to not duplicate if there is only one reference -- then it's safe to not duplicate as we are replacing that only existing reference. And in the case of a matrix, that will be true at the latest from the second subassignment on.
> >>>
> >>> Unfortunately the method dispatch (AFAICS) introduces one more reference in the dispatch chain so there will always be two references so duplication is necessary. Since we have only 0 / 1 / 2+ information on the references, we can't distinguish whether the second reference is due to the dispatch or due to the passed object having more than one reference, so we have to duplicate in any case. That is unfortunate, and I don't see a way around (unless we handle subassignment methods is some special way).
> >>
> >> I don't believe dispatch is bumping NAMED (and a quick experiment
> >> seems to confirm this though I don't guarantee I did that right). The
> >> issue is that a replacement function implemented as a closure, which
> >> is the only option for a package, will always see NAMED on the object
> >> to be modified as 2 (because the value is obtained by forcing the
> >> argument promise) and so any R level assignments will duplicate.  This
> >> also isn't really an issue of imprecise reference counting -- there
> >> really are (at least) two legitimate references -- one though the
> >> argument and one through the caller's environment.
> >>
> >> It would be good it we could come up with a way for packages to be
> >> able to define replacement functions that do not duplicate in cases
> >> where we really don't want them to, but this would require coming up
> >> with some sort of protocol, minimally involving an efficient way to
> >> detect whether a replacement funciton is being called in a replacement
> >> context or directly.
> >
> > Would "$<-" always satisfy that condition. It would be big help to me if it could be designed to avoid duplication the rest of the data.frame.
> >
> > --
> >
> >>
> >> There are some replacement functions that use C code to cheat, but
> >> these may create problems if called directly, so I won't advertise
> >> them.
> >>
> >> Best,
> >>
> >> luke
> >>
> >>>
> >>> Cheers,
> >>> Simon
> >>>
> >>>
> >>>
> >>
> >> --
> >> Luke Tierney
> >> Statistics and Actuarial Science
> >> Ralph E. Wareham Professor of Mathematical Sciences
> >> University of Iowa                  Phone:             319-335-3386
> >> Department of Statistics and        Fax:               319-335-3017
> >>  Actuarial Science
> >> 241 Schaeffer Hall                  email:      [hidden email]
> >> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu______________________________________________
> >> [hidden email] mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> > David Winsemius, MD
> > West Hartford, CT
> >
> >

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [datatable-help] speeding up perception

Simon Urbanek
Interesting, and I stand corrected:

> x = data.frame(a=1:n,b=1:n)
> .Internal(inspect(x))
@103511c00 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
  @102c7b000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
  @102af3000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...

> x[1,1]=42L
> .Internal(inspect(x))
@10349c720 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
  @102c19000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,...
  @102b55000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...

> x[[1]][1]=42L
> .Internal(inspect(x))
@103511a78 19 VECSXP g1c2 [OBJ,MARK,NAM(2),ATT] (len=2, tl=0)
  @102e65000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,...
  @101f14000 13 INTSXP g1c7 [MARK] (len=100000, tl=0) 1,2,3,4,5,...

> x[[1]][1]=42L
> .Internal(inspect(x))
@10349c800 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
  @102a2f000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,...
  @102ec7000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...


I have R to release ;) so I won't be looking into this right now, but it's something worth investigating ... Since all the inner contents have NAMED=0 I would not expect any duplication to be needed, but apparently becomes so is at some point ...

Cheers,
Simon


On Jul 6, 2011, at 4:36 AM, Matthew Dowle wrote:

>
> On Tue, 2011-07-05 at 21:11 -0400, Simon Urbanek wrote:
>> No subassignment function satisfies that condition, because you can always call them directly. However, that doesn't stop the default method from making that assumption, so I'm not sure it's an issue.
>>
>> David, Just to clarify - the data frame content is not copied, we are talking about the vector holding columns.
>
> If it is just the vector holding the columns that is copied (and not the
> columns themselves), why does n make a difference in this test (on R
> 2.13.0)?
>
>> n = 1000
>> x = data.frame(a=1:n,b=1:n)
>> system.time(for (i in 1:1000) x[1,1] <- 42L)
>   user  system elapsed
>  0.628   0.000   0.628
>> n = 100000
>> x = data.frame(a=1:n,b=1:n)      # still 2 columns, but longer columns
>> system.time(for (i in 1:1000) x[1,1] <- 42L)
>   user  system elapsed
> 20.145   1.232  21.455
>>
>
> With $<- :
>
>> n = 1000
>> x = data.frame(a=1:n,b=1:n)
>> system.time(for (i in 1:1000) x$a[1] <- 42L)
>   user  system elapsed
>  0.304   0.000   0.307
>> n = 100000
>> x = data.frame(a=1:n,b=1:n)
>> system.time(for (i in 1:1000) x$a[1] <- 42L)
>   user  system elapsed
> 37.586   0.388  38.161
>>
>
> If it's because the 1st column needs to be copied (only) because that's
> the one being assigned to (in this test), that magnitude of slow down
> doesn't seem consistent with the time of a vector copy of the 1st
> column :
>
>> n=100000
>> v = 1:n
>> system.time(for (i in 1:1000) v[1] <- 42L)
>   user  system elapsed
>  0.016   0.000   0.017
>> system.time(for (i in 1:1000) {v2=v;v2[1] <- 42L})
>   user  system elapsed
>  1.816   1.076   2.900
>
> Finally, increasing the number of columns, again only the 1st is
> assigned to :
>
>> n=100000
>> x = data.frame(rep(list(1:n),100))
>> dim(x)
> [1] 100000    100
>> system.time(for (i in 1:1000) x[1,1] <- 42L)
>   user  system elapsed
> 167.974  50.903 219.711
>>
>
>
>
>>
>> Cheers,
>> Simon
>>
>> Sent from my iPhone
>>
>> On Jul 5, 2011, at 9:01 PM, David Winsemius <[hidden email]> wrote:
>>
>>>
>>> On Jul 5, 2011, at 7:18 PM, <[hidden email]> <[hidden email]> wrote:
>>>
>>>> On Tue, 5 Jul 2011, Simon Urbanek wrote:
>>>>
>>>>>
>>>>> On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote:
>>>>>
>>>>>> Simon (and all),
>>>>>>
>>>>>> I've tried to make assignment as fast as calling `[<-.data.table`
>>>>>> directly, for user convenience. Profiling shows (IIUC) that it isn't
>>>>>> dispatch, but x being copied. Is there a way to prevent '[<-' from
>>>>>> copying x?
>>>>>
>>>>> Good point, and conceptually, no. It's a subassignment after all - see R-lang 3.4.4 - it is equivalent to
>>>>>
>>>>> `*tmp*` <- x
>>>>> x <- `[<-`(`*tmp*`, i, j, value)
>>>>> rm(`*tmp*`)
>>>>>
>>>>> so there is always a copy involved.
>>>>>
>>>>> Now, a conceptual copy doesn't mean real copy in R since R tries to keep the pass-by-value illusion while passing references in cases where it knows that modifications cannot occur and/or they are safe. The default subassign method uses that feature which means it can afford to not duplicate if there is only one reference -- then it's safe to not duplicate as we are replacing that only existing reference. And in the case of a matrix, that will be true at the latest from the second subassignment on.
>>>>>
>>>>> Unfortunately the method dispatch (AFAICS) introduces one more reference in the dispatch chain so there will always be two references so duplication is necessary. Since we have only 0 / 1 / 2+ information on the references, we can't distinguish whether the second reference is due to the dispatch or due to the passed object having more than one reference, so we have to duplicate in any case. That is unfortunate, and I don't see a way around (unless we handle subassignment methods is some special way).
>>>>
>>>> I don't believe dispatch is bumping NAMED (and a quick experiment
>>>> seems to confirm this though I don't guarantee I did that right). The
>>>> issue is that a replacement function implemented as a closure, which
>>>> is the only option for a package, will always see NAMED on the object
>>>> to be modified as 2 (because the value is obtained by forcing the
>>>> argument promise) and so any R level assignments will duplicate.  This
>>>> also isn't really an issue of imprecise reference counting -- there
>>>> really are (at least) two legitimate references -- one though the
>>>> argument and one through the caller's environment.
>>>>
>>>> It would be good it we could come up with a way for packages to be
>>>> able to define replacement functions that do not duplicate in cases
>>>> where we really don't want them to, but this would require coming up
>>>> with some sort of protocol, minimally involving an efficient way to
>>>> detect whether a replacement funciton is being called in a replacement
>>>> context or directly.
>>>
>>> Would "$<-" always satisfy that condition. It would be big help to me if it could be designed to avoid duplication the rest of the data.frame.
>>>
>>> --
>>>
>>>>
>>>> There are some replacement functions that use C code to cheat, but
>>>> these may create problems if called directly, so I won't advertise
>>>> them.
>>>>
>>>> Best,
>>>>
>>>> luke
>>>>
>>>>>
>>>>> Cheers,
>>>>> Simon
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Luke Tierney
>>>> Statistics and Actuarial Science
>>>> Ralph E. Wareham Professor of Mathematical Sciences
>>>> University of Iowa                  Phone:             319-335-3386
>>>> Department of Statistics and        Fax:               319-335-3017
>>>> Actuarial Science
>>>> 241 Schaeffer Hall                  email:      [hidden email]
>>>> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu______________________________________________
>>>> [hidden email] mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>> David Winsemius, MD
>>> West Hartford, CT
>>>
>>>
>
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [datatable-help] speeding up perception

luke-tierney
On Wed, 6 Jul 2011, Simon Urbanek wrote:

> Interesting, and I stand corrected:
>
>> x = data.frame(a=1:n,b=1:n)
>> .Internal(inspect(x))
> @103511c00 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
>  @102c7b000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
>  @102af3000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
>
>> x[1,1]=42L
>> .Internal(inspect(x))
> @10349c720 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
>  @102c19000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,...
>  @102b55000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
>
>> x[[1]][1]=42L
>> .Internal(inspect(x))
> @103511a78 19 VECSXP g1c2 [OBJ,MARK,NAM(2),ATT] (len=2, tl=0)
>  @102e65000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,...
>  @101f14000 13 INTSXP g1c7 [MARK] (len=100000, tl=0) 1,2,3,4,5,...
>
>> x[[1]][1]=42L
>> .Internal(inspect(x))
> @10349c800 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
>  @102a2f000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,...
>  @102ec7000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
>
>
> I have R to release ;) so I won't be looking into this right now, but it's something worth investigating ... Since all the inner contents have NAMED=0 I would not expect any duplication to be needed, but apparently becomes so is at some point ...


The internals assume in various places that deep copies are made (one
of the reasons NAMED setings are not propagated to sub-sturcture).
The main issues are avoiding cycles and that there is no easy way to
check for sharing.  There may be some circumstances in which a shallow
copy would be OK but making sure it would be in all cases is probably
more trouble than it is worth at this point. (I've tried this in the
past in a few cases and always had to back off.)


Best,

luke

>
> Cheers,
> Simon
>
>
> On Jul 6, 2011, at 4:36 AM, Matthew Dowle wrote:
>
>>
>> On Tue, 2011-07-05 at 21:11 -0400, Simon Urbanek wrote:
>>> No subassignment function satisfies that condition, because you can always call them directly. However, that doesn't stop the default method from making that assumption, so I'm not sure it's an issue.
>>>
>>> David, Just to clarify - the data frame content is not copied, we are talking about the vector holding columns.
>>
>> If it is just the vector holding the columns that is copied (and not the
>> columns themselves), why does n make a difference in this test (on R
>> 2.13.0)?
>>
>>> n = 1000
>>> x = data.frame(a=1:n,b=1:n)
>>> system.time(for (i in 1:1000) x[1,1] <- 42L)
>>   user  system elapsed
>>  0.628   0.000   0.628
>>> n = 100000
>>> x = data.frame(a=1:n,b=1:n)      # still 2 columns, but longer columns
>>> system.time(for (i in 1:1000) x[1,1] <- 42L)
>>   user  system elapsed
>> 20.145   1.232  21.455
>>>
>>
>> With $<- :
>>
>>> n = 1000
>>> x = data.frame(a=1:n,b=1:n)
>>> system.time(for (i in 1:1000) x$a[1] <- 42L)
>>   user  system elapsed
>>  0.304   0.000   0.307
>>> n = 100000
>>> x = data.frame(a=1:n,b=1:n)
>>> system.time(for (i in 1:1000) x$a[1] <- 42L)
>>   user  system elapsed
>> 37.586   0.388  38.161
>>>
>>
>> If it's because the 1st column needs to be copied (only) because that's
>> the one being assigned to (in this test), that magnitude of slow down
>> doesn't seem consistent with the time of a vector copy of the 1st
>> column :
>>
>>> n=100000
>>> v = 1:n
>>> system.time(for (i in 1:1000) v[1] <- 42L)
>>   user  system elapsed
>>  0.016   0.000   0.017
>>> system.time(for (i in 1:1000) {v2=v;v2[1] <- 42L})
>>   user  system elapsed
>>  1.816   1.076   2.900
>>
>> Finally, increasing the number of columns, again only the 1st is
>> assigned to :
>>
>>> n=100000
>>> x = data.frame(rep(list(1:n),100))
>>> dim(x)
>> [1] 100000    100
>>> system.time(for (i in 1:1000) x[1,1] <- 42L)
>>   user  system elapsed
>> 167.974  50.903 219.711
>>>
>>
>>
>>
>>>
>>> Cheers,
>>> Simon
>>>
>>> Sent from my iPhone
>>>
>>> On Jul 5, 2011, at 9:01 PM, David Winsemius <[hidden email]> wrote:
>>>
>>>>
>>>> On Jul 5, 2011, at 7:18 PM, <[hidden email]> <[hidden email]> wrote:
>>>>
>>>>> On Tue, 5 Jul 2011, Simon Urbanek wrote:
>>>>>
>>>>>>
>>>>>> On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote:
>>>>>>
>>>>>>> Simon (and all),
>>>>>>>
>>>>>>> I've tried to make assignment as fast as calling `[<-.data.table`
>>>>>>> directly, for user convenience. Profiling shows (IIUC) that it isn't
>>>>>>> dispatch, but x being copied. Is there a way to prevent '[<-' from
>>>>>>> copying x?
>>>>>>
>>>>>> Good point, and conceptually, no. It's a subassignment after all - see R-lang 3.4.4 - it is equivalent to
>>>>>>
>>>>>> `*tmp*` <- x
>>>>>> x <- `[<-`(`*tmp*`, i, j, value)
>>>>>> rm(`*tmp*`)
>>>>>>
>>>>>> so there is always a copy involved.
>>>>>>
>>>>>> Now, a conceptual copy doesn't mean real copy in R since R tries to keep the pass-by-value illusion while passing references in cases where it knows that modifications cannot occur and/or they are safe. The default subassign method uses that feature which means it can afford to not duplicate if there is only one reference -- then it's safe to not duplicate as we are replacing that only existing reference. And in the case of a matrix, that will be true at the latest from the second subassignment on.
>>>>>>
>>>>>> Unfortunately the method dispatch (AFAICS) introduces one more reference in the dispatch chain so there will always be two references so duplication is necessary. Since we have only 0 / 1 / 2+ information on the references, we can't distinguish whether the second reference is due to the dispatch or due to the passed object having more than one reference, so we have to duplicate in any case. That is unfortunate, and I don't see a way around (unless we handle subassignment methods is some special way).
>>>>>
>>>>> I don't believe dispatch is bumping NAMED (and a quick experiment
>>>>> seems to confirm this though I don't guarantee I did that right). The
>>>>> issue is that a replacement function implemented as a closure, which
>>>>> is the only option for a package, will always see NAMED on the object
>>>>> to be modified as 2 (because the value is obtained by forcing the
>>>>> argument promise) and so any R level assignments will duplicate.  This
>>>>> also isn't really an issue of imprecise reference counting -- there
>>>>> really are (at least) two legitimate references -- one though the
>>>>> argument and one through the caller's environment.
>>>>>
>>>>> It would be good it we could come up with a way for packages to be
>>>>> able to define replacement functions that do not duplicate in cases
>>>>> where we really don't want them to, but this would require coming up
>>>>> with some sort of protocol, minimally involving an efficient way to
>>>>> detect whether a replacement funciton is being called in a replacement
>>>>> context or directly.
>>>>
>>>> Would "$<-" always satisfy that condition. It would be big help to me if it could be designed to avoid duplication the rest of the data.frame.
>>>>
>>>> --
>>>>
>>>>>
>>>>> There are some replacement functions that use C code to cheat, but
>>>>> these may create problems if called directly, so I won't advertise
>>>>> them.
>>>>>
>>>>> Best,
>>>>>
>>>>> luke
>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>> Simon
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Luke Tierney
>>>>> Statistics and Actuarial Science
>>>>> Ralph E. Wareham Professor of Mathematical Sciences
>>>>> University of Iowa                  Phone:             319-335-3386
>>>>> Department of Statistics and        Fax:               319-335-3017
>>>>> Actuarial Science
>>>>> 241 Schaeffer Hall                  email:      [hidden email]
>>>>> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu______________________________________________
>>>>> [hidden email] mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>
>>>> David Winsemius, MD
>>>> West Hartford, CT
>>>>
>>>>
>>
>>
>>
>
>

--
Luke Tierney
Statistics and Actuarial Science
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:      [hidden email]
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [datatable-help] speeding up perception

Matthew Dowle
Thanks for the replies and info. An attempt at fast
assign is now committed to data.table v1.6.3 on
R-Forge. From NEWS :

o   Fast update is now implemented, FR#200.
    DT[i,j]<-value is now handled by data.table in C rather
    than falling through to data.frame methods.
   
    Thanks to Ivo Welch for raising speed issues on r-devel,
    to Simon Urbanek for the suggestion, and Luke Tierney and
    Simon for information on R internals.

    [<- syntax still incurs one working copy of the whole
    table (as of R 2.13.0) due to R's [<- dispatch mechanism
    copying to `*tmp*`, so, for ultimate speed and brevity,
    'within' syntax is now available as follows.
       
o   A new 'within' argument has been added to [.data.table,
    by default TRUE. It is very similar to the within()
    function in base R. If an assignment appears in j, it
    assigns to the column of DT, by reference; e.g.,
         
    DT[i,colname<-value]
       
    This syntax makes no copies of any part of memory at all.
       
    > m = matrix(1,nrow=100000,ncol=100)
    > DF = as.data.frame(m)
    > DT = as.data.table(m)
    > system.time(for (i in 1:1000) DF[1,1] <- 3)
       user  system elapsed
    287.730 323.196 613.453
    > system.time(for (i in 1:1000) DT[1,V1 <- 3])
       user  system elapsed
      1.152   0.004   1.161         # 528 times faster

Please note :
       
    *******************************************************
    **  Within syntax is presently highly experimental.  **
    *******************************************************

http://datatable.r-forge.r-project.org/


On Wed, 2011-07-06 at 09:08 -0500, [hidden email] wrote:

> On Wed, 6 Jul 2011, Simon Urbanek wrote:
>
> > Interesting, and I stand corrected:
> >
> >> x = data.frame(a=1:n,b=1:n)
> >> .Internal(inspect(x))
> > @103511c00 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
> >  @102c7b000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
> >  @102af3000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
> >
> >> x[1,1]=42L
> >> .Internal(inspect(x))
> > @10349c720 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
> >  @102c19000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,...
> >  @102b55000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
> >
> >> x[[1]][1]=42L
> >> .Internal(inspect(x))
> > @103511a78 19 VECSXP g1c2 [OBJ,MARK,NAM(2),ATT] (len=2, tl=0)
> >  @102e65000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,...
> >  @101f14000 13 INTSXP g1c7 [MARK] (len=100000, tl=0) 1,2,3,4,5,...
> >
> >> x[[1]][1]=42L
> >> .Internal(inspect(x))
> > @10349c800 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
> >  @102a2f000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,...
> >  @102ec7000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
> >
> >
> > I have R to release ;) so I won't be looking into this right now, but it's something worth investigating ... Since all the inner contents have NAMED=0 I would not expect any duplication to be needed, but apparently becomes so is at some point ...
>
>
> The internals assume in various places that deep copies are made (one
> of the reasons NAMED setings are not propagated to sub-sturcture).
> The main issues are avoiding cycles and that there is no easy way to
> check for sharing.  There may be some circumstances in which a shallow
> copy would be OK but making sure it would be in all cases is probably
> more trouble than it is worth at this point. (I've tried this in the
> past in a few cases and always had to back off.)
>
>
> Best,
>
> luke
>
> >
> > Cheers,
> > Simon
> >
> >
> > On Jul 6, 2011, at 4:36 AM, Matthew Dowle wrote:
> >
> >>
> >> On Tue, 2011-07-05 at 21:11 -0400, Simon Urbanek wrote:
> >>> No subassignment function satisfies that condition, because you can always call them directly. However, that doesn't stop the default method from making that assumption, so I'm not sure it's an issue.
> >>>
> >>> David, Just to clarify - the data frame content is not copied, we are talking about the vector holding columns.
> >>
> >> If it is just the vector holding the columns that is copied (and not the
> >> columns themselves), why does n make a difference in this test (on R
> >> 2.13.0)?
> >>
> >>> n = 1000
> >>> x = data.frame(a=1:n,b=1:n)
> >>> system.time(for (i in 1:1000) x[1,1] <- 42L)
> >>   user  system elapsed
> >>  0.628   0.000   0.628
> >>> n = 100000
> >>> x = data.frame(a=1:n,b=1:n)      # still 2 columns, but longer columns
> >>> system.time(for (i in 1:1000) x[1,1] <- 42L)
> >>   user  system elapsed
> >> 20.145   1.232  21.455
> >>>
> >>
> >> With $<- :
> >>
> >>> n = 1000
> >>> x = data.frame(a=1:n,b=1:n)
> >>> system.time(for (i in 1:1000) x$a[1] <- 42L)
> >>   user  system elapsed
> >>  0.304   0.000   0.307
> >>> n = 100000
> >>> x = data.frame(a=1:n,b=1:n)
> >>> system.time(for (i in 1:1000) x$a[1] <- 42L)
> >>   user  system elapsed
> >> 37.586   0.388  38.161
> >>>
> >>
> >> If it's because the 1st column needs to be copied (only) because that's
> >> the one being assigned to (in this test), that magnitude of slow down
> >> doesn't seem consistent with the time of a vector copy of the 1st
> >> column :
> >>
> >>> n=100000
> >>> v = 1:n
> >>> system.time(for (i in 1:1000) v[1] <- 42L)
> >>   user  system elapsed
> >>  0.016   0.000   0.017
> >>> system.time(for (i in 1:1000) {v2=v;v2[1] <- 42L})
> >>   user  system elapsed
> >>  1.816   1.076   2.900
> >>
> >> Finally, increasing the number of columns, again only the 1st is
> >> assigned to :
> >>
> >>> n=100000
> >>> x = data.frame(rep(list(1:n),100))
> >>> dim(x)
> >> [1] 100000    100
> >>> system.time(for (i in 1:1000) x[1,1] <- 42L)
> >>   user  system elapsed
> >> 167.974  50.903 219.711
> >>>
> >>
> >>
> >>
> >>>
> >>> Cheers,
> >>> Simon
> >>>
> >>> Sent from my iPhone
> >>>
> >>> On Jul 5, 2011, at 9:01 PM, David Winsemius <[hidden email]> wrote:
> >>>
> >>>>
> >>>> On Jul 5, 2011, at 7:18 PM, <[hidden email]> <[hidden email]> wrote:
> >>>>
> >>>>> On Tue, 5 Jul 2011, Simon Urbanek wrote:
> >>>>>
> >>>>>>
> >>>>>> On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote:
> >>>>>>
> >>>>>>> Simon (and all),
> >>>>>>>
> >>>>>>> I've tried to make assignment as fast as calling `[<-.data.table`
> >>>>>>> directly, for user convenience. Profiling shows (IIUC) that it isn't
> >>>>>>> dispatch, but x being copied. Is there a way to prevent '[<-' from
> >>>>>>> copying x?
> >>>>>>
> >>>>>> Good point, and conceptually, no. It's a subassignment after all - see R-lang 3.4.4 - it is equivalent to
> >>>>>>
> >>>>>> `*tmp*` <- x
> >>>>>> x <- `[<-`(`*tmp*`, i, j, value)
> >>>>>> rm(`*tmp*`)
> >>>>>>
> >>>>>> so there is always a copy involved.
> >>>>>>
> >>>>>> Now, a conceptual copy doesn't mean real copy in R since R tries to keep the pass-by-value illusion while passing references in cases where it knows that modifications cannot occur and/or they are safe. The default subassign method uses that feature which means it can afford to not duplicate if there is only one reference -- then it's safe to not duplicate as we are replacing that only existing reference. And in the case of a matrix, that will be true at the latest from the second subassignment on.
> >>>>>>
> >>>>>> Unfortunately the method dispatch (AFAICS) introduces one more reference in the dispatch chain so there will always be two references so duplication is necessary. Since we have only 0 / 1 / 2+ information on the references, we can't distinguish whether the second reference is due to the dispatch or due to the passed object having more than one reference, so we have to duplicate in any case. That is unfortunate, and I don't see a way around (unless we handle subassignment methods is some special way).
> >>>>>
> >>>>> I don't believe dispatch is bumping NAMED (and a quick experiment
> >>>>> seems to confirm this though I don't guarantee I did that right). The
> >>>>> issue is that a replacement function implemented as a closure, which
> >>>>> is the only option for a package, will always see NAMED on the object
> >>>>> to be modified as 2 (because the value is obtained by forcing the
> >>>>> argument promise) and so any R level assignments will duplicate.  This
> >>>>> also isn't really an issue of imprecise reference counting -- there
> >>>>> really are (at least) two legitimate references -- one though the
> >>>>> argument and one through the caller's environment.
> >>>>>
> >>>>> It would be good it we could come up with a way for packages to be
> >>>>> able to define replacement functions that do not duplicate in cases
> >>>>> where we really don't want them to, but this would require coming up
> >>>>> with some sort of protocol, minimally involving an efficient way to
> >>>>> detect whether a replacement funciton is being called in a replacement
> >>>>> context or directly.
> >>>>
> >>>> Would "$<-" always satisfy that condition. It would be big help to me if it could be designed to avoid duplication the rest of the data.frame.
> >>>>
> >>>> --
> >>>>
> >>>>>
> >>>>> There are some replacement functions that use C code to cheat, but
> >>>>> these may create problems if called directly, so I won't advertise
> >>>>> them.
> >>>>>
> >>>>> Best,
> >>>>>
> >>>>> luke
> >>>>>
> >>>>>>
> >>>>>> Cheers,
> >>>>>> Simon
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> --
> >>>>> Luke Tierney
> >>>>> Statistics and Actuarial Science
> >>>>> Ralph E. Wareham Professor of Mathematical Sciences
> >>>>> University of Iowa                  Phone:             319-335-3386
> >>>>> Department of Statistics and        Fax:               319-335-3017
> >>>>> Actuarial Science
> >>>>> 241 Schaeffer Hall                  email:      [hidden email]
> >>>>> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu______________________________________________
> >>>>> [hidden email] mailing list
> >>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>>>
> >>>> David Winsemius, MD
> >>>> West Hartford, CT
> >>>>
> >>>>
> >>
> >>
> >>
> >
> >
>
> --
> Luke Tierney
> Statistics and Actuarial Science
> Ralph E. Wareham Professor of Mathematical Sciences
> University of Iowa                  Phone:             319-335-3386
> Department of Statistics and        Fax:               319-335-3017
>     Actuarial Science
> 241 Schaeffer Hall                  email:      [hidden email]
> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [datatable-help] speeding up perception

Simon Urbanek
Matthew,

I was hoping I misunderstood you first proposal, but I suspect I did not ;).

Personally, I find  DT[1,V1 <- 3] highly disturbing - I would expect it to evaluate to
{ V1 <- 3; DT[1, V1] }
thus returning the first element of the third column.

I do understand that within(foo, expr, ...) was the motivation for passing expressions, but unlike within() the subsetting operator [ is not expected to take expression as its second argument. Such abuse is quite unexpected and I would say dangerous.

That said, I don't think it works, either. Taking you example and data.table form r-forge:

> m = matrix(1,nrow=100000,ncol=100)
> DF = as.data.frame(m)
> DT = as.data.table(m)
> for (i in 1:1000) DT[1,V1 <- 3]
> DT[1,]
     V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
[1,]  1  1  1  1  1  1  1  1  1   1   1   1   1   1   1   1   1   1   1   1   1

as you can see, DT is not modified.

Also I suspect there is something quite amiss because even trivial things don't work:

> DF[1:4,1:4]
  V1 V2 V3 V4
1  3  1  1  1
2  1  1  1  1
3  1  1  1  1
4  1  1  1  1
> DT[1:4,1:4]
[1] 1 2 3 4


When I first saw your proposal, I thought you have rather something like
within(DT, V1[1] <- 3)
in mind which looks innocent enough but performs terribly (note that I had to scale down the loop by a factor of 100!!!):

> system.time(for (i in 1:10) within(DT, V1[1] <- 3))
   user  system elapsed
  2.701   4.437   7.138

With the for loop something like within(DF, for (i in 1:1000) V1[i] <- 3)) performs reasonably:

> system.time(within(DT, for (i in 1:1000) V1[i] <- 3))
   user  system elapsed
  0.392   0.613   1.003

(Note: system.time() can be misleading when within() is involved, because the expression is evaluated in a different environment so within() won't actually change the object in the  global environment - it also interacts with the possible duplication)

Cheers,
Simon

On Jul 11, 2011, at 8:21 PM, Matthew Dowle wrote:

> Thanks for the replies and info. An attempt at fast
> assign is now committed to data.table v1.6.3 on
> R-Forge. From NEWS :
>
> o   Fast update is now implemented, FR#200.
>    DT[i,j]<-value is now handled by data.table in C rather
>    than falling through to data.frame methods.
>
>    Thanks to Ivo Welch for raising speed issues on r-devel,
>    to Simon Urbanek for the suggestion, and Luke Tierney and
>    Simon for information on R internals.
>
>    [<- syntax still incurs one working copy of the whole
>    table (as of R 2.13.0) due to R's [<- dispatch mechanism
>    copying to `*tmp*`, so, for ultimate speed and brevity,
>    'within' syntax is now available as follows.
>
> o   A new 'within' argument has been added to [.data.table,
>    by default TRUE. It is very similar to the within()
>    function in base R. If an assignment appears in j, it
>    assigns to the column of DT, by reference; e.g.,
>
>    DT[i,colname<-value]
>
>    This syntax makes no copies of any part of memory at all.
>
>> m = matrix(1,nrow=100000,ncol=100)
>> DF = as.data.frame(m)
>> DT = as.data.table(m)
>> system.time(for (i in 1:1000) DF[1,1] <- 3)
>       user  system elapsed
>    287.730 323.196 613.453
>> system.time(for (i in 1:1000) DT[1,V1 <- 3])
>       user  system elapsed
>      1.152   0.004   1.161         # 528 times faster
>
> Please note :
>
>    *******************************************************
>    **  Within syntax is presently highly experimental.  **
>    *******************************************************
>
> http://datatable.r-forge.r-project.org/
>
>
> On Wed, 2011-07-06 at 09:08 -0500, [hidden email] wrote:
>> On Wed, 6 Jul 2011, Simon Urbanek wrote:
>>
>>> Interesting, and I stand corrected:
>>>
>>>> x = data.frame(a=1:n,b=1:n)
>>>> .Internal(inspect(x))
>>> @103511c00 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
>>> @102c7b000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
>>> @102af3000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
>>>
>>>> x[1,1]=42L
>>>> .Internal(inspect(x))
>>> @10349c720 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
>>> @102c19000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,...
>>> @102b55000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
>>>
>>>> x[[1]][1]=42L
>>>> .Internal(inspect(x))
>>> @103511a78 19 VECSXP g1c2 [OBJ,MARK,NAM(2),ATT] (len=2, tl=0)
>>> @102e65000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,...
>>> @101f14000 13 INTSXP g1c7 [MARK] (len=100000, tl=0) 1,2,3,4,5,...
>>>
>>>> x[[1]][1]=42L
>>>> .Internal(inspect(x))
>>> @10349c800 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
>>> @102a2f000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,...
>>> @102ec7000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
>>>
>>>
>>> I have R to release ;) so I won't be looking into this right now, but it's something worth investigating ... Since all the inner contents have NAMED=0 I would not expect any duplication to be needed, but apparently becomes so is at some point ...
>>
>>
>> The internals assume in various places that deep copies are made (one
>> of the reasons NAMED setings are not propagated to sub-sturcture).
>> The main issues are avoiding cycles and that there is no easy way to
>> check for sharing.  There may be some circumstances in which a shallow
>> copy would be OK but making sure it would be in all cases is probably
>> more trouble than it is worth at this point. (I've tried this in the
>> past in a few cases and always had to back off.)
>>
>>
>> Best,
>>
>> luke
>>
>>>
>>> Cheers,
>>> Simon
>>>
>>>
>>> On Jul 6, 2011, at 4:36 AM, Matthew Dowle wrote:
>>>
>>>>
>>>> On Tue, 2011-07-05 at 21:11 -0400, Simon Urbanek wrote:
>>>>> No subassignment function satisfies that condition, because you can always call them directly. However, that doesn't stop the default method from making that assumption, so I'm not sure it's an issue.
>>>>>
>>>>> David, Just to clarify - the data frame content is not copied, we are talking about the vector holding columns.
>>>>
>>>> If it is just the vector holding the columns that is copied (and not the
>>>> columns themselves), why does n make a difference in this test (on R
>>>> 2.13.0)?
>>>>
>>>>> n = 1000
>>>>> x = data.frame(a=1:n,b=1:n)
>>>>> system.time(for (i in 1:1000) x[1,1] <- 42L)
>>>>  user  system elapsed
>>>> 0.628   0.000   0.628
>>>>> n = 100000
>>>>> x = data.frame(a=1:n,b=1:n)      # still 2 columns, but longer columns
>>>>> system.time(for (i in 1:1000) x[1,1] <- 42L)
>>>>  user  system elapsed
>>>> 20.145   1.232  21.455
>>>>>
>>>>
>>>> With $<- :
>>>>
>>>>> n = 1000
>>>>> x = data.frame(a=1:n,b=1:n)
>>>>> system.time(for (i in 1:1000) x$a[1] <- 42L)
>>>>  user  system elapsed
>>>> 0.304   0.000   0.307
>>>>> n = 100000
>>>>> x = data.frame(a=1:n,b=1:n)
>>>>> system.time(for (i in 1:1000) x$a[1] <- 42L)
>>>>  user  system elapsed
>>>> 37.586   0.388  38.161
>>>>>
>>>>
>>>> If it's because the 1st column needs to be copied (only) because that's
>>>> the one being assigned to (in this test), that magnitude of slow down
>>>> doesn't seem consistent with the time of a vector copy of the 1st
>>>> column :
>>>>
>>>>> n=100000
>>>>> v = 1:n
>>>>> system.time(for (i in 1:1000) v[1] <- 42L)
>>>>  user  system elapsed
>>>> 0.016   0.000   0.017
>>>>> system.time(for (i in 1:1000) {v2=v;v2[1] <- 42L})
>>>>  user  system elapsed
>>>> 1.816   1.076   2.900
>>>>
>>>> Finally, increasing the number of columns, again only the 1st is
>>>> assigned to :
>>>>
>>>>> n=100000
>>>>> x = data.frame(rep(list(1:n),100))
>>>>> dim(x)
>>>> [1] 100000    100
>>>>> system.time(for (i in 1:1000) x[1,1] <- 42L)
>>>>  user  system elapsed
>>>> 167.974  50.903 219.711
>>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>> Cheers,
>>>>> Simon
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On Jul 5, 2011, at 9:01 PM, David Winsemius <[hidden email]> wrote:
>>>>>
>>>>>>
>>>>>> On Jul 5, 2011, at 7:18 PM, <[hidden email]> <[hidden email]> wrote:
>>>>>>
>>>>>>> On Tue, 5 Jul 2011, Simon Urbanek wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote:
>>>>>>>>
>>>>>>>>> Simon (and all),
>>>>>>>>>
>>>>>>>>> I've tried to make assignment as fast as calling `[<-.data.table`
>>>>>>>>> directly, for user convenience. Profiling shows (IIUC) that it isn't
>>>>>>>>> dispatch, but x being copied. Is there a way to prevent '[<-' from
>>>>>>>>> copying x?
>>>>>>>>
>>>>>>>> Good point, and conceptually, no. It's a subassignment after all - see R-lang 3.4.4 - it is equivalent to
>>>>>>>>
>>>>>>>> `*tmp*` <- x
>>>>>>>> x <- `[<-`(`*tmp*`, i, j, value)
>>>>>>>> rm(`*tmp*`)
>>>>>>>>
>>>>>>>> so there is always a copy involved.
>>>>>>>>
>>>>>>>> Now, a conceptual copy doesn't mean real copy in R since R tries to keep the pass-by-value illusion while passing references in cases where it knows that modifications cannot occur and/or they are safe. The default subassign method uses that feature which means it can afford to not duplicate if there is only one reference -- then it's safe to not duplicate as we are replacing that only existing reference. And in the case of a matrix, that will be true at the latest from the second subassignment on.
>>>>>>>>
>>>>>>>> Unfortunately the method dispatch (AFAICS) introduces one more reference in the dispatch chain so there will always be two references so duplication is necessary. Since we have only 0 / 1 / 2+ information on the references, we can't distinguish whether the second reference is due to the dispatch or due to the passed object having more than one reference, so we have to duplicate in any case. That is unfortunate, and I don't see a way around (unless we handle subassignment methods is some special way).
>>>>>>>
>>>>>>> I don't believe dispatch is bumping NAMED (and a quick experiment
>>>>>>> seems to confirm this though I don't guarantee I did that right). The
>>>>>>> issue is that a replacement function implemented as a closure, which
>>>>>>> is the only option for a package, will always see NAMED on the object
>>>>>>> to be modified as 2 (because the value is obtained by forcing the
>>>>>>> argument promise) and so any R level assignments will duplicate.  This
>>>>>>> also isn't really an issue of imprecise reference counting -- there
>>>>>>> really are (at least) two legitimate references -- one though the
>>>>>>> argument and one through the caller's environment.
>>>>>>>
>>>>>>> It would be good it we could come up with a way for packages to be
>>>>>>> able to define replacement functions that do not duplicate in cases
>>>>>>> where we really don't want them to, but this would require coming up
>>>>>>> with some sort of protocol, minimally involving an efficient way to
>>>>>>> detect whether a replacement funciton is being called in a replacement
>>>>>>> context or directly.
>>>>>>
>>>>>> Would "$<-" always satisfy that condition. It would be big help to me if it could be designed to avoid duplication the rest of the data.frame.
>>>>>>
>>>>>> --
>>>>>>
>>>>>>>
>>>>>>> There are some replacement functions that use C code to cheat, but
>>>>>>> these may create problems if called directly, so I won't advertise
>>>>>>> them.
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> luke
>>>>>>>
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Simon
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Luke Tierney
>>>>>>> Statistics and Actuarial Science
>>>>>>> Ralph E. Wareham Professor of Mathematical Sciences
>>>>>>> University of Iowa                  Phone:             319-335-3386
>>>>>>> Department of Statistics and        Fax:               319-335-3017
>>>>>>> Actuarial Science
>>>>>>> 241 Schaeffer Hall                  email:      [hidden email]
>>>>>>> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu______________________________________________
>>>>>>> [hidden email] mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>>>
>>>>>> David Winsemius, MD
>>>>>> West Hartford, CT
>>>>>>
>>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>> --
>> Luke Tierney
>> Statistics and Actuarial Science
>> Ralph E. Wareham Professor of Mathematical Sciences
>> University of Iowa                  Phone:             319-335-3386
>> Department of Statistics and        Fax:               319-335-3017
>>    Actuarial Science
>> 241 Schaeffer Hall                  email:      [hidden email]
>> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
>
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
12
Loading...