which() vs. just logical selection in df

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

which() vs. just logical selection in df

kman-4
Hi R-helpers,

Does anyone know why adding which() makes the select call more
efficient than just using logical selection in a dataframe? Doesn't
which() technically add another conversion/function call on top of the
logical selection? Here is a reproducible example with a slight
difference in timing.

# Surrogate data - the timing here isn't interesting
urltext <- paste("https://drive.google.com/",
                 "uc?id=1AZ-s1EgZXs4M_XF3YYEaKjjMMvRQ7",
                 "-h8&export=download", sep="")
download.file(url=urltext, destfile="tempfile.csv") # download file first
dat <- read.csv("tempfile.csv", stringsAsFactors = FALSE, header=TRUE,
                  nrows=2.5e6) # read the file; 'nrows' is a slight
                                         # overestimate
dat <- dat[,1:3] # select just the first 3 columns
head(dat, 10) # print the first 10 rows

# Select using which() as the final step ~ 90ms total time on my macbook air
system.time(
  head(
    dat[which(dat$gender2=="other"),],),
  gcFirst=TRUE)

# Select skipping which() ~130ms total time
system.time(
  head(
    dat[dat$gender2=="other", ]),
  gcFirst=TRUE)

Now I would think that the second one without which() would be more
efficient. However, every time I run these, the first version, with
which() is more efficient by about 20ms of system time and 20ms of
user time. Does anyone know why this is?

Cheers!
Keith

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: which() vs. just logical selection in df

glsnow
I would suggest using the microbenchmark package to do the time
comparison.  This will run each a bunch of times for a more meaningful
comparison.

One possible reason for the difference is the number of missing values
in your data (along with the number of columns).  Consider the
difference in the following results:

> x <- c(1,2,NA)
> x[x==1]
[1]  1 NA
> x[which(x==1)]
[1] 1



On Sat, Oct 10, 2020 at 5:25 PM 1/k^c <[hidden email]> wrote:

>
> Hi R-helpers,
>
> Does anyone know why adding which() makes the select call more
> efficient than just using logical selection in a dataframe? Doesn't
> which() technically add another conversion/function call on top of the
> logical selection? Here is a reproducible example with a slight
> difference in timing.
>
> # Surrogate data - the timing here isn't interesting
> urltext <- paste("https://drive.google.com/",
>                  "uc?id=1AZ-s1EgZXs4M_XF3YYEaKjjMMvRQ7",
>                  "-h8&export=download", sep="")
> download.file(url=urltext, destfile="tempfile.csv") # download file first
> dat <- read.csv("tempfile.csv", stringsAsFactors = FALSE, header=TRUE,
>                   nrows=2.5e6) # read the file; 'nrows' is a slight
>                                          # overestimate
> dat <- dat[,1:3] # select just the first 3 columns
> head(dat, 10) # print the first 10 rows
>
> # Select using which() as the final step ~ 90ms total time on my macbook air
> system.time(
>   head(
>     dat[which(dat$gender2=="other"),],),
>   gcFirst=TRUE)
>
> # Select skipping which() ~130ms total time
> system.time(
>   head(
>     dat[dat$gender2=="other", ]),
>   gcFirst=TRUE)
>
> Now I would think that the second one without which() would be more
> efficient. However, every time I run these, the first version, with
> which() is more efficient by about 20ms of system time and 20ms of
> user time. Does anyone know why this is?
>
> Cheers!
> Keith
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



--
Gregory (Greg) L. Snow Ph.D.
[hidden email]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: which() vs. just logical selection in df

kman-4
In reply to this post by kman-4
Hi Dr. Snow, & R-helpers,

Thank you for your reply! I hadn't heard of the {microbenchmark}
package & was excited to try it! Thank you for the suggestion! I did
check the reference source for which() beforehand, which included the
statement to remove NAa, and I didn't have any missing values or NAs:

sum(is.na(dat$gender2))
sum(is.na(dat$gender))
sum(is.na(dat$y))

[1] 0
[1] 0
[1] 0

I still had a 10ms difference in the value returned by microbenchmark
between the following methods: one with and one without using which().
The difference is reversed from what I expected, since which() is an
extra step.

microbenchmark(
  head(
    dat[which(dat$gender2=="other"),],), times=100L)
microbenchmark(
  head(
    dat[dat$gender2=="other",],), times=100L)

         min                lq                 mean
head(dat[which(dat$gender2 == "other"), ], )      62.93803
74.25939     88.4704
head(dat[dat$gender2 == "other", ], )                 71.8914
87.95844    103.7231

Is which() invoking c-level code by chance, making it slightly faster
on average? The difference likely becomes important on terabytes of
data. The addition of which() still seems superfluous to me, and I'd
like to know whether it's considered best practice to keep it. What is
R inoking when which() isn't called explicitly? Is R invoking which()
eventually anyway?

Cheers!
Keith

> Message: 2
> Date: Mon, 12 Oct 2020 13:01:36 -0600
> From: Greg Snow <[hidden email]>
> To: "1/k^c" <[hidden email]>
> Cc: r-help <[hidden email]>
> Subject: Re: [R] which() vs. just logical selection in df
> Message-ID:
>         <[hidden email]>
> Content-Type: text/plain; charset="utf-8"
>
> I would suggest using the microbenchmark package to do the time
> comparison.  This will run each a bunch of times for a more meaningful
> comparison.
>
> One possible reason for the difference is the number of missing values
> in your data (along with the number of columns).  Consider the
> difference in the following results:
>
> > x <- c(1,2,NA)
> > x[x==1]
> [1]  1 NA
> > x[which(x==1)]
> [1] 1
>
>
>
> On Sat, Oct 10, 2020 at 5:25 PM 1/k^c <[hidden email]> wrote:
> >
> > Hi R-helpers,
> >
> > Does anyone know why adding which() makes the select call more
> > efficient than just using logical selection in a dataframe? Doesn't
> > which() technically add another conversion/function call on top of the
> > logical selection? Here is a reproducible example with a slight
> > difference in timing.
> >
> > # Surrogate data - the timing here isn't interesting
> > urltext <- paste("https://drive.google.com/",
> >                  "uc?id=1AZ-s1EgZXs4M_XF3YYEaKjjMMvRQ7",
> >                  "-h8&export=download", sep="")
> > download.file(url=urltext, destfile="tempfile.csv") # download file first
> > dat <- read.csv("tempfile.csv", stringsAsFactors = FALSE, header=TRUE,
> >                   nrows=2.5e6) # read the file; 'nrows' is a slight
> >                                          # overestimate
> > dat <- dat[,1:3] # select just the first 3 columns
> > head(dat, 10) # print the first 10 rows
> >
> > # Select using which() as the final step ~ 90ms total time on my macbook air
> > system.time(
> >   head(
> >     dat[which(dat$gender2=="other"),],),
> >   gcFirst=TRUE)
> >
> > # Select skipping which() ~130ms total time
> > system.time(
> >   head(
> >     dat[dat$gender2=="other", ]),
> >   gcFirst=TRUE)
> >
> > Now I would think that the second one without which() would be more
> > efficient. However, every time I run these, the first version, with
> > which() is more efficient by about 20ms of system time and 20ms of
> > user time. Does anyone know why this is?
> >
> > Cheers!
> > Keith
> >
> > ______________________________________________
> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
> Gregory (Greg) L. Snow Ph.D.
> [hidden email]
>
>
>
>
> ------------------------------
>
> Message: 3
> Date: Mon, 12 Oct 2020 08:33:44 +0200 (CEST)
> From: =?UTF-8?Q?Frauke_G=C3=BCnther?= <[hidden email]>
> To: "[hidden email]" <[hidden email]>
> Cc: William Michels <[hidden email]>, "[hidden email]"
>         <[hidden email]>
> Subject: Re: [R]  Fwd:  Help using the exclude option in the neuralnet
>         package
> Message-ID: <[hidden email]>
> Content-Type: text/plain; charset="utf-8"
>
> Dear all,
>
> the exclude and constant.weights options are used as follows:
>
> exclude: A matrix with n rows and 3 columns will exclude n weights. The the first column refers to the layer, the second column to the input neuron and the third column to the output neuron of the weight.
>
> constant.weights: A vector specifying the values of the weights that are excluded from the training process and treated as fix.
>
> Please refer to the following example:
>
> Not using exclude and constant.weights (all weights are trained):
>
> > nn <- neuralnet(Species == "setosa" ~ Petal.Length + Petal.Width, iris, linear.output = FALSE)
> >
> > nn$weights
> [[1]]
> [[1]][[1]]
> [,1]
> [1,] 6.513239
> [2,] -0.815920
> [3,] -5.859802
> [[1]][[2]]
> [,1]
> [1,] -4.597934
> [2,] 9.179436
>
> Using exclude (2 weights are excluded --> NA):
>
> > nn <- neuralnet(Species == "setosa" ~ Petal.Length + Petal.Width, iris, linear.output = FALSE,
> exclude = matrix(c(1,2,1, 2,2,1),byrow=T, nrow=2))
> > nn$weights
> [[1]]
> [[1]][[1]]
> [,1]
> [1,] -0.2815942
> [2,] NA
> [3,] 0.2481212
> [[1]][[2]]
> [,1]
> [1,] -0.6934932
> [2,] NA
>
> Using exclude and constant.weights (2 weights are excluded and treated as fix --> 100 and 1000, respectively):
>
> > nn <- neuralnet(Species == "setosa" ~ Petal.Length + Petal.Width, iris, linear.output = FALSE,
> exclude = matrix(c(1,2,1, 2,2,1),byrow=T, nrow=2),
> constant.weights=c(100,1000))
> > nn$weights
> [[1]]
> [[1]][[1]]
> [,1]
> [1,] 0.554119
> [2,] 100.000000
> [3,] 1.153611
> [[1]][[2]]
> [,1]
> [1,] -0.3962524
> [2,] 1000.0000000
>
> I hope you will find this example helpful.
>
> Sincerely,
> Frauke
>
>
> >     William Michels <[hidden email] mailto:[hidden email] > hat am 10.10.2020 18:16 geschrieben:
> >
> >
> >     Forwarding: Question re "neuralnet" package on the R-Help mailing list:
> >
> >     https://stat.ethz.ch/pipermail/r-help/2020-October/469020.html
> >
> >     If you are so inclined, please reply to:
> >
> >     [hidden email] mailto:[hidden email] <[hidden email] mailto:[hidden email] >
> >
> >     ---------- Forwarded message ---------
> >     From: Dan Ryan <[hidden email] mailto:[hidden email] >
> >     Date: Fri, Oct 9, 2020 at 3:52 PM
> >     Subject: Re: [R] Help using the exclude option in the neuralnet package
> >     To: [hidden email] mailto:[hidden email] <[hidden email] mailto:[hidden email] >
> >
> >     Good Morning,
> >
> >     I am using the neuralnet package in R, and am able to produce some
> >     basic neural nets, and use the output.
> >
> >     I would like to exclude some of the weights and biases from the
> >     iteration process and fix their values.
> >
> >     However I do not seem to be able to correctly define the exclude and
> >     constant.weights vectors.
> >
> >     Question: Can someone point me to an example where exclude and
> >     contant.weights are used. I have search the R help archive, and
> >     haven't found any examples which use these on the web.
> >
> >     Thank you in advance for any help.
> >
> >     Sincerely
> >
> >     Dan
> >
> >     [[alternative HTML version deleted]]
> >
> >     ______________________________________________
> >     [hidden email] mailto:[hidden email] mailing list -- To UNSUBSCRIBE and more, see
> >     https://stat.ethz.ch/mailman/listinfo/r-help
> >     PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >     and provide commented, minimal, self-contained, reproducible code.
> >
>
>         [[alternative HTML version deleted]]
>
>
>
>
> ------------------------------
>
> Message: 4
> Date: Tue, 13 Oct 2020 08:04:32 +0200
> From: Ablaye Ngalaba <[hidden email]>
> To: [hidden email]
> Subject: [R] package for kernel on R
> Message-ID:
>         <[hidden email]>
> Content-Type: text/plain; charset="utf-8"
>
> Hello,
> Please, I want to know which package to install on R when coding the kernel
> functions
>
>         [[alternative HTML version deleted]]
>
>
>
>
> ------------------------------
>
> Message: 5
> Date: Tue, 13 Oct 2020 09:09:00 +0200
> From: Ablaye Ngalaba <[hidden email]>
> To: [hidden email]
> Subject: [R] help for R code
> Message-ID:
>         <CAOkWQv0LsgxkHdqpai1=[hidden email]>
> Content-Type: text/plain; charset="utf-8"
>
> Good morning dear administrators,
> Please help me to code this code in R.
> I use in this file the redescription function Φ which by making a scalar
> product gives a . You can also choose instead of the redescription function
> Φ a kernel k(x,x).
>
>
>
>
>                   Sincerely
>
>         [[alternative HTML version deleted]]
>
>
>
>
> ------------------------------
>
> Message: 6
> Date: Tue, 13 Oct 2020 11:21:45 +0300
> From: Eric Berger <[hidden email]>
> To: Ablaye Ngalaba <[hidden email]>
> Cc: R mailing list <[hidden email]>
> Subject: Re: [R] help for R code
> Message-ID:
>         <[hidden email]>
> Content-Type: text/plain; charset="utf-8"
>
> Hi Ablaye,
> The CRAN repository has thousands of available R packages. To help
> people find relevant packages amid such a huge collection, there are
> some 'task view' pages that group packages according to a particular
> task. I am guessing that you are interested in kernels because of
> their use in machine learning, so you might want to look at the
> Machine Learning task view at:
>
> https://cran.r-project.org/web/views/MachineLearning.html
>
> If you search for 'kernels' on that page you will find
>
> 'Support Vector Machines and Kernel Methods' which mentions a few
> packages that use kernels.
>
> Good luck,
> Eric
>
>
> On Tue, Oct 13, 2020 at 10:09 AM Ablaye Ngalaba <[hidden email]> wrote:
> >
> > Good morning dear administrators,
> > Please help me to code this code in R.
> > I use in this file the redescription function Φ which by making a scalar
> > product gives a . You can also choose instead of the redescription function
> > Φ a kernel k(x,x).
> >
> >
> >
> >
> >                   Sincerely
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
> ------------------------------
>
> End of R-help Digest, Vol 212, Issue 12
> ***************************************

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: which() vs. just logical selection in df

Bert Gunter-2
Inline.

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Wed, Oct 14, 2020 at 3:23 PM 1/k^c <[hidden email]> wrote:

Is which() invoking c-level code by chance, making it slightly faster
> on average?
>

You do not need to ask such questions. R is open source, so just look!

> which
function (x, arr.ind = FALSE, useNames = TRUE)
{
    wh <- .Internal(which(x))   ## C code
    if (arr.ind && !is.null(d <- dim(x)))
        arrayInd(wh, d, dimnames(x), useNames = useNames)
    else wh
}
<bytecode: 0x7fcdba0b8e80>
<environment: namespace:base>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: which() vs. just logical selection in df

kman-4
Hi Bert,

Thank you very much! I was unaware that .Internal() referred to C code.

I figured out the difference. which() dimensions the object returned
to be only the relevant records first. Logical indexing dimensions
last.

> length(index1<-dat$gender2=="other")
[1] 2000000
> length(index2<-which(index1))
[1] 666667
length(dat[index1,])
[1] 666667
length(dat[index2,])
[1] 666667

microbenchmark(index1<-dat$gender2=="other", times=100L) # 2e6 records, ~ 13ms.
microbenchmark(index2<-which(index1), times=100L) # Extra time for
which() ~ 5ms.
microbenchmark(dat[index1,], times=100L) # Time to return just TRUE
records using the whole 2e6 index. ~99ms
microbenchmark(dat[index2,], times=100L) # Time to return all records
from shorter index ~64ms.

Cheers,
Keith


On Wed, Oct 14, 2020 at 4:42 PM Bert Gunter <[hidden email]> wrote:

>
> Inline.
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Wed, Oct 14, 2020 at 3:23 PM 1/k^c <[hidden email]> wrote:
>
>> Is which() invoking c-level code by chance, making it slightly faster
>> on average?
>
>
> You do not need to ask such questions. R is open source, so just look!
>
> > which
> function (x, arr.ind = FALSE, useNames = TRUE)
> {
>     wh <- .Internal(which(x))   ## C code
>     if (arr.ind && !is.null(d <- dim(x)))
>         arrayInd(wh, d, dimnames(x), useNames = useNames)
>     else wh
> }
> <bytecode: 0x7fcdba0b8e80>
> <environment: namespace:base>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.