Strange behavior when sampling rows of a data frame

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Strange behavior when sampling rows of a data frame

Sébastien Lahaie
I ran into some strange behavior in R when trying to assign a treatment to
rows in a data frame. I'm wondering whether any R experts can explain
what's going on.

First, let's assign a treatment to 3 out of 10 rows as follows.

> df <- data.frame(unit = 1:10)

> df$treated <- FALSE

>

> s <- sample(nrow(df), 3)

> df[s,]$treated <- TRUE

>

> df

   unit treated

1     1   FALSE

2     2    TRUE

3     3   FALSE

4     4   FALSE

5     5    TRUE

6     6   FALSE

7     7    TRUE

8     8   FALSE

9     9   FALSE

10   10   FALSE

This is as expected. Now we'll just skip the intermediate step of saving
the sampled indices, and apply the treatment directly as follows.

> df <- data.frame(unit = 1:10)

> df$treated <- FALSE

>

> df[sample(nrow(df), 3),]$treated <- TRUE

>

> df

   unit treated

1     6    TRUE

2     2   FALSE

3     3   FALSE

4     9    TRUE

5     5   FALSE

6     6   FALSE

7     7   FALSE

8     5    TRUE

9     9   FALSE

10   10   FALSE

Now the data frame still has 10 rows with 3 assigned to the treatment. But
the units are garbled. Units 1 and 4 have disappeared, for instance, and
there are duplicates for 6 and 9, one assigned to treatment and the other
to control. Why would this happen?

Thanks,
Sebastien

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Strange behavior when sampling rows of a data frame

Rui Barradas
Hello,

I don't have an answer on the reason why this happens but it seems like
a bug. Where?

In which of  `[<-.data.frame` or `[<-.default`?

A solution is to subset and assign the vector:


set.seed(2020)
df2 <- data.frame(unit = 1:10)
df2$treated <- FALSE

df2$treated[sample(nrow(df2), 3)] <- TRUE
df2
#  unit treated
#1     1   FALSE
#2     2   FALSE
#3     3   FALSE
#4     4   FALSE
#5     5   FALSE
#6     6    TRUE
#7     7    TRUE
#8     8    TRUE
#9     9   FALSE
#10   10   FALSE


Or


set.seed(2020)
df3 <- data.frame(unit = 1:10)
df3$treated <- FALSE

df3[sample(nrow(df3), 3), "treated"] <- TRUE
df3
# result as expected


Hope this helps,

Rui  Barradas



Às 13:49 de 19/06/2020, Sébastien Lahaie escreveu:

> I ran into some strange behavior in R when trying to assign a treatment to
> rows in a data frame. I'm wondering whether any R experts can explain
> what's going on.
>
> First, let's assign a treatment to 3 out of 10 rows as follows.
>
>> df <- data.frame(unit = 1:10)
>> df$treated <- FALSE
>> s <- sample(nrow(df), 3)
>> df[s,]$treated <- TRUE
>> df
>     unit treated
>
> 1     1   FALSE
>
> 2     2    TRUE
>
> 3     3   FALSE
>
> 4     4   FALSE
>
> 5     5    TRUE
>
> 6     6   FALSE
>
> 7     7    TRUE
>
> 8     8   FALSE
>
> 9     9   FALSE
>
> 10   10   FALSE
>
> This is as expected. Now we'll just skip the intermediate step of saving
> the sampled indices, and apply the treatment directly as follows.
>
>> df <- data.frame(unit = 1:10)
>> df$treated <- FALSE
>> df[sample(nrow(df), 3),]$treated <- TRUE
>> df
>     unit treated
>
> 1     6    TRUE
>
> 2     2   FALSE
>
> 3     3   FALSE
>
> 4     9    TRUE
>
> 5     5   FALSE
>
> 6     6   FALSE
>
> 7     7   FALSE
>
> 8     5    TRUE
>
> 9     9   FALSE
>
> 10   10   FALSE
>
> Now the data frame still has 10 rows with 3 assigned to the treatment. But
> the units are garbled. Units 1 and 4 have disappeared, for instance, and
> there are duplicates for 6 and 9, one assigned to treatment and the other
> to control. Why would this happen?
>
> Thanks,
> Sebastien
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--
Este e-mail foi verificado em termos de vírus pelo software antivírus Avast.
https://www.avast.com/antivirus

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Strange behavior when sampling rows of a data frame

R help mailing list-2
The first subscript argument is getting evaluated twice.
> trace(sample)
> set.seed(2020); df[i<-sample(10,3), ]$Treated <- TRUE
trace: sample(10, 3)
trace: sample(10, 3)
> i
[1]  1 10  4
> set.seed(2020); sample(10,3)
trace: sample(10, 3)
[1] 7 6 8
> sample(10,3)
trace: sample(10, 3)
[1]  1 10  4

Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Fri, Jun 19, 2020 at 8:46 AM Rui Barradas <[hidden email]> wrote:

> Hello,
>
> I don't have an answer on the reason why this happens but it seems like
> a bug. Where?
>
> In which of  `[<-.data.frame` or `[<-.default`?
>
> A solution is to subset and assign the vector:
>
>
> set.seed(2020)
> df2 <- data.frame(unit = 1:10)
> df2$treated <- FALSE
>
> df2$treated[sample(nrow(df2), 3)] <- TRUE
> df2
> #  unit treated
> #1     1   FALSE
> #2     2   FALSE
> #3     3   FALSE
> #4     4   FALSE
> #5     5   FALSE
> #6     6    TRUE
> #7     7    TRUE
> #8     8    TRUE
> #9     9   FALSE
> #10   10   FALSE
>
>
> Or
>
>
> set.seed(2020)
> df3 <- data.frame(unit = 1:10)
> df3$treated <- FALSE
>
> df3[sample(nrow(df3), 3), "treated"] <- TRUE
> df3
> # result as expected
>
>
> Hope this helps,
>
> Rui  Barradas
>
>
>
> Às 13:49 de 19/06/2020, Sébastien Lahaie escreveu:
> > I ran into some strange behavior in R when trying to assign a treatment
> to
> > rows in a data frame. I'm wondering whether any R experts can explain
> > what's going on.
> >
> > First, let's assign a treatment to 3 out of 10 rows as follows.
> >
> >> df <- data.frame(unit = 1:10)
> >> df$treated <- FALSE
> >> s <- sample(nrow(df), 3)
> >> df[s,]$treated <- TRUE
> >> df
> >     unit treated
> >
> > 1     1   FALSE
> >
> > 2     2    TRUE
> >
> > 3     3   FALSE
> >
> > 4     4   FALSE
> >
> > 5     5    TRUE
> >
> > 6     6   FALSE
> >
> > 7     7    TRUE
> >
> > 8     8   FALSE
> >
> > 9     9   FALSE
> >
> > 10   10   FALSE
> >
> > This is as expected. Now we'll just skip the intermediate step of saving
> > the sampled indices, and apply the treatment directly as follows.
> >
> >> df <- data.frame(unit = 1:10)
> >> df$treated <- FALSE
> >> df[sample(nrow(df), 3),]$treated <- TRUE
> >> df
> >     unit treated
> >
> > 1     6    TRUE
> >
> > 2     2   FALSE
> >
> > 3     3   FALSE
> >
> > 4     9    TRUE
> >
> > 5     5   FALSE
> >
> > 6     6   FALSE
> >
> > 7     7   FALSE
> >
> > 8     5    TRUE
> >
> > 9     9   FALSE
> >
> > 10   10   FALSE
> >
> > Now the data frame still has 10 rows with 3 assigned to the treatment.
> But
> > the units are garbled. Units 1 and 4 have disappeared, for instance, and
> > there are duplicates for 6 and 9, one assigned to treatment and the other
> > to control. Why would this happen?
> >
> > Thanks,
> > Sebastien
> >
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> --
> Este e-mail foi verificado em termos de vírus pelo software antivírus
> Avast.
> https://www.avast.com/antivirus
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Strange behavior when sampling rows of a data frame

Rui Barradas
Hello,


Thanks, I hadn't thought of that.

But, why? Is it evaluated once before assignment and a second time when
the assignment occurs?

To trace both sample and `[<-` gives 2 calls to sample.


trace(sample)
trace(`[<-`)
df[sample(nrow(df), 3),]$treated <- TRUE
trace: sample(nrow(df), 3)
trace: `[<-`(`*tmp*`, sample(nrow(df), 3), , value = list(unit = c(7L,
6L, 8L), treated = c(TRUE, TRUE, TRUE)))
trace: sample(nrow(df), 3)


Regards,

Rui Barradas


Às 17:20 de 19/06/2020, William Dunlap escreveu:

> The first subscript argument is getting evaluated twice.
> > trace(sample)
> > set.seed(2020); df[i<-sample(10,3), ]$Treated <- TRUE
> trace: sample(10, 3)
> trace: sample(10, 3)
> > i
> [1]  1 10  4
> > set.seed(2020); sample(10,3)
> trace: sample(10, 3)
> [1] 7 6 8
> > sample(10,3)
> trace: sample(10, 3)
> [1]  1 10  4
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com <http://tibco.com>
>
>
> On Fri, Jun 19, 2020 at 8:46 AM Rui Barradas <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Hello,
>
>     I don't have an answer on the reason why this happens but it seems
>     like
>     a bug. Where?
>
>     In which of  `[<-.data.frame` or `[<-.default`?
>
>     A solution is to subset and assign the vector:
>
>
>     set.seed(2020)
>     df2 <- data.frame(unit = 1:10)
>     df2$treated <- FALSE
>
>     df2$treated[sample(nrow(df2), 3)] <- TRUE
>     df2
>     #  unit treated
>     #1     1   FALSE
>     #2     2   FALSE
>     #3     3   FALSE
>     #4     4   FALSE
>     #5     5   FALSE
>     #6     6    TRUE
>     #7     7    TRUE
>     #8     8    TRUE
>     #9     9   FALSE
>     #10   10   FALSE
>
>
>     Or
>
>
>     set.seed(2020)
>     df3 <- data.frame(unit = 1:10)
>     df3$treated <- FALSE
>
>     df3[sample(nrow(df3), 3), "treated"] <- TRUE
>     df3
>     # result as expected
>
>
>     Hope this helps,
>
>     Rui  Barradas
>
>
>
>     Às 13:49 de 19/06/2020, Sébastien Lahaie escreveu:
>     > I ran into some strange behavior in R when trying to assign a
>     treatment to
>     > rows in a data frame. I'm wondering whether any R experts can
>     explain
>     > what's going on.
>     >
>     > First, let's assign a treatment to 3 out of 10 rows as follows.
>     >
>     >> df <- data.frame(unit = 1:10)
>     >> df$treated <- FALSE
>     >> s <- sample(nrow(df), 3)
>     >> df[s,]$treated <- TRUE
>     >> df
>     >     unit treated
>     >
>     > 1     1   FALSE
>     >
>     > 2     2    TRUE
>     >
>     > 3     3   FALSE
>     >
>     > 4     4   FALSE
>     >
>     > 5     5    TRUE
>     >
>     > 6     6   FALSE
>     >
>     > 7     7    TRUE
>     >
>     > 8     8   FALSE
>     >
>     > 9     9   FALSE
>     >
>     > 10   10   FALSE
>     >
>     > This is as expected. Now we'll just skip the intermediate step
>     of saving
>     > the sampled indices, and apply the treatment directly as follows.
>     >
>     >> df <- data.frame(unit = 1:10)
>     >> df$treated <- FALSE
>     >> df[sample(nrow(df), 3),]$treated <- TRUE
>     >> df
>     >     unit treated
>     >
>     > 1     6    TRUE
>     >
>     > 2     2   FALSE
>     >
>     > 3     3   FALSE
>     >
>     > 4     9    TRUE
>     >
>     > 5     5   FALSE
>     >
>     > 6     6   FALSE
>     >
>     > 7     7   FALSE
>     >
>     > 8     5    TRUE
>     >
>     > 9     9   FALSE
>     >
>     > 10   10   FALSE
>     >
>     > Now the data frame still has 10 rows with 3 assigned to the
>     treatment. But
>     > the units are garbled. Units 1 and 4 have disappeared, for
>     instance, and
>     > there are duplicates for 6 and 9, one assigned to treatment and
>     the other
>     > to control. Why would this happen?
>     >
>     > Thanks,
>     > Sebastien
>     >
>     >       [[alternative HTML version deleted]]
>     >
>     > ______________________________________________
>     > [hidden email] <mailto:[hidden email]> mailing list
>     -- To UNSUBSCRIBE and more, see
>     > https://stat.ethz.ch/mailman/listinfo/r-help
>     > PLEASE do read the posting guide
>     http://www.R-project.org/posting-guide.html
>     > and provide commented, minimal, self-contained, reproducible code.
>
>     --
>     Este e-mail foi verificado em termos de vírus pelo software
>     antivírus Avast.
>     https://www.avast.com/antivirus
>
>     ______________________________________________
>     [hidden email] <mailto:[hidden email]> mailing list --
>     To UNSUBSCRIBE and more, see
>     https://stat.ethz.ch/mailman/listinfo/r-help
>     PLEASE do read the posting guide
>     http://www.R-project.org/posting-guide.html
>     and provide commented, minimal, self-contained, reproducible code.
>

--
Este e-mail foi verificado em termos de vírus pelo software antivírus Avast.
https://www.avast.com/antivirus

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Strange behavior when sampling rows of a data frame

R help mailing list-2
It is a bug that has been present in R since at least R-2.14.0 (the oldest
that I have installed on my laptop).

Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Fri, Jun 19, 2020 at 10:37 AM Rui Barradas <[hidden email]> wrote:

> Hello,
>
>
> Thanks, I hadn't thought of that.
>
> But, why? Is it evaluated once before assignment and a second time when
> the assignment occurs?
>
> To trace both sample and `[<-` gives 2 calls to sample.
>
>
> trace(sample)
> trace(`[<-`)
> df[sample(nrow(df), 3),]$treated <- TRUE
> trace: sample(nrow(df), 3)
> trace: `[<-`(`*tmp*`, sample(nrow(df), 3), , value = list(unit = c(7L,
> 6L, 8L), treated = c(TRUE, TRUE, TRUE)))
> trace: sample(nrow(df), 3)
>
>
> Regards,
>
> Rui Barradas
>
>
> Às 17:20 de 19/06/2020, William Dunlap escreveu:
> > The first subscript argument is getting evaluated twice.
> > > trace(sample)
> > > set.seed(2020); df[i<-sample(10,3), ]$Treated <- TRUE
> > trace: sample(10, 3)
> > trace: sample(10, 3)
> > > i
> > [1]  1 10  4
> > > set.seed(2020); sample(10,3)
> > trace: sample(10, 3)
> > [1] 7 6 8
> > > sample(10,3)
> > trace: sample(10, 3)
> > [1]  1 10  4
> >
> > Bill Dunlap
> > TIBCO Software
> > wdunlap tibco.com <http://tibco.com>
> >
> >
> > On Fri, Jun 19, 2020 at 8:46 AM Rui Barradas <[hidden email]
> > <mailto:[hidden email]>> wrote:
> >
> >     Hello,
> >
> >     I don't have an answer on the reason why this happens but it seems
> >     like
> >     a bug. Where?
> >
> >     In which of  `[<-.data.frame` or `[<-.default`?
> >
> >     A solution is to subset and assign the vector:
> >
> >
> >     set.seed(2020)
> >     df2 <- data.frame(unit = 1:10)
> >     df2$treated <- FALSE
> >
> >     df2$treated[sample(nrow(df2), 3)] <- TRUE
> >     df2
> >     #  unit treated
> >     #1     1   FALSE
> >     #2     2   FALSE
> >     #3     3   FALSE
> >     #4     4   FALSE
> >     #5     5   FALSE
> >     #6     6    TRUE
> >     #7     7    TRUE
> >     #8     8    TRUE
> >     #9     9   FALSE
> >     #10   10   FALSE
> >
> >
> >     Or
> >
> >
> >     set.seed(2020)
> >     df3 <- data.frame(unit = 1:10)
> >     df3$treated <- FALSE
> >
> >     df3[sample(nrow(df3), 3), "treated"] <- TRUE
> >     df3
> >     # result as expected
> >
> >
> >     Hope this helps,
> >
> >     Rui  Barradas
> >
> >
> >
> >     Às 13:49 de 19/06/2020, Sébastien Lahaie escreveu:
> >     > I ran into some strange behavior in R when trying to assign a
> >     treatment to
> >     > rows in a data frame. I'm wondering whether any R experts can
> >     explain
> >     > what's going on.
> >     >
> >     > First, let's assign a treatment to 3 out of 10 rows as follows.
> >     >
> >     >> df <- data.frame(unit = 1:10)
> >     >> df$treated <- FALSE
> >     >> s <- sample(nrow(df), 3)
> >     >> df[s,]$treated <- TRUE
> >     >> df
> >     >     unit treated
> >     >
> >     > 1     1   FALSE
> >     >
> >     > 2     2    TRUE
> >     >
> >     > 3     3   FALSE
> >     >
> >     > 4     4   FALSE
> >     >
> >     > 5     5    TRUE
> >     >
> >     > 6     6   FALSE
> >     >
> >     > 7     7    TRUE
> >     >
> >     > 8     8   FALSE
> >     >
> >     > 9     9   FALSE
> >     >
> >     > 10   10   FALSE
> >     >
> >     > This is as expected. Now we'll just skip the intermediate step
> >     of saving
> >     > the sampled indices, and apply the treatment directly as follows.
> >     >
> >     >> df <- data.frame(unit = 1:10)
> >     >> df$treated <- FALSE
> >     >> df[sample(nrow(df), 3),]$treated <- TRUE
> >     >> df
> >     >     unit treated
> >     >
> >     > 1     6    TRUE
> >     >
> >     > 2     2   FALSE
> >     >
> >     > 3     3   FALSE
> >     >
> >     > 4     9    TRUE
> >     >
> >     > 5     5   FALSE
> >     >
> >     > 6     6   FALSE
> >     >
> >     > 7     7   FALSE
> >     >
> >     > 8     5    TRUE
> >     >
> >     > 9     9   FALSE
> >     >
> >     > 10   10   FALSE
> >     >
> >     > Now the data frame still has 10 rows with 3 assigned to the
> >     treatment. But
> >     > the units are garbled. Units 1 and 4 have disappeared, for
> >     instance, and
> >     > there are duplicates for 6 and 9, one assigned to treatment and
> >     the other
> >     > to control. Why would this happen?
> >     >
> >     > Thanks,
> >     > Sebastien
> >     >
> >     >       [[alternative HTML version deleted]]
> >     >
> >     > ______________________________________________
> >     > [hidden email] <mailto:[hidden email]> mailing list
> >     -- To UNSUBSCRIBE and more, see
> >     > https://stat.ethz.ch/mailman/listinfo/r-help
> >     > PLEASE do read the posting guide
> >     http://www.R-project.org/posting-guide.html
> >     > and provide commented, minimal, self-contained, reproducible code.
> >
> >     --
> >     Este e-mail foi verificado em termos de vírus pelo software
> >     antivírus Avast.
> >     https://www.avast.com/antivirus
> >
> >     ______________________________________________
> >     [hidden email] <mailto:[hidden email]> mailing list --
> >     To UNSUBSCRIBE and more, see
> >     https://stat.ethz.ch/mailman/listinfo/r-help
> >     PLEASE do read the posting guide
> >     http://www.R-project.org/posting-guide.html
> >     and provide commented, minimal, self-contained, reproducible code.
> >
>
> --
> Este e-mail foi verificado em termos de vírus pelo software antivírus
> Avast.
> https://www.avast.com/antivirus
>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: [External] Re: Strange behavior when sampling rows of a data frame

luke-tierney
The behavior has been there much longer than that in R and it's been a
known issue with complex assignment for a long time (not the only
one). You're in a better position than I to know how Splus handles this.

The complex assignment expression

     df[<index>, ]$treated <- TRUE

is basically evaluated as

     tmp <-df[<index>, ]
     tmp$treated <- TRUE
     df[<index>,] <- tmp

So the <index> argument is evaluated twice. This is always a little
inefficient, but probably not what you want if there are side effects
in the index argument. So the main take-away is:

     Don't use index arguments with side effects in complex assignments.

It is in principle possible, when standard evaluation is in use, to
capture the value of <index> from the first evaluation and re-use for
the second. But, for better or worse, assignment methods can and do
use non-standard evaluation for the index arguments, and it would be
very hard for authors of such methods to avoid this. So changing to
avoid multiple index evaluation would always have to come with an
asterisk.

There are other issues with complex assignment as implemented
currently that have higher priority but are also quite tricky to
address. Possibly this one can be addressed at the same time.

Best,

luke

On Fri, 19 Jun 2020, William Dunlap via R-help wrote:

> It is a bug that has been present in R since at least R-2.14.0 (the oldest
> that I have installed on my laptop).
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
>
> On Fri, Jun 19, 2020 at 10:37 AM Rui Barradas <[hidden email]> wrote:
>
>> Hello,
>>
>>
>> Thanks, I hadn't thought of that.
>>
>> But, why? Is it evaluated once before assignment and a second time when
>> the assignment occurs?
>>
>> To trace both sample and `[<-` gives 2 calls to sample.
>>
>>
>> trace(sample)
>> trace(`[<-`)
>> df[sample(nrow(df), 3),]$treated <- TRUE
>> trace: sample(nrow(df), 3)
>> trace: `[<-`(`*tmp*`, sample(nrow(df), 3), , value = list(unit = c(7L,
>> 6L, 8L), treated = c(TRUE, TRUE, TRUE)))
>> trace: sample(nrow(df), 3)
>>
>>
>> Regards,
>>
>> Rui Barradas
>>
>>
>> Às 17:20 de 19/06/2020, William Dunlap escreveu:
>>> The first subscript argument is getting evaluated twice.
>>>> trace(sample)
>>>> set.seed(2020); df[i<-sample(10,3), ]$Treated <- TRUE
>>> trace: sample(10, 3)
>>> trace: sample(10, 3)
>>>> i
>>> [1]  1 10  4
>>>> set.seed(2020); sample(10,3)
>>> trace: sample(10, 3)
>>> [1] 7 6 8
>>>> sample(10,3)
>>> trace: sample(10, 3)
>>> [1]  1 10  4
>>>
>>> Bill Dunlap
>>> TIBCO Software
>>> wdunlap tibco.com <http://tibco.com>
>>>
>>>
>>> On Fri, Jun 19, 2020 at 8:46 AM Rui Barradas <[hidden email]
>>> <mailto:[hidden email]>> wrote:
>>>
>>>     Hello,
>>>
>>>     I don't have an answer on the reason why this happens but it seems
>>>     like
>>>     a bug. Where?
>>>
>>>     In which of  `[<-.data.frame` or `[<-.default`?
>>>
>>>     A solution is to subset and assign the vector:
>>>
>>>
>>>     set.seed(2020)
>>>     df2 <- data.frame(unit = 1:10)
>>>     df2$treated <- FALSE
>>>
>>>     df2$treated[sample(nrow(df2), 3)] <- TRUE
>>>     df2
>>>     #  unit treated
>>>     #1     1   FALSE
>>>     #2     2   FALSE
>>>     #3     3   FALSE
>>>     #4     4   FALSE
>>>     #5     5   FALSE
>>>     #6     6    TRUE
>>>     #7     7    TRUE
>>>     #8     8    TRUE
>>>     #9     9   FALSE
>>>     #10   10   FALSE
>>>
>>>
>>>     Or
>>>
>>>
>>>     set.seed(2020)
>>>     df3 <- data.frame(unit = 1:10)
>>>     df3$treated <- FALSE
>>>
>>>     df3[sample(nrow(df3), 3), "treated"] <- TRUE
>>>     df3
>>>     # result as expected
>>>
>>>
>>>     Hope this helps,
>>>
>>>     Rui  Barradas
>>>
>>>
>>>
>>>     Às 13:49 de 19/06/2020, Sébastien Lahaie escreveu:
>>>    > I ran into some strange behavior in R when trying to assign a
>>>     treatment to
>>>    > rows in a data frame. I'm wondering whether any R experts can
>>>     explain
>>>    > what's going on.
>>>    >
>>>    > First, let's assign a treatment to 3 out of 10 rows as follows.
>>>    >
>>>    >> df <- data.frame(unit = 1:10)
>>>    >> df$treated <- FALSE
>>>    >> s <- sample(nrow(df), 3)
>>>    >> df[s,]$treated <- TRUE
>>>    >> df
>>>    >     unit treated
>>>    >
>>>    > 1     1   FALSE
>>>    >
>>>    > 2     2    TRUE
>>>    >
>>>    > 3     3   FALSE
>>>    >
>>>    > 4     4   FALSE
>>>    >
>>>    > 5     5    TRUE
>>>    >
>>>    > 6     6   FALSE
>>>    >
>>>    > 7     7    TRUE
>>>    >
>>>    > 8     8   FALSE
>>>    >
>>>    > 9     9   FALSE
>>>    >
>>>    > 10   10   FALSE
>>>    >
>>>    > This is as expected. Now we'll just skip the intermediate step
>>>     of saving
>>>    > the sampled indices, and apply the treatment directly as follows.
>>>    >
>>>    >> df <- data.frame(unit = 1:10)
>>>    >> df$treated <- FALSE
>>>    >> df[sample(nrow(df), 3),]$treated <- TRUE
>>>    >> df
>>>    >     unit treated
>>>    >
>>>    > 1     6    TRUE
>>>    >
>>>    > 2     2   FALSE
>>>    >
>>>    > 3     3   FALSE
>>>    >
>>>    > 4     9    TRUE
>>>    >
>>>    > 5     5   FALSE
>>>    >
>>>    > 6     6   FALSE
>>>    >
>>>    > 7     7   FALSE
>>>    >
>>>    > 8     5    TRUE
>>>    >
>>>    > 9     9   FALSE
>>>    >
>>>    > 10   10   FALSE
>>>    >
>>>    > Now the data frame still has 10 rows with 3 assigned to the
>>>     treatment. But
>>>    > the units are garbled. Units 1 and 4 have disappeared, for
>>>     instance, and
>>>    > there are duplicates for 6 and 9, one assigned to treatment and
>>>     the other
>>>    > to control. Why would this happen?
>>>    >
>>>    > Thanks,
>>>    > Sebastien
>>>    >
>>>    >       [[alternative HTML version deleted]]
>>>    >
>>>    > ______________________________________________
>>>    > [hidden email] <mailto:[hidden email]> mailing list
>>>     -- To UNSUBSCRIBE and more, see
>>>    > https://stat.ethz.ch/mailman/listinfo/r-help
>>>    > PLEASE do read the posting guide
>>>     http://www.R-project.org/posting-guide.html
>>>    > and provide commented, minimal, self-contained, reproducible code.
>>>
>>>     --
>>>     Este e-mail foi verificado em termos de vírus pelo software
>>>     antivírus Avast.
>>>     https://www.avast.com/antivirus
>>>
>>>     ______________________________________________
>>>     [hidden email] <mailto:[hidden email]> mailing list --
>>>     To UNSUBSCRIBE and more, see
>>>     https://stat.ethz.ch/mailman/listinfo/r-help
>>>     PLEASE do read the posting guide
>>>     http://www.R-project.org/posting-guide.html
>>>     and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> --
>> Este e-mail foi verificado em termos de vírus pelo software antivírus
>> Avast.
>> https://www.avast.com/antivirus
>>
>>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:   [hidden email]
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Strange behavior when sampling rows of a data frame

Daniel Nordlund-3
In reply to this post by Sébastien Lahaie
On 6/19/2020 5:49 AM, Sébastien Lahaie wrote:

> I ran into some strange behavior in R when trying to assign a treatment to
> rows in a data frame. I'm wondering whether any R experts can explain
> what's going on.
>
> First, let's assign a treatment to 3 out of 10 rows as follows.
>
> df <- data.frame(unit = 1:10)
> df$treated <- FALSE
> s <- sample(nrow(df), 3)
> df[s,]$treated <- TRUE
> df
>     unit treated
> 1     1   FALSE
> 2     2    TRUE
> 3     3   FALSE
> 4     4   FALSE
> 5     5    TRUE
> 6     6   FALSE
> 7     7    TRUE
> 8     8   FALSE
> 9     9   FALSE
> 10   10   FALSE
>
> This is as expected. Now we'll just skip the intermediate step of saving
> the sampled indices, and apply the treatment directly as follows.
>
> df <- data.frame(unit = 1:10)
> df$treated <- FALSE
> df[sample(nrow(df), 3),]$treated <- TRUE
> df
>     unit treated
> 1     6    TRUE
> 2     2   FALSE
> 3     3   FALSE
> 4     9    TRUE
> 5     5   FALSE
> 6     6   FALSE
> 7     7   FALSE
> 8     5    TRUE
> 9     9   FALSE
> 10   10   FALSE
>
> Now the data frame still has 10 rows with 3 assigned to the treatment. But
> the units are garbled. Units 1 and 4 have disappeared, for instance, and
> there are duplicates for 6 and 9, one assigned to treatment and the other
> to control. Why would this happen?
>
> Thanks,
> Sebastien
>
Sébastien,

You have received good explanations of what is going on with your code. 
I think you can get what you want by making a simple modification of
your treatment assignment statement. At least it works for me.

df[sample(nrow(df),3), 'treated'] <- TRUE

Hope this is helpful,

Dan

--
Daniel Nordlund
Port Townsend, WA  USA

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Strange behavior when sampling rows of a data frame

Sébastien Lahaie
Thank you all for the responses, these are the insights I was hoping for.
There are many ways to get this right, and I happened to run into one that
has a glitch. I see from Luke's explanation how the strange output came
about. Glad to hear that this bug/behavior is already known.

On Fri, Jun 19, 2020 at 7:04 PM Daniel Nordlund <[hidden email]>
wrote:

> On 6/19/2020 5:49 AM, Sébastien Lahaie wrote:
> > I ran into some strange behavior in R when trying to assign a treatment
> to
> > rows in a data frame. I'm wondering whether any R experts can explain
> > what's going on.
> >
> > First, let's assign a treatment to 3 out of 10 rows as follows.
> >
> > df <- data.frame(unit = 1:10)
> > df$treated <- FALSE
> > s <- sample(nrow(df), 3)
> > df[s,]$treated <- TRUE
> > df
> >     unit treated
> > 1     1   FALSE
> > 2     2    TRUE
> > 3     3   FALSE
> > 4     4   FALSE
> > 5     5    TRUE
> > 6     6   FALSE
> > 7     7    TRUE
> > 8     8   FALSE
> > 9     9   FALSE
> > 10   10   FALSE
> >
> > This is as expected. Now we'll just skip the intermediate step of saving
> > the sampled indices, and apply the treatment directly as follows.
> >
> > df <- data.frame(unit = 1:10)
> > df$treated <- FALSE
> > df[sample(nrow(df), 3),]$treated <- TRUE
> > df
> >     unit treated
> > 1     6    TRUE
> > 2     2   FALSE
> > 3     3   FALSE
> > 4     9    TRUE
> > 5     5   FALSE
> > 6     6   FALSE
> > 7     7   FALSE
> > 8     5    TRUE
> > 9     9   FALSE
> > 10   10   FALSE
> >
> > Now the data frame still has 10 rows with 3 assigned to the treatment.
> But
> > the units are garbled. Units 1 and 4 have disappeared, for instance, and
> > there are duplicates for 6 and 9, one assigned to treatment and the other
> > to control. Why would this happen?
> >
> > Thanks,
> > Sebastien
> >
> Sébastien,
>
> You have received good explanations of what is going on with your code.
> I think you can get what you want by making a simple modification of
> your treatment assignment statement. At least it works for me.
>
> df[sample(nrow(df),3), 'treated'] <- TRUE
>
> Hope this is helpful,
>
> Dan
>
> --
> Daniel Nordlund
> Port Townsend, WA  USA
>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.