dplyr: summarise across using variable names and a condition

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

dplyr: summarise across using variable names and a condition

R help mailing list-2
Hello All,

Would like to be able to summarize across in dplyr using variable names and a condition. Below is an example "have" data set followed by an example "need" data set. After that, I've got a vector of numeric variable names. After that, I've got the very humble beginnings of a dplyr-based solution.

What I think I need to be able to do is to submit my variable names to dplyr and then to have a conditional function. If the variable is is in my list of names, calculate the mean and the std. If not, then calculate the mean but label it as a proportion. The question is how to do that. It appears that using variable names might involve !!, or possibly enquo, or possibly quo, but I haven't had much success with these. I imagine I might have been very close but not quite have gotten it. The conditional part seems less difficult but I'm not quite sure how to do that either.

Help with this would be greatly appreciated.

Thanks,

Paul


have <- structure(list(
        ptno = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M",
                 "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z"),
        age1 = c(74, 70, 78, 79, 72, 81, 76, 58, 53, 74, 72, 74, 75,
                 73, 80, 62, 67, 65, 83, 67, 72, 90, 73, 84, 90, 51),
        age2 = c(71, 67, 72, 74, 65, 79, 70, 49, 45, 68, 70, 71, 74,
                 71, 69, 58, 65, 59, 80, 60, 68, 87, 71, 82, 80, 49),
        gender_male = c(1L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 0L,
                        1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 0L),
        gender_female = c(0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L,
                          0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 1L),
        race_white = c(0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L,
                       1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
        race_black = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
                       0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
        race_other = c(1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L,
                       0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)),
        row.names = c(NA, -26L), class = c("tbl_df", "tbl", "data.frame"))
 

need <-structure(list(
       age1_mean = 72.8076923076923, age1_std = 9.72838827666425,
       age2_mean = 68.2307692307692, age2_std = 10.2227498934785,
       gender_male_prop = 0.576923076923077, gender_female_prop = 0.423076923076923,
       race_white_prop = 0.769230769230769, race_black_prop = 0.0384615384615385,
       race_other_prop = 0.192307692307692),
       row.names = c(NA, -1L), class = c("tbl_df", "tbl", "data.frame"))

vars_num <-  c("age1", "age2")

library(magrittr)
library(dplyr)

have %>%
  summarise(across(
  .cols = !contains("ptno"),
  .fns = list(mean = mean, std = sd),
  .names = "{col}_{fn}"
))

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: dplyr: summarise across using variable names and a condition

Rui Barradas
Hello,

Here is a way of doing what the question asks for. There might be
others, simpler, but this one works.

have %>%
   summarise(across(
     .cols = !contains("ptno"),
     .fns = list(mean = mean, std = sd),
     .names = "{col}_{fn}"
   )) %>%
   select(
     -matches("^gender_.*_std$"),
     -matches("^race_.*_std$")
   ) %>%
   rename_with(
     .cols = matches("^gender|^race"),
     ~sub("mean$", "prop", .x)
   ) %>%
   all.equal(need)
#[1] TRUE


Hope this helps,

Rui Barradas

Às 13:47 de 26/03/21, Paul Miller via R-help escreveu:

> Hello All,
>
> Would like to be able to summarize across in dplyr using variable names and a condition. Below is an example "have" data set followed by an example "need" data set. After that, I've got a vector of numeric variable names. After that, I've got the very humble beginnings of a dplyr-based solution.
>
> What I think I need to be able to do is to submit my variable names to dplyr and then to have a conditional function. If the variable is is in my list of names, calculate the mean and the std. If not, then calculate the mean but label it as a proportion. The question is how to do that. It appears that using variable names might involve !!, or possibly enquo, or possibly quo, but I haven't had much success with these. I imagine I might have been very close but not quite have gotten it. The conditional part seems less difficult but I'm not quite sure how to do that either.
>
> Help with this would be greatly appreciated.
>
> Thanks,
>
> Paul
>
>
> have <- structure(list(
>          ptno = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M",
>                   "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z"),
>          age1 = c(74, 70, 78, 79, 72, 81, 76, 58, 53, 74, 72, 74, 75,
>                   73, 80, 62, 67, 65, 83, 67, 72, 90, 73, 84, 90, 51),
>          age2 = c(71, 67, 72, 74, 65, 79, 70, 49, 45, 68, 70, 71, 74,
>                   71, 69, 58, 65, 59, 80, 60, 68, 87, 71, 82, 80, 49),
>          gender_male = c(1L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 0L,
>                          1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 0L),
>          gender_female = c(0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L,
>                            0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 1L),
>          race_white = c(0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L,
>                         1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
>          race_black = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
>                         0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
>          race_other = c(1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L,
>                         0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)),
>          row.names = c(NA, -26L), class = c("tbl_df", "tbl", "data.frame"))
>  
>
> need <-structure(list(
>         age1_mean = 72.8076923076923, age1_std = 9.72838827666425,
>         age2_mean = 68.2307692307692, age2_std = 10.2227498934785,
>         gender_male_prop = 0.576923076923077, gender_female_prop = 0.423076923076923,
>         race_white_prop = 0.769230769230769, race_black_prop = 0.0384615384615385,
>         race_other_prop = 0.192307692307692),
>         row.names = c(NA, -1L), class = c("tbl_df", "tbl", "data.frame"))
>
> vars_num <-  c("age1", "age2")
>
> library(magrittr)
> library(dplyr)
>
> have %>%
>    summarise(across(
>    .cols = !contains("ptno"),
>    .fns = list(mean = mean, std = sd),
>    .names = "{col}_{fn}"
> ))
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Dynamically defining dplyr across statements (was dplyr: summarise across using variable names and a condition)

R help mailing list-2
Hello All,

Thanks Rui for your response to my question. I agree that it is possible to use a workaround. Get most of what you want and then tidy it up afterwards. I too have a workaround that I have pasted below. I wanted to avoid that initially. I felt I was only using a workaround because I hadn't yet figured out how to use the dplyr software properly.

Determined how to summarise across conditionally during the weekend. That led me to rename my question as learning this changed the nature of the problem.

Below are my "have" and "need" data sets from before, for which I've reordered columns. After that, are some vectors of variable names that are already defined in my code and which I thought might be helpful in producing a solution. After that, is dplyr code that summarizes across conditionally. If the data being submitted to this code were always going to be the same, this would work perfectly. That's not the case though. So the across statements that are needed will be data dependent. Last, I've pasted my version of a workaround. This should work for any dataset.

Ideally, I'd like to get a solution that builds on the summarise across code below. It seems likely that would involve dynamically creating the various across statements though, and that might wind up being a lot more complicated and verbose than my workaround. Another possibility might be to do this in one pass using non-dplyr code. If neither of those options works out, it may be that the workaround is actually the way to go.

Thanks,

Paul

#### Have and need data ####

library(magrittr)
library(dplyr)

have <- structure(list(
  ptno = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M",
           "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z"),
  age1 = c(74, 70, 78, 79, 72, 81, 76, 58, 53, 74, 72, 74, 75,
           73, 80, 62, 67, 65, 83, 67, 72, 90, 73, 84, 90, 51),
  age2 = c(71, 67, 72, 74, 65, 79, 70, 49, 45, 68, 70, 71, 74,
           71, 69, 58, 65, 59, 80, 60, 68, 87, 71, 82, 80, 49),
  gender_male = c(1L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 0L,
                  1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 0L),
  gender_female = c(0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L,
                    0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 1L),
  race_white = c(0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L,
                 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
  race_black = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
                 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
  race_other = c(1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L,
                 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)),
  row.names = c(NA, -26L), class = c("tbl_df", "tbl", "data.frame"))

have <- have %>%
  select(ptno, age1, gender_male, gender_female, age2, everything())

need <-structure(list(
  age1_mean = 72.8076923076923, age1_std = 9.72838827666425,
  age2_mean = 68.2307692307692, age2_std = 10.2227498934785,
  gender_male_prop = 0.576923076923077, gender_female_prop = 0.423076923076923,
  race_white_prop = 0.769230769230769, race_black_prop = 0.0384615384615385,
  race_other_prop = 0.192307692307692),
  row.names = c(NA, -1L), class = c("tbl_df", "tbl", "data.frame"))

need <- need %>%
  select(age1_mean, age1_std, gender_male_prop, gender_female_prop, age2_mean, age2_std, everything())

#### Vectors of variable names ####

vars_num <-  c("age1", "age2")
vars_dmy <-  c("gender", "race")
vars_all <-  c("age1", "age2","gender", "race")

#### dplyr conditional summarize across ####

have %>%
  summarize(
     across(2:2, list(mean = mean, std = sd)),
     across(3:4, list(prop = mean)),
     across(5:5, list(mean = mean, std = sd)),
     across(6:8, list(prop = mean))
  ) %>%
  all.equal(need) 

#### Workaround ####

have %>%
  summarise(across(
     .cols = !contains("chai_patient_id"),
     .fns = list(mean = mean, std = sd),
     .names = "{col}_{fn}"
  )) %>%
  select(starts_with(vars_num) | ends_with("mean")) %>%
  rename_at(vars(!starts_with(vars_num)), list(~ str_replace(., "mean$", "prop"))) %>%
  all.equal(need)


On Friday, March 26, 2021, 1:08:58 p.m. EDT, Rui Barradas <[hidden email]> wrote:

Hello,

Here is a way of doing what the question asks for. There might be
others, simpler, but this one works.

have %>%
  summarise(across(
    .cols = !contains("ptno"),
    .fns = list(mean = mean, std = sd),
    .names = "{col}_{fn}"
  )) %>%
  select(
    -matches("^gender_.*_std$"),
    -matches("^race_.*_std$")
  ) %>%
  rename_with(
    .cols = matches("^gender|^race"),
    ~sub("mean$", "prop", .x)
  ) %>%
  all.equal(need)
#[1] TRUE


Hope this helps,

Rui Barradas

Às 13:47 de 26/03/21, Paul Miller via R-help escreveu:

> Hello All,
>
> Would like to be able to summarize across in dplyr using variable names and a condition. Below is an example "have" data set followed by an example "need" data set. After that, I've got a vector of numeric variable names. After that, I've got the very humble beginnings of a dplyr-based solution.
>
> What I think I need to be able to do is to submit my variable names to dplyr and then to have a conditional function. If the variable is is in my list of names, calculate the mean and the std. If not, then calculate the mean but label it as a proportion. The question is how to do that. It appears that using variable names might involve !!, or possibly enquo, or possibly quo, but I haven't had much success with these. I imagine I might have been very close but not quite have gotten it. The conditional part seems less difficult but I'm not quite sure how to do that either.
>
> Help with this would be greatly appreciated.
>
> Thanks,
>
> Paul
>
>
> have <- structure(list(
>          ptno = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M",
>                   "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z"),
>          age1 = c(74, 70, 78, 79, 72, 81, 76, 58, 53, 74, 72, 74, 75,
>                   73, 80, 62, 67, 65, 83, 67, 72, 90, 73, 84, 90, 51),
>          age2 = c(71, 67, 72, 74, 65, 79, 70, 49, 45, 68, 70, 71, 74,
>                   71, 69, 58, 65, 59, 80, 60, 68, 87, 71, 82, 80, 49),
>          gender_male = c(1L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 0L,
>                          1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 0L),
>          gender_female = c(0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L,
>                            0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 1L),
>          race_white = c(0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L,
>                         1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
>          race_black = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
>                         0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
>          race_other = c(1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L,
>                         0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)),
>          row.names = c(NA, -26L), class = c("tbl_df", "tbl", "data.frame"))

>
> need <-structure(list(
>         age1_mean = 72.8076923076923, age1_std = 9.72838827666425,
>         age2_mean = 68.2307692307692, age2_std = 10.2227498934785,
>         gender_male_prop = 0.576923076923077, gender_female_prop = 0.423076923076923,
>         race_white_prop = 0.769230769230769, race_black_prop = 0.0384615384615385,
>         race_other_prop = 0.192307692307692),
>         row.names = c(NA, -1L), class = c("tbl_df", "tbl", "data.frame"))
>
> vars_num <-  c("age1", "age2")
>
> library(magrittr)
> library(dplyr)
>
> have %>%
>    summarise(across(
>    .cols = !contains("ptno"),
>    .fns = list(mean = mean, std = sd),
>    .names = "{col}_{fn}"

> ))
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.