Summarizing select columns in a data frame

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Summarizing select columns in a data frame

Bernard McGarvey
I have a data frame that consists of several factor columns say A, B, C, D, and E and several columns containing numerical data, say X1, X2, .... X10. I would like to create statistics of some of the numerical columns by some of the factor columns. For example,

Calculate the mean, min, and max of variables X1 and X7, by factors A, and E. The results should look like the table below:

Factor A Factor E     mean(X1) min(x1) max(X1) mean(X7) min(x7) max(X7) mean(X10) min(x10) max(X10)
A1        E1
A1        E2
A1        E3
A2        E1
A2        E2
A2        E3

I would like the results to be returned to a data frame or other object that I can write out using the write.csv function. I have looked at the summarize and numSummary functions but they do not appear to be flexible enough to do the above.

Any help would be appreciated,

Thanks

Bernard McGarvey
Director, Fort Myers Beach Lions Foundation, Inc.
Retired (Lilly Engineering Fellow).

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Summarizing select columns in a data frame

Bert Gunter-2
There are literally tons of ways to do this sort of thing in R.

In base R ?tapply and friends, especially ?ave and ?by that may be close to
what you want.
But there is a whole parallel universe -- the so-called "tidyverse set of
packages -- that many folks prefer.
This link takes you down that rabbit hole: https://dplyr.tidyverse.org/

There are still others (e.g. the data.table package). You should expect to
invest a little time in learning whichever you choose. You may wish to also
search a bit for tutorials on your choice -- there are many good ones out
there.


Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Sun, Jan 17, 2021 at 12:18 PM Bernard McGarvey <
[hidden email]> wrote:

> I have a data frame that consists of several factor columns say A, B, C,
> D, and E and several columns containing numerical data, say X1, X2, ....
> X10. I would like to create statistics of some of the numerical columns by
> some of the factor columns. For example,
>
> Calculate the mean, min, and max of variables X1 and X7, by factors A, and
> E. The results should look like the table below:
>
> Factor A Factor E     mean(X1) min(x1) max(X1) mean(X7) min(x7) max(X7)
> mean(X10) min(x10) max(X10)
> A1        E1
> A1        E2
> A1        E3
> A2        E1
> A2        E2
> A2        E3
>
> I would like the results to be returned to a data frame or other object
> that I can write out using the write.csv function. I have looked at the
> summarize and numSummary functions but they do not appear to be flexible
> enough to do the above.
>
> Any help would be appreciated,
>
> Thanks
>
> Bernard McGarvey
> Director, Fort Myers Beach Lions Foundation, Inc.
> Retired (Lilly Engineering Fellow).
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Summarizing select columns in a data frame

Bernard McGarvey
Thanks Bert

Bernard
Sent from my iPhone so please excuse the spelling!"

> On Jan 17, 2021, at 3:48 PM, Bert Gunter <[hidden email]> wrote:
>
> 
> There are literally tons of ways to do this sort of thing in R.
>
> In base R ?tapply and friends, especially ?ave and ?by that may be close to what you want.
> But there is a whole parallel universe -- the so-called "tidyverse set of packages -- that many folks prefer.
> This link takes you down that rabbit hole: https://dplyr.tidyverse.org/
>
> There are still others (e.g. the data.table package). You should expect to invest a little time in learning whichever you choose. You may wish to also search a bit for tutorials on your choice -- there are many good ones out there.
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
>> On Sun, Jan 17, 2021 at 12:18 PM Bernard McGarvey <[hidden email]> wrote:
>> I have a data frame that consists of several factor columns say A, B, C, D, and E and several columns containing numerical data, say X1, X2, .... X10. I would like to create statistics of some of the numerical columns by some of the factor columns. For example,
>>
>> Calculate the mean, min, and max of variables X1 and X7, by factors A, and E. The results should look like the table below:
>>
>> Factor A Factor E     mean(X1) min(x1) max(X1) mean(X7) min(x7) max(X7) mean(X10) min(x10) max(X10)
>> A1        E1
>> A1        E2
>> A1        E3
>> A2        E1
>> A2        E2
>> A2        E3
>>
>> I would like the results to be returned to a data frame or other object that I can write out using the write.csv function. I have looked at the summarize and numSummary functions but they do not appear to be flexible enough to do the above.
>>
>> Any help would be appreciated,
>>
>> Thanks
>>
>> Bernard McGarvey
>> Director, Fort Myers Beach Lions Foundation, Inc.
>> Retired (Lilly Engineering Fellow).
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Summarizing select columns in a data frame

David Winsemius
In reply to this post by Bernard McGarvey

On 1/17/21 12:15 PM, Bernard McGarvey wrote:

> I have a data frame that consists of several factor columns say A, B, C, D, and E and several columns containing numerical data, say X1, X2, .... X10. I would like to create statistics of some of the numerical columns by some of the factor columns. For example,
>
> Calculate the mean, min, and max of variables X1 and X7, by factors A, and E. The results should look like the table below:
>
> Factor A Factor E     mean(X1) min(x1) max(X1) mean(X7) min(x7) max(X7) mean(X10) min(x10) max(X10)
> A1        E1
> A1        E2
> A1        E3
> A2        E1
> A2        E2
> A2        E3
>
> I would like the results to be returned to a data frame or other object that I can write out using the write.csv function. I have looked at the summarize and numSummary functions but they do not appear to be flexible enough to do the above.


The `aggregate` function will do the subsetting and function application.

 > dfrm <- cbind(dfrm, matrix(rnorm(600), ncol=10 ) ); names(dfrm)[3:12]
<- paste0("X", 1:10)
 > str(dfrm)
'data.frame':    60 obs. of  12 variables:
  $ Factor_A: Factor w/ 2 levels "A1","A2": 1 1 1 2 2 2 1 1 1 2 ...
  $ Factor_B: Factor w/ 3 levels "E1","E2","E3": 1 2 3 1 2 3 1 2 3 1 ...
  $ X1      : num  -0.02116 -0.00049 0.12875 -0.05412 0.51886 ...
  $ X2      : num  1.6799 -0.0963 -0.5727 -0.3638 -0.322 ...
  $ X3      : num  -0.349 0.267 -0.666 -0.329 0.902 ...
  $ X4      : num  0.1125 -0.5384 0.0924 0.6849 -0.4194 ...
  $ X5      : num  -0.421 0.372 1.316 1.323 -0.03 ...
  $ X6      : num  -0.0767 1.4972 0.1967 -0.7092 -1.0943 ...
  $ X7      : num  0.1771 -0.2136 -1.0818 -0.0671 2.0015 ...
  $ X8      : num  1.456 -0.383 -0.47 0.965 0.569 ...
  $ X9      : num  -1.795 -0.4546 0.0069 1.2245 -0.395 ...
  $ X10     : num  -1.931 1.708 0.274 0.73 -0.995 ...



  aggregate(  dfrm[ ,  c("X1", "X7", "X10")],    # columns to analyze

                       dfrm[ c("Factor_A", "Factor_B")],  # classifying
columns

                       FUN=function (x) c(mn =mean(x), min=min(x),
max=max(x) ) )  # desired "summarizers"

#--- result----

   Factor_A Factor_B        X1.mn       X1.min       X1.max X7.mn     
X7.min      X7.max
1       A1       E1  0.187513792 -0.866094155  2.310960164 0.22489729
-0.91442493  1.94095786
2       A2       E1  0.078361707 -1.515410191  1.382420050 -0.51309155
-1.67026123  0.70869034
3       A1       E2 -0.267416858 -1.995131138  1.392115793 -0.04772929
-2.45426692  2.02225946
4       A2       E2 -0.069807208 -0.703073589  1.879448658 -0.37770923
-2.66221239  2.00152154
5       A1       E3 -0.007800886 -1.297561250  1.216627848 -0.30395411
-1.08181218  1.09764895
6       A2       E3 -0.054466856 -1.577891927  1.674719118 0.35594015
-1.20865279  2.25765422
       X10.mn    X10.min    X10.max
1 -0.3458888 -2.0312811  1.1483179
2 -0.1021727 -1.3230372  0.8045472
3  0.3514645 -3.2334010  1.7075298
4 -0.4988984 -2.1091311  0.5857192
5  0.2297461 -1.1336967  0.8483935

6  0.3700621 -1.5609424  2.2792024


--

David

>
> Any help would be appreciated,
>
> Thanks
>
> Bernard McGarvey
> Director, Fort Myers Beach Lions Foundation, Inc.
> Retired (Lilly Engineering Fellow).
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Summarizing select columns in a data frame

Bernard McGarvey
Thanks David

Bernard
Sent from my iPhone so please excuse the spelling!"

> On Jan 17, 2021, at 5:59 PM, David Winsemius <[hidden email]> wrote:
>
> 
>> On 1/17/21 12:15 PM, Bernard McGarvey wrote:
>> I have a data frame that consists of several factor columns say A, B, C, D, and E and several columns containing numerical data, say X1, X2, .... X10. I would like to create statistics of some of the numerical columns by some of the factor columns. For example,
>>
>> Calculate the mean, min, and max of variables X1 and X7, by factors A, and E. The results should look like the table below:
>>
>> Factor A Factor E     mean(X1) min(x1) max(X1) mean(X7) min(x7) max(X7) mean(X10) min(x10) max(X10)
>> A1        E1
>> A1        E2
>> A1        E3
>> A2        E1
>> A2        E2
>> A2        E3
>>
>> I would like the results to be returned to a data frame or other object that I can write out using the write.csv function. I have looked at the summarize and numSummary functions but they do not appear to be flexible enough to do the above.
>
>
> The `aggregate` function will do the subsetting and function application.
>
> > dfrm <- cbind(dfrm, matrix(rnorm(600), ncol=10 ) ); names(dfrm)[3:12] <- paste0("X", 1:10)
> > str(dfrm)
> 'data.frame':    60 obs. of  12 variables:
>  $ Factor_A: Factor w/ 2 levels "A1","A2": 1 1 1 2 2 2 1 1 1 2 ...
>  $ Factor_B: Factor w/ 3 levels "E1","E2","E3": 1 2 3 1 2 3 1 2 3 1 ...
>  $ X1      : num  -0.02116 -0.00049 0.12875 -0.05412 0.51886 ...
>  $ X2      : num  1.6799 -0.0963 -0.5727 -0.3638 -0.322 ...
>  $ X3      : num  -0.349 0.267 -0.666 -0.329 0.902 ...
>  $ X4      : num  0.1125 -0.5384 0.0924 0.6849 -0.4194 ...
>  $ X5      : num  -0.421 0.372 1.316 1.323 -0.03 ...
>  $ X6      : num  -0.0767 1.4972 0.1967 -0.7092 -1.0943 ...
>  $ X7      : num  0.1771 -0.2136 -1.0818 -0.0671 2.0015 ...
>  $ X8      : num  1.456 -0.383 -0.47 0.965 0.569 ...
>  $ X9      : num  -1.795 -0.4546 0.0069 1.2245 -0.395 ...
>  $ X10     : num  -1.931 1.708 0.274 0.73 -0.995 ...
>
>
>
>  aggregate(  dfrm[ ,  c("X1", "X7", "X10")],    # columns to analyze
>
>                       dfrm[ c("Factor_A", "Factor_B")],  # classifying columns
>
>                       FUN=function (x) c(mn =mean(x), min=min(x), max=max(x) ) )  # desired "summarizers"
>
> #--- result----
>
>   Factor_A Factor_B        X1.mn       X1.min       X1.max X7.mn      X7.min      X7.max
> 1       A1       E1  0.187513792 -0.866094155  2.310960164 0.22489729 -0.91442493  1.94095786
> 2       A2       E1  0.078361707 -1.515410191  1.382420050 -0.51309155 -1.67026123  0.70869034
> 3       A1       E2 -0.267416858 -1.995131138  1.392115793 -0.04772929 -2.45426692  2.02225946
> 4       A2       E2 -0.069807208 -0.703073589  1.879448658 -0.37770923 -2.66221239  2.00152154
> 5       A1       E3 -0.007800886 -1.297561250  1.216627848 -0.30395411 -1.08181218  1.09764895
> 6       A2       E3 -0.054466856 -1.577891927  1.674719118 0.35594015 -1.20865279  2.25765422
>       X10.mn    X10.min    X10.max
> 1 -0.3458888 -2.0312811  1.1483179
> 2 -0.1021727 -1.3230372  0.8045472
> 3  0.3514645 -3.2334010  1.7075298
> 4 -0.4988984 -2.1091311  0.5857192
> 5  0.2297461 -1.1336967  0.8483935
>
> 6  0.3700621 -1.5609424  2.2792024
>
>
> --
>
> David
>
>>
>> Any help would be appreciated,
>>
>> Thanks
>>
>> Bernard McGarvey
>> Director, Fort Myers Beach Lions Foundation, Inc.
>> Retired (Lilly Engineering Fellow).
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.