Calculating SD according to groups of rows

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Calculating SD according to groups of rows

Boyce
*Hi all,

I know this is probably basic, but I have proven to be a slow learner in any
programming language.   Anyhow,
how can I calculate the SD for each person in my table?  I have two patients
in this R data.frame, 7200 and 23955.
I extracted this from a relational database, but am I better off attempting
to compute SD in SQL, or is this easily accomplished in R?


*      SUBJECT_ID  HR
1        7200 158
2        7200 165
3        7200 138
4        7200 152
5        7200 139
6        7200 157
7        7200 186
8       23955 167
9       23955 162
10      23955 171
11      23955 139
12      23955 170
13      23955 177
14      23955 180
15      23955 176
16      23955 172
17      23955 179
18      23955 181
19      23955 169
20      23955 168
21      23955 185
22      23955 181
23      23955 191
24      23955 179
25      23955 178
26      23955 184
27      23955 179
28      23955 172
29      23955 173
30      23955 182
31      23955 174

*
So, what I would want is a table of 800 patients with a SD for their heart
rates:

subject id       Heart Rate SD

7200              20 (for example)
23955           18 (for example)*

Thank you!

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Calculating SD according to groups of rows

Jorge I Velez
Dear pufftissue,

If your data set is a data.frame called 'x', one approach could be:

# Data set
x=read.table('clipboard',header=TRUE)

# Calculations
tapply(x$HR,x$SUBJECT_ID,sd,na.rm=TRUE)
    7200    23955
16.39977 10.03896

See ?tapply and/or ?ave for more information.

HTH,

Jorge


On Wed, Nov 19, 2008 at 11:59 PM, pufftissue pufftissue <
[hidden email]> wrote:

> *Hi all,
>
> I know this is probably basic, but I have proven to be a slow learner in
> any
> programming language.   Anyhow,
> how can I calculate the SD for each person in my table?  I have two
> patients
> in this R data.frame, 7200 and 23955.
> I extracted this from a relational database, but am I better off attempting
> to compute SD in SQL, or is this easily accomplished in R?
>
>
> *      SUBJECT_ID  HR
> 1        7200 158
> 2        7200 165
> 3        7200 138
> 4        7200 152
> 5        7200 139
> 6        7200 157
> 7        7200 186
> 8       23955 167
> 9       23955 162
> 10      23955 171
> 11      23955 139
> 12      23955 170
> 13      23955 177
> 14      23955 180
> 15      23955 176
> 16      23955 172
> 17      23955 179
> 18      23955 181
> 19      23955 169
> 20      23955 168
> 21      23955 185
> 22      23955 181
> 23      23955 191
> 24      23955 179
> 25      23955 178
> 26      23955 184
> 27      23955 179
> 28      23955 172
> 29      23955 173
> 30      23955 182
> 31      23955 174
>
> *
> So, what I would want is a table of 800 patients with a SD for their heart
> rates:
>
> subject id       Heart Rate SD
>
> 7200              20 (for example)
> 23955           18 (for example)*
>
> Thank you!
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Calculating SD according to groups of rows

Simon Blomberg-4
In reply to this post by Boyce
How about:

with(dat, tapply(HR, SUBJECT_ID, sd))

Assuming your data frame is named dat.

On Wed, 2008-11-19 at 23:59 -0500, pufftissue pufftissue wrote:

> *Hi all,
>
> I know this is probably basic, but I have proven to be a slow learner in any
> programming language.   Anyhow,
> how can I calculate the SD for each person in my table?  I have two patients
> in this R data.frame, 7200 and 23955.
> I extracted this from a relational database, but am I better off attempting
> to compute SD in SQL, or is this easily accomplished in R?
>
>
> *      SUBJECT_ID  HR
> 1        7200 158
> 2        7200 165
> 3        7200 138
> 4        7200 152
> 5        7200 139
> 6        7200 157
> 7        7200 186
> 8       23955 167
> 9       23955 162
> 10      23955 171
> 11      23955 139
> 12      23955 170
> 13      23955 177
> 14      23955 180
> 15      23955 176
> 16      23955 172
> 17      23955 179
> 18      23955 181
> 19      23955 169
> 20      23955 168
> 21      23955 185
> 22      23955 181
> 23      23955 191
> 24      23955 179
> 25      23955 178
> 26      23955 184
> 27      23955 179
> 28      23955 172
> 29      23955 173
> 30      23955 182
> 31      23955 174
>
> *
> So, what I would want is a table of 800 patients with a SD for their heart
> rates:
>
> subject id       Heart Rate SD
>
> 7200              20 (for example)
> 23955           18 (for example)*
>
> Thank you!
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Simon Blomberg, BSc (Hons), PhD, MAppStat.
Lecturer and Consultant Statistician
Faculty of Biological and Chemical Sciences
The University of Queensland
St. Lucia Queensland 4072
Australia
Room 320 Goddard Building (8)
T: +61 7 3365 2506
http://www.uq.edu.au/~uqsblomb
email: S.Blomberg1_at_uq.edu.au

Policies:
1.  I will NOT analyse your data for you.
2.  Your deadline is your problem.

The combination of some data and an aching desire for
an answer does not ensure that a reasonable answer can
be extracted from a given body of data. - John Tukey.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Calculating SD according to groups of rows

Boyce
Thank you for your help!
It works beautifully, but how would I transpose the results?

What I am getting is indeed:

7200          23955        34563        8934
16.39977 10.03896    11.234      14.02

I'd like the final output to be:

subject_id         hr_Stand_Deviation
7200                  16.39977
23955                10.03896
34563                11.234
8934                  14.02

What I have realized so far is:

tapply() is outputting an array, so I need to transpose the array.
Then I read that you need to use aperm to transpose an array, but I don' t
know what permutation vector I should choose
Then even if it is transposed, I'd like to retitle the columns to be
'subject_id' and 'hr_standard_deviation'

I tried to convert the array to a table and then to a matrix, but I'm not
sure what the best way to go is from this point.

Thanks again!



On Thu, Nov 20, 2008 at 12:17 AM, Simon Blomberg <[hidden email]>wrote:

> How about:
>
> with(dat, tapply(HR, SUBJECT_ID, sd))
>
> Assuming your data frame is named dat.
>
> On Wed, 2008-11-19 at 23:59 -0500, pufftissue pufftissue wrote:
> > *Hi all,
> >
> > I know this is probably basic, but I have proven to be a slow learner in
> any
> > programming language.   Anyhow,
> > how can I calculate the SD for each person in my table?  I have two
> patients
> > in this R data.frame, 7200 and 23955.
> > I extracted this from a relational database, but am I better off
> attempting
> > to compute SD in SQL, or is this easily accomplished in R?
> >
> >
> > *      SUBJECT_ID  HR
> > 1        7200 158
> > 2        7200 165
> > 3        7200 138
> > 4        7200 152
> > 5        7200 139
> > 6        7200 157
> > 7        7200 186
> > 8       23955 167
> > 9       23955 162
> > 10      23955 171
> > 11      23955 139
> > 12      23955 170
> > 13      23955 177
> > 14      23955 180
> > 15      23955 176
> > 16      23955 172
> > 17      23955 179
> > 18      23955 181
> > 19      23955 169
> > 20      23955 168
> > 21      23955 185
> > 22      23955 181
> > 23      23955 191
> > 24      23955 179
> > 25      23955 178
> > 26      23955 184
> > 27      23955 179
> > 28      23955 172
> > 29      23955 173
> > 30      23955 182
> > 31      23955 174
> >
> > *
> > So, what I would want is a table of 800 patients with a SD for their
> heart
> > rates:
> >
> > subject id       Heart Rate SD
> >
> > 7200              20 (for example)
> > 23955           18 (for example)*
> >
> > Thank you!
> >
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> --
> Simon Blomberg, BSc (Hons), PhD, MAppStat.
> Lecturer and Consultant Statistician
> Faculty of Biological and Chemical Sciences
> The University of Queensland
> St. Lucia Queensland 4072
> Australia
> Room 320 Goddard Building (8)
> T: +61 7 3365 2506
> http://www.uq.edu.au/~uqsblomb <http://www.uq.edu.au/%7Euqsblomb>
> email: S.Blomberg1_at_uq.edu.au
>
> Policies:
> 1.  I will NOT analyse your data for you.
> 2.  Your deadline is your problem.
>
> The combination of some data and an aching desire for
> an answer does not ensure that a reasonable answer can
> be extracted from a given body of data. - John Tukey.
>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Calculating SD according to groups of rows

Dieter Menne
pufftissue pufftissue <pufftissue <at> gmail.com> writes:

>
> What I am getting is indeed:
>
> 7200          23955        34563        8934
> 16.39977 10.03896    11.234      14.02
>
> I'd like the final output to be:
>
> subject_id         hr_Stand_Deviation
> 7200                  16.39977
> 23955                10.03896
> 34563                11.234
> 8934                  14.02
>

The hard way could go like that; I personally got used to it, but I admit
it is one of the thinks that are unusually difficult in R.

dat = data.frame(SUBJECT_ID=sample(letters[1:5],100,TRUE),HR=rnorm(100))
sd.list = with(dat, tapply(HR, SUBJECT_ID, sd))
data.frame(SUBJECT_ID=rownames(sd.list),sd=sd.list)

I think Hadley Wickham tried to make life easier with the plyr package,
so I thought something like the below would work out of the box.
However, there must be something wrong with the syntax, the
result is only "approximately" correct.

Dieter

library(plyr)
daply(dat,.(SUBJECT_ID),sd)
ddply(dat,.(SUBJECT_ID),sd)

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Calculating SD according to groups of rows

PIKAL Petr
What about aggregate.

with(dat, aggregate(HR, list(sub_id=SUBJECT_ID), sd))

shall result in required final output form.

Regards

Petr

[hidden email] napsal dne 20.11.2008 09:20:36:

> pufftissue pufftissue <pufftissue <at> gmail.com> writes:
>
> >
> > What I am getting is indeed:
> >
> > 7200          23955        34563        8934
> > 16.39977 10.03896    11.234      14.02
> >
> > I'd like the final output to be:
> >
> > subject_id         hr_Stand_Deviation
> > 7200                  16.39977
> > 23955                10.03896
> > 34563                11.234
> > 8934                  14.02
> >
>
> The hard way could go like that; I personally got used to it, but I
admit

> it is one of the thinks that are unusually difficult in R.
>
> dat = data.frame(SUBJECT_ID=sample(letters[1:5],100,TRUE),HR=rnorm(100))
> sd.list = with(dat, tapply(HR, SUBJECT_ID, sd))
> data.frame(SUBJECT_ID=rownames(sd.list),sd=sd.list)
>
> I think Hadley Wickham tried to make life easier with the plyr package,
> so I thought something like the below would work out of the box.
> However, there must be something wrong with the syntax, the
> result is only "approximately" correct.
>
> Dieter
>
> library(plyr)
> daply(dat,.(SUBJECT_ID),sd)
> ddply(dat,.(SUBJECT_ID),sd)
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Calculating SD according to groups of rows

hadley wickham
In reply to this post by Dieter Menne
On Thu, Nov 20, 2008 at 2:20 AM, Dieter Menne
<[hidden email]> wrote:

> pufftissue pufftissue <pufftissue <at> gmail.com> writes:
>
>>
>> What I am getting is indeed:
>>
>> 7200          23955        34563        8934
>> 16.39977 10.03896    11.234      14.02
>>
>> I'd like the final output to be:
>>
>> subject_id         hr_Stand_Deviation
>> 7200                  16.39977
>> 23955                10.03896
>> 34563                11.234
>> 8934                  14.02
>>
>
> The hard way could go like that; I personally got used to it, but I admit
> it is one of the thinks that are unusually difficult in R.
>
> dat = data.frame(SUBJECT_ID=sample(letters[1:5],100,TRUE),HR=rnorm(100))
> sd.list = with(dat, tapply(HR, SUBJECT_ID, sd))
> data.frame(SUBJECT_ID=rownames(sd.list),sd=sd.list)
>
> I think Hadley Wickham tried to make life easier with the plyr package,
> so I thought something like the below would work out of the box.
> However, there must be something wrong with the syntax, the
> result is only "approximately" correct.
>
> Dieter
>
> library(plyr)
> daply(dat,.(SUBJECT_ID),sd)
> ddply(dat,.(SUBJECT_ID),sd)

Well that calculates sd on the whole data frame.  (Like sd(dat)). You
probably want:

ddply(dat,.(SUBJECT_ID), numcolwise(sd))

which calculates sd for numeric columns only, or

ddply(dat,.(SUBJECT_ID), function(df) sd(df$HR))

which calculates it for HR explicitly.


Hadley

--
http://had.co.nz/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Calculating SD according to groups of rows

Dieter Menne
hadley wickham <h.wickham <at> gmail.com> writes:

> > library(plyr)
> > dat = data.frame(SUBJECT_ID=sample(letters[1:5],100,TRUE),HR=rnorm(100))
> > daply(dat,.(SUBJECT_ID),sd)
> > ddply(dat,.(SUBJECT_ID),sd)
>
> Well that calculates sd on the whole data frame.  (Like sd(dat)).

Not really, it looks like the breakdown is somehow done:

> library(plyr)
> dat = data.frame(SUBJECT_ID=sample(letters[1:5],100,TRUE),HR=rnorm(100))
> daply(dat,.(SUBJECT_ID),sd)
         
SUBJECT_ID SUBJECT_ID        HR
         a         NA 1.0488930
         b         NA 0.9110685
         c         NA 1.0776996
         d         NA 1.1724009
         e         NA 0.9455105
Warning messages:
1: In var(as.vector(x), na.rm = na.rm) : NAs introduced by coercion
..more warnings

> ddply(dat,.(SUBJECT_ID),sd)
  SUBJECT_ID        HR
1         NA 1.0488930
2         NA 0.9110685
3         NA 1.0776996
4         NA 1.1724009
5         NA 0.9455105
Warning messages:
1: In var(as.vector(x), na.rm = na.rm) : NAs introduced by coercion

That's what I meant by "almost correct". Your suggestion works, but wouldn't is
be a good default to make numcolwise(sd) the default with this close miss?

Dieter

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Calculating SD according to groups of rows

hadley wickham
On Thu, Nov 20, 2008 at 10:04 AM, Dieter Menne
<[hidden email]> wrote:

> hadley wickham <h.wickham <at> gmail.com> writes:
>
>> > library(plyr)
>> > dat = data.frame(SUBJECT_ID=sample(letters[1:5],100,TRUE),HR=rnorm(100))
>> > daply(dat,.(SUBJECT_ID),sd)
>> > ddply(dat,.(SUBJECT_ID),sd)
>>
>> Well that calculates sd on the whole data frame.  (Like sd(dat)).
>
> Not really, it looks like the breakdown is somehow done:
>
>> library(plyr)
>> dat = data.frame(SUBJECT_ID=sample(letters[1:5],100,TRUE),HR=rnorm(100))
>> daply(dat,.(SUBJECT_ID),sd)
>
> SUBJECT_ID SUBJECT_ID        HR
>         a         NA 1.0488930
>         b         NA 0.9110685
>         c         NA 1.0776996
>         d         NA 1.1724009
>         e         NA 0.9455105
> Warning messages:
> 1: In var(as.vector(x), na.rm = na.rm) : NAs introduced by coercion
> ..more warnings
>
>> ddply(dat,.(SUBJECT_ID),sd)
>  SUBJECT_ID        HR
> 1         NA 1.0488930
> 2         NA 0.9110685
> 3         NA 1.0776996
> 4         NA 1.1724009
> 5         NA 0.9455105
> Warning messages:
> 1: In var(as.vector(x), na.rm = na.rm) : NAs introduced by coercion
>
> That's what I meant by "almost correct". Your suggestion works, but wouldn't is
> be a good default to make numcolwise(sd) the default with this close miss?

I have considered it, but I think it makes it harder to use plyr for
the more complicated problems where it really shines.  Being able to
work with the whole data frame, instead of just some subset of the
columns, makes it possible to do much much more.  For example,
because aggregate operates on a column at a time, you can't calculate
the correlation between variables: given a data frame you can always
operate on a column at time, but given a column at a time, you can not
operate on the data frame as a whole.  Plyr chooses to supply your
aggregation function with the whole data frame, and then provides
functions (colwise, numcolwise, catcolwise) that make it easy to
operate column-wise.

Hadley

--
http://had.co.nz/

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.