2 D density plot interpretation and manipulating the data

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

2 D density plot interpretation and manipulating the data

anikaM
Hello,

I have a data frame like this:

> head(SNP)
               mean      var     sd
FQC.10090295 0.0327 0.002678 0.0517
FQC.10119363 0.0220 0.000978 0.0313
FQC.10132112 0.0275 0.002088 0.0457
FQC.10201128 0.0169 0.000289 0.0170
FQC.10208432 0.0443 0.004081 0.0639
FQC.10218466 0.0116 0.000131 0.0115
...

and I am creating plot like this:

s <- ggplot(SNP, mapping = aes(x = mean, y = var))
s <- s +  geom_density_2d() + geom_point() + my.theme + ggtitle("SNPs")
s

I am getting plot in attach.

My question is how do I:
1.interpret the inclusion versus exclusion within the ellipses-contours?

2. how do I extract from my data frame the points which are outside of ellipses?

Thanks
Ana

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

snps.pdf (37K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: 2 D density plot interpretation and manipulating the data

anikaM
My understanding is that this represents bivariate normal
approximation of the data which uses the kernel density function to
test for inclusion within a level set. (please correct me)

In order to exclude the outlier to these ellipses/contours is it
advisable to do something like this:

SNP$density <- get_density(SNP$mean, SNP$var)
> summary(SNP$density)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
      0     383     696     738    1170    1789

where get_density() is function from here:
https://slowkow.com/notes/ggplot2-color-by-density/

and then do something like this:

a=SNP[SNP$density>400,]

and plot it again:

p <- ggplot(a, mapping = aes(x = mean, y = var))
p <- p +  geom_density_2d() + geom_point() + my.theme + ggtitle("SNPS_red")

On Thu, Oct 8, 2020 at 3:52 PM Ana Marija <[hidden email]> wrote:

>
> Hello,
>
> I have a data frame like this:
>
> > head(SNP)
>                mean      var     sd
> FQC.10090295 0.0327 0.002678 0.0517
> FQC.10119363 0.0220 0.000978 0.0313
> FQC.10132112 0.0275 0.002088 0.0457
> FQC.10201128 0.0169 0.000289 0.0170
> FQC.10208432 0.0443 0.004081 0.0639
> FQC.10218466 0.0116 0.000131 0.0115
> ...
>
> and I am creating plot like this:
>
> s <- ggplot(SNP, mapping = aes(x = mean, y = var))
> s <- s +  geom_density_2d() + geom_point() + my.theme + ggtitle("SNPs")
> s
>
> I am getting plot in attach.
>
> My question is how do I:
> 1.interpret the inclusion versus exclusion within the ellipses-contours?
>
> 2. how do I extract from my data frame the points which are outside of ellipses?
>
> Thanks
> Ana

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: 2 D density plot interpretation and manipulating the data

aBBy Spurdle, ⍺XY
> My understanding is that this represents bivariate normal
> approximation of the data which uses the kernel density function to
> test for inclusion within a level set. (please correct me)

You can fit a bivariate normal distribution by computing five parameters.
Two means, two standard deviations (or two variances) and one
correlation (or covariance) coefficient.
The bivariate normal *has* elliptical contours.

A kernel density estimate is usually regarded as an estimate of an
unknown density function.
Often they use a normal (or Gaussian) kernel, but I wouldn't describe
them as normal approximations.
In general, bivariate kernel density estimates do *not* have
elliptical contours.
But in saying that, if the data is close to normality, then contours
will be close to elliptical.

Kernel density estimates do not test for inclusion, as such.
(But technically, there are some exceptions to that).

I'm not sure what you're trying to achieve here.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: 2 D density plot interpretation and manipulating the data

anikaM
Hi Abby,

thank you for getting back to me and for this useful information.

I'm trying to detect the outliers in my distribution based of mean and
variance. Can I see that from the plot I provided? Would outliers be
outside of ellipses? If so how do I extract those from my data frame,
based on which parameter?

So I am trying to connect outliers based on what the plot is showing:
s <- ggplot(SNP, mapping = aes(x = mean, y = var))
s <- s +  geom_density_2d() + geom_point() + my.theme + ggtitle("SNPs")

versus what is in the data:

> head(SNP)
               mean      var     sd
FQC.10090295 0.0327 0.002678 0.0517
FQC.10119363 0.0220 0.000978 0.0313
FQC.10132112 0.0275 0.002088 0.0457
FQC.10201128 0.0169 0.000289 0.0170
FQC.10208432 0.0443 0.004081 0.0639
FQC.10218466 0.0116 0.000131 0.0115
...

the distribution is not normal, it is right-skewed.

Cheers,
Ana

On Fri, Oct 9, 2020 at 2:13 AM Abby Spurdle <[hidden email]> wrote:

>
> > My understanding is that this represents bivariate normal
> > approximation of the data which uses the kernel density function to
> > test for inclusion within a level set. (please correct me)
>
> You can fit a bivariate normal distribution by computing five parameters.
> Two means, two standard deviations (or two variances) and one
> correlation (or covariance) coefficient.
> The bivariate normal *has* elliptical contours.
>
> A kernel density estimate is usually regarded as an estimate of an
> unknown density function.
> Often they use a normal (or Gaussian) kernel, but I wouldn't describe
> them as normal approximations.
> In general, bivariate kernel density estimates do *not* have
> elliptical contours.
> But in saying that, if the data is close to normality, then contours
> will be close to elliptical.
>
> Kernel density estimates do not test for inclusion, as such.
> (But technically, there are some exceptions to that).
>
> I'm not sure what you're trying to achieve here.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: 2 D density plot interpretation and manipulating the data

Bert Gunter-2
I recommend that you consult with a local statistical expert. Much of what
you say (outliers?!?) seems to make little sense, and your statistical
knowledge seems minimal. Perhaps more to the point, none of your questions
can be properly answered without subject matter context, which this list is
not designed to provide. That's why I believe you need local expertise.

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Fri, Oct 9, 2020 at 8:25 AM Ana Marija <[hidden email]>
wrote:

> Hi Abby,
>
> thank you for getting back to me and for this useful information.
>
> I'm trying to detect the outliers in my distribution based of mean and
> variance. Can I see that from the plot I provided? Would outliers be
> outside of ellipses? If so how do I extract those from my data frame,
> based on which parameter?
>
> So I am trying to connect outliers based on what the plot is showing:
> s <- ggplot(SNP, mapping = aes(x = mean, y = var))
> s <- s +  geom_density_2d() + geom_point() + my.theme + ggtitle("SNPs")
>
> versus what is in the data:
>
> > head(SNP)
>                mean      var     sd
> FQC.10090295 0.0327 0.002678 0.0517
> FQC.10119363 0.0220 0.000978 0.0313
> FQC.10132112 0.0275 0.002088 0.0457
> FQC.10201128 0.0169 0.000289 0.0170
> FQC.10208432 0.0443 0.004081 0.0639
> FQC.10218466 0.0116 0.000131 0.0115
> ...
>
> the distribution is not normal, it is right-skewed.
>
> Cheers,
> Ana
>
> On Fri, Oct 9, 2020 at 2:13 AM Abby Spurdle <[hidden email]> wrote:
> >
> > > My understanding is that this represents bivariate normal
> > > approximation of the data which uses the kernel density function to
> > > test for inclusion within a level set. (please correct me)
> >
> > You can fit a bivariate normal distribution by computing five parameters.
> > Two means, two standard deviations (or two variances) and one
> > correlation (or covariance) coefficient.
> > The bivariate normal *has* elliptical contours.
> >
> > A kernel density estimate is usually regarded as an estimate of an
> > unknown density function.
> > Often they use a normal (or Gaussian) kernel, but I wouldn't describe
> > them as normal approximations.
> > In general, bivariate kernel density estimates do *not* have
> > elliptical contours.
> > But in saying that, if the data is close to normality, then contours
> > will be close to elliptical.
> >
> > Kernel density estimates do not test for inclusion, as such.
> > (But technically, there are some exceptions to that).
> >
> > I'm not sure what you're trying to achieve here.
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: 2 D density plot interpretation and manipulating the data

anikaM
Hi Bert,

Another confrontational response from you...

You might have noticed that I use the word "outlier" carefully in this
post and only in relation to the plotted ellipses. I do not know the
underlying algorithm of geom_density_2d() and therefore I am having an
issue of how to interpret the plot. I was hoping someone here knows
that and can help me.

Ana

On Fri, Oct 9, 2020 at 11:31 AM Bert Gunter <[hidden email]> wrote:

>
> I recommend that you consult with a local statistical expert. Much of what you say (outliers?!?) seems to make little sense, and your statistical knowledge seems minimal. Perhaps more to the point, none of your questions can be properly answered without subject matter context, which this list is not designed to provide. That's why I believe you need local expertise.
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Fri, Oct 9, 2020 at 8:25 AM Ana Marija <[hidden email]> wrote:
>>
>> Hi Abby,
>>
>> thank you for getting back to me and for this useful information.
>>
>> I'm trying to detect the outliers in my distribution based of mean and
>> variance. Can I see that from the plot I provided? Would outliers be
>> outside of ellipses? If so how do I extract those from my data frame,
>> based on which parameter?
>>
>> So I am trying to connect outliers based on what the plot is showing:
>> s <- ggplot(SNP, mapping = aes(x = mean, y = var))
>> s <- s +  geom_density_2d() + geom_point() + my.theme + ggtitle("SNPs")
>>
>> versus what is in the data:
>>
>> > head(SNP)
>>                mean      var     sd
>> FQC.10090295 0.0327 0.002678 0.0517
>> FQC.10119363 0.0220 0.000978 0.0313
>> FQC.10132112 0.0275 0.002088 0.0457
>> FQC.10201128 0.0169 0.000289 0.0170
>> FQC.10208432 0.0443 0.004081 0.0639
>> FQC.10218466 0.0116 0.000131 0.0115
>> ...
>>
>> the distribution is not normal, it is right-skewed.
>>
>> Cheers,
>> Ana
>>
>> On Fri, Oct 9, 2020 at 2:13 AM Abby Spurdle <[hidden email]> wrote:
>> >
>> > > My understanding is that this represents bivariate normal
>> > > approximation of the data which uses the kernel density function to
>> > > test for inclusion within a level set. (please correct me)
>> >
>> > You can fit a bivariate normal distribution by computing five parameters.
>> > Two means, two standard deviations (or two variances) and one
>> > correlation (or covariance) coefficient.
>> > The bivariate normal *has* elliptical contours.
>> >
>> > A kernel density estimate is usually regarded as an estimate of an
>> > unknown density function.
>> > Often they use a normal (or Gaussian) kernel, but I wouldn't describe
>> > them as normal approximations.
>> > In general, bivariate kernel density estimates do *not* have
>> > elliptical contours.
>> > But in saying that, if the data is close to normality, then contours
>> > will be close to elliptical.
>> >
>> > Kernel density estimates do not test for inclusion, as such.
>> > (But technically, there are some exceptions to that).
>> >
>> > I'm not sure what you're trying to achieve here.
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: 2 D density plot interpretation and manipulating the data

aBBy Spurdle, ⍺XY
You could assign a density value to each point.
Maybe you've done that already...?

Then trim the lowest n (number of) data points
Or trim the lowest p (proportion of) data points.

e.g.
Remove the data points with the 20 lowest density values.
Or remove the data points with the lowest 5% of density values.

I'll let you decide whether that is a good idea or a bad idea.
And if it's a good idea, then how much to trim.


On Sat, Oct 10, 2020 at 5:47 AM Ana Marija <[hidden email]> wrote:

>
> Hi Bert,
>
> Another confrontational response from you...
>
> You might have noticed that I use the word "outlier" carefully in this
> post and only in relation to the plotted ellipses. I do not know the
> underlying algorithm of geom_density_2d() and therefore I am having an
> issue of how to interpret the plot. I was hoping someone here knows
> that and can help me.
>
> Ana
>
> On Fri, Oct 9, 2020 at 11:31 AM Bert Gunter <[hidden email]> wrote:
> >
> > I recommend that you consult with a local statistical expert. Much of what you say (outliers?!?) seems to make little sense, and your statistical knowledge seems minimal. Perhaps more to the point, none of your questions can be properly answered without subject matter context, which this list is not designed to provide. That's why I believe you need local expertise.
> >
> > Bert Gunter
> >
> > "The trouble with having an open mind is that people keep coming along and sticking things into it."
> > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >
> >
> > On Fri, Oct 9, 2020 at 8:25 AM Ana Marija <[hidden email]> wrote:
> >>
> >> Hi Abby,
> >>
> >> thank you for getting back to me and for this useful information.
> >>
> >> I'm trying to detect the outliers in my distribution based of mean and
> >> variance. Can I see that from the plot I provided? Would outliers be
> >> outside of ellipses? If so how do I extract those from my data frame,
> >> based on which parameter?
> >>
> >> So I am trying to connect outliers based on what the plot is showing:
> >> s <- ggplot(SNP, mapping = aes(x = mean, y = var))
> >> s <- s +  geom_density_2d() + geom_point() + my.theme + ggtitle("SNPs")
> >>
> >> versus what is in the data:
> >>
> >> > head(SNP)
> >>                mean      var     sd
> >> FQC.10090295 0.0327 0.002678 0.0517
> >> FQC.10119363 0.0220 0.000978 0.0313
> >> FQC.10132112 0.0275 0.002088 0.0457
> >> FQC.10201128 0.0169 0.000289 0.0170
> >> FQC.10208432 0.0443 0.004081 0.0639
> >> FQC.10218466 0.0116 0.000131 0.0115
> >> ...
> >>
> >> the distribution is not normal, it is right-skewed.
> >>
> >> Cheers,
> >> Ana
> >>
> >> On Fri, Oct 9, 2020 at 2:13 AM Abby Spurdle <[hidden email]> wrote:
> >> >
> >> > > My understanding is that this represents bivariate normal
> >> > > approximation of the data which uses the kernel density function to
> >> > > test for inclusion within a level set. (please correct me)
> >> >
> >> > You can fit a bivariate normal distribution by computing five parameters.
> >> > Two means, two standard deviations (or two variances) and one
> >> > correlation (or covariance) coefficient.
> >> > The bivariate normal *has* elliptical contours.
> >> >
> >> > A kernel density estimate is usually regarded as an estimate of an
> >> > unknown density function.
> >> > Often they use a normal (or Gaussian) kernel, but I wouldn't describe
> >> > them as normal approximations.
> >> > In general, bivariate kernel density estimates do *not* have
> >> > elliptical contours.
> >> > But in saying that, if the data is close to normality, then contours
> >> > will be close to elliptical.
> >> >
> >> > Kernel density estimates do not test for inclusion, as such.
> >> > (But technically, there are some exceptions to that).
> >> >
> >> > I'm not sure what you're trying to achieve here.
> >>
> >> ______________________________________________
> >> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: 2 D density plot interpretation and manipulating the data

anikaM
Hi Abby,

Thanks for getting back to me, yes I believe I did that by doing this:

SNP$density <- get_density(SNP$mean, SNP$var)
> summary(SNP$density)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
      0     383     696     738    1170    1789

where get_density() is function from here:
https://slowkow.com/notes/ggplot2-color-by-density/

and keep only entries with density > 400

a=SNP[SNP$density>400,]

and plot it again:

p <- ggplot(a, mapping = aes(x = mean, y = var))
p <- p +  geom_density_2d() + geom_point() + my.theme + ggtitle("SNPS_red")

and probably I can increase that threshold...

Any idea how do I interpret data points that are left contained within
the ellipses?

On Fri, Oct 9, 2020 at 6:09 PM Abby Spurdle <[hidden email]> wrote:

>
> You could assign a density value to each point.
> Maybe you've done that already...?
>
> Then trim the lowest n (number of) data points
> Or trim the lowest p (proportion of) data points.
>
> e.g.
> Remove the data points with the 20 lowest density values.
> Or remove the data points with the lowest 5% of density values.
>
> I'll let you decide whether that is a good idea or a bad idea.
> And if it's a good idea, then how much to trim.
>
>
> On Sat, Oct 10, 2020 at 5:47 AM Ana Marija <[hidden email]> wrote:
> >
> > Hi Bert,
> >
> > Another confrontational response from you...
> >
> > You might have noticed that I use the word "outlier" carefully in this
> > post and only in relation to the plotted ellipses. I do not know the
> > underlying algorithm of geom_density_2d() and therefore I am having an
> > issue of how to interpret the plot. I was hoping someone here knows
> > that and can help me.
> >
> > Ana
> >
> > On Fri, Oct 9, 2020 at 11:31 AM Bert Gunter <[hidden email]> wrote:
> > >
> > > I recommend that you consult with a local statistical expert. Much of what you say (outliers?!?) seems to make little sense, and your statistical knowledge seems minimal. Perhaps more to the point, none of your questions can be properly answered without subject matter context, which this list is not designed to provide. That's why I believe you need local expertise.
> > >
> > > Bert Gunter
> > >
> > > "The trouble with having an open mind is that people keep coming along and sticking things into it."
> > > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> > >
> > >
> > > On Fri, Oct 9, 2020 at 8:25 AM Ana Marija <[hidden email]> wrote:
> > >>
> > >> Hi Abby,
> > >>
> > >> thank you for getting back to me and for this useful information.
> > >>
> > >> I'm trying to detect the outliers in my distribution based of mean and
> > >> variance. Can I see that from the plot I provided? Would outliers be
> > >> outside of ellipses? If so how do I extract those from my data frame,
> > >> based on which parameter?
> > >>
> > >> So I am trying to connect outliers based on what the plot is showing:
> > >> s <- ggplot(SNP, mapping = aes(x = mean, y = var))
> > >> s <- s +  geom_density_2d() + geom_point() + my.theme + ggtitle("SNPs")
> > >>
> > >> versus what is in the data:
> > >>
> > >> > head(SNP)
> > >>                mean      var     sd
> > >> FQC.10090295 0.0327 0.002678 0.0517
> > >> FQC.10119363 0.0220 0.000978 0.0313
> > >> FQC.10132112 0.0275 0.002088 0.0457
> > >> FQC.10201128 0.0169 0.000289 0.0170
> > >> FQC.10208432 0.0443 0.004081 0.0639
> > >> FQC.10218466 0.0116 0.000131 0.0115
> > >> ...
> > >>
> > >> the distribution is not normal, it is right-skewed.
> > >>
> > >> Cheers,
> > >> Ana
> > >>
> > >> On Fri, Oct 9, 2020 at 2:13 AM Abby Spurdle <[hidden email]> wrote:
> > >> >
> > >> > > My understanding is that this represents bivariate normal
> > >> > > approximation of the data which uses the kernel density function to
> > >> > > test for inclusion within a level set. (please correct me)
> > >> >
> > >> > You can fit a bivariate normal distribution by computing five parameters.
> > >> > Two means, two standard deviations (or two variances) and one
> > >> > correlation (or covariance) coefficient.
> > >> > The bivariate normal *has* elliptical contours.
> > >> >
> > >> > A kernel density estimate is usually regarded as an estimate of an
> > >> > unknown density function.
> > >> > Often they use a normal (or Gaussian) kernel, but I wouldn't describe
> > >> > them as normal approximations.
> > >> > In general, bivariate kernel density estimates do *not* have
> > >> > elliptical contours.
> > >> > But in saying that, if the data is close to normality, then contours
> > >> > will be close to elliptical.
> > >> >
> > >> > Kernel density estimates do not test for inclusion, as such.
> > >> > (But technically, there are some exceptions to that).
> > >> >
> > >> > I'm not sure what you're trying to achieve here.
> > >>
> > >> ______________________________________________
> > >> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > >> https://stat.ethz.ch/mailman/listinfo/r-help
> > >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > >> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

snps_red.pdf (39K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: 2 D density plot interpretation and manipulating the data

aBBy Spurdle, ⍺XY
> SNP$density <- get_density(SNP$mean, SNP$var)
> > summary(SNP$density)
>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>       0     383     696     738    1170    1789

This doesn't look accurate.
The density values shouldn't all be integers.
And I wouldn't expect the smallest density to be zero, if using a
Gaussian kernel.

Are these values rounded or formatted?

(Recombined Excerpts)
> and keep only entries with density > 400
> a=SNP[SNP$density>400,]
> Any idea how do I interpret data points that are left contained within
the ellipses?

Reiterating, they're contour lines, but they should *not* be ellipses.

You could work out the proportion of "densities" > 400.

    d <- SNP$density
    p.remain <- length (d [d > 400]) / length (d)
    p.remain

Or a more succinct version:

    p.remain <- sum (SNP$density > 400) / nrow (SNP)

Then you can say that you've plotted data with the highest <p.remain> densities.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: 2 D density plot interpretation and manipulating the data

David Winsemius
In reply to this post by anikaM
I’m wondering if you do any searching of the Web or use the help facilities before asking questions? When I posed the question to Google’s search facilities I immediately was directed to, unsurprisingly, the help page text in a webpage format:

https://ggplot2.tidyverse.org/reference/geom_density_2d.html

In many situations I find it very difficult to pry code out of ggplot functions, but that help page says it’s a routine found in the MASS package which is very well documented.

David.

Sent from my iPhone

> On Oct 9, 2020, at 4:23 PM, Ana Marija <[hidden email]> wrote:
>
> Hi Abby,
>
> Thanks for getting back to me, yes I believe I did that by doing this:
>
> SNP$density <- get_density(SNP$mean, SNP$var)
>> summary(SNP$density)
>   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>      0     383     696     738    1170    1789
>
> where get_density() is function from here:
> https://slowkow.com/notes/ggplot2-color-by-density/
>
> and keep only entries with density > 400
>
> a=SNP[SNP$density>400,]
>
> and plot it again:
>
> p <- ggplot(a, mapping = aes(x = mean, y = var))
> p <- p +  geom_density_2d() + geom_point() + my.theme + ggtitle("SNPS_red")
>
> and probably I can increase that threshold...
>
> Any idea how do I interpret data points that are left contained within
> the ellipses?
>
>> On Fri, Oct 9, 2020 at 6:09 PM Abby Spurdle <[hidden email]> wrote:
>>
>> You could assign a density value to each point.
>> Maybe you've done that already...?
>>
>> Then trim the lowest n (number of) data points
>> Or trim the lowest p (proportion of) data points.
>>
>> e.g.
>> Remove the data points with the 20 lowest density values.
>> Or remove the data points with the lowest 5% of density values.
>>
>> I'll let you decide whether that is a good idea or a bad idea.
>> And if it's a good idea, then how much to trim.
>>
>>
>>> On Sat, Oct 10, 2020 at 5:47 AM Ana Marija <[hidden email]> wrote:
>>>
>>> Hi Bert,
>>>
>>> Another confrontational response from you...
>>>
>>> You might have noticed that I use the word "outlier" carefully in this
>>> post and only in relation to the plotted ellipses. I do not know the
>>> underlying algorithm of geom_density_2d() and therefore I am having an
>>> issue of how to interpret the plot. I was hoping someone here knows
>>> that and can help me.
>>>
>>> Ana
>>>
>>> On Fri, Oct 9, 2020 at 11:31 AM Bert Gunter <[hidden email]> wrote:
>>>>
>>>> I recommend that you consult with a local statistical expert. Much of what you say (outliers?!?) seems to make little sense, and your statistical knowledge seems minimal. Perhaps more to the point, none of your questions can be properly answered without subject matter context, which this list is not designed to provide. That's why I believe you need local expertise.
>>>>
>>>> Bert Gunter
>>>>
>>>> "The trouble with having an open mind is that people keep coming along and sticking things into it."
>>>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>>>
>>>>
>>>> On Fri, Oct 9, 2020 at 8:25 AM Ana Marija <[hidden email]> wrote:
>>>>>
>>>>> Hi Abby,
>>>>>
>>>>> thank you for getting back to me and for this useful information.
>>>>>
>>>>> I'm trying to detect the outliers in my distribution based of mean and
>>>>> variance. Can I see that from the plot I provided? Would outliers be
>>>>> outside of ellipses? If so how do I extract those from my data frame,
>>>>> based on which parameter?
>>>>>
>>>>> So I am trying to connect outliers based on what the plot is showing:
>>>>> s <- ggplot(SNP, mapping = aes(x = mean, y = var))
>>>>> s <- s +  geom_density_2d() + geom_point() + my.theme + ggtitle("SNPs")
>>>>>
>>>>> versus what is in the data:
>>>>>
>>>>>> head(SNP)
>>>>>               mean      var     sd
>>>>> FQC.10090295 0.0327 0.002678 0.0517
>>>>> FQC.10119363 0.0220 0.000978 0.0313
>>>>> FQC.10132112 0.0275 0.002088 0.0457
>>>>> FQC.10201128 0.0169 0.000289 0.0170
>>>>> FQC.10208432 0.0443 0.004081 0.0639
>>>>> FQC.10218466 0.0116 0.000131 0.0115
>>>>> ...
>>>>>
>>>>> the distribution is not normal, it is right-skewed.
>>>>>
>>>>> Cheers,
>>>>> Ana
>>>>>
>>>>> On Fri, Oct 9, 2020 at 2:13 AM Abby Spurdle <[hidden email]> wrote:
>>>>>>
>>>>>>> My understanding is that this represents bivariate normal
>>>>>>> approximation of the data which uses the kernel density function to
>>>>>>> test for inclusion within a level set. (please correct me)
>>>>>>
>>>>>> You can fit a bivariate normal distribution by computing five parameters.
>>>>>> Two means, two standard deviations (or two variances) and one
>>>>>> correlation (or covariance) coefficient.
>>>>>> The bivariate normal *has* elliptical contours.
>>>>>>
>>>>>> A kernel density estimate is usually regarded as an estimate of an
>>>>>> unknown density function.
>>>>>> Often they use a normal (or Gaussian) kernel, but I wouldn't describe
>>>>>> them as normal approximations.
>>>>>> In general, bivariate kernel density estimates do *not* have
>>>>>> elliptical contours.
>>>>>> But in saying that, if the data is close to normality, then contours
>>>>>> will be close to elliptical.
>>>>>>
>>>>>> Kernel density estimates do not test for inclusion, as such.
>>>>>> (But technically, there are some exceptions to that).
>>>>>>
>>>>>> I'm not sure what you're trying to achieve here.
>>>>>
>>>>> ______________________________________________
>>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
> <snps_red.pdf>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.