about ECDF display in ggplot2

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

about ECDF display in ggplot2

Bogdan Tanasa
Dear all,

I would appreciate having your advice/suggestions/comments on the following
:

1 -- starting from a vector that contains LENGTHS (numerically, the values
are from 1 to 10 000)

2 -- shall I display the ECDF by using the R code and some "limits" :

BREAKS = c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500,
           1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000)

ggplot(x, aes(LENGTH)) +
          stat_ecdf(geom = "point") +
          scale_x_continuous(name = "LENGTH of DEL",
                             breaks = BREAKS,
                             limits=c(0, 500))

3 -- I am getting the following warning message : "Warning message: Removed
109 rows containing non-finite values (stat_ecdf)."

The question is : are these 109 values removed from VISUALIZATION as i set
up the "limits", or are these 109 values removed from statistical
CALCULATION?

4 -- in contrast, shall I use the standard R functions plot(ecdf), there is
no "warning mesage"

plot(ecdf(x$LENGTH), xlab="DEL LENGTH",
                     ylab="Fraction of DEL", main="DEL", xlim=c(0,500),
                     col = "dark red")

Thanks a lot !

-- bogdan

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: about ECDF display in ggplot2

Jeff Newmiller
It is a feature of ggplot that points excluded by limits raise warnings, while base graphics do not.

You may find that using coord_cartesian with the xlim=c(0,500) argument works better with ggplot by showing the consequences of points out of the limits on lines within the viewport.

There are other possible problems with your data that your non-reproducible example does not show, and sending R code in HTML-formatted email usually corrupts it.. so please follow the recommendations in the Posting Guide next time you post.

On July 6, 2018 4:32:41 PM PDT, Bogdan Tanasa <[hidden email]> wrote:

>Dear all,
>
>I would appreciate having your advice/suggestions/comments on the
>following
>:
>
>1 -- starting from a vector that contains LENGTHS (numerically, the
>values
>are from 1 to 10 000)
>
>2 -- shall I display the ECDF by using the R code and some "limits" :
>
>BREAKS = c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400,
>500,
>         1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000)
>
>ggplot(x, aes(LENGTH)) +
>          stat_ecdf(geom = "point") +
>          scale_x_continuous(name = "LENGTH of DEL",
>                             breaks = BREAKS,
>                             limits=c(0, 500))
>
>3 -- I am getting the following warning message : "Warning message:
>Removed
>109 rows containing non-finite values (stat_ecdf)."
>
>The question is : are these 109 values removed from VISUALIZATION as i
>set
>up the "limits", or are these 109 values removed from statistical
>CALCULATION?
>
>4 -- in contrast, shall I use the standard R functions plot(ecdf),
>there is
>no "warning mesage"
>
>plot(ecdf(x$LENGTH), xlab="DEL LENGTH",
>                     ylab="Fraction of DEL", main="DEL", xlim=c(0,500),
>                     col = "dark red")
>
>Thanks a lot !
>
>-- bogdan
>
> [[alternative HTML version deleted]]
>
>______________________________________________
>[hidden email] mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

--
Sent from my phone. Please excuse my brevity.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: about ECDF display in ggplot2

Bogdan Tanasa
In reply to this post by Bogdan Tanasa
Dear Jeff,

thank you for your email.

Yes, in order to be more descriptive/comprehensive, please find attached to
my email the following files (my apologies ... I am sending these as
attachments, as I do not have a web server running at this moment) :

-- the R script (R_script_display_ECDF.R) that reads the file "LENGTH" and
outputs ECDF figure by using the standard R function or ggplot2.

-- the display of ECDF by using standard R function
("display.R.ecdf.LENGTH.pdf")

-- the display of ECDF by using ggplot2 ("display.ggplot2.ecdf.LENGTH.pdf")

The ECDF over xlim(0,500) looks very different (contrasting plot(ecdf) vs
ggplot2).  Please would you advise why ? what shall I change in my ggplot2
code ?

thanks a lot,

- bogdan

ps : the R code is also written below :

 library("ggplot2")
>


> file <- read.delim("LENGTH", sep="\t", header=T, stringsAsFactors=F)
>


> ############################# display with PLOT FUNCTION:
>


> pdf("display.R.ecdf.LENGTH.pdf", width=10, height=6, paper='special')
>


> plot(ecdf(file$LENGTH), xlab="DEL SIZE",
>                      ylab="fraction of DEL",
>                      main="LENGTH of DEL",
>                      xlim=c(0,500),
>                      col = "dark red", axes = FALSE)
>


> ticks_y <- c(0, 0.2, 0.4, 0.6, 0.8, 1, 1.2, 1.4)
>


> axis(2, at=ticks_y, labels=ticks_y, col.axis="red")
>


> ticks_x <- c(0, 100, 200, 400, 500, 600, 700, 800)
>


> axis(1, at=ticks_x, labels=ticks_x, col.axis="blue")
>


> dev.off()
>


> ############################# display in GGPLOT2 :
>


> BREAKS = c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500,
>            1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000)
>


> barfill <- "#4271AE"
> barlines <- "#1F3552"
>


> pdf("display.ggplot2.ecdf.LENGTH.pdf", width=10, height=6,
> paper='special')
>


> ggplot(file, aes(LENGTH)) +
>           stat_ecdf(geom = "point", colour = barlines, fill = barfill) +
>           scale_x_continuous(name = "LENGTH of DEL",
>                              breaks = BREAKS,
>                              limits=c(0, 500)) +
>           scale_y_continuous(name = "FRACTION") +
>           ggtitle("ECDF of LENGTH") +
>           theme_bw() +
>           theme(legend.position = "bottom", legend.direction =
> "horizontal",
>                legend.box = "horizontal",
>                legend.key.size = unit(1, "cm"),
>                axis.title = element_text(size = 12),
>                legend.text = element_text(size = 9),
>                legend.title=element_text(face = "bold", size = 9))
>

> dev.off()








On Sat, Jul 7, 2018 at 9:47 PM, Jeff Newmiller <[hidden email]>
wrote:

> It is a feature of ggplot that points excluded by limits raise warnings,
> while base graphics do not.
>
> You may find that using coord_cartesian with the xlim=c(0,500) argument
> works better with ggplot by showing the consequences of points out of the
> limits on lines within the viewport.
>
> There are other possible problems with your data that your
> non-reproducible example does not show, and sending R code in
> HTML-formatted email usually corrupts it.. so please follow the
> recommendations in the Posting Guide next time you post.
>
> On July 6, 2018 4:32:41 PM PDT, Bogdan Tanasa <[hidden email]> wrote:
> >Dear all,
> >
> >I would appreciate having your advice/suggestions/comments on the
> >following
> >:
> >
> >1 -- starting from a vector that contains LENGTHS (numerically, the
> >values
> >are from 1 to 10 000)
> >
> >2 -- shall I display the ECDF by using the R code and some "limits" :
> >
> >BREAKS = c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400,
> >500,
> >         1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000)
> >
> >ggplot(x, aes(LENGTH)) +
> >          stat_ecdf(geom = "point") +
> >          scale_x_continuous(name = "LENGTH of DEL",
> >                             breaks = BREAKS,
> >                             limits=c(0, 500))
> >
> >3 -- I am getting the following warning message : "Warning message:
> >Removed
> >109 rows containing non-finite values (stat_ecdf)."
> >
> >The question is : are these 109 values removed from VISUALIZATION as i
> >set
> >up the "limits", or are these 109 values removed from statistical
> >CALCULATION?
> >
> >4 -- in contrast, shall I use the standard R functions plot(ecdf),
> >there is
> >no "warning mesage"
> >
> >plot(ecdf(x$LENGTH), xlab="DEL LENGTH",
> >                     ylab="Fraction of DEL", main="DEL", xlim=c(0,500),
> >                     col = "dark red")
> >
> >Thanks a lot !
> >
> >-- bogdan
> >
> >       [[alternative HTML version deleted]]
> >
> >______________________________________________
> >[hidden email] mailing list -- To UNSUBSCRIBE and more, see
> >https://stat.ethz.ch/mailman/listinfo/r-help
> >PLEASE do read the posting guide
> >http://www.R-project.org/posting-guide.html
> >and provide commented, minimal, self-contained, reproducible code.
>
> --
> Sent from my phone. Please excuse my brevity.
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

display.ggplot2.ecdf.LENGTH.pdf (11K) Download Attachment
display.R.ecdf.LENGTH.pdf (18K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: about ECDF display in ggplot2

Jeff Newmiller
Thank you for making the effort... but most attachments get stripped on
the mailing list. Using the reprex package as I suggested and putting the
result into the email is by far the safest approach. Since I received your
email directly, I did get the attachments. Below is my reproducible
example... to serve as an example for how you can get help from everyone
on the list rather than just the few you are responding to.

My summary comment is that you have to decide whether the LENGTH values
greater than 500 are relevant... and if they are, you REALLY SHOULD create
a data set that is limited in this fashion. Then you won't have to create
"fake" axes, and you won't get ggplot warnings.

Note: The reprex package allows you to confirm that the example is in fact
reproducible, so technically it is not necessary to include the plot
images in the question. However, reprex used to conveniently support
putting the images on the imgur website, and for some reason it no longer
does that, so just run the example interactively to see the graphs.

#######
############################################################
############################################################

library("ggplot2")

# "file" is the name of a very fundamental function in base R. Re-using
# that name for a data value is at best confusing to anyone reading your
# code and at worst will prevent you from using that function.
#file <- read.delim("LENGTH", sep="\t", header=T, stringsAsFactors=F)

# Instead of giving us a file, keep the data within the example
# DF <- read.delim("LENGTH", sep="\t", header=T, stringsAsFactors=F)
# set.seed( 42 )
# also shrink the size of the data for the example... we almost
# never need all of it
# dput( DF[ sample( seq.int( nrow( DF ) ), size = 200 ), , drop=FALSE ] )
DF <- structure(list(LENGTH = c(6813L, 56035L, 123997L, 281L, 851L, 1072L,
           72196L, 21L, 304L, 110L, 198L, 5922L, 283L, 199348L, 109L,
           3317104L, 106L, 37642146L, 82641L, 20L, 125911L, 354L, 11625388L,
           330L, 9811711L, 18L, 35L, 39897L, 27L, 277L, 79L, 2657L, 17L,
           26L, 23L, 248L, 3634L, 21L, 324L, 206L, 328L, 42L, 286L, 6042409L,
           24L, 36L, 2879L, 18L, 301L, 90684L, 4296636L, 43L, 1222L, 4536L,
           3281L, 324L, 393L, 3754L, 98824541L, 459L, 18L, 1081L, 175L,
           970L, 17L, 219L, 235558L, 1167315L, 25L, 623L, 2517515L, 32L,
           217L, 29L, 17L, 1744L, 18L, 39L, 26L, 77L, 41L, 22L, 311L, 119015225L,
           146413L, 22L, 19L, 301L, 373L, 2240L, 6439L, 128L, 18L, 257L,
           783L, 5169L, 31608038L, 325L, 1533L, 25L, 69344L, 54L, 10651L,
           31L, 335062L, 1854019L, 7153L, 38605567L, 51L, 23L, 16L, 301L,
           79L, 313L, 18L, 29L, 39L, 22L, 17L, 306L, 67L, 280L, 324L, 158L,
           93L, 2561L, 302L, 134578L, 328L, 9002L, 969051L, 34L, 20L, 309L,
           355L, 28L, 9461327L, 18627013L, 305L, 64L, 18L, 2730L, 28L, 246L,
           911L, 28L, 241483L, 154691L, 58891L, 55L, 456362L, 281L, 276L,
           51L, 26L, 106821L, 313L, 78L, 29L, 400L, 61171382L, 200L, 101L,
           220331L, 128L, 325L, 28L, 22L, 325L, 2330L, 5879L, 24L, 36L,
           23L, 51L, 26L, 32584707L, 1672L, 13939L, 315L, 20L, 580785L,
           42795L, 49193543L, 695L, 48568156L, 55634L, 207L, 318L, 22056L,
           3670420L, 4815387L, 309L, 17L, 3143160L, 431L, 1164L, 33L, 5503L,
           4166L)), .Names = "LENGTH", row.names = c(8283L, 8484L, 2591L,
           7517L, 5808L, 4698L, 6665L, 1219L, 5944L, 6378L, 4140L, 6503L,
           8452L, 2310L, 4180L, 8497L, 8842L, 1062L, 4293L, 5063L, 8168L,
           1253L, 8932L, 8550L, 745L, 4643L, 3523L, 8177L, 4035L, 7545L,
           6657L, 7319L, 3502L, 6181L, 36L, 7513L, 67L, 1873L, 8174L, 5516L,
           3422L, 3928L, 338L, 8773L, 3891L, 8627L, 7997L, 5765L, 8745L,
           5573L, 3003L, 3122L, 3588L, 7064L, 351L, 6739L, 6095L, 1541L,
           2349L, 4628L, 6077L, 8839L, 6830L, 5094L, 7639L, 1704L, 2439L,
           7443L, 6230L, 2162L, 387L, 1262L, 1944L, 4306L, 1773L, 6460L,
           71L, 3371L, 4618L, 15L, 5220L, 1417L, 3222L, 5792L, 6960L, 5056L,
           2096L, 807L, 768L, 2737L, 5983L, 3L, 1870L, 8361L, 8294L, 6577L,
           2984L, 4614L, 6664L, 5545L, 5608L, 1945L, 1939L, 3482L, 8435L,
           8615L, 6621L, 6561L, 4793L, 21L, 5447L, 7484L, 6721L, 4048L,
           4790L, 4804L, 13L, 3179L, 5471L, 7407L, 3187L, 3669L, 5123L,
           5267L, 6427L, 3527L, 8207L, 8593L, 2085L, 6467L, 8065L, 5385L,
           5635L, 8363L, 7587L, 5172L, 7326L, 1015L, 6817L, 5560L, 1324L,
           716L, 4136L, 6945L, 6536L, 7281L, 1516L, 8415L, 2616L, 1328L,
           6406L, 2886L, 6933L, 3511L, 6040L, 6905L, 1672L, 259L, 1208L,
           6051L, 8315L, 4896L, 5351L, 1752L, 4759L, 1597L, 4017L, 2818L,
           1033L, 1654L, 6483L, 3659L, 3678L, 4266L, 3797L, 1212L, 7322L,
           5258L, 7052L, 6826L, 8147L, 7655L, 2813L, 2300L, 6584L, 6629L,
           8140L, 7034L, 1183L, 2551L, 1726L, 6950L, 1143L, 1144L, 641L,
           471L, 4712L, 995L, 6582L, 6476L), class = "data.frame")


############################# display with PLOT FUNCTION:


# saving files should be avoided in reproducible examples... especially files
# that cannot be transmitted through the R-help mailing list such as pdf files
#pdf("display.R.ecdf.LENGTH.pdf", width=10, height=6, paper='special')

# Your original plot commands below create a fake impression of the data by
# falsifying the axes. If you really are only interested in data points less
# than 500, you should be explicit about creating a data set containing only
# such constrained values before plotting them.
plot(ecdf(DF$LENGTH), xlab="DEL SIZE",
                      ylab="fraction of DEL",
                      main="LENGTH of DEL",
                      xlim=c(0,500),
                      col = "dark red", axes = FALSE)
ticks_y <- c(0, 0.2, 0.4, 0.6, 0.8, 1, 1.2, 1.4)
axis(2, at=ticks_y, labels=ticks_y, col.axis="red")
ticks_x <- c(0, 100, 200, 400, 500, 600, 700, 800)
axis(1, at=ticks_x, labels=ticks_x, col.axis="blue")

#' ![](file1f4143e5e164_reprex_files/figure-markdown_strict/reprex-body-1.png)

# my recommendation
DF500 <- subset( DF, LENGTH < 500 )
plot( ecdf( DF500$LENGTH )
     , xlab = "DEL SIZE"
     , ylab = "fraction of DEL"
     , main = "LENGTH of DEL"
     , col = "dark red"
     )

#' ![](file1f4143e5e164_reprex_files/figure-markdown_strict/reprex-body-2.png)

# alternatively
plot( ecdf( DF$LENGTH )
     , xlab = "DEL SIZE"
     , ylab = "fraction of DEL"
     , main = "LENGTH of DEL"
     , col = "dark red"
     , xlim=c( 1, 1e9 )
     , log="x"
     )

#' ![](file1f4143e5e164_reprex_files/figure-markdown_strict/reprex-body-3.png)



#dev.off()

############################# display in GGPLOT2 :

BREAKS = c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500,
            1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000)

barfill <- "#4271AE"
barlines <- "#1F3552"

#pdf("display.ggplot2.ecdf.LENGTH.pdf", width=10, height=6, paper='special')

# ggplot's limits behavior is enabling your false representation of the data, but it
# warns you of the data removal
ggplot(DF, aes(LENGTH)) +
           stat_ecdf(geom = "point", colour = barlines, fill = barfill) +
           scale_x_continuous(name = "LENGTH of DEL",
                              breaks = BREAKS,
                              limits=c(0, 500)
                              ) +
           scale_y_continuous(name = "FRACTION") +
           ggtitle("ECDF of LENGTH") +
           theme_bw() +
           theme(legend.position = "bottom", legend.direction = "horizontal",
                legend.box = "horizontal",
                legend.key.size = unit(1, "cm"),
                axis.title = element_text(size = 12),
                legend.text = element_text(size = 9),
                legend.title=element_text(face = "bold", size = 9))
#> Warning: Removed 80 rows containing non-finite values (stat_ecdf).

#' ![](file1f4143e5e164_reprex_files/figure-markdown_strict/reprex-body-4.png)


# my recommendation
ggplot(DF500, aes(LENGTH)) +
   stat_ecdf(geom = "point", colour = barlines, fill = barfill) +
   scale_x_continuous(name = "LENGTH of DEL",
                      breaks = BREAKS ) +
   scale_y_continuous(name = "FRACTION") +
   ggtitle("ECDF of LENGTH") +
   theme_bw() +
   theme(legend.position = "bottom", legend.direction = "horizontal",
         legend.box = "horizontal",
         legend.key.size = unit(1, "cm"),
         axis.title = element_text(size = 12),
         legend.text = element_text(size = 9),
         legend.title=element_text(face = "bold", size = 9))

#' ![](file1f4143e5e164_reprex_files/figure-markdown_strict/reprex-body-5.png)

# or for the un-filtered data
ggplot(DF, aes(LENGTH)) +
   stat_ecdf(geom = "point", colour = barlines, fill = barfill) +
   scale_x_log10( name = "LENGTH of DEL") +
   scale_y_continuous(name = "FRACTION") +
   ggtitle("ECDF of LENGTH") +
   theme_bw() +
   theme(legend.position = "bottom", legend.direction = "horizontal",
         legend.box = "horizontal",
         legend.key.size = unit(1, "cm"),
         axis.title = element_text(size = 12),
         legend.text = element_text(size = 9),
         legend.title=element_text(face = "bold", size = 9))

#' ![](file1f4143e5e164_reprex_files/figure-markdown_strict/reprex-body-6.png)


#dev.off()

#' Created on 2018-07-09 by the [reprex package](http://reprex.tidyverse.org) (v0.2.0).
#######

On Sun, 8 Jul 2018, Bogdan Tanasa wrote:

> Dear Jeff, 
> thank you for your email. 
>
> Yes, in order to be more descriptive/comprehensive, please find attached to
> my email the following files (my apologies ... I am sending these as
> attachments, as I do not have a web server running at this moment) : 
>
> -- the R script (R_script_display_ECDF.R) that reads the file "LENGTH" and
> outputs ECDF figure by using the standard R function or ggplot2.
>
> -- the display of ECDF by using standard R function
> ("display.R.ecdf.LENGTH.pdf")
>
> -- the display of ECDF by using ggplot2 ("display.ggplot2.ecdf.LENGTH.pdf")
>
> The ECDF over xlim(0,500) looks very different (contrasting plot(ecdf) vs
> ggplot2).  Please would you advise why ? what shall I change in my ggplot2
> code ?
>
> thanks a lot, 
>
> - bogdan
>
> ps : the R code is also written below :
>
>        library("ggplot2")
>
>  
>       file <- read.delim("LENGTH", sep="\t", header=T,
>       stringsAsFactors=F) 
>
>  
>       ############################# display with PLOT FUNCTION: 
>
>  
>       pdf("display.R.ecdf.LENGTH.pdf", width=10, height=6,
>       paper='special') 
>
>  
>       plot(ecdf(file$LENGTH), xlab="DEL SIZE",  
>                            ylab="fraction of DEL", 
>                            main="LENGTH of DEL",  
>                            xlim=c(0,500), 
>                            col = "dark red", axes = FALSE)
>
>  
>       ticks_y <- c(0, 0.2, 0.4, 0.6, 0.8, 1, 1.2, 1.4)
>
>  
>       axis(2, at=ticks_y, labels=ticks_y, col.axis="red")
>
>  
>       ticks_x <- c(0, 100, 200, 400, 500, 600, 700, 800)
>
>  
>       axis(1, at=ticks_x, labels=ticks_x, col.axis="blue")
>
>  
>       dev.off()
>
>  
>       ############################# display in GGPLOT2 : 
>
>  
>       BREAKS = c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300,
>       400, 500, 
>                  1000, 10000, 100000, 1000000, 10000000, 100000000,
>       1000000000)
>
>  
>       barfill <- "#4271AE"
>       barlines <- "#1F3552"
>
>  
>       pdf("display.ggplot2.ecdf.LENGTH.pdf", width=10, height=6,
>       paper='special') 
>
>  
>       ggplot(file, aes(LENGTH)) + 
>                 stat_ecdf(geom = "point", colour = barlines, fill =
>       barfill) +
>                 scale_x_continuous(name = "LENGTH of DEL",
>                                    breaks = BREAKS,
>                                    limits=c(0, 500)) +
>                 scale_y_continuous(name = "FRACTION") +
>                 ggtitle("ECDF of LENGTH") + 
>                 theme_bw() +
>                 theme(legend.position = "bottom", legend.direction =
>       "horizontal",
>                      legend.box = "horizontal",
>                      legend.key.size = unit(1, "cm"),
>                      axis.title = element_text(size = 12),
>                      legend.text = element_text(size = 9),
>                      legend.title=element_text(face = "bold", size =
>       9))
>
>  
>       dev.off()
>
>
>
>
>  
>
>
> On Sat, Jul 7, 2018 at 9:47 PM, Jeff Newmiller <[hidden email]>
> wrote:
>       It is a feature of ggplot that points excluded by limits raise
>       warnings, while base graphics do not.
>
>       You may find that using coord_cartesian with the xlim=c(0,500)
>       argument works better with ggplot by showing the consequences of
>       points out of the limits on lines within the viewport.
>
>       There are other possible problems with your data that your
>       non-reproducible example does not show, and sending R code in
>       HTML-formatted email usually corrupts it.. so please follow the
>       recommendations in the Posting Guide next time you post.
>
>       On July 6, 2018 4:32:41 PM PDT, Bogdan Tanasa <[hidden email]>
>       wrote:
>       >Dear all,
>       >
>       >I would appreciate having your advice/suggestions/comments on
>       the
>       >following
>       >:
>       >
>       >1 -- starting from a vector that contains LENGTHS (numerically,
>       the
>       >values
>       >are from 1 to 10 000)
>       >
>       >2 -- shall I display the ECDF by using the R code and some
>       "limits" :
>       >
>       >BREAKS = c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200,
>       300, 400,
>       >500,
>       >         1000, 10000, 100000, 1000000, 10000000, 100000000,
>       1000000000)
>       >
>       >ggplot(x, aes(LENGTH)) +
>       >          stat_ecdf(geom = "point") +
>       >          scale_x_continuous(name = "LENGTH of DEL",
>       >                             breaks = BREAKS,
>       >                             limits=c(0, 500))
>       >
>       >3 -- I am getting the following warning message : "Warning
>       message:
>       >Removed
>       >109 rows containing non-finite values (stat_ecdf)."
>       >
>       >The question is : are these 109 values removed from
>       VISUALIZATION as i
>       >set
>       >up the "limits", or are these 109 values removed from
>       statistical
>       >CALCULATION?
>       >
>       >4 -- in contrast, shall I use the standard R functions
>       plot(ecdf),
>       >there is
>       >no "warning mesage"
>       >
>       >plot(ecdf(x$LENGTH), xlab="DEL LENGTH",
>       >                     ylab="Fraction of DEL", main="DEL",
>       xlim=c(0,500),
>       >                     col = "dark red")
>       >
>       >Thanks a lot !
>       >
>       >-- bogdan
>       >
> >       [[alternative HTML version deleted]]
> >
> >______________________________________________
> >[hidden email] mailing list -- To UNSUBSCRIBE and more, see
> >https://stat.ethz.ch/mailman/listinfo/r-help
> >PLEASE do read the posting guide
> >http://www.R-project.org/posting-guide.html
> >and provide commented, minimal, self-contained, reproducible code.
>
> --
> Sent from my phone. Please excuse my brevity.
>
>
>
>

---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<[hidden email]>        Basics: ##.#.       ##.#.  Live Go...
                                       Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
---------------------------------------------------------------------------
______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: about ECDF display in ggplot2

Bogdan Tanasa
Dear Jeff,

thank you for all your time, and very precious help.

with best regards.

-- bogdan

On Mon, Jul 9, 2018 at 1:41 AM, Jeff Newmiller <[hidden email]>
wrote:

> Thank you for making the effort... but most attachments get stripped on
> the mailing list. Using the reprex package as I suggested and putting the
> result into the email is by far the safest approach. Since I received your
> email directly, I did get the attachments. Below is my reproducible
> example... to serve as an example for how you can get help from everyone on
> the list rather than just the few you are responding to.
>
> My summary comment is that you have to decide whether the LENGTH values
> greater than 500 are relevant... and if they are, you REALLY SHOULD create
> a data set that is limited in this fashion. Then you won't have to create
> "fake" axes, and you won't get ggplot warnings.
>
> Note: The reprex package allows you to confirm that the example is in fact
> reproducible, so technically it is not necessary to include the plot images
> in the question. However, reprex used to conveniently support putting the
> images on the imgur website, and for some reason it no longer does that, so
> just run the example interactively to see the graphs.
>
> #######
> ############################################################
> ############################################################
>
> library("ggplot2")
>
> # "file" is the name of a very fundamental function in base R. Re-using
> # that name for a data value is at best confusing to anyone reading your
> # code and at worst will prevent you from using that function.
> #file <- read.delim("LENGTH", sep="\t", header=T, stringsAsFactors=F)
>
> # Instead of giving us a file, keep the data within the example
> # DF <- read.delim("LENGTH", sep="\t", header=T, stringsAsFactors=F)
> # set.seed( 42 )
> # also shrink the size of the data for the example... we almost
> # never need all of it
> # dput( DF[ sample( seq.int( nrow( DF ) ), size = 200 ), , drop=FALSE ] )
> DF <- structure(list(LENGTH = c(6813L, 56035L, 123997L, 281L, 851L, 1072L,
>           72196L, 21L, 304L, 110L, 198L, 5922L, 283L, 199348L, 109L,
>           3317104L, 106L, 37642146L, 82641L, 20L, 125911L, 354L, 11625388L,
>           330L, 9811711L, 18L, 35L, 39897L, 27L, 277L, 79L, 2657L, 17L,
>           26L, 23L, 248L, 3634L, 21L, 324L, 206L, 328L, 42L, 286L,
> 6042409L,
>           24L, 36L, 2879L, 18L, 301L, 90684L, 4296636L, 43L, 1222L, 4536L,
>           3281L, 324L, 393L, 3754L, 98824541L, 459L, 18L, 1081L, 175L,
>           970L, 17L, 219L, 235558L, 1167315L, 25L, 623L, 2517515L, 32L,
>           217L, 29L, 17L, 1744L, 18L, 39L, 26L, 77L, 41L, 22L, 311L,
> 119015225L,
>           146413L, 22L, 19L, 301L, 373L, 2240L, 6439L, 128L, 18L, 257L,
>           783L, 5169L, 31608038L, 325L, 1533L, 25L, 69344L, 54L, 10651L,
>           31L, 335062L, 1854019L, 7153L, 38605567L, 51L, 23L, 16L, 301L,
>           79L, 313L, 18L, 29L, 39L, 22L, 17L, 306L, 67L, 280L, 324L, 158L,
>           93L, 2561L, 302L, 134578L, 328L, 9002L, 969051L, 34L, 20L, 309L,
>           355L, 28L, 9461327L, 18627013L, 305L, 64L, 18L, 2730L, 28L, 246L,
>           911L, 28L, 241483L, 154691L, 58891L, 55L, 456362L, 281L, 276L,
>           51L, 26L, 106821L, 313L, 78L, 29L, 400L, 61171382L, 200L, 101L,
>           220331L, 128L, 325L, 28L, 22L, 325L, 2330L, 5879L, 24L, 36L,
>           23L, 51L, 26L, 32584707L, 1672L, 13939L, 315L, 20L, 580785L,
>           42795L, 49193543L, 695L, 48568156L, 55634L, 207L, 318L, 22056L,
>           3670420L, 4815387L, 309L, 17L, 3143160L, 431L, 1164L, 33L, 5503L,
>           4166L)), .Names = "LENGTH", row.names = c(8283L, 8484L, 2591L,
>           7517L, 5808L, 4698L, 6665L, 1219L, 5944L, 6378L, 4140L, 6503L,
>           8452L, 2310L, 4180L, 8497L, 8842L, 1062L, 4293L, 5063L, 8168L,
>           1253L, 8932L, 8550L, 745L, 4643L, 3523L, 8177L, 4035L, 7545L,
>           6657L, 7319L, 3502L, 6181L, 36L, 7513L, 67L, 1873L, 8174L, 5516L,
>           3422L, 3928L, 338L, 8773L, 3891L, 8627L, 7997L, 5765L, 8745L,
>           5573L, 3003L, 3122L, 3588L, 7064L, 351L, 6739L, 6095L, 1541L,
>           2349L, 4628L, 6077L, 8839L, 6830L, 5094L, 7639L, 1704L, 2439L,
>           7443L, 6230L, 2162L, 387L, 1262L, 1944L, 4306L, 1773L, 6460L,
>           71L, 3371L, 4618L, 15L, 5220L, 1417L, 3222L, 5792L, 6960L, 5056L,
>           2096L, 807L, 768L, 2737L, 5983L, 3L, 1870L, 8361L, 8294L, 6577L,
>           2984L, 4614L, 6664L, 5545L, 5608L, 1945L, 1939L, 3482L, 8435L,
>           8615L, 6621L, 6561L, 4793L, 21L, 5447L, 7484L, 6721L, 4048L,
>           4790L, 4804L, 13L, 3179L, 5471L, 7407L, 3187L, 3669L, 5123L,
>           5267L, 6427L, 3527L, 8207L, 8593L, 2085L, 6467L, 8065L, 5385L,
>           5635L, 8363L, 7587L, 5172L, 7326L, 1015L, 6817L, 5560L, 1324L,
>           716L, 4136L, 6945L, 6536L, 7281L, 1516L, 8415L, 2616L, 1328L,
>           6406L, 2886L, 6933L, 3511L, 6040L, 6905L, 1672L, 259L, 1208L,
>           6051L, 8315L, 4896L, 5351L, 1752L, 4759L, 1597L, 4017L, 2818L,
>           1033L, 1654L, 6483L, 3659L, 3678L, 4266L, 3797L, 1212L, 7322L,
>           5258L, 7052L, 6826L, 8147L, 7655L, 2813L, 2300L, 6584L, 6629L,
>           8140L, 7034L, 1183L, 2551L, 1726L, 6950L, 1143L, 1144L, 641L,
>           471L, 4712L, 995L, 6582L, 6476L), class = "data.frame")
>
>
> ############################# display with PLOT FUNCTION:
>
>
> # saving files should be avoided in reproducible examples... especially
> files
> # that cannot be transmitted through the R-help mailing list such as pdf
> files
> #pdf("display.R.ecdf.LENGTH.pdf", width=10, height=6, paper='special')
>
> # Your original plot commands below create a fake impression of the data by
> # falsifying the axes. If you really are only interested in data points
> less
> # than 500, you should be explicit about creating a data set containing
> only
> # such constrained values before plotting them.
> plot(ecdf(DF$LENGTH), xlab="DEL SIZE",
>                      ylab="fraction of DEL",
>                      main="LENGTH of DEL",
>                      xlim=c(0,500),
>                      col = "dark red", axes = FALSE)
> ticks_y <- c(0, 0.2, 0.4, 0.6, 0.8, 1, 1.2, 1.4)
> axis(2, at=ticks_y, labels=ticks_y, col.axis="red")
> ticks_x <- c(0, 100, 200, 400, 500, 600, 700, 800)
> axis(1, at=ticks_x, labels=ticks_x, col.axis="blue")
>
> #' ![](file1f4143e5e164_reprex_files/figure-markdown_strict/rep
> rex-body-1.png)
>
> # my recommendation
> DF500 <- subset( DF, LENGTH < 500 )
> plot( ecdf( DF500$LENGTH )
>     , xlab = "DEL SIZE"
>     , ylab = "fraction of DEL"
>     , main = "LENGTH of DEL"
>     , col = "dark red"
>     )
>
> #' ![](file1f4143e5e164_reprex_files/figure-markdown_strict/rep
> rex-body-2.png)
>
> # alternatively
> plot( ecdf( DF$LENGTH )
>     , xlab = "DEL SIZE"
>     , ylab = "fraction of DEL"
>     , main = "LENGTH of DEL"
>     , col = "dark red"
>     , xlim=c( 1, 1e9 )
>     , log="x"
>     )
>
> #' ![](file1f4143e5e164_reprex_files/figure-markdown_strict/rep
> rex-body-3.png)
>
>
>
> #dev.off()
>
> ############################# display in GGPLOT2 :
>
> BREAKS = c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500,
>            1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000)
>
> barfill <- "#4271AE"
> barlines <- "#1F3552"
>
> #pdf("display.ggplot2.ecdf.LENGTH.pdf", width=10, height=6,
> paper='special')
>
> # ggplot's limits behavior is enabling your false representation of the
> data, but it
> # warns you of the data removal
> ggplot(DF, aes(LENGTH)) +
>           stat_ecdf(geom = "point", colour = barlines, fill = barfill) +
>           scale_x_continuous(name = "LENGTH of DEL",
>                              breaks = BREAKS,
>                              limits=c(0, 500)
>                              ) +
>           scale_y_continuous(name = "FRACTION") +
>           ggtitle("ECDF of LENGTH") +
>           theme_bw() +
>           theme(legend.position = "bottom", legend.direction =
> "horizontal",
>                legend.box = "horizontal",
>                legend.key.size = unit(1, "cm"),
>                axis.title = element_text(size = 12),
>                legend.text = element_text(size = 9),
>                legend.title=element_text(face = "bold", size = 9))
> #> Warning: Removed 80 rows containing non-finite values (stat_ecdf).
>
> #' ![](file1f4143e5e164_reprex_files/figure-markdown_strict/rep
> rex-body-4.png)
>
>
> # my recommendation
> ggplot(DF500, aes(LENGTH)) +
>   stat_ecdf(geom = "point", colour = barlines, fill = barfill) +
>   scale_x_continuous(name = "LENGTH of DEL",
>                      breaks = BREAKS ) +
>   scale_y_continuous(name = "FRACTION") +
>   ggtitle("ECDF of LENGTH") +
>   theme_bw() +
>   theme(legend.position = "bottom", legend.direction = "horizontal",
>         legend.box = "horizontal",
>         legend.key.size = unit(1, "cm"),
>         axis.title = element_text(size = 12),
>         legend.text = element_text(size = 9),
>         legend.title=element_text(face = "bold", size = 9))
>
> #' ![](file1f4143e5e164_reprex_files/figure-markdown_strict/rep
> rex-body-5.png)
>
> # or for the un-filtered data
> ggplot(DF, aes(LENGTH)) +
>   stat_ecdf(geom = "point", colour = barlines, fill = barfill) +
>   scale_x_log10( name = "LENGTH of DEL") +
>   scale_y_continuous(name = "FRACTION") +
>   ggtitle("ECDF of LENGTH") +
>   theme_bw() +
>   theme(legend.position = "bottom", legend.direction = "horizontal",
>         legend.box = "horizontal",
>         legend.key.size = unit(1, "cm"),
>         axis.title = element_text(size = 12),
>         legend.text = element_text(size = 9),
>         legend.title=element_text(face = "bold", size = 9))
>
> #' ![](file1f4143e5e164_reprex_files/figure-markdown_strict/rep
> rex-body-6.png)
>
>
> #dev.off()
>
> #' Created on 2018-07-09 by the [reprex package](http://reprex.tidyver
> se.org) (v0.2.0).
> #######
>
>
> On Sun, 8 Jul 2018, Bogdan Tanasa wrote:
>
> Dear Jeff,
>> thank you for your email.
>>
>> Yes, in order to be more descriptive/comprehensive, please find attached
>> to
>> my email the following files (my apologies ... I am sending these as
>> attachments, as I do not have a web server running at this moment) :
>>
>> -- the R script (R_script_display_ECDF.R) that reads the file "LENGTH" and
>> outputs ECDF figure by using the standard R function or ggplot2.
>>
>> -- the display of ECDF by using standard R function
>> ("display.R.ecdf.LENGTH.pdf")
>>
>> -- the display of ECDF by using ggplot2 ("display.ggplot2.ecdf.LENGTH.
>> pdf")
>>
>> The ECDF over xlim(0,500) looks very different (contrasting plot(ecdf) vs
>> ggplot2).  Please would you advise why ? what shall I change in my ggplot2
>> code ?
>>
>> thanks a lot,
>>
>> - bogdan
>>
>> ps : the R code is also written below :
>>
>>        library("ggplot2")
>>
>>
>>       file <- read.delim("LENGTH", sep="\t", header=T,
>>       stringsAsFactors=F)
>>
>>
>>       ############################# display with PLOT FUNCTION:
>>
>>
>>       pdf("display.R.ecdf.LENGTH.pdf", width=10, height=6,
>>       paper='special')
>>
>>
>>       plot(ecdf(file$LENGTH), xlab="DEL SIZE",
>>                            ylab="fraction of DEL",
>>                            main="LENGTH of DEL",
>>                            xlim=c(0,500),
>>                            col = "dark red", axes = FALSE)
>>
>>
>>       ticks_y <- c(0, 0.2, 0.4, 0.6, 0.8, 1, 1.2, 1.4)
>>
>>
>>       axis(2, at=ticks_y, labels=ticks_y, col.axis="red")
>>
>>
>>       ticks_x <- c(0, 100, 200, 400, 500, 600, 700, 800)
>>
>>
>>       axis(1, at=ticks_x, labels=ticks_x, col.axis="blue")
>>
>>
>>       dev.off()
>>
>>
>>       ############################# display in GGPLOT2 :
>>
>>
>>       BREAKS = c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300,
>>       400, 500,
>>                  1000, 10000, 100000, 1000000, 10000000, 100000000,
>>       1000000000)
>>
>>
>>       barfill <- "#4271AE"
>>       barlines <- "#1F3552"
>>
>>
>>       pdf("display.ggplot2.ecdf.LENGTH.pdf", width=10, height=6,
>>       paper='special')
>>
>>
>>       ggplot(file, aes(LENGTH)) +
>>                 stat_ecdf(geom = "point", colour = barlines, fill =
>>       barfill) +
>>                 scale_x_continuous(name = "LENGTH of DEL",
>>                                    breaks = BREAKS,
>>                                    limits=c(0, 500)) +
>>                 scale_y_continuous(name = "FRACTION") +
>>                 ggtitle("ECDF of LENGTH") +
>>                 theme_bw() +
>>                 theme(legend.position = "bottom", legend.direction =
>>       "horizontal",
>>                      legend.box = "horizontal",
>>                      legend.key.size = unit(1, "cm"),
>>                      axis.title = element_text(size = 12),
>>                      legend.text = element_text(size = 9),
>>                      legend.title=element_text(face = "bold", size =
>>       9))
>>
>>
>>       dev.off()
>>
>>
>>
>>
>>
>>
>>
>> On Sat, Jul 7, 2018 at 9:47 PM, Jeff Newmiller <[hidden email]>
>> wrote:
>>       It is a feature of ggplot that points excluded by limits raise
>>       warnings, while base graphics do not.
>>
>>       You may find that using coord_cartesian with the xlim=c(0,500)
>>       argument works better with ggplot by showing the consequences of
>>       points out of the limits on lines within the viewport.
>>
>>       There are other possible problems with your data that your
>>       non-reproducible example does not show, and sending R code in
>>       HTML-formatted email usually corrupts it.. so please follow the
>>       recommendations in the Posting Guide next time you post.
>>
>>       On July 6, 2018 4:32:41 PM PDT, Bogdan Tanasa <[hidden email]>
>>       wrote:
>>       >Dear all,
>>       >
>>       >I would appreciate having your advice/suggestions/comments on
>>       the
>>       >following
>>       >:
>>       >
>>       >1 -- starting from a vector that contains LENGTHS (numerically,
>>       the
>>       >values
>>       >are from 1 to 10 000)
>>       >
>>       >2 -- shall I display the ECDF by using the R code and some
>>       "limits" :
>>       >
>>       >BREAKS = c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200,
>>       300, 400,
>>       >500,
>>       >         1000, 10000, 100000, 1000000, 10000000, 100000000,
>>       1000000000)
>>       >
>>       >ggplot(x, aes(LENGTH)) +
>>       >          stat_ecdf(geom = "point") +
>>       >          scale_x_continuous(name = "LENGTH of DEL",
>>       >                             breaks = BREAKS,
>>       >                             limits=c(0, 500))
>>       >
>>       >3 -- I am getting the following warning message : "Warning
>>       message:
>>       >Removed
>>       >109 rows containing non-finite values (stat_ecdf)."
>>       >
>>       >The question is : are these 109 values removed from
>>       VISUALIZATION as i
>>       >set
>>       >up the "limits", or are these 109 values removed from
>>       statistical
>>       >CALCULATION?
>>       >
>>       >4 -- in contrast, shall I use the standard R functions
>>       plot(ecdf),
>>       >there is
>>       >no "warning mesage"
>>       >
>>       >plot(ecdf(x$LENGTH), xlab="DEL LENGTH",
>>       >                     ylab="Fraction of DEL", main="DEL",
>>       xlim=c(0,500),
>>       >                     col = "dark red")
>>       >
>>       >Thanks a lot !
>>       >
>>       >-- bogdan
>>       >
>> >       [[alternative HTML version deleted]]
>> >
>> >______________________________________________
>> >[hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> >https://stat.ethz.ch/mailman/listinfo/r-help
>> >PLEASE do read the posting guide
>> >http://www.R-project.org/posting-guide.html
>> >and provide commented, minimal, self-contained, reproducible code.
>>
>> --
>> Sent from my phone. Please excuse my brevity.
>>
>>
>>
>>
>>
> ------------------------------------------------------------
> ---------------
> Jeff Newmiller                        The     .....       .....  Go Live...
> DCN:<[hidden email]>        Basics: ##.#.       ##.#.  Live
> Go...
>                                       Live:   OO#.. Dead: OO#..  Playing
> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
> ------------------------------------------------------------
> ---------------

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.