CRAN package sizes

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

CRAN package sizes

Prof Brian Ripley
Robin Hankin's post reminded me to post about the following recent
addition to 'Writing R Extensions', in the section on 'Submitting a
package to CRAN'

   Ensure that the package sources are not unnecessarily large. ...
   As a general rule, doc directories should not exceed 5Mb, and
   where data directories need to be 10Mb or more, consideration should
   be given to a separate package containing just the data. (Similarly
   for external data directories, large jar files and other libraries
   that need to be installed.)

With 2800 packages on CRAN, overall size is becoming a concern and
currently to install all of CRAN takes 4Gb.  As the attached (I hope)
graph shows, the 20 packages over 20Mb take a quarter, and those over
5Mb take half.  (And this is after we have removed 100Mb from the
largest installed package by re-compression, and archived the second
largest, so Robin's package is currently the largest.)  Some of the
largest packages are data/jar packages, but there are 55 packages with
'doc' directories over 5Mb.  To put that in perspective, PDFs of whole
books with lots of figures (MASS, Paul's R Graphics) are well under
5Mb.

R CMD check in R-devel reports on large packages, and expect in future
that submitted package sizes will be questioned more often.

There are lots of different reasons why doc directories are large, but
the major ones are

- installing files that are unneeded, such as Rplots.pdf and .eps
   figures.
- using PDF figures of images where PNG would be more appropriate.
- including less than relevant material (such as how to install R,
   with screenshots!)

There are several ways to reduce the sizes of PDFs with no loss in
quality, e.g. Adobe Acrobat Standard/Pro.

--
Brian D. Ripley,                  [hidden email]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

sizes.png (31K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: CRAN package sizes

Yihui Xie-2
Regarding the reasons that make the doc directory large, I wonder if
we can make some changes in R:

1. Use a null graphics device as the default device rather than pdf()
when running Sweave -- this can avoid the useless Rplots.pdf:

options(device = function(...) {
    .Call("R_GD_nullDevice", PACKAGE = "grDevices")
})

This can save some time in building the vignette(s) as well. (see
http://yihui.name/en/?p=673)

However, this undocumented null device may not work for certain
graphics. Here is an example that it fails for ggplot2:
http://stackoverflow.com/questions/4692974/ggplot2-code-that-works-interactively-rkward-crashes-under-lyx-pgfsweave-hint/4707745#4707745

Is it possible for someone to look into the null device (Dr Murrell?)
to make it stable enough?

2. Compress the PDF graphics and vignettes using third-party tools,
among which I recommend qpdf (it's free).

qpdf --stream-data=compress input.pdf output.pdf

This can reduce the size of PDF files a lot without quality loss. I'm
using this tool in the animation package to reduce the size of PDF
animations.

3. Sorry I bring up this issue again, but I don't understand why
Sweave could not implement the png() device along with pdf() and
postscript(). I'm willing to provide a patch if needed.

Thanks!

Regards,
Yihui
--
Yihui Xie <[hidden email]>
Phone: 515-294-2465 Web: http://yihui.name
Department of Statistics, Iowa State University
2215 Snedecor Hall, Ames, IA



On Sun, Feb 13, 2011 at 6:30 AM, Prof Brian Ripley
<[hidden email]> wrote:

> Robin Hankin's post reminded me to post about the following recent addition
> to 'Writing R Extensions', in the section on 'Submitting a package to CRAN'
>
>  Ensure that the package sources are not unnecessarily large. ...
>  As a general rule, doc directories should not exceed 5Mb, and
>  where data directories need to be 10Mb or more, consideration should
>  be given to a separate package containing just the data. (Similarly
>  for external data directories, large jar files and other libraries
>  that need to be installed.)
>
> With 2800 packages on CRAN, overall size is becoming a concern and currently
> to install all of CRAN takes 4Gb.  As the attached (I hope) graph shows, the
> 20 packages over 20Mb take a quarter, and those over 5Mb take half.  (And
> this is after we have removed 100Mb from the largest installed package by
> re-compression, and archived the second largest, so Robin's package is
> currently the largest.)  Some of the largest packages are data/jar packages,
> but there are 55 packages with 'doc' directories over 5Mb.  To put that in
> perspective, PDFs of whole books with lots of figures (MASS, Paul's R
> Graphics) are well under 5Mb.
>
> R CMD check in R-devel reports on large packages, and expect in future that
> submitted package sizes will be questioned more often.
>
> There are lots of different reasons why doc directories are large, but the
> major ones are
>
> - installing files that are unneeded, such as Rplots.pdf and .eps
>  figures.
> - using PDF figures of images where PNG would be more appropriate.
> - including less than relevant material (such as how to install R,
>  with screenshots!)
>
> There are several ways to reduce the sizes of PDFs with no loss in quality,
> e.g. Adobe Acrobat Standard/Pro.
>
> --
> Brian D. Ripley,                  [hidden email]
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: CRAN package sizes

Kevin Coombes
I think it would be even more useful if we could get Sweave to easily
produce PNG figures instead of just PDF/EPS.  In the current state of
things, making PNG versions is more cumbersome than making PDF versions,
so I'm not surprised that most people don't go to that trouble most of
the time.

I also know (from searching the archives when I wanted to try this
myself) that a couple of people have, in the past, modified Sweave so it
can generate PNG automatically.  However, the changes have never
migrated into the released version.  Perhaps the space constraints at
CRAN can convince Freidrich Leisch that the change would be a good idea....

     Kevin

On 2/13/2011 3:02 PM, Yihui Xie wrote:

> Regarding the reasons that make the doc directory large, I wonder if
> we can make some changes in R:
>
> 1. Use a null graphics device as the default device rather than pdf()
> when running Sweave -- this can avoid the useless Rplots.pdf:
>
> options(device = function(...) {
>      .Call("R_GD_nullDevice", PACKAGE = "grDevices")
> })
>
> This can save some time in building the vignette(s) as well. (see
> http://yihui.name/en/?p=673)
>
> However, this undocumented null device may not work for certain
> graphics. Here is an example that it fails for ggplot2:
> http://stackoverflow.com/questions/4692974/ggplot2-code-that-works-interactively-rkward-crashes-under-lyx-pgfsweave-hint/4707745#4707745
>
> Is it possible for someone to look into the null device (Dr Murrell?)
> to make it stable enough?
>
> 2. Compress the PDF graphics and vignettes using third-party tools,
> among which I recommend qpdf (it's free).
>
> qpdf --stream-data=compress input.pdf output.pdf
>
> This can reduce the size of PDF files a lot without quality loss. I'm
> using this tool in the animation package to reduce the size of PDF
> animations.
>
> 3. Sorry I bring up this issue again, but I don't understand why
> Sweave could not implement the png() device along with pdf() and
> postscript(). I'm willing to provide a patch if needed.
>
> Thanks!
>
> Regards,
> Yihui
> --
> Yihui Xie<[hidden email]>
> Phone: 515-294-2465 Web: http://yihui.name
> Department of Statistics, Iowa State University
> 2215 Snedecor Hall, Ames, IA
>
>
>
> On Sun, Feb 13, 2011 at 6:30 AM, Prof Brian Ripley
> <[hidden email]>  wrote:
>> Robin Hankin's post reminded me to post about the following recent addition
>> to 'Writing R Extensions', in the section on 'Submitting a package to CRAN'
>>
>>   Ensure that the package sources are not unnecessarily large. ...
>>   As a general rule, doc directories should not exceed 5Mb, and
>>   where data directories need to be 10Mb or more, consideration should
>>   be given to a separate package containing just the data. (Similarly
>>   for external data directories, large jar files and other libraries
>>   that need to be installed.)
>>
>> With 2800 packages on CRAN, overall size is becoming a concern and currently
>> to install all of CRAN takes 4Gb.  As the attached (I hope) graph shows, the
>> 20 packages over 20Mb take a quarter, and those over 5Mb take half.  (And
>> this is after we have removed 100Mb from the largest installed package by
>> re-compression, and archived the second largest, so Robin's package is
>> currently the largest.)  Some of the largest packages are data/jar packages,
>> but there are 55 packages with 'doc' directories over 5Mb.  To put that in
>> perspective, PDFs of whole books with lots of figures (MASS, Paul's R
>> Graphics) are well under 5Mb.
>>
>> R CMD check in R-devel reports on large packages, and expect in future that
>> submitted package sizes will be questioned more often.
>>
>> There are lots of different reasons why doc directories are large, but the
>> major ones are
>>
>> - installing files that are unneeded, such as Rplots.pdf and .eps
>>   figures.
>> - using PDF figures of images where PNG would be more appropriate.
>> - including less than relevant material (such as how to install R,
>>   with screenshots!)
>>
>> There are several ways to reduce the sizes of PDFs with no loss in quality,
>> e.g. Adobe Acrobat Standard/Pro.
>>
>> --
>> Brian D. Ripley,                  [hidden email]
>> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
>> University of Oxford,             Tel:  +44 1865 272861 (self)
>> 1 South Parks Road,                     +44 1865 272866 (PA)
>> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: CRAN package sizes

Prof Brian Ripley
In reply to this post by Yihui Xie-2
On Sun, 13 Feb 2011, Yihui Xie wrote:

> Regarding the reasons that make the doc directory large, I wonder if
> we can make some changes in R:

'we' cannot: only core developers can.  However, end users can
contribute in many other ways: see below.

> 1. Use a null graphics device as the default device rather than pdf()
> when running Sweave -- this can avoid the useless Rplots.pdf:
>
> options(device = function(...) {
>    .Call("R_GD_nullDevice", PACKAGE = "grDevices")
> })
>
> This can save some time in building the vignette(s) as well. (see
> http://yihui.name/en/?p=673)
>
> However, this undocumented null device may not work for certain
> graphics. Here is an example that it fails for ggplot2:
> http://stackoverflow.com/questions/4692974/ggplot2-code-that-works-interactively-rkward-crashes-under-lyx-pgfsweave-hint/4707745#4707745
>
> Is it possible for someone to look into the null device (Dr Murrell?)
> to make it stable enough?
I don't see a bug report on that, and a patch would help expedite
this.

> 2. Compress the PDF graphics and vignettes using third-party tools,
> among which I recommend qpdf (it's free).
>
> qpdf --stream-data=compress input.pdf output.pdf
>
> This can reduce the size of PDF files a lot without quality loss. I'm
> using this tool in the animation package to reduce the size of PDF
> animations.

*Can*, but I did say

   'There are several ways to reduce the sizes of PDFs with no loss in
    quality, e.g. Adobe Acrobat Standard/Pro.'

and qpdf is often ineffective (or worse), e.g. on package mokken.  The
problem is that many of the large packages need images re-saved in
some other format (or preferably re-generated in some other format).

I've added a --compact-vignettes option to R CMD build (in R-devel).
At present it uses qpdf, but I will look out for better/additional
options.  (I use Acrobat 9 Pro on my Mac and that has always beaten
qpdf, often by a large margin.  But qpdf is perhaps the most readily
available of these tools.)

> 3. Sorry I bring up this issue again, but I don't understand why
> Sweave could not implement the png() device along with pdf() and
> postscript(). I'm willing to provide a patch if needed.

Does it need changes to R?  I believe that it just needs a
different driver, something which could be provided in a package.

This has been raised several times (including recently) with the
Sweave maintainer, so maybe it will happpen eventually.  But a package
would retrofit it to eariier versions of R.


>
> Thanks!
>
> Regards,
> Yihui
> --
> Yihui Xie <[hidden email]>
> Phone: 515-294-2465 Web: http://yihui.name
> Department of Statistics, Iowa State University
> 2215 Snedecor Hall, Ames, IA
>
>
>
> On Sun, Feb 13, 2011 at 6:30 AM, Prof Brian Ripley
> <[hidden email]> wrote:
>> Robin Hankin's post reminded me to post about the following recent addition
>> to 'Writing R Extensions', in the section on 'Submitting a package to CRAN'
>>
>>  Ensure that the package sources are not unnecessarily large. ...
>>  As a general rule, doc directories should not exceed 5Mb, and
>>  where data directories need to be 10Mb or more, consideration should
>>  be given to a separate package containing just the data. (Similarly
>>  for external data directories, large jar files and other libraries
>>  that need to be installed.)
>>
>> With 2800 packages on CRAN, overall size is becoming a concern and currently
>> to install all of CRAN takes 4Gb.  As the attached (I hope) graph shows, the
>> 20 packages over 20Mb take a quarter, and those over 5Mb take half.  (And
>> this is after we have removed 100Mb from the largest installed package by
>> re-compression, and archived the second largest, so Robin's package is
>> currently the largest.)  Some of the largest packages are data/jar packages,
>> but there are 55 packages with 'doc' directories over 5Mb.  To put that in
>> perspective, PDFs of whole books with lots of figures (MASS, Paul's R
>> Graphics) are well under 5Mb.
>>
>> R CMD check in R-devel reports on large packages, and expect in future that
>> submitted package sizes will be questioned more often.
>>
>> There are lots of different reasons why doc directories are large, but the
>> major ones are
>>
>> - installing files that are unneeded, such as Rplots.pdf and .eps
>>  figures.
>> - using PDF figures of images where PNG would be more appropriate.
>> - including less than relevant material (such as how to install R,
>>  with screenshots!)
>>
>> There are several ways to reduce the sizes of PDFs with no loss in quality,
>> e.g. Adobe Acrobat Standard/Pro.
--
Brian D. Ripley,                  [hidden email]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

vignette question was: CRAN package sizes

Claudia Beleites
Also I started doing my homework with regards to package size, and that is
mainly cleaning leftovers from vignette generation and compressing the pdfs.

For most of my vignettes, ghostscript (lossy) compression works very well:
I use the /screen settings and -dDownsampleColorImages=false
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE
-dQUIET -dBATCH -dDownsampleColorImages=false -dAutoRotatePages=/None
(DownsampleColorImages=false is important as I found otherwise that some .png
become completely useless. However, the pngs are saved with carefully determined
size and are pngs because the pdfs were too large: so I know that the bitmap
images are already "size-optimized")
I wrote a inst/doc/makefile to do this and also clean up a few more "leftovers
from the vignette".

BTW: while compressing the final .pdf achieves better total compression, it
already helps a lot to compress the .pdf figures which can be done at the end of
the .Rnw.

qpdf didn't help for my vignettes.


One question remains, though. I have two vignettes, where I cannot put the
original data into the package (the very first thing in the vignette is the link
to a zip file on r-forge that contains everything needed to reproduce the
vignette, though. I think this is accessible enough for FOSS).
I'd like to have these documents accessible via the usual vignette () mechanism
(this question has come up before, but I found only that the 00Index.dcf does
not work any longer).
My second thought was to set up the Makefile so that instead of building the pdf
a message is printed and the available pdf is used.
This does not work, however: buildVignettes (which I guess does the work*) first
Sweaves the .Rnw file and then replaces the texi2dvi () call by make.
Is this intended behaviour? If so, how do I make my vignette accessible
[obviously the "dummy .Rnw that includes the pdf"-technique doesn't look quite
appropriate as it leads to unnecessarily large package size]?

*I did not realise this from the Makefile discussion in the extensions manual
(nor does the help page of buildVignettes mention anything about this). Also,
I'd appreciate very much if the extension manual would mention buildVignettes -
it took me quite a while to find out what code is used and why my Makefile
didn't lead to the desired results.

Thanks a lot for any ideas,

Claudia

--
Claudia Beleites
Dipartimento dei Materiali e delle Risorse Naturali
Università degli Studi di Trieste
Via Alfonso Valerio 6/a
I-34127 Trieste

phone: +39 0 40 5 58-37 68
email: [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: vignette question was: CRAN package sizes

Claudia Beleites
Please excuse the noise about dummy .Rnw.

On 02/15/2011 03:44 PM, Claudia Beleites wrote:

> Also I started doing my homework with regards to package size, and that is
> mainly cleaning leftovers from vignette generation and compressing the pdfs.
>
> For most of my vignettes, ghostscript (lossy) compression works very well:
> I use the /screen settings and -dDownsampleColorImages=false
> gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE
> -dQUIET -dBATCH -dDownsampleColorImages=false -dAutoRotatePages=/None
> (DownsampleColorImages=false is important as I found otherwise that some .png
> become completely useless. However, the pngs are saved with carefully determined
> size and are pngs because the pdfs were too large: so I know that the bitmap
> images are already "size-optimized")
> I wrote a inst/doc/makefile to do this and also clean up a few more "leftovers
> from the vignette".
>
> BTW: while compressing the final .pdf achieves better total compression, it
> already helps a lot to compress the .pdf figures which can be done at the end of
> the .Rnw.
>
> qpdf didn't help for my vignettes.


> One question remains, though. I have two vignettes, where I cannot put the
> original data into the package (the very first thing in the vignette is the link
> to a zip file on r-forge that contains everything needed to reproduce the
> vignette, though. I think this is accessible enough for FOSS).
> I'd like to have these documents accessible via the usual vignette () mechanism
> (this question has come up before, but I found only that the 00Index.dcf does
> not work any longer).
> My second thought was to set up the Makefile so that instead of building the pdf
> a message is printed and the available pdf is used.
> This does not work, however: buildVignettes (which I guess does the work*) first
> Sweaves the .Rnw file and then replaces the texi2dvi () call by make.
> Is this intended behaviour? If so, how do I make my vignette accessible
> [obviously the "dummy .Rnw that includes the pdf"-technique doesn't look quite
> appropriate as it leads to unnecessarily large package size]?
coffe break was helpful: of course I just need a dummy .Rnw that is processed to
a .tex but (via Makefile) _not_ to pdf...

> *I did not realise this from the Makefile discussion in the extensions manual
> (nor does the help page of buildVignettes mention anything about this). Also,
> I'd appreciate very much if the extension manual would mention buildVignettes -
> it took me quite a while to find out what code is used and why my Makefile
> didn't lead to the desired results.

Sorry, Claudia



--
Claudia Beleites
Dipartimento dei Materiali e delle Risorse Naturali
Università degli Studi di Trieste
Via Alfonso Valerio 6/a
I-34127 Trieste

phone: +39 0 40 5 58-37 68
email: [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel