how to unbreak a circular package dependence (S4 class data)

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

how to unbreak a circular package dependence (S4 class data)

Daniel Kelley
I have an issue with a circular package dependence that prevents building/checking, and I seek advice on breaking the circle so the packages can pass the build-check tests that are required for CRAN submission.

The package pair I'm working with is slow to build, but my tests suggest the issue may be general, and so I will explain it in general terms.

Suppose there are two packages:

1. Foo, a package that defines some data types with S4 classes.

2. Foodata, a package that provides such datasets, for use by Foo.

With this setup, it seems reasonable that Foo "depends" on Foodata, so the data can be used in Foo and its documentation.

Since the data within Foodata are S4 classes as defined in Foo, an attempt to build-check Foodata will produce an error unless Foo is present.  But Foo cannot be built unless Foodata exists, since it depends on it.  Thus neither Foo nor Foodata can be built and checked.

One solution would be to wrap the Foo documentation examples (and relevant Foo code) in require() blocks, and to make Foo "suggest" Foodata, not "depend” upon it.  My question is whether this is the recommended practice, or the common practice.

Thanks in advance to anyone who wishes to offer hints.

PS. The problem arose from an attempt to reduce CRAN load by extracting the datasets that had been contained within a previous version of Foo.

PPS. my (slow-building) packages are on github and I can supply details if needed.

Dan E. Kelley
Professor, Oceanography Department
Dalhousie University, Canada
[hidden email]<mailto:[hidden email]>



        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: how to unbreak a circular package dependence (S4 class data)

Kasper Daniel Hansen-2
This question is quite common in Bioconductor because of the extensive use
of S4 and because our data are often too big to stay within the size
requirements on software packages (we separate packages into software and
data, with size limits (5MB total size of final source tar ball) on
software, but not data).

The solution we use is to let Foo suggest Foodata and then wrap every
example into

if(require(Foodata)) {
  CODE
}

This is exactly one of the possibilities you mention in your post.


As I see it, Foodata has to Depends on Foo because it has data defined
using the classes in Foo.  R-exts 1.1.3 says (about the Suggests field)
"The 'Suggests' field uses the same syntax as 'Depends' and lists packages
that are not necessarily needed. This includes packages used only in
examples, tests or vignettes".

Bioc packages I have authored which follows this setup are
  minfi/minfiData
  bsseq/bsseqData
but there are other examples by other authors (which I cannot recall on top
of my head).

Best,
Kasper



On Tue, Jan 28, 2014 at 6:49 PM, Daniel Kelley <[hidden email]> wrote:

> I have an issue with a circular package dependence that prevents
> building/checking, and I seek advice on breaking the circle so the packages
> can pass the build-check tests that are required for CRAN submission.
>
> The package pair I'm working with is slow to build, but my tests suggest
> the issue may be general, and so I will explain it in general terms.
>
> Suppose there are two packages:
>
> 1. Foo, a package that defines some data types with S4 classes.
>
> 2. Foodata, a package that provides such datasets, for use by Foo.
>
> With this setup, it seems reasonable that Foo "depends" on Foodata, so the
> data can be used in Foo and its documentation.
>
> Since the data within Foodata are S4 classes as defined in Foo, an attempt
> to build-check Foodata will produce an error unless Foo is present.  But
> Foo cannot be built unless Foodata exists, since it depends on it.  Thus
> neither Foo nor Foodata can be built and checked.
>
> One solution would be to wrap the Foo documentation examples (and relevant
> Foo code) in require() blocks, and to make Foo "suggest" Foodata, not
> "depend" upon it.  My question is whether this is the recommended practice,
> or the common practice.
>
> Thanks in advance to anyone who wishes to offer hints.
>
> PS. The problem arose from an attempt to reduce CRAN load by extracting
> the datasets that had been contained within a previous version of Foo.
>
> PPS. my (slow-building) packages are on github and I can supply details if
> needed.
>
> Dan E. Kelley
> Professor, Oceanography Department
> Dalhousie University, Canada
> [hidden email]<mailto:[hidden email]>
>
>
>
>         [[alternative HTML version deleted]]
>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: how to unbreak a circular package dependence (S4 class data)

Hervé Pagès
In reply to this post by Daniel Kelley
Hi Daniel,

On 01/28/2014 03:49 PM, Daniel Kelley wrote:

> I have an issue with a circular package dependence that prevents building/checking, and I seek advice on breaking the circle so the packages can pass the build-check tests that are required for CRAN submission.
>
> The package pair I'm working with is slow to build, but my tests suggest the issue may be general, and so I will explain it in general terms.
>
> Suppose there are two packages:
>
> 1. Foo, a package that defines some data types with S4 classes.
>
> 2. Foodata, a package that provides such datasets, for use by Foo.
>
> With this setup, it seems reasonable that Foo "depends" on Foodata, so the data can be used in Foo and its documentation.
>
> Since the data within Foodata are S4 classes as defined in Foo, an attempt to build-check Foodata will produce an error unless Foo is present.  But Foo cannot be built unless Foodata exists, since it depends on it.  Thus neither Foo nor Foodata can be built and checked.

I've learned by experience that it's generally better (although not
always possible) to avoid putting serialized S4 objects in a data
package. They will break if you need to modify a little bit the
internals of the class (and chances are high that you will at some
point). Better to store the data in a format that is more or less
guaranteed to remain the same for years (SQLite, XML, hdf5, plain text,
serialized data frame, SAM/BAM, etc...) and try to come up with
a fast way to load and turn the data into an S4 object on demand.

Not always possible if the data is huge... but for the purpose of using
it in Foo's examples and vignette do you really need huge data?

Another advantage of this approach is that the data can then be
more easily shared because it can be accessed with tools other
than yours, e.g. tools that don't know about S4 and even non-R
tools.

Cheers,
H.

>
> One solution would be to wrap the Foo documentation examples (and relevant Foo code) in require() blocks, and to make Foo "suggest" Foodata, not "depend” upon it.  My question is whether this is the recommended practice, or the common practice.
>
> Thanks in advance to anyone who wishes to offer hints.
>
> PS. The problem arose from an attempt to reduce CRAN load by extracting the datasets that had been contained within a previous version of Foo.
>
> PPS. my (slow-building) packages are on github and I can supply details if needed.
>
> Dan E. Kelley
> Professor, Oceanography Department
> Dalhousie University, Canada
> [hidden email]<mailto:[hidden email]>
>
>
>
> [[alternative HTML version deleted]]
>
>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: [hidden email]
Phone:  (206) 667-5791
Fax:    (206) 667-1319

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: how to unbreak a circular package dependence (S4 class data)

Kasper Daniel Hansen-2
This is a great comment if the primary use of the data is to make the data
available.

It is clear that a change in the internals of the class structure requires
changing the data package, and that is a clear drawback to my
recommendation.  I have had to do this on several occasions.

One issue with Herve's recommendation is when the same data structure is
used in several examples.  In that case, the conversion / parsing overhead
multiplies by the number of examples.  As an example, in minfiData I have
data on 6 samples on a somewhat large array.  Parsing the raw data files
for 3 of the 6 files takes 16 secs (you get this timing, because this is
what I have in example(read.450k.exp)).  Loading all 6 arrays as an R data
structure takes 1.1 sec.

I would generally recommend that a data package either includes a more raw
form of the data or has a script which makes the data easily retrievable.

Best,
Kasper


On Tue, Jan 28, 2014 at 8:01 PM, Hervé Pagès <[hidden email]> wrote:

> Hi Daniel,
>
>
> On 01/28/2014 03:49 PM, Daniel Kelley wrote:
>
>> I have an issue with a circular package dependence that prevents
>> building/checking, and I seek advice on breaking the circle so the packages
>> can pass the build-check tests that are required for CRAN submission.
>>
>> The package pair I'm working with is slow to build, but my tests suggest
>> the issue may be general, and so I will explain it in general terms.
>>
>> Suppose there are two packages:
>>
>> 1. Foo, a package that defines some data types with S4 classes.
>>
>> 2. Foodata, a package that provides such datasets, for use by Foo.
>>
>> With this setup, it seems reasonable that Foo "depends" on Foodata, so
>> the data can be used in Foo and its documentation.
>>
>> Since the data within Foodata are S4 classes as defined in Foo, an
>> attempt to build-check Foodata will produce an error unless Foo is present.
>>  But Foo cannot be built unless Foodata exists, since it depends on it.
>>  Thus neither Foo nor Foodata can be built and checked.
>>
>
> I've learned by experience that it's generally better (although not
> always possible) to avoid putting serialized S4 objects in a data
> package. They will break if you need to modify a little bit the
> internals of the class (and chances are high that you will at some
> point). Better to store the data in a format that is more or less
> guaranteed to remain the same for years (SQLite, XML, hdf5, plain text,
> serialized data frame, SAM/BAM, etc...) and try to come up with
> a fast way to load and turn the data into an S4 object on demand.
>
> Not always possible if the data is huge... but for the purpose of using
> it in Foo's examples and vignette do you really need huge data?
>
> Another advantage of this approach is that the data can then be
> more easily shared because it can be accessed with tools other
> than yours, e.g. tools that don't know about S4 and even non-R
> tools.
>
> Cheers,
> H.
>
>
>> One solution would be to wrap the Foo documentation examples (and
>> relevant Foo code) in require() blocks, and to make Foo "suggest" Foodata,
>> not "depend" upon it.  My question is whether this is the recommended
>> practice, or the common practice.
>>
>> Thanks in advance to anyone who wishes to offer hints.
>>
>> PS. The problem arose from an attempt to reduce CRAN load by extracting
>> the datasets that had been contained within a previous version of Foo.
>>
>> PPS. my (slow-building) packages are on github and I can supply details
>> if needed.
>>
>> Dan E. Kelley
>> Professor, Oceanography Department
>> Dalhousie University, Canada
>> [hidden email]<mailto:[hidden email]>
>>
>>
>>
>>         [[alternative HTML version deleted]]
>>
>>
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: [hidden email]
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: how to unbreak a circular package dependence (S4 class data)

beleites,claudia
Kasper,

here's how I deal with a largish data set (although data + code in
one package for exactly that kind of circular dependency):

The data set is stored PCA-compressed (only the first few principal
components) in matrices plus some meta information (vector, list,
data.frame).

I then have an internal function that reconstructs my example data:

.make.chondro <- function (){
  new ("hyperSpec",
       spc =  (tcrossprod (.chondro.scores, .chondro.loadings) +
               rep (.chondro.center, each = nrow (.chondro.scores))),
       wavelength = .chondro.wl,
       data = .chondro.extra, labels = .chondro.labels)
}

The result of that function is assigned when the data is first used:

delayedAssign ("chondro", .make.chondro ())

With that it should be possible to have the data package Suggests: the
main package, while the main package Depends: on the data (though I did
not yet find the time to separate both)

Side note: the original raw data file (compressed ASCII) is available
together with a variety of other raw data files from the project
home page - interested users find download links in the
vignettes and help pages.

Best,

Claudia





Am Tue, 28 Jan 2014 21:21:20 -0500
schrieb Kasper Daniel Hansen <[hidden email]>:

> This is a great comment if the primary use of the data is to make the
> data available.
>
> It is clear that a change in the internals of the class structure
> requires changing the data package, and that is a clear drawback to my
> recommendation.  I have had to do this on several occasions.
>
> One issue with Herve's recommendation is when the same data structure
> is used in several examples.  In that case, the conversion / parsing
> overhead multiplies by the number of examples.  As an example, in
> minfiData I have data on 6 samples on a somewhat large array.
> Parsing the raw data files for 3 of the 6 files takes 16 secs (you
> get this timing, because this is what I have in
> example(read.450k.exp)).  Loading all 6 arrays as an R data structure
> takes 1.1 sec.



> I would generally recommend that a data package either includes a
> more raw form of the data or has a script which makes the data easily
> retrievable.
>
> Best,
> Kasper
>
>
> On Tue, Jan 28, 2014 at 8:01 PM, Hervé Pagès <[hidden email]> wrote:
>
> > Hi Daniel,
> >
> >
> > On 01/28/2014 03:49 PM, Daniel Kelley wrote:
> >
> >> I have an issue with a circular package dependence that prevents
> >> building/checking, and I seek advice on breaking the circle so the
> >> packages can pass the build-check tests that are required for CRAN
> >> submission.
> >>
> >> The package pair I'm working with is slow to build, but my tests
> >> suggest the issue may be general, and so I will explain it in
> >> general terms.
> >>
> >> Suppose there are two packages:
> >>
> >> 1. Foo, a package that defines some data types with S4 classes.
> >>
> >> 2. Foodata, a package that provides such datasets, for use by Foo.
> >>
> >> With this setup, it seems reasonable that Foo "depends" on
> >> Foodata, so the data can be used in Foo and its documentation.
> >>
> >> Since the data within Foodata are S4 classes as defined in Foo, an
> >> attempt to build-check Foodata will produce an error unless Foo is
> >> present. But Foo cannot be built unless Foodata exists, since it
> >> depends on it. Thus neither Foo nor Foodata can be built and
> >> checked.
> >>
> >
> > I've learned by experience that it's generally better (although not
> > always possible) to avoid putting serialized S4 objects in a data
> > package. They will break if you need to modify a little bit the
> > internals of the class (and chances are high that you will at some
> > point). Better to store the data in a format that is more or less
> > guaranteed to remain the same for years (SQLite, XML, hdf5, plain
> > text, serialized data frame, SAM/BAM, etc...) and try to come up
> > with a fast way to load and turn the data into an S4 object on
> > demand.
> >
> > Not always possible if the data is huge... but for the purpose of
> > using it in Foo's examples and vignette do you really need huge
> > data?
> >
> > Another advantage of this approach is that the data can then be
> > more easily shared because it can be accessed with tools other
> > than yours, e.g. tools that don't know about S4 and even non-R
> > tools.
> >
> > Cheers,
> > H.
> >
> >
> >> One solution would be to wrap the Foo documentation examples (and
> >> relevant Foo code) in require() blocks, and to make Foo "suggest"
> >> Foodata, not "depend" upon it.  My question is whether this is the
> >> recommended practice, or the common practice.
> >>
> >> Thanks in advance to anyone who wishes to offer hints.
> >>
> >> PS. The problem arose from an attempt to reduce CRAN load by
> >> extracting the datasets that had been contained within a previous
> >> version of Foo.
> >>
> >> PPS. my (slow-building) packages are on github and I can supply
> >> details if needed.
> >>
> >> Dan E. Kelley
> >> Professor, Oceanography Department
> >> Dalhousie University, Canada
> >> [hidden email]<mailto:[hidden email]>
> >>
> >>
> >>
> >>         [[alternative HTML version deleted]]
> >>
> >>
> >>
> >> ______________________________________________
> >> [hidden email] mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>
> >>
> > --
> > Hervé Pagès
> >
> > Program in Computational Biology
> > Division of Public Health Sciences
> > Fred Hutchinson Cancer Research Center
> > 1100 Fairview Ave. N, M1-B514
> > P.O. Box 19024
> > Seattle, WA 98109-1024
> >
> > E-mail: [hidden email]
> > Phone:  (206) 667-5791
> > Fax:    (206) 667-1319
> >
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>
> [[alternative HTML version deleted]]
>



--
Claudia Beleites, Chemist
Spectroscopy/Imaging
Institute of Photonic Technology
Albert-Einstein-Str. 9
07745 Jena
Germany

email: [hidden email]
phone: +49 3641 206-133
fax:   +49 2641 206-399

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel