Struggeling with svydesign()

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Struggeling with svydesign()

Thierry Onkelinx
Dear all,

We are analysing some survey data and we are not sure if we are using
the correct syntax for our design.

The population of interest is a set of 4416 polygons with different
sizes ranging from 0.003 to 45.6 ha, 7460 ha in total. Each polygon has
a binary attribute (presence/absence) and we want to estimate the
probability of presence in the population.

We used sampling with replacement weighted by the area of the polygon.
The population was stratified using 2 variables: block and type. Each of
the 14 blocks is a 20 by 50 km geographical region. Type is a two level
factor. Not every level is present in each block. Each block has a
Status attribute with two levels: medium (9 blocks) or good (5 blocks).
Besides the overall ratio, we would like the estimate the ratio per
Status.
The samplesize per stratum was calculated with epi.stratasize() from the
epiR package. The population size in the 21 strata ranges from 1 to
1158. The sample size ranges from 0 in the blocks with very few polygons
(<20), 1 in blocks with a low number of polygon (20 - 50) and up to 25
polygons in the largest strata.

Does the syntax below represents the data structure above? Any comments
are welcome.

library(survey)
svydesign(
        id = ~ 1, #no clustering
        weights = ~ Area, #weighted by the area of the polygon
        strata = ~ Status + Block + Type,
        nest = TRUE
)
# Is Area a correct weighting factor? Or should we use the area divided
by the sum of the total area (per stratum?)
# The code above runs. But when we omit "Status" from the strata, then
we get an error: "a stratum has only 1 PSU". Shouldn't we get the same
error with the code above?

#with finity population correction
svydesign(
        id = ~ 1, #no clustering
        weights = ~ Area, #weighted by the area of the polygon
        strata = ~ Status + Block + Type,
        fpc ~ nStatus + nBlock + nType,
        nest = TRUE
)
#We are not sure what to use for nStatus, nBlock and nType. Is it the
number of levels of that stratum (nStatus = 2)? The number of levels in
the stratum below (nStatus = length(unique(Block)) per level of Status,
nType = number of polygons per Status:Block:Type)? The total number of
polygons in that stratum?

Best regards,

Thierry

------------------------------------------------------------------------
----
ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek
team Biometrie & Kwaliteitszorg
Gaverstraat 4
9500 Geraardsbergen
Belgium

Research Institute for Nature and Forest
team Biometrics & Quality Assurance
Gaverstraat 4
9500 Geraardsbergen
Belgium

tel. + 32 54/436 185
[hidden email]
www.inbo.be

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to
say what the experiment died of.
~ Sir Ronald Aylmer Fisher

The plural of anecdote is not data.
~ Roger Brinner

The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of
data.
~ John Tukey


Druk dit bericht a.u.b. niet onnodig af.
Please do not print this message unnecessarily.

Dit bericht en eventuele bijlagen geven enkel de visie van de schrijver weer
en binden het INBO onder geen enkel beding, zolang dit bericht niet bevestigd is
door een geldig ondertekend document. The views expressed in  this message
and any annex are purely those of the writer and may not be regarded as stating
an official position of INBO, as long as the message is not confirmed by a duly
signed document.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Struggeling with svydesign()

Thomas Lumley
On Wed, 7 Apr 2010, ONKELINX, Thierry wrote:

> Dear all,
>
> We are analysing some survey data and we are not sure if we are using
> the correct syntax for our design.
>
> The population of interest is a set of 4416 polygons with different
> sizes ranging from 0.003 to 45.6 ha, 7460 ha in total. Each polygon has
> a binary attribute (presence/absence) and we want to estimate the
> probability of presence in the population.
>
> We used sampling with replacement weighted by the area of the polygon.
> The population was stratified using 2 variables: block and type. Each of
> the 14 blocks is a 20 by 50 km geographical region. Type is a two level
> factor. Not every level is present in each block. Each block has a
> Status attribute with two levels: medium (9 blocks) or good (5 blocks).
> Besides the overall ratio, we would like the estimate the ratio per
> Status.
> The samplesize per stratum was calculated with epi.stratasize() from the
> epiR package. The population size in the 21 strata ranges from 1 to
> 1158. The sample size ranges from 0 in the blocks with very few polygons
> (<20), 1 in blocks with a low number of polygon (20 - 50) and up to 25
> polygons in the largest strata.

That sounds strange.  If you have a stratified sample and have set the sample size in some strata to be zero, you cannot possibly learn anything about those strata and so you can't get unbiased population estimates.   In order to get unbiased estimates and valid standard errors you need at least two samples per stratum.

You're going to have to combine some of the strata so that each stratum has at least two observations.  Since your design only makes sense if you assume the small, unsampled, strata are similar to some of the larger strata, it should be possible for you to combine them.


> Does the syntax below represents the data structure above? Any comments
> are welcome.
>
> library(survey)
> svydesign(
> id = ~ 1, #no clustering
> weights = ~ Area, #weighted by the area of the polygon
> strata = ~ Status + Block + Type,
> nest = TRUE
> )

You want strata = ~interaction(Block,Type,drop=TRUE), which specifies a single stage of sampling in which the strata are combinations of Block and Type.  The fact that you need drop=TRUE is a bug, which I will fix.

> # Is Area a correct weighting factor? Or should we use the area divided
> by the sum of the total area (per stratum?)

It's not clear to me from your description whether the probability of sampling a particular region is proportional to its Area or inversely proportional to its Area.  If the probability is proportional to Area, the weight would be 1/Area

  svydesign(
  id = ~ 1, #no clustering
  weights = ~ I(1/Area), #weighted by the area of the polygon
  strata = ~ interaction(Block, Type,drop=TRUE),
  nest = TRUE
  )


> # The code above runs. But when we omit "Status" from the strata, then
> we get an error: "a stratum has only 1 PSU". Shouldn't we get the same
> error with the code above?
>
> #with finity population correction
> svydesign(
> id = ~ 1, #no clustering
> weights = ~ Area, #weighted by the area of the polygon
> strata = ~ Status + Block + Type,
> fpc ~ nStatus + nBlock + nType,
> nest = TRUE
> )
> #We are not sure what to use for nStatus, nBlock and nType. Is it the
> number of levels of that stratum (nStatus = 2)? The number of levels in
> the stratum below (nStatus = length(unique(Block)) per level of Status,
> nType = number of polygons per Status:Block:Type)? The total number of
> polygons in that stratum?

This is easier when you get the right strata.  There should be a single fpc variable, which should be equal to the number of polygons in the population for that stratum.


> To call in the statistician after the experiment is done may be no more
> than asking him to perform a post-mortem examination: he may be able to
> say what the experiment died of.
> ~ Sir Ronald Aylmer Fisher

Indeed.


      -thomas

Thomas Lumley Assoc. Professor, Biostatistics
[hidden email] University of Washington, Seattle

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Struggeling with svydesign()

Thierry Onkelinx
Dear Thomas,

Thank you for your informative answer. We used epi.stratasize() to
estimate the required sample size per stratum. Notice in the example
below that it can select a sample size smaller than 2 in the very small
strata. Would you recommend to sample at least two items per stratum or
rather to merge some strata a priori until the sample size is at least
2? Or is there a better way to estimate the sample size per stratum?
Note that the stratification only aims to get a good geographical
coverage (the strata a geographical regions). We are not interested in
estimates per stratum.

library(epiR)
N <- c(39, 270, 1060, 1336, 118, 26, 154, 10, 3)
epi.stratasize(strata.n = N, strata.mean = 0.9, epsilon = 0.05, method =
"proportion")
$strata.sample
[1]  2 15 57 72  6  1  8  1  0

$total.sample
[1] 162

The probability of sampling was proportional with the area (larger
polygons are more likely to be selected than smaller ones). So we will
use weights = I(1/Area), as you suggested.

Best regards,

Thierry

------------------------------------------------------------------------
----
ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek
team Biometrie & Kwaliteitszorg
Gaverstraat 4
9500 Geraardsbergen
Belgium

Research Institute for Nature and Forest
team Biometrics & Quality Assurance
Gaverstraat 4
9500 Geraardsbergen
Belgium

tel. + 32 54/436 185
[hidden email]
www.inbo.be

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to
say what the experiment died of.
~ Sir Ronald Aylmer Fisher

The plural of anecdote is not data.
~ Roger Brinner

The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of
data.
~ John Tukey
 

> -----Oorspronkelijk bericht-----
> Van: Thomas Lumley [mailto:[hidden email]]
> Verzonden: woensdag 7 april 2010 18:51
> Aan: ONKELINX, Thierry
> CC: [hidden email]
> Onderwerp: Re: [R] Struggeling with svydesign()
>
> On Wed, 7 Apr 2010, ONKELINX, Thierry wrote:
>
> > Dear all,
> >
> > We are analysing some survey data and we are not sure if we
> are using
> > the correct syntax for our design.
> >
> > The population of interest is a set of 4416 polygons with different
> > sizes ranging from 0.003 to 45.6 ha, 7460 ha in total. Each polygon
> > has a binary attribute (presence/absence) and we want to
> estimate the
> > probability of presence in the population.
> >
> > We used sampling with replacement weighted by the area of
> the polygon.
> > The population was stratified using 2 variables: block and
> type. Each
> > of the 14 blocks is a 20 by 50 km geographical region. Type
> is a two
> > level factor. Not every level is present in each block.
> Each block has
> > a Status attribute with two levels: medium (9 blocks) or
> good (5 blocks).
> > Besides the overall ratio, we would like the estimate the ratio per
> > Status.
> > The samplesize per stratum was calculated with
> epi.stratasize() from
> > the epiR package. The population size in the 21 strata
> ranges from 1
> > to 1158. The sample size ranges from 0 in the blocks with very few
> > polygons (<20), 1 in blocks with a low number of polygon
> (20 - 50) and
> > up to 25 polygons in the largest strata.
>
> That sounds strange.  If you have a stratified sample and
> have set the sample size in some strata to be zero, you
> cannot possibly learn anything about those strata and so you
> can't get unbiased population estimates.   In order to get
> unbiased estimates and valid standard errors you need at
> least two samples per stratum.
>
> You're going to have to combine some of the strata so that
> each stratum has at least two observations.  Since your
> design only makes sense if you assume the small, unsampled,
> strata are similar to some of the larger strata, it should be
> possible for you to combine them.
>
>
> > Does the syntax below represents the data structure above? Any
> > comments are welcome.
> >
> > library(survey)
> > svydesign(
> > id = ~ 1, #no clustering
> > weights = ~ Area, #weighted by the area of the polygon
> > strata = ~ Status + Block + Type,
> > nest = TRUE
> > )
>
> You want strata = ~interaction(Block,Type,drop=TRUE), which
> specifies a single stage of sampling in which the strata are
> combinations of Block and Type.  The fact that you need
> drop=TRUE is a bug, which I will fix.
>
> > # Is Area a correct weighting factor? Or should we use the area
> > divided by the sum of the total area (per stratum?)
>
> It's not clear to me from your description whether the
> probability of sampling a particular region is proportional
> to its Area or inversely proportional to its Area.  If the
> probability is proportional to Area, the weight would be 1/Area
>
>   svydesign(
>   id = ~ 1, #no clustering
>   weights = ~ I(1/Area), #weighted by the area of the polygon
>   strata = ~ interaction(Block, Type,drop=TRUE),
>   nest = TRUE
>   )
>
>
> > # The code above runs. But when we omit "Status" from the
> strata, then
> > we get an error: "a stratum has only 1 PSU". Shouldn't we
> get the same
> > error with the code above?
> >
> > #with finity population correction
> > svydesign(
> > id = ~ 1, #no clustering
> > weights = ~ Area, #weighted by the area of the polygon
> > strata = ~ Status + Block + Type,
> > fpc ~ nStatus + nBlock + nType,
> > nest = TRUE
> > )
> > #We are not sure what to use for nStatus, nBlock and nType.
> Is it the
> > number of levels of that stratum (nStatus = 2)? The number
> of levels
> > in the stratum below (nStatus = length(unique(Block)) per level of
> > Status, nType = number of polygons per Status:Block:Type)?
> The total
> > number of polygons in that stratum?
>
> This is easier when you get the right strata.  There should
> be a single fpc variable, which should be equal to the number
> of polygons in the population for that stratum.
>
>
> > To call in the statistician after the experiment is done may be no
> > more than asking him to perform a post-mortem examination:
> he may be
> > able to say what the experiment died of.
> > ~ Sir Ronald Aylmer Fisher
>
> Indeed.
>
>
>       -thomas
>
> Thomas Lumley Assoc. Professor, Biostatistics
> [hidden email] University of Washington, Seattle
>
>

Druk dit bericht a.u.b. niet onnodig af.
Please do not print this message unnecessarily.

Dit bericht en eventuele bijlagen geven enkel de visie van de schrijver weer
en binden het INBO onder geen enkel beding, zolang dit bericht niet bevestigd is
door een geldig ondertekend document. The views expressed in  this message
and any annex are purely those of the writer and may not be regarded as stating
an official position of INBO, as long as the message is not confirmed by a duly
signed document.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Struggeling with svydesign()

Thomas Lumley
On Thu, 8 Apr 2010, ONKELINX, Thierry wrote:

> Dear Thomas,
>
> Thank you for your informative answer. We used epi.stratasize() to
> estimate the required sample size per stratum. Notice in the example
> below that it can select a sample size smaller than 2 in the very small
> strata. Would you recommend to sample at least two items per stratum or
> rather to merge some strata a priori until the sample size is at least
> 2?

Merging the strata would be best

> Or is there a better way to estimate the sample size per stratum?
> Note that the stratification only aims to get a good geographical
> coverage (the strata a geographical regions). We are not interested in
> estimates per stratum.
>
> library(epiR)
> N <- c(39, 270, 1060, 1336, 118, 26, 154, 10, 3)
> epi.stratasize(strata.n = N, strata.mean = 0.9, epsilon = 0.05, method =
> "proportion")
> $strata.sample
> [1]  2 15 57 72  6  1  8  1  0
>
> $total.sample
> [1] 162
>
> The probability of sampling was proportional with the area (larger
> polygons are more likely to be selected than smaller ones). So we will
> use weights = I(1/Area), as you suggested.

If you are using probability proportional to size and you want to use finite-population correctsions, you also need to specify the fpc= argument differently. The simplest version is an approximation that uses only the marginal sampling probabilities
   svydesign(id=~1, fpc=~p, pps="brewer", strata=~strat
where p is a variable with the actual sampling probability (not just proportional to sampling probability).

Also, how did you do the sampling?  It's quite hard to do unequal probability sampling without replacement (the R sample() function doesn't actually  do it, though the sampling package does).

     -thomas

Thomas Lumley Assoc. Professor, Biostatistics
[hidden email] University of Washington, Seattle

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.