Sampling problems

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Sampling problems

Oritteropus
Hi,
I need to sample randomly my dataset for 1000 times. The sample need to be the 80%. I know how to do that, my problem is that not only I need the 80%, but I also need the corresponding 20% each time. Is there any way to do that?
Alternatively, I was thinking to something like setdiff () function to compare my 80% sample to the original dataset and obtain the corresponding 20%, unfortunately setdiff works just for vectors, do you know a similar function for dataframes?
Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Sampling problems

Sarah Goslee
You could make a vector containing the number of TRUE values that
makes up 80% of your data, and the number of FALSE values that makes
up 20% of your data. Use sample() to reorder it, then use it to divide
your dataset.

If you had provided a reproducible example, I could write you code.

Sarah

On Wed, Mar 7, 2012 at 11:41 AM, Oritteropus <[hidden email]> wrote:

> Hi,
> I need to sample randomly my dataset for 1000 times. The sample need to be
> the 80%. I know how to do that, my problem is that not only I need the 80%,
> but I also need the corresponding 20% each time. Is there any way to do
> that?
> Alternatively, I was thinking to something like setdiff () function to
> compare my 80% sample to the original dataset and obtain the corresponding
> 20%, unfortunately setdiff works just for vectors, do you know a similar
> function for dataframes?
> Thanks
>

--
Sarah Goslee
http://www.functionaldiversity.org

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Sampling problems

Petr Savicky
In reply to this post by Oritteropus
On Wed, Mar 07, 2012 at 08:41:35AM -0800, Oritteropus wrote:
> Hi,
> I need to sample randomly my dataset for 1000 times. The sample need to be
> the 80%. I know how to do that, my problem is that not only I need the 80%,
> but I also need the corresponding 20% each time. Is there any way to do
> that?

Hi.

If you use sample() to get the 80% and store the indices, you
can also get the remaining cases

  a <- matrix(1:30, ncol=3)
  i <- sample(10, 8)
  a[sort(i), ]

       [,1] [,2] [,3]
  [1,]    1   11   21
  [2,]    2   12   22
  [3,]    3   13   23
  [4,]    4   14   24
  [5,]    6   16   26
  [6,]    7   17   27
  [7,]    8   18   28
  [8,]   10   20   30

  a[-i, ]

       [,1] [,2] [,3]
  [1,]    5   15   25
  [2,]    9   19   29

Hope this helps.

Petr Savicky.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Sampling problems

David Winsemius
In reply to this post by Oritteropus

On Mar 7, 2012, at 11:41 AM, Oritteropus wrote:

> Hi,
> I need to sample randomly my dataset for 1000 times. The sample need  
> to be
> the 80%. I know how to do that, my problem is that not only I need  
> the 80%,
> but I also need the corresponding 20% each time. Is there any way to  
> do
> that?
> Alternatively, I was thinking to something like setdiff () function to
> compare my 80% sample to the original dataset and obtain the  
> corresponding
> 20%, unfortunately setdiff works just for vectors, do you know a  
> similar
> function for dataframes?

Create an index vector with runif or sample and then use that to get  
you sample and use negative indexing to get the remainder.

idx <- sample(1:1000, 800)
x[ idx, ]  # 80%
x[ -idx, ] # the other 20%

(I think this does presume you have not mucked with the default  
rownames.)


--

David Winsemius, MD
West Hartford, CT

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Sampling problems

Oritteropus
Hi, thank you but it does work for vectors and matrix but not dataframes, it gives me this message error:

MeanA <- read.csv("MeanAmf.csv",header=T)
mysample <- MeanA[sample(1:nrow(MeanA), 20, replace=FALSE),]
remainder<-MeanA[-mysample]
Error in `[.default`(MeanA, -mysample) : invalid subscript type 'list'
In Ops.factor(left) : - not meaningful for factors

Any other way?
Reply | Threaded
Open this post in threaded view
|

Re: Sampling problems

Oritteropus
In reply to this post by Sarah Goslee
Hi sarah, it is not clear to me how to do that, can you show me please?

Imagine I have a situation like this:

MeanA <- read.csv("MeanAmf.csv",header=T)
mysample <- MeanA[sample(1:nrow(MeanA), 20, replace=FALSE),]

Then?
Reply | Threaded
Open this post in threaded view
|

Re: Sampling problems

PIKAL Petr
In reply to this post by Oritteropus
Hi

I have only faint idea what was you problem as there is no context in you
message but maybe

remainder<-MeanA[-mysample, ]

could work.

Regards
Petr

>
> Hi, thank you but it does work for vectors and matrix but not
dataframes, it

> gives me this message error:
>
> MeanA <- read.csv("MeanAmf.csv",header=T)
> mysample <- MeanA[sample(1:nrow(MeanA), 20, replace=FALSE),]
> remainder<-MeanA[-mysample]
> Error in `[.default`(MeanA, -mysample) : invalid subscript type 'list'
> In Ops.factor(left) : - not meaningful for factors
>
> Any other way?
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Sampling-
> problems-tp4453752p4455912.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Sampling problems

PIKAL Petr
In reply to this post by Oritteropus
>
> Hi, thank you but it does work for vectors and matrix but not
dataframes, it
> gives me this message error:
>
> MeanA <- read.csv("MeanAmf.csv",header=T)
> mysample <- MeanA[sample(1:nrow(MeanA), 20, replace=FALSE),]

Well, maybe slight correction

mysample <- sample(1:nrow(MeanA), 20, replace=FALSE)
chosen.one<-MeanA[mysample,]
remainder<-MeanA[-mysample,]

Regards
Petr

> remainder<-MeanA[-mysample]
> Error in `[.default`(MeanA, -mysample) : invalid subscript type 'list'
> In Ops.factor(left) : - not meaningful for factors
>
> Any other way?
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Sampling-
> problems-tp4453752p4455912.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Sampling problems

Oritteropus
In reply to this post by PIKAL Petr
Thanks, but it doesn't work either, it gives me the same message error.
It works just if my first sample is taken in this way:

mysample <- sample(1:nrow(MeanA), 20, replace=FALSE)

However, in this way it sample just the number of rows:
 [1] 71 24 12 36  2 39 69 62 43 38  9 44 13 54 50 63 67 66 37 28

but not the data inside.  I need to sample in this way:

mysample <- MeanA[sample(1:nrow(MeanA), 20, replace=FALSE),]

to get a sample like this

HRkm        Mean.mf         Mean.mfm         Loc         Diet         Terr Soc         Type         Soc.Ter         W.cat.0.25         W.cat.0.5
-2.49                -0.43                2.57                       A            O                T                       S                   D                      TS                          b        
23                     -2.05                0.67                       T            C                N                       S                    D                      NS                       A

This is an example of my dataframe
Reply | Threaded
Open this post in threaded view
|

Re: Sampling problems

Michael Weylandt
Please use dput() to give a reproducible example: I can make this work
on a data frame quite easily --

x <- data.frame(1:10, letters[1:10], rnorm(10))
str(x)
print(x)
x[sample(nrow(x), 5), ]

So it's not a problem with something being a data frame or having factors.

Michael

On Thu, Mar 8, 2012 at 5:16 AM, Oritteropus <[hidden email]> wrote:

> Thanks, but it doesn't work either, it gives me the same message error.
> It works just if my first sample is taken in this way:
>
> mysample <- sample(1:nrow(MeanA), 20, replace=FALSE)
>
> However, in this way it sample just the number of rows:
>  [1] 71 24 12 36  2 39 69 62 43 38  9 44 13 54 50 63 67 66 37 28
>
> but not the data inside.  I need to sample in this way:
>
> mysample <- MeanA[sample(1:nrow(MeanA), 20, replace=FALSE),]
>
> to get a sample like this
>
> HRkm        Mean.mf         Mean.mfm         Loc         Diet         Terr
> Soc         Type         Soc.Ter         W.cat.0.25         W.cat.0.5
> -2.49                -0.43                2.57                       A
> O                T                       S                   D
> TS                          b
> 23                     -2.05                0.67                       T
> C                N                       S                    D
> NS                       A
>
> This is an example of my dataframe
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Sampling-problems-tp4453752p4456048.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Simple solution

Oritteropus
In reply to this post by Oritteropus
Hi everybody,
Thank you all for your suggestions, you have been very helpful.
However at the end I solved in this way:

mysample <- MaxDH[sample(1:nrow(MaxDH), 150, replace=FALSE),]
A<-mysample[1:120,]
B<-mysample[121:150,]

So simple at the end...

Best,

Luca