Deleting duplicate rows in a matrix at random

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Deleting duplicate rows in a matrix at random

jeff.m.ewers
Hello,

I am relatively new to R, and I've run into a problem formatting my data for input into the package RankAggreg.

I have a matrix of gene titles and P-values (weights) in two columns:

KCTD12 4.06904E-22
UNC93A 9.91852E-22
CDKN3 1.24695E-21
CLEC2B 4.71759E-21
DAB2 1.12062E-20
HSPB1 1.23125E-20
...

The data contains many, many duplicate gene titles, and I need to remove all but one of each, which must be chosen at random. I have looked for quite some time, and I've been unable to find a way to do this. Any help would be greatly appreciated!

Thanks,

Jeff
Reply | Threaded
Open this post in threaded view
|

Re: Deleting duplicate rows in a matrix at random

Magnus Thor Torfason
 > I need to remove all but one of each [row in a matrix], which
 > must be chosen at random.

This request (included in full at the bottom), has been unanswered for a
while, but I had the same problem and ended up writing a function to
solve it. I call it "duplicated.random()" and it does exactly the same
thing as the "duplicated()" function apart from the fact that the choice
of which of the duplicated observations gets a FALSE in the result is
random, rather than always being the first. There is no way to specify
any distribution probabilities; each duplicated observation is equally
likely to be chosen.

The implementation is through permuting the original using "sample()",
then running "duplicated()" and finally reversing the permutation on the
result. So the randomization should have "similar properties" as
sample(), probably including reproducibility by setting the random seed
(although haven't tested that explicitly).

The function and some test code are included below. It handles vectors
and matrices for now, but adding other data structures that are handled
correctly by duplicated() should be a simple matter of ensuring that the
indexing is handled correctly in the permutation process. If anyone
makes any improvements to the function, I'd be grateful to be notified.

#############################################################

# This function returns a logical vector, the elements of which
# are FALSE, unless there are duplicated values in x, in which
# case all but one elements are TRUE (for each set of duplicates).
# The only difference between this function and the duplicated()
# function is that rather than always returning FALSE for the first
# instance of a duplicated value, the choice of instance is random.
duplicated.random = function(x, incomparables = FALSE, ...)
{
     if ( is.vector(x) )
     {
         permutation = sample(length(x))
         x.perm      = x[permutation]
         result.perm = duplicated(x.perm, incomparables, ...)
         result      = result.perm[order(permutation)]
         return(result)
     }
     else if ( is.matrix(x) )
     {
         permutation = sample(nrow(x))
         x.perm      = x[permutation,]
         result.perm = duplicated(x.perm, incomparables, ...)
         result      = result.perm[order(permutation)]
         return(result)
     }
     else
     {
         stop(paste("duplicated.random() only supports vectors",
                "matrices for now."))
     }
}

#############################################################

# Test code for vector case
x = sample(1:5,10,T)
d = duplicated(x)
r = duplicated.random(x)
cbind(x,d,r)
x[!d]
x[!r]

# Test code for matrix case
x = matrix(sample(1:2,30,T), ncol=3)
d = duplicated(x)
r = duplicated.random(x)
cbind(x,d,r)

#############################################################


On 3/24/2010 11:44 AM, jeff.m.ewers wrote:

>
> Hello,
>
> I am relatively new to R, and I've run into a problem formatting my data for
> input into the package RankAggreg.
>
> I have a matrix of gene titles and P-values (weights) in two columns:
>
> KCTD12 4.06904E-22
> UNC93A 9.91852E-22
> CDKN3 1.24695E-21
> CLEC2B 4.71759E-21
> DAB2 1.12062E-20
> HSPB1 1.23125E-20
> ...
>
> The data contains many, many duplicate gene titles, and I need to remove all
> but one of each, which must be chosen at random. I have looked for quite
> some time, and I've been unable to find a way to do this. Any help would be
> greatly appreciated!
>
> Thanks,
>
> Jeff

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.