> I need to remove all but one of each [row in a matrix], which

> must be chosen at random.

This request (included in full at the bottom), has been unanswered for a

while, but I had the same problem and ended up writing a function to

solve it. I call it "duplicated.random()" and it does exactly the same

thing as the "duplicated()" function apart from the fact that the choice

of which of the duplicated observations gets a FALSE in the result is

random, rather than always being the first. There is no way to specify

any distribution probabilities; each duplicated observation is equally

likely to be chosen.

The implementation is through permuting the original using "sample()",

then running "duplicated()" and finally reversing the permutation on the

result. So the randomization should have "similar properties" as

sample(), probably including reproducibility by setting the random seed

(although haven't tested that explicitly).

The function and some test code are included below. It handles vectors

and matrices for now, but adding other data structures that are handled

correctly by duplicated() should be a simple matter of ensuring that the

indexing is handled correctly in the permutation process. If anyone

makes any improvements to the function, I'd be grateful to be notified.

#############################################################

# This function returns a logical vector, the elements of which

# are FALSE, unless there are duplicated values in x, in which

# case all but one elements are TRUE (for each set of duplicates).

# The only difference between this function and the duplicated()

# function is that rather than always returning FALSE for the first

# instance of a duplicated value, the choice of instance is random.

duplicated.random = function(x, incomparables = FALSE, ...)

{

if ( is.vector(x) )

{

permutation = sample(length(x))

x.perm = x[permutation]

result.perm = duplicated(x.perm, incomparables, ...)

result = result.perm[order(permutation)]

return(result)

}

else if ( is.matrix(x) )

{

permutation = sample(nrow(x))

x.perm = x[permutation,]

result.perm = duplicated(x.perm, incomparables, ...)

result = result.perm[order(permutation)]

return(result)

}

else

{

stop(paste("duplicated.random() only supports vectors",

"matrices for now."))

}

}

#############################################################

# Test code for vector case

x = sample(1:5,10,T)

d = duplicated(x)

r = duplicated.random(x)

cbind(x,d,r)

x[!d]

x[!r]

# Test code for matrix case

x = matrix(sample(1:2,30,T), ncol=3)

d = duplicated(x)

r = duplicated.random(x)

cbind(x,d,r)

#############################################################

On 3/24/2010 11:44 AM, jeff.m.ewers wrote:

>

> Hello,

>

> I am relatively new to R, and I've run into a problem formatting my data for

> input into the package RankAggreg.

>

> I have a matrix of gene titles and P-values (weights) in two columns:

>

> KCTD12 4.06904E-22

> UNC93A 9.91852E-22

> CDKN3 1.24695E-21

> CLEC2B 4.71759E-21

> DAB2 1.12062E-20

> HSPB1 1.23125E-20

> ...

>

> The data contains many, many duplicate gene titles, and I need to remove all

> but one of each, which must be chosen at random. I have looked for quite

> some time, and I've been unable to find a way to do this. Any help would be

> greatly appreciated!

>

> Thanks,

>

> Jeff

______________________________________________

[hidden email] mailing list

https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide

http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.