A distance measure between top-k list

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

A distance measure between top-k list

sayan dasgupta
Hi folks,
Here is the problem. I am giving an example .I want to find a measure
of similarity or dissimilarity among ranking (of students of a same class
of size say 50)by two judges. But instead of observing the rank of all the
50 students
(Where we could have used rank correlation measures)in each case what I have
is
2 list of top 20 students chosen by each judge.

The following paper gives out a few measures for such problem
www.almaden.ibm.com/cs/people/fagin/topk.pdf

Now I have written the code for the kendal's - measure of distance

here is the code

topklist <- function(df1,df2,matchby="name",rankby="pat",p=0.5,
normalize=TRUE){
library(gtools)
df1$rank <- rank(-df1[,rankby],ties.method="first")
df2$rank <- rank(-df2[,rankby],ties.method="first")
dftmp <- merge(df1,df2,matchby,all=TRUE)
rownames(dftmp) <- dftmp[,matchby]
df <- combinations(length(dftmp[,matchby]),2
,as.character(dftmp[,matchby]))
concor <- function(x,dftmp,p){
a <- NA
n <- sum(as.numeric(!is.na(dftmp$rank.x)))
x <- dftmp[c(x[1],x[2]),c("rank.x","rank.y")]
if (all(is.na(x$rank.x)== FALSE) && all(is.na(x$rank.y)==TRUE))
{ a <- p}
else if (all(is.na(x$rank.x)== TRUE) && all(is.na(x$rank.y)== FALSE))
{a <- p}
else
{ x[is.na(x)] <- n+1
a <- 1
if((x$rank.x[1] > x$rank.x[2] && x$rank.y[1] > x$rank.y[2])||
(x$rank.x[1] < x$rank.x[2] && x$rank.y[1] < x$rank.y[2]))
{a <- 0}
}
a
}
corr <- (sum(apply(df,1,function(x){concor(x,dftmp,p)})))
if(normalize){
dn <- p*choose(nrow(df1),2)+
p*choose(nrow(df2),2)+
choose(nrow(df1),1)*choose(nrow(df2),1)
corr <- corr/dn
}
corr
}

Here is the sample use for it
df1 <- structure(list(name = structure(c(21L, 12L, 3L, 16L, 15L, 5L,
8L, 23L, 7L, 18L, 4L, 17L, 2L, 6L, 22L, 20L, 10L, 1L, 19L, 14L
), .Label = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J",
"K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W",
"X"), class = "factor"), pat = c(2051.55, 679.2, 502.77, 408.14,
278.62, 236.05, 232.44, 215.65, 202.92, 180.13, 172.82, 166.69,
152.82, 150.69, 130.69, 127.81, 121.59, 120.59, 120.42, 120.17
)), .Names = c("name", "pat"), row.names = c(NA, -20L), class =
"data.frame")

df2 <- structure(list(name = structure(c(21L, 12L, 16L, 7L, 5L, 3L,
8L, 4L, 23L, 9L, 15L, 10L, 17L, 14L, 11L, 22L, 24L, 20L, 1L,
13L), .Label = c("A", "B", "C", "D", "E", "F", "G", "H", "I",
"J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V",
"W", "X"), class = "factor"), pat = c(1604.25, 690.97, 463.64,
285.23, 280.3, 274.66, 261.84, 251.88, 234.94, 210.12, 202.89,
200.89, 185.43, 167.56, 161.1, 161.1, 155.47, 150.22, 121.19,
115.93)), .Names = c("name", "pat"), row.names = c(NA, -20L), class =
"data.frame")

Now we get the result
topklist(df1,df2,matchby="name",rankby="pat",p=0.5)

See the measure gives 0 for tqo exactly similar list
topklist(df1,df1,matchby="name",rankby="pat",p=0.5)


So what do you guys think about this ??

thanks and regards
Sayan Dasgupta

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.