Find the 50 highest values in a matrix

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Find the 50 highest values in a matrix

uschlecht
Hi,

I have a huge matrix (4000 * 2000 data points) and I would like to retrieve the coordinates (column and row) for the top 50 (or x) values. Some positions in the matrix have NA as a value. These should be discarded.

My current method is to replace all NAs by 0, then rank all the values and then extract the positions with the 50 highest ranks. It is very time-consuming!

Is there a simpler way to do this?

Thank you,
Ulrich
Reply | Threaded
Open this post in threaded view
|

Re: Find the 50 highest values in a matrix

Nikhil Kaza-2

Matrix is just a vector. So order should work
haven't verified the following code.

a <- matrix(rnorm(4000*2000), 4000, 2000)
b <- order(a, na.last=TRUE, decreasing=TRUE)[1:50]

use %% or %/% to get the row# and column #s


Nikhil Kaza
Asst. Professor,
City and Regional Planning
University of North Carolina

[hidden email]

On Jun 18, 2010, at 1:41 AM, uschlecht wrote:

>
> Hi,
>
> I have a huge matrix (4000 * 2000 data points) and I would like to  
> retrieve
> the coordinates (column and row) for the top 50 (or x) values. Some
> positions in the matrix have NA as a value. These should be discarded.
>
> My current method is to replace all NAs by 0, then rank all the  
> values and
> then extract the positions with the 50 highest ranks. It is very
> time-consuming!
>
> Is there a simpler way to do this?
>
> Thank you,
> Ulrich
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Find-the-50-highest-values-in-a-matrix-tp2259721p2259721.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
b - b%%nrow(a)

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Find the 50 highest values in a matrix

djmuseR
In reply to this post by uschlecht
Hi:

Here's a faked up example:

a <- matrix(rnorm(4000*2000), 4000, 2000)
# Generate some NAs in the matrix
nr <- sample(50, 1:4000)
nc <- sample(50, 1:2000)
a[nr, nc] <- NA

# convert to data frame:
b <- data.frame(row = rep(1:4000, 2000), col = rep(1:2000, each = 4000),
                          x = as.vector(a))
# relatively time consuming...about 13.5 s on my machine
bb <- b[rev(order(b$x, na.last = FALSE)), ]
> bb[1:10, ]
         row  col        x
691269  3269  173 5.103704
7815076 3076 1954 4.961544
4999621 3621 1250 4.953265
500469   469  126 4.937655
5878224 2224 1470 4.929150
4287270 3270 1072 4.913791
4442521 2521 1111 4.896869
4668867  867 1168 4.863504
5716575  575 1430 4.760778
3055274 3274  764 4.758995

HTH,
Dennis


On Thu, Jun 17, 2010 at 10:41 PM, uschlecht <[hidden email]>wrote:

>
> Hi,
>
> I have a huge matrix (4000 * 2000 data points) and I would like to retrieve
> the coordinates (column and row) for the top 50 (or x) values. Some
> positions in the matrix have NA as a value. These should be discarded.
>
> My current method is to replace all NAs by 0, then rank all the values and
> then extract the positions with the 50 highest ranks. It is very
> time-consuming!
>
> Is there a simpler way to do this?
>
> Thank you,
> Ulrich
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/Find-the-50-highest-values-in-a-matrix-tp2259721p2259721.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Find the 50 highest values in a matrix

Peter Ehlers

m <- matrix(round(rnorm(4000 * 2000), 4), nr = 4000)
is.na(m) <- sample(8e6, 1e6)

system.time(
   idx <- which(
     matrix(m %in% head(sort(m, TRUE), 50),
            nr = nrow(m)), arr.ind = TRUE))

#   user  system elapsed
#   3.12    0.19    3.18

   -Peter Ehlers


On 2010-06-18 5:13, Dennis Murphy wrote:

> Hi:
>
> Here's a faked up example:
>
> a<- matrix(rnorm(4000*2000), 4000, 2000)
> # Generate some NAs in the matrix
> nr<- sample(50, 1:4000)
> nc<- sample(50, 1:2000)
> a[nr, nc]<- NA
>
> # convert to data frame:
> b<- data.frame(row = rep(1:4000, 2000), col = rep(1:2000, each = 4000),
>                            x = as.vector(a))
> # relatively time consuming...about 13.5 s on my machine
> bb<- b[rev(order(b$x, na.last = FALSE)), ]
>> bb[1:10, ]
>           row  col        x
> 691269  3269  173 5.103704
> 7815076 3076 1954 4.961544
> 4999621 3621 1250 4.953265
> 500469   469  126 4.937655
> 5878224 2224 1470 4.929150
> 4287270 3270 1072 4.913791
> 4442521 2521 1111 4.896869
> 4668867  867 1168 4.863504
> 5716575  575 1430 4.760778
> 3055274 3274  764 4.758995
>
> HTH,
> Dennis
>
>
> On Thu, Jun 17, 2010 at 10:41 PM, uschlecht<[hidden email]>wrote:
>
>>
>> Hi,
>>
>> I have a huge matrix (4000 * 2000 data points) and I would like to retrieve
>> the coordinates (column and row) for the top 50 (or x) values. Some
>> positions in the matrix have NA as a value. These should be discarded.
>>
>> My current method is to replace all NAs by 0, then rank all the values and
>> then extract the positions with the 50 highest ranks. It is very
>> time-consuming!
>>
>> Is there a simpler way to do this?
>>
>> Thank you,
>> Ulrich
>>
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/Find-the-50-highest-values-in-a-matrix-tp2259721p2259721.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> [[alternative HTML version deleted]]
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Find the 50 highest values in a matrix

Henrik Bengtsson
You might also want to consider _partial sorting_ by using the
'partial' argument of sort(), especially when the number of data
points is really large.

Since argument 'decreasing=FALSE' is not supported when using
'partial', you have to flip it yourself by negating the values, e.g.

x <- rnorm(8e6);
is.na(x) <- sample(length(x), size=1e6);

n <- 50;
t1 <- system.time({
  x1 <- sort(x, decreasing=TRUE);
  x1h <- x1[1:n];
});

t2 <- system.time({
  x2 <- sort(-x, partial=n);
  x2h <- -sort(x2[1:n]);
});

stopifnot(identical(x2h, x1h));
print(t2/t1);
     user    system   elapsed
0.3076923 0.7777778 0.3491525

/Henrik

On Fri, Jun 18, 2010 at 1:20 PM, Peter Ehlers <[hidden email]> wrote:

>
> m <- matrix(round(rnorm(4000 * 2000), 4), nr = 4000)
> is.na(m) <- sample(8e6, 1e6)
>
> system.time(
>  idx <- which(
>    matrix(m %in% head(sort(m, TRUE), 50),
>           nr = nrow(m)), arr.ind = TRUE))
>
> #   user  system elapsed
> #   3.12    0.19    3.18
>
>  -Peter Ehlers
>
>
> On 2010-06-18 5:13, Dennis Murphy wrote:
>>
>> Hi:
>>
>> Here's a faked up example:
>>
>> a<- matrix(rnorm(4000*2000), 4000, 2000)
>> # Generate some NAs in the matrix
>> nr<- sample(50, 1:4000)
>> nc<- sample(50, 1:2000)
>> a[nr, nc]<- NA
>>
>> # convert to data frame:
>> b<- data.frame(row = rep(1:4000, 2000), col = rep(1:2000, each = 4000),
>>                           x = as.vector(a))
>> # relatively time consuming...about 13.5 s on my machine
>> bb<- b[rev(order(b$x, na.last = FALSE)), ]
>>>
>>> bb[1:10, ]
>>
>>          row  col        x
>> 691269  3269  173 5.103704
>> 7815076 3076 1954 4.961544
>> 4999621 3621 1250 4.953265
>> 500469   469  126 4.937655
>> 5878224 2224 1470 4.929150
>> 4287270 3270 1072 4.913791
>> 4442521 2521 1111 4.896869
>> 4668867  867 1168 4.863504
>> 5716575  575 1430 4.760778
>> 3055274 3274  764 4.758995
>>
>> HTH,
>> Dennis
>>
>>
>> On Thu, Jun 17, 2010 at 10:41 PM,
>> uschlecht<[hidden email]>wrote:
>>
>>>
>>> Hi,
>>>
>>> I have a huge matrix (4000 * 2000 data points) and I would like to
>>> retrieve
>>> the coordinates (column and row) for the top 50 (or x) values. Some
>>> positions in the matrix have NA as a value. These should be discarded.
>>>
>>> My current method is to replace all NAs by 0, then rank all the values
>>> and
>>> then extract the positions with the 50 highest ranks. It is very
>>> time-consuming!
>>>
>>> Is there a simpler way to do this?
>>>
>>> Thank you,
>>> Ulrich
>>>
>>> --
>>> View this message in context:
>>>
>>> http://r.789695.n4.nabble.com/Find-the-50-highest-values-in-a-matrix-tp2259721p2259721.html
>>> Sent from the R help mailing list archive at Nabble.com.
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>        [[alternative HTML version deleted]]
>>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Find the 50 highest values in a matrix

djmuseR
Hi:

>From what I can tell, Henrik efficiently finds the 50 largest values without
the matrix
indices and Peter efficiently finds the matrix indices without the
corresponding values.
Let's combine the two:

x <- rnorm(8e6)
is.na(x) <- sample(8e6, 1e6)
n <- 50
x1 <- sort(x, decreasing=TRUE)[1:n]
# Find the indices that match the 50 largest values
ix <- which(x %in% x1)
# Map these to rows and columns assuming that x is shaped
# into a 4000 * 2000 matrix columnwise:
ro <- ix %% 4000
ro <- ifelse(ro == 0L, 4000, ro)  # just in case we get row 4000
co <- 1 + ix %/% 4000
# Find the x values corresponding to the indices in ix, in their proper
order
x2 <- x[ix]
d <- data.frame(row = ro, col = co, x = x2)
d[rev(order(d$x)), ]    # sort in decreasing order

To make sure this works, set up
xm <- matrix(x, nrow = 4000)

In my test,
> head(d)     # before sorting
   row col        x
1  280  17 4.375632
2 1346 179 4.506947
3 2661 214 4.438979
4 1870 308 4.447505
5 1951 344 4.497253
6   12 365 4.465841
> xm[280, 17]
[1] 4.375632
> xm[1870, 308]
[1] 4.447505

Looks OK. Total time to run the entire block of code:
   user  system elapsed
   3.38    0.19    3.57

Now let's try it with Henrik's second method:

system.time({
x <- rnorm(8e6)
is.na(x) <- sample(8e6, 1e6)
n <- 50
x2 <-  sort(-x, partial = n)
x2h <-  -sort(x2[1:n])
ix <- which(x %in% x2h)
ro <- ix %% 4000
ro <- ifelse(ro == 0L, 4000, ro)
co <- 1 + ix %/% 4000
x3 <- x[ix]
d <- data.frame(row = ro, col = co, x = x3)
d[rev(order(d$x)), ]
})
   user  system elapsed
   2.24    0.17    2.42

Not bad for 8M observations :)

HTH,
Dennis

On Fri, Jun 18, 2010 at 4:39 AM, Henrik Bengtsson <[hidden email]>wrote:

> You might also want to consider _partial sorting_ by using the
> 'partial' argument of sort(), especially when the number of data
> points is really large.
>
> Since argument 'decreasing=FALSE' is not supported when using
> 'partial', you have to flip it yourself by negating the values, e.g.
>
> x <- rnorm(8e6);
> is.na(x) <- sample(length(x), size=1e6);
>
> n <- 50;
> t1 <- system.time({
>  x1 <- sort(x, decreasing=TRUE);
>  x1h <- x1[1:n];
> });
>
> t2 <- system.time({
>  x2 <- sort(-x, partial=n);
>  x2h <- -sort(x2[1:n]);
> });
>
> stopifnot(identical(x2h, x1h));
> print(t2/t1);
>     user    system   elapsed
> 0.3076923 0.7777778 0.3491525
>
> /Henrik
>
> On Fri, Jun 18, 2010 at 1:20 PM, Peter Ehlers <[hidden email]> wrote:
> >
> > m <- matrix(round(rnorm(4000 * 2000), 4), nr = 4000)
> > is.na(m) <- sample(8e6, 1e6)
> >
> > system.time(
> >  idx <- which(
> >    matrix(m %in% head(sort(m, TRUE), 50),
> >           nr = nrow(m)), arr.ind = TRUE))
> >
> > #   user  system elapsed
> > #   3.12    0.19    3.18
> >
> >  -Peter Ehlers
> >
> >
> > On 2010-06-18 5:13, Dennis Murphy wrote:
> >>
> >> Hi:
> >>
> >> Here's a faked up example:
> >>
> >> a<- matrix(rnorm(4000*2000), 4000, 2000)
> >> # Generate some NAs in the matrix
> >> nr<- sample(50, 1:4000)
> >> nc<- sample(50, 1:2000)
> >> a[nr, nc]<- NA
> >>
> >> # convert to data frame:
> >> b<- data.frame(row = rep(1:4000, 2000), col = rep(1:2000, each = 4000),
> >>                           x = as.vector(a))
> >> # relatively time consuming...about 13.5 s on my machine
> >> bb<- b[rev(order(b$x, na.last = FALSE)), ]
> >>>
> >>> bb[1:10, ]
> >>
> >>          row  col        x
> >> 691269  3269  173 5.103704
> >> 7815076 3076 1954 4.961544
> >> 4999621 3621 1250 4.953265
> >> 500469   469  126 4.937655
> >> 5878224 2224 1470 4.929150
> >> 4287270 3270 1072 4.913791
> >> 4442521 2521 1111 4.896869
> >> 4668867  867 1168 4.863504
> >> 5716575  575 1430 4.760778
> >> 3055274 3274  764 4.758995
> >>
> >> HTH,
> >> Dennis
> >>
> >>
> >> On Thu, Jun 17, 2010 at 10:41 PM,
> >> uschlecht<[hidden email]>wrote:
> >>
> >>>
> >>> Hi,
> >>>
> >>> I have a huge matrix (4000 * 2000 data points) and I would like to
> >>> retrieve
> >>> the coordinates (column and row) for the top 50 (or x) values. Some
> >>> positions in the matrix have NA as a value. These should be discarded.
> >>>
> >>> My current method is to replace all NAs by 0, then rank all the values
> >>> and
> >>> then extract the positions with the 50 highest ranks. It is very
> >>> time-consuming!
> >>>
> >>> Is there a simpler way to do this?
> >>>
> >>> Thank you,
> >>> Ulrich
> >>>
> >>> --
> >>> View this message in context:
> >>>
> >>>
> http://r.789695.n4.nabble.com/Find-the-50-highest-values-in-a-matrix-tp2259721p2259721.html
> >>> Sent from the R help mailing list archive at Nabble.com.
> >>>
> >>> ______________________________________________
> >>> [hidden email] mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide
> >>> http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>>
> >>
> >>        [[alternative HTML version deleted]]
> >>
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.