vectorization with subset?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

vectorization with subset?

dlv04c
Hello,

I have a data frame (68,000 rows) of scores (V4) for a series of [genomic] coordinates ranges (V2 to V3).

###################################
> head(scores)
    V1      V2      V3       V4
1 chr1 2037651 2037700 1.474269
2 chr1 2037659 2037708 1.021012
3 chr1 2037677 2037726 1.180993
4 chr1 2037685 2037734 1.717131
5 chr1 2037703 2037752 2.361985
6 chr1 2037715 2037764 1.257013
###################################

I also have a data frame (1.2 million rows) of single [genomic] coordinates.  

###################################
> head(coord,n=20)
     V1      V2
1  chr1 2037652
2  chr1 2037653
3  chr1 2037654
4  chr1 2037655
5  chr1 2037656
6  chr1 2037657
7  chr1 2037658
8  chr1 2037659
9  chr1 2037660
10 chr1 2037661
11 chr1 2037662
12 chr1 2037663
13 chr1 2037664
14 chr1 2037665
15 chr1 2037666
16 chr1 2037667
17 chr1 2037668
18 chr1 2037669
19 chr1 2037670
20 chr1 2037671
###################################

For each genomic coordinate (in coord), I would like to determine the average of all scores whose genomic ranges (in scores) encompass the coordinate (in coord). To accomplish this, I tried:

###################################
>for(i in 1:nrow(coord)){range_scores<-subset(scores,scores$V1 == coord$V1[i] & scores$V2 <= coord$V2[i] & scores$V3 >= coord$V2[i]);coord$V3[i]<-mean(range_scores$V4)}

> head(coord,n=20)
     V1      V2       V3
1  chr1 2037652 1.474269
2  chr1 2037653 1.474269
3  chr1 2037654 1.474269
4  chr1 2037655 1.474269
5  chr1 2037656 1.474269
6  chr1 2037657 1.474269
7  chr1 2037658 1.474269
8  chr1 2037659 1.247641
9  chr1 2037660 1.247641
10 chr1 2037661 1.247641
11 chr1 2037662 1.247641
12 chr1 2037663 1.247641
13 chr1 2037664 1.247641
14 chr1 2037665 1.247641
15 chr1 2037666 1.247641
16 chr1 2037667 1.247641
17 chr1 2037668 1.247641
18 chr1 2037669 1.247641
19 chr1 2037670 1.247641
20 chr1 2037671 1.247641

###################################

The function works, but is extremely slow.

It would take about 4 days for this to finish for a single data set, and I have 64 data sets.

Why does the rate at which coordinate averages are calculated increase when coord is smaller, but not when scores is smaller?

How can I accomplish the same thing more efficiently?

Thanks,

Dan
Reply | Threaded
Open this post in threaded view
|

Re: vectorization with subset?

David Winsemius

On Jul 2, 2012, at 12:15 PM, dlv04c wrote:

> Hello,
>
> I have a data frame (68,000 rows) of scores (V4) for a series of  
> [genomic]
> coordinates ranges (V2 to V3).
>
>
>
> I also have a data frame (1.2 million rows) of single [genomic]  
> coordinates.
>
>
>
> For each genomic coordinate (in coord), I would like to determine the
> average of all scores whose genomic ranges (in scores) encompass the
> coordinate (in coord). To accomplish this, I tried:
>
>
>
> The function works, but is extremely slow.
>
> It would take about 4 days for this to finish for a single data set,  
> and I
> have 64 data sets.
>
> Why does the rate at which coordinate averages are calculated  
> increase when
> coord is smaller, but not when scores is smaller?
>
> How can I accomplish the same thing more efficiently?

You probably need to start by reading the vignettes for the IRanges  
package. It's difficult to be sure since you did not show the code for  
what you were doing currently.

--

David Winsemius, MD
West Hartford, CT

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: vectorization with subset?

dlv04c
The code is in the original post, but here it is again:

>for(i in 1:nrow(coord)){range_scores<-subset(scores,scores$V1 == coord$V1[i] & scores$V2 <= coord$V2[i] & scores$V3 >= coord$V2[i]);coord$V3[i]<-mean(range_scores$V4)}

Thanks,

Dan
Reply | Threaded
Open this post in threaded view
|

Re: vectorization with subset?

David Winsemius

On Jul 2, 2012, at 5:16 PM, dlv04c wrote:

> The code is in the original post, but here it is again:
>

No code here or in original posting to rhelp. You are under the  
delusion that Nabble is R-help. It is not.

> --
> View this message in context: http://r.789695.n4.nabble.com/vectorization-with-subset-tp4635156p4635208.html
> Sent from the R help mailing list archive at Nabble.com.

This is the rhelp mailing list. Not a website.

--

David Winsemius, MD
West Hartford, CT

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.