# vectorization with subset?

4 messages
Open this post in threaded view
|

## vectorization with subset?

 Hello, I have a data frame (68,000 rows) of scores (V4) for a series of [genomic] coordinates ranges (V2 to V3). ```################################### > head(scores) V1 V2 V3 V4 1 chr1 2037651 2037700 1.474269 2 chr1 2037659 2037708 1.021012 3 chr1 2037677 2037726 1.180993 4 chr1 2037685 2037734 1.717131 5 chr1 2037703 2037752 2.361985 6 chr1 2037715 2037764 1.257013 ###################################```I also have a data frame (1.2 million rows) of single [genomic] coordinates.   ```################################### > head(coord,n=20) V1 V2 1 chr1 2037652 2 chr1 2037653 3 chr1 2037654 4 chr1 2037655 5 chr1 2037656 6 chr1 2037657 7 chr1 2037658 8 chr1 2037659 9 chr1 2037660 10 chr1 2037661 11 chr1 2037662 12 chr1 2037663 13 chr1 2037664 14 chr1 2037665 15 chr1 2037666 16 chr1 2037667 17 chr1 2037668 18 chr1 2037669 19 chr1 2037670 20 chr1 2037671 ###################################```For each genomic coordinate (in coord), I would like to determine the average of all scores whose genomic ranges (in scores) encompass the coordinate (in coord). To accomplish this, I tried: ```################################### >for(i in 1:nrow(coord)){range_scores<-subset(scores,scores\$V1 == coord\$V1[i] & scores\$V2 <= coord\$V2[i] & scores\$V3 >= coord\$V2[i]);coord\$V3[i]<-mean(range_scores\$V4)} > head(coord,n=20) V1 V2 V3 1 chr1 2037652 1.474269 2 chr1 2037653 1.474269 3 chr1 2037654 1.474269 4 chr1 2037655 1.474269 5 chr1 2037656 1.474269 6 chr1 2037657 1.474269 7 chr1 2037658 1.474269 8 chr1 2037659 1.247641 9 chr1 2037660 1.247641 10 chr1 2037661 1.247641 11 chr1 2037662 1.247641 12 chr1 2037663 1.247641 13 chr1 2037664 1.247641 14 chr1 2037665 1.247641 15 chr1 2037666 1.247641 16 chr1 2037667 1.247641 17 chr1 2037668 1.247641 18 chr1 2037669 1.247641 19 chr1 2037670 1.247641 20 chr1 2037671 1.247641 ###################################```The function works, but is extremely slow. It would take about 4 days for this to finish for a single data set, and I have 64 data sets. Why does the rate at which coordinate averages are calculated increase when coord is smaller, but not when scores is smaller? How can I accomplish the same thing more efficiently? Thanks, Dan
Open this post in threaded view
|

## Re: vectorization with subset?

 On Jul 2, 2012, at 12:15 PM, dlv04c wrote: > Hello, > > I have a data frame (68,000 rows) of scores (V4) for a series of   > [genomic] > coordinates ranges (V2 to V3). > > > > I also have a data frame (1.2 million rows) of single [genomic]   > coordinates. > > > > For each genomic coordinate (in coord), I would like to determine the > average of all scores whose genomic ranges (in scores) encompass the > coordinate (in coord). To accomplish this, I tried: > > > > The function works, but is extremely slow. > > It would take about 4 days for this to finish for a single data set,   > and I > have 64 data sets. > > Why does the rate at which coordinate averages are calculated   > increase when > coord is smaller, but not when scores is smaller? > > How can I accomplish the same thing more efficiently? You probably need to start by reading the vignettes for the IRanges   package. It's difficult to be sure since you did not show the code for   what you were doing currently. -- David Winsemius, MD West Hartford, CT ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.
 The code is in the original post, but here it is again: `>for(i in 1:nrow(coord)){range_scores<-subset(scores,scores\$V1 == coord\$V1[i] & scores\$V2 <= coord\$V2[i] & scores\$V3 >= coord\$V2[i]);coord\$V3[i]<-mean(range_scores\$V4)}`Thanks, Dan