Needing a better solution to a lookup problem.

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Needing a better solution to a lookup problem.

Davis, Brian
I have a solution (actually a few) to this problem, but none are computationally efficient enough to be useful.  I'm hoping someone can enlighten me to a better solution.

I have data frame of chromosome/position pairs (along with other data for the location).  For each pair I need to determine if it is with in a given data frame of ranges.  I need to keep only the pairs that are within any of the ranges for further processing.

Example:
snps<-NULL
snps$CHR<-c("1","2","2","3","X")
snps$POS<-as.integer(c(295,640,670,100,1100))
snps$DAT<-seq(1:length(snps$CHR))
snps<-as.data.frame(snps, stringsAsFactors=FALSE)

 snps
  CHR  POS DAT
1   1  295   1
2   2  640   2
3   2  670   3
4   3  100   4
5   X 1100   5

region<-NULL
region$CHR<-c("1","1","2","2","2","X")
region$START<-as.integer(c(10,210,430,650,810,1090))
region$STOP<-as.integer(c(100,350,630,675,850,1111))
region<-as.data.frame(region, stringsAsFactors=FALSE)

region
  CHR START STOP
1   1    10  100
2   1   210  350
3   2   430  630
4   2   650  675
5   2   810  850
6   X  1090 1111


The result I need would look like

Res

 CHR  POS DAT
   1  295   1
   2  670   3
   X 1100   5


I have a solution that works reasonably well on small sets, but my current data set is ~100K snp entries, and my regions table has ~200K entries. I have ~1500 files to go through

I haven't found a good way to efficiently solve this problem.  I've tried various versions of mapply/lapply, for loops, etc which get the answer for small sets but takes hours (per file) on my real data.  Bioconductor seemed like the obvious place to look, but my GoogleFu must not be that great.  I never found anything relevant.

Any ideas or points to the right direction would be greatly appreciated.



Brian Davis

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Needing a better solution to a lookup problem.

David Winsemius

On Mar 14, 2012, at 3:27 PM, Davis, Brian wrote:

> I have a solution (actually a few) to this problem, but none are  
> computationally efficient enough to be useful.  I'm hoping someone  
> can enlighten me to a better solution.
>
> I have data frame of chromosome/position pairs (along with other  
> data for the location).  For each pair I need to determine if it is  
> with in a given data frame of ranges.  I need to keep only the pairs  
> that are within any of the ranges for further processing.
>
> Example:
> snps<-NULL
> snps$CHR<-c("1","2","2","3","X")
> snps$POS<-as.integer(c(295,640,670,100,1100))
> snps$DAT<-seq(1:length(snps$CHR))
> snps<-as.data.frame(snps, stringsAsFactors=FALSE)
>
> snps
>  CHR  POS DAT
> 1   1  295   1
> 2   2  640   2
> 3   2  670   3
> 4   3  100   4
> 5   X 1100   5
>
> region<-NULL
> region$CHR<-c("1","1","2","2","2","X")
> region$START<-as.integer(c(10,210,430,650,810,1090))
> region$STOP<-as.integer(c(100,350,630,675,850,1111))
> region<-as.data.frame(region, stringsAsFactors=FALSE)
>
> region
>  CHR START STOP
> 1   1    10  100
> 2   1   210  350
> 3   2   430  630
> 4   2   650  675
> 5   2   810  850
> 6   X  1090 1111
>
>
> The result I need would look like
>
> Res
>
> CHR  POS DAT
>   1  295   1
>   2  670   3
>   X 1100   5
>
>
> I have a solution that works reasonably well on small sets, but my  
> current data set is ~100K snp entries, and my regions table has  
> ~200K entries. I have ~1500 files to go through
>
> I haven't found a good way to efficiently solve this problem.  I've  
> tried various versions of mapply/lapply, for loops, etc which get  
> the answer for small sets but takes hours (per file) on my real  
> data.  Bioconductor seemed like the obvious place to look, but my  
> GoogleFu must not be that great.  I never found anything relevant.
>
> Any ideas or points to the right direction would be greatly  
> appreciated.

The usual BioC recommendation for this sort of problem is package  
IRanges. And that mailing list probably has many readers who have used  
that package, unkike this mailing list.

It purported to handle overlapping ranges as well as the non-
overlapping problem you pose.

http://www.googlesyndicatedsearch.com/u/newcastlemaths?q=+chromosome+position+iranges&sa=Google+Search

--

David Winsemius, MD
West Hartford, CT

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Needing a better solution to a lookup problem.

Mikhail Titov-2
In reply to this post by Davis, Brian
> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]]
On

> Behalf Of Davis, Brian
> Sent: Wednesday, March 14, 2012 2:28 PM
> To: [hidden email]
> Subject: [R] Needing a better solution to a lookup problem.
>
> I have a solution (actually a few) to this problem, but none are
> computationally efficient enough to be useful.  I'm hoping someone can
> enlighten me to a better solution.
> ...
> I have a solution that works reasonably well on small sets, but my current
> data set is ~100K snp entries, and my regions table has ~200K entries. I
have
> ~1500 files to go through
>
> I haven't found a good way to efficiently solve this problem.  I've tried
> various versions of mapply/lapply, for loops, etc which get the answer for
> small sets but takes hours (per file) on my real data.  Bioconductor
seemed
> like the obvious place to look, but my GoogleFu must not be that great.  I
> never found anything relevant.
>
> Any ideas or points to the right direction would be greatly appreciated.

Consider using a database. For instance PostgreSQL can easily handle large
amount of data and can restrict data set to only those that are within a
certain subset. While it requires some DB & SQL knowledge, it will pay off.
And you can query your data right from DB using RODBC or something. Solve
this problem in DB and use R for further analysis.

Mikhail

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Needing a better solution to a lookup problem.

ilai-2
In reply to this post by Davis, Brian
You could try doing it without a loop (.C or other):

 (rgnsnp <- merge(region,snps))
 (rgnsnp[with(rgnsnp,STOP>=POS & POS >= START),])

Here is my test for merge+search on 100k/200k:

fdf1 <- data.frame(chr=1:100000,p=runif(100000),d=sample(100000))
fdf2 <- data.frame(chr=rep(1:100000,2),s=runif(200000),t=runif(200000))
system.time(with(FDF <- merge(fdf2,fdf1),FDF[s>=p & p >= t,]))
   user  system elapsed
  2.560   0.152   2.905

 Hope this helps
Elai

On Wed, Mar 14, 2012 at 1:27 PM, Davis, Brian <[hidden email]> wrote:

> I have a solution (actually a few) to this problem, but none are computationally efficient enough to be useful.  I'm hoping someone can enlighten me to a better solution.
>
> I have data frame of chromosome/position pairs (along with other data for the location).  For each pair I need to determine if it is with in a given data frame of ranges.  I need to keep only the pairs that are within any of the ranges for further processing.
>
> Example:
> snps<-NULL
> snps$CHR<-c("1","2","2","3","X")
> snps$POS<-as.integer(c(295,640,670,100,1100))
> snps$DAT<-seq(1:length(snps$CHR))
> snps<-as.data.frame(snps, stringsAsFactors=FALSE)
>
>  snps
>  CHR  POS DAT
> 1   1  295   1
> 2   2  640   2
> 3   2  670   3
> 4   3  100   4
> 5   X 1100   5
>
> region<-NULL
> region$CHR<-c("1","1","2","2","2","X")
> region$START<-as.integer(c(10,210,430,650,810,1090))
> region$STOP<-as.integer(c(100,350,630,675,850,1111))
> region<-as.data.frame(region, stringsAsFactors=FALSE)
>
> region
>  CHR START STOP
> 1   1    10  100
> 2   1   210  350
> 3   2   430  630
> 4   2   650  675
> 5   2   810  850
> 6   X  1090 1111
>
>
> The result I need would look like
>
> Res
>
>  CHR  POS DAT
>   1  295   1
>   2  670   3
>   X 1100   5
>
>
> I have a solution that works reasonably well on small sets, but my current data set is ~100K snp entries, and my regions table has ~200K entries. I have ~1500 files to go through
>
> I haven't found a good way to efficiently solve this problem.  I've tried various versions of mapply/lapply, for loops, etc which get the answer for small sets but takes hours (per file) on my real data.  Bioconductor seemed like the obvious place to look, but my GoogleFu must not be that great.  I never found anything relevant.
>
> Any ideas or points to the right direction would be greatly appreciated.
>
>
>
> Brian Davis
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Needing a better solution to a lookup problem.

Davis, Brian
Thanks for the idea.  I think this example works fast mainly due to the limited number of matches.  For each fdf1$chr there are only 2 potential matches in fdf2.  In reality there are only 24 possible values for chr (1-22 and X, Y).  When I replace the chr seq with more realistic values, I run out of memory.  I'll try it out our server and let you know how it goes.  Fast and memory intensive will get me over the hump for now.

fdf1 <- data.frame(chr=sort(sample(seq(1:24), 100000, replace=TRUE)),p=runif(100000),d=sample(100000))
fdf2 <- data.frame(chr=sort(sample(seq(1:24), 200000, replace=TRUE)),s=runif(200000),t=runif(200000))
system.time(with(FDF <- merge(fdf2,fdf1),FDF[s>=p & p >= t,]))

Thanks again,

Brian

-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of ilai
Sent: Wednesday, March 14, 2012 3:26 PM
To: Davis, Brian
Cc: [hidden email]
Subject: Re: [R] Needing a better solution to a lookup problem.

You could try doing it without a loop (.C or other):

 (rgnsnp <- merge(region,snps))
 (rgnsnp[with(rgnsnp,STOP>=POS & POS >= START),])

Here is my test for merge+search on 100k/200k:

fdf1 <- data.frame(chr=1:100000,p=runif(100000),d=sample(100000))
fdf2 <- data.frame(chr=rep(1:100000,2),s=runif(200000),t=runif(200000))
system.time(with(FDF <- merge(fdf2,fdf1),FDF[s>=p & p >= t,]))
   user  system elapsed
  2.560   0.152   2.905

 Hope this helps
Elai

On Wed, Mar 14, 2012 at 1:27 PM, Davis, Brian <[hidden email]> wrote:

> I have a solution (actually a few) to this problem, but none are computationally efficient enough to be useful.  I'm hoping someone can enlighten me to a better solution.
>
> I have data frame of chromosome/position pairs (along with other data for the location).  For each pair I need to determine if it is with in a given data frame of ranges.  I need to keep only the pairs that are within any of the ranges for further processing.
>
> Example:
> snps<-NULL
> snps$CHR<-c("1","2","2","3","X")
> snps$POS<-as.integer(c(295,640,670,100,1100))
> snps$DAT<-seq(1:length(snps$CHR))
> snps<-as.data.frame(snps, stringsAsFactors=FALSE)
>
>  snps
>  CHR  POS DAT
> 1   1  295   1
> 2   2  640   2
> 3   2  670   3
> 4   3  100   4
> 5   X 1100   5
>
> region<-NULL
> region$CHR<-c("1","1","2","2","2","X")
> region$START<-as.integer(c(10,210,430,650,810,1090))
> region$STOP<-as.integer(c(100,350,630,675,850,1111))
> region<-as.data.frame(region, stringsAsFactors=FALSE)
>
> region
>  CHR START STOP
> 1   1    10  100
> 2   1   210  350
> 3   2   430  630
> 4   2   650  675
> 5   2   810  850
> 6   X  1090 1111
>
>
> The result I need would look like
>
> Res
>
>  CHR  POS DAT
>   1  295   1
>   2  670   3
>   X 1100   5
>
>
> I have a solution that works reasonably well on small sets, but my current data set is ~100K snp entries, and my regions table has ~200K entries. I have ~1500 files to go through
>
> I haven't found a good way to efficiently solve this problem.  I've tried various versions of mapply/lapply, for loops, etc which get the answer for small sets but takes hours (per file) on my real data.  Bioconductor seemed like the obvious place to look, but my GoogleFu must not be that great.  I never found anything relevant.
>
> Any ideas or points to the right direction would be greatly appreciated.
>
>
>
> Brian Davis
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Needing a better solution to a lookup problem.

Davis, Brian
In reply to this post by David Winsemius
Thanks for the point in the right direction.  I now have a great solution.

> library(GenomicRanges)
> system.time({
+ snplist<-with(snp, GRanges(CHR, IRanges(POS, POS)))
+ locations<-with(targets, GRanges(CHR, IRanges(START, STOP)))
+ olaps<-findOverlaps(snplist, locations)
+ })
   user  system elapsed
   0.70    0.00    0.71

Brian

-----Original Message-----
From: David Winsemius [mailto:[hidden email]]
Sent: Wednesday, March 14, 2012 2:44 PM
To: Davis, Brian
Cc: [hidden email]
Subject: Re: [R] Needing a better solution to a lookup problem.


On Mar 14, 2012, at 3:27 PM, Davis, Brian wrote:

> I have a solution (actually a few) to this problem, but none are
> computationally efficient enough to be useful.  I'm hoping someone can
> enlighten me to a better solution.
>
> I have data frame of chromosome/position pairs (along with other data
> for the location).  For each pair I need to determine if it is with in
> a given data frame of ranges.  I need to keep only the pairs that are
> within any of the ranges for further processing.
>
> Example:
> snps<-NULL
> snps$CHR<-c("1","2","2","3","X")
> snps$POS<-as.integer(c(295,640,670,100,1100))
> snps$DAT<-seq(1:length(snps$CHR))
> snps<-as.data.frame(snps, stringsAsFactors=FALSE)
>
> snps
>  CHR  POS DAT
> 1   1  295   1
> 2   2  640   2
> 3   2  670   3
> 4   3  100   4
> 5   X 1100   5
>
> region<-NULL
> region$CHR<-c("1","1","2","2","2","X")
> region$START<-as.integer(c(10,210,430,650,810,1090))
> region$STOP<-as.integer(c(100,350,630,675,850,1111))
> region<-as.data.frame(region, stringsAsFactors=FALSE)
>
> region
>  CHR START STOP
> 1   1    10  100
> 2   1   210  350
> 3   2   430  630
> 4   2   650  675
> 5   2   810  850
> 6   X  1090 1111
>
>
> The result I need would look like
>
> Res
>
> CHR  POS DAT
>   1  295   1
>   2  670   3
>   X 1100   5
>
>
> I have a solution that works reasonably well on small sets, but my
> current data set is ~100K snp entries, and my regions table has ~200K
> entries. I have ~1500 files to go through
>
> I haven't found a good way to efficiently solve this problem.  I've
> tried various versions of mapply/lapply, for loops, etc which get the
> answer for small sets but takes hours (per file) on my real data.  
> Bioconductor seemed like the obvious place to look, but my GoogleFu
> must not be that great.  I never found anything relevant.
>
> Any ideas or points to the right direction would be greatly
> appreciated.

The usual BioC recommendation for this sort of problem is package IRanges. And that mailing list probably has many readers who have used that package, unkike this mailing list.

It purported to handle overlapping ranges as well as the non- overlapping problem you pose.

http://www.googlesyndicatedsearch.com/u/newcastlemaths?q=+chromosome+position+iranges&sa=Google+Search

--

David Winsemius, MD
West Hartford, CT

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.