Is that an efficient way to find the overlapped , upstream and downstream ranges for a bunch of ranges

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Is that an efficient way to find the overlapped , upstream and downstream ranges for a bunch of ranges

何尧
I do have a bunch of genes ( nearly ~50000)  from the whole genome, which read in genomic ranges

A range(gene) can be seem as an observation has three columns chromosome, start and end, like that

       seqnames start end width strand

gene1     chr1     1   5     5      +

gene2     chr1    10  15     6      +

gene3     chr1    12  17     6      +

gene4     chr1    20  25     6      +

gene5     chr1    30  40    11      +

I just wondering is there an efficient way to find overlapped, upstream and downstream genes for each gene in the granges

For example, assuming all_genes_gr is a ~50000 genes genomic range, the result I want like belows:

gene_nameupstream_genedownstream_geneoverlapped_gene
gene1NAgene2NA
gene2gene1gene4gene3
gene3gene1gene4gene2
gene4gene3gene5NA

Currently ,  the strategy I use is like that,  
library(GenomicRanges)
find_overlapped_gene <- function(idx, all_genes_gr) {
  #cat(idx, "\n")
  curr_gene <- all_genes_gr[idx]
  other_genes <- all_genes_gr[-idx]
  n <- countOverlaps(curr_gene, other_genes)
  gene <- subsetByOverlaps(curr_gene, other_genes)
  return(list(n, gene))
}​

system.time(lapply(1:100, function(idx)  find_overlapped_gene(idx, all_genes_gr)))
However, for 100 genes, it use nearly ~8s by system.time().That means if I had 50000 genes, nearly one hour for just find overlapped gene.

I am just wondering any algorithm or strategy to do that efficiently, perhaps 50000 genes in ~10min or even less

 



        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Is that an efficient way to find the overlapped , upstream and downstream ranges for a bunch of ranges

David Winsemius

> On Apr 5, 2016, at 10:27 AM, 何尧 <[hidden email]> wrote:
>
> I do have a bunch of genes ( nearly ~50000)  from the whole genome, which read in genomic ranges
>
> A range(gene) can be seem as an observation has three columns chromosome, start and end, like that
>
>       seqnames start end width strand
>
> gene1     chr1     1   5     5      +
>
> gene2     chr1    10  15     6      +
>
> gene3     chr1    12  17     6      +
>
> gene4     chr1    20  25     6      +
>
> gene5     chr1    30  40    11      +
>
> I just wondering is there an efficient way to find overlapped, upstream and downstream genes for each gene in the granges

The data.table package (in CRAN) and the iRanges package (in bioC) have formalized efficient approaches to those problems.


>
> For example, assuming all_genes_gr is a ~50000 genes genomic range, the result I want like belows:
>
> gene_nameupstream_genedownstream_geneoverlapped_gene
> gene1NAgene2NA
> gene2gene1gene4gene3
> gene3gene1gene4gene2
> gene4gene3gene5NA
>
> Currently ,  the strategy I use is like that,  
> library(GenomicRanges)
> find_overlapped_gene <- function(idx, all_genes_gr) {
>  #cat(idx, "\n")
>  curr_gene <- all_genes_gr[idx]
>  other_genes <- all_genes_gr[-idx]
>  n <- countOverlaps(curr_gene, other_genes)
>  gene <- subsetByOverlaps(curr_gene, other_genes)
>  return(list(n, gene))
> }​
>
> system.time(lapply(1:100, function(idx)  find_overlapped_gene(idx, all_genes_gr)))
> However, for 100 genes, it use nearly ~8s by system.time().That means if I had 50000 genes, nearly one hour for just find overlapped gene.
>
> I am just wondering any algorithm or strategy to do that efficiently, perhaps 50000 genes in ~10min or even less
>
I suspect this would happen on a much faster basis for such a small dataset.

--
David.



> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Is that an efficient way to find the overlapped , upstream and downstream ranges for a bunch of ranges

Michael Lawrence-3
In reply to this post by 何尧
For the sake of prosterity, this question was asked and answered here:
https://support.bioconductor.org/p/80448

On Tue, Apr 5, 2016 at 10:27 AM, 何尧 <[hidden email]> wrote:

> I do have a bunch of genes ( nearly ~50000)  from the whole genome, which read in genomic ranges
>
> A range(gene) can be seem as an observation has three columns chromosome, start and end, like that
>
>        seqnames start end width strand
>
> gene1     chr1     1   5     5      +
>
> gene2     chr1    10  15     6      +
>
> gene3     chr1    12  17     6      +
>
> gene4     chr1    20  25     6      +
>
> gene5     chr1    30  40    11      +
>
> I just wondering is there an efficient way to find overlapped, upstream and downstream genes for each gene in the granges
>
> For example, assuming all_genes_gr is a ~50000 genes genomic range, the result I want like belows:
>
> gene_nameupstream_genedownstream_geneoverlapped_gene
> gene1NAgene2NA
> gene2gene1gene4gene3
> gene3gene1gene4gene2
> gene4gene3gene5NA
>
> Currently ,  the strategy I use is like that,
> library(GenomicRanges)
> find_overlapped_gene <- function(idx, all_genes_gr) {
>   #cat(idx, "\n")
>   curr_gene <- all_genes_gr[idx]
>   other_genes <- all_genes_gr[-idx]
>   n <- countOverlaps(curr_gene, other_genes)
>   gene <- subsetByOverlaps(curr_gene, other_genes)
>   return(list(n, gene))
> }
>
> system.time(lapply(1:100, function(idx)  find_overlapped_gene(idx, all_genes_gr)))
> However, for 100 genes, it use nearly ~8s by system.time().That means if I had 50000 genes, nearly one hour for just find overlapped gene.
>
> I am just wondering any algorithm or strategy to do that efficiently, perhaps 50000 genes in ~10min or even less
>
>
>
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.