Quantcast

Memoize and vectorize a custom function

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Memoize and vectorize a custom function

Kamil Slowikowski
My goal is simple: calcuate GC content of each sequence in a list of
nucleotide
sequences. I have figured out how to vectorize, but all my attempts at
memoization failed.

Can you show me how to properly memoize my function?

There is a StackOverflow post on the subject of memoization, but it does not
help me:
http://stackoverflow.com/questions/7262485/options-for-caching-memoization-hashing-in-r

I haven't been able to find any other discussions on this subject. Searching
for "memoise" or "memoize" on r-bloggers.com returns zero results. Searching
for those keywords at http://r-project.markmail.org/ does not return helpful
discussions.

Here's my data:

    seqs <- c("","G","C","CCC","T","","TTCCT","","C","CTC")

Some sequences are missing, so they're blank `""`.

I have a function for calculating GC content:

    > GC <- function(s) {
        if (!is.character(s)) return(NA)
        n <- nchar(s)
        if (n == 0) return(NA)
        m <- gregexpr('[GCSgcs]', s)[[1]]
        if (m[1] < 1) return(0)
        return(100.0 * length(m) / n)
    }

It works:

    > GC('')
    [1] NA
    > GC('G')
    [1] 100
    > GC('GAG')
    [1] 66.66667
    > sapply(seqs, GC)
                      G         C       CCC         T               TTCCT
                C
           NA 100.00000 100.00000 100.00000   0.00000        NA  40.00000
     NA 100.00000
          CTC
     66.66667

I want to memoize it. Then, I want to vectorize it. Should be easy, right?

Apparently, I must have the wrong mindset for using the `memoise` or
`R.cache`
R packages:

    > system.time(dummy <- sapply(rep(seqs,100), GC))
       user  system elapsed
      0.044   0.000   0.054
    >
    > library(memoise)
    > GCm1 <- memoise(GC)
    > system.time(dummy <- sapply(rep(seqs,100), GCm1))
       user  system elapsed
      0.164   0.000   0.173
    >
    > library(R.cache)
    > GCm2 <- addMemoization(GC)
    > system.time(dummy <- sapply(rep(seqs,100), GCm2))
       user  system elapsed
     10.601   0.252  10.926

Notice that the memoized functions are several orders of magnitude slower.

I tried the `hash` package, but things seem to be happening behind the
scenes
and I don't understand the output:

    > cache <- hash()
    > GCc <- function(s) {
        if (!is.character(s) || nchar(s) == 0) {
            return(NA)
        }
        if(exists(s, cache)) {
            return(cache[[s]])
        }
        result <- GC(s)
        cache[[s]] <<- result
        return(result)
    }
    > sapply(seqs,GCc)
    [[1]]
    [1] NA

    $G
    [1] 100

    $C
    NULL

    $CCC
    [1] 100

    $T
    NULL

    [[6]]
    [1] NA

    $TTCCT
    [1] 40

    [[8]]
    [1] NA

    $C
    NULL

    $CTC
    [1] 66.66667

At least I figured out how to vectorize:

    > GCv <- Vectorize(GC)
    > GCv(seqs)
                      G         C       CCC         T               TTCCT
                C
      0.00000 100.00000 100.00000 100.00000   0.00000   0.00000  40.00000
0.00000 100.00000
          CTC
     66.66667

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Memoize and vectorize a custom function

Martin Morgan
On 04/26/2012 03:21 PM, Kamil Slowikowski wrote:

> My goal is simple: calcuate GC content of each sequence in a list of
> nucleotide
> sequences. I have figured out how to vectorize, but all my attempts at
> memoization failed.
>
> Can you show me how to properly memoize my function?
>
> There is a StackOverflow post on the subject of memoization, but it does not
> help me:
> http://stackoverflow.com/questions/7262485/options-for-caching-memoization-hashing-in-r
>
> I haven't been able to find any other discussions on this subject. Searching
> for "memoise" or "memoize" on r-bloggers.com returns zero results. Searching
> for those keywords at http://r-project.markmail.org/ does not return helpful
> discussions.
>
> Here's my data:
>
>      seqs<- c("","G","C","CCC","T","","TTCCT","","C","CTC")
>
> Some sequences are missing, so they're blank `""`.
>
> I have a function for calculating GC content:
>
>      >  GC<- function(s) {
>          if (!is.character(s)) return(NA)
>          n<- nchar(s)
>          if (n == 0) return(NA)
>          m<- gregexpr('[GCSgcs]', s)[[1]]
>          if (m[1]<  1) return(0)
>          return(100.0 * length(m) / n)
>      }
>
> It works:
>
>      >  GC('')
>      [1] NA
>      >  GC('G')
>      [1] 100
>      >  GC('GAG')
>      [1] 66.66667
>      >  sapply(seqs, GC)
>                        G         C       CCC         T               TTCCT
>                  C
>             NA 100.00000 100.00000 100.00000   0.00000        NA  40.00000
>       NA 100.00000
>            CTC
>       66.66667
>
> I want to memoize it. Then, I want to vectorize it. Should be easy, right?

Hi Kamil --

Not really an answer to your question, but looking at

   http://bioconductor.org/packages/2.10/bioc/html/Biostrings.html

will tell you to install Biostrings with

   source("http://bioconductor.org/biocLite.R")
   biocLite("Biostrings")

and then

   library(Biostrings)
   dna = DNAStringSet(c("","G","C","CCC","T","","TTCCT","","C","CTC"))
   alf = alphabetFrequency(dna, as.prob=TRUE, baseOnly=TRUE)
   rowSums(alf[,c("G", "C")])

will give you GC content of each string.

 > rowSums(alf[,c("G", "C")])
  [1]       NaN 1.0000000 1.0000000 1.0000000 0.0000000       NaN 0.4000000
  [8]       NaN 1.0000000 0.6666667

this will be fast and scalable; Biostrings and other Bioconductor
(http://bioconductor.org) packages have many useful functions for
working with DNA.

See the Bioconductor mailing list for more help if this is a promising
direction.

   http://bioconductor.org/help/mailing-list/

Martin

>
> Apparently, I must have the wrong mindset for using the `memoise` or
> `R.cache`
> R packages:
>
>      >  system.time(dummy<- sapply(rep(seqs,100), GC))
>         user  system elapsed
>        0.044   0.000   0.054
>      >
>      >  library(memoise)
>      >  GCm1<- memoise(GC)
>      >  system.time(dummy<- sapply(rep(seqs,100), GCm1))
>         user  system elapsed
>        0.164   0.000   0.173
>      >
>      >  library(R.cache)
>      >  GCm2<- addMemoization(GC)
>      >  system.time(dummy<- sapply(rep(seqs,100), GCm2))
>         user  system elapsed
>       10.601   0.252  10.926
>
> Notice that the memoized functions are several orders of magnitude slower.
>
> I tried the `hash` package, but things seem to be happening behind the
> scenes
> and I don't understand the output:
>
>      >  cache<- hash()
>      >  GCc<- function(s) {
>          if (!is.character(s) || nchar(s) == 0) {
>              return(NA)
>          }
>          if(exists(s, cache)) {
>              return(cache[[s]])
>          }
>          result<- GC(s)
>          cache[[s]]<<- result
>          return(result)
>      }
>      >  sapply(seqs,GCc)
>      [[1]]
>      [1] NA
>
>      $G
>      [1] 100
>
>      $C
>      NULL
>
>      $CCC
>      [1] 100
>
>      $T
>      NULL
>
>      [[6]]
>      [1] NA
>
>      $TTCCT
>      [1] 40
>
>      [[8]]
>      [1] NA
>
>      $C
>      NULL
>
>      $CTC
>      [1] 66.66667
>
> At least I figured out how to vectorize:
>
>      >  GCv<- Vectorize(GC)
>      >  GCv(seqs)
>                        G         C       CCC         T               TTCCT
>                  C
>        0.00000 100.00000 100.00000 100.00000   0.00000   0.00000  40.00000
> 0.00000 100.00000
>            CTC
>       66.66667
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Memoize and vectorize a custom function

Henrik Bengtsson-3
In reply to this post by Kamil Slowikowski
On Thu, Apr 26, 2012 at 3:21 PM, Kamil Slowikowski
<[hidden email]> wrote:

> My goal is simple: calcuate GC content of each sequence in a list of
> nucleotide
> sequences. I have figured out how to vectorize, but all my attempts at
> memoization failed.
>
> Can you show me how to properly memoize my function?
>
> There is a StackOverflow post on the subject of memoization, but it does not
> help me:
> http://stackoverflow.com/questions/7262485/options-for-caching-memoization-hashing-in-r
>
> I haven't been able to find any other discussions on this subject. Searching
> for "memoise" or "memoize" on r-bloggers.com returns zero results. Searching
> for those keywords at http://r-project.markmail.org/ does not return helpful
> discussions.
>
> Here's my data:
>
>    seqs <- c("","G","C","CCC","T","","TTCCT","","C","CTC")
>
> Some sequences are missing, so they're blank `""`.
>
> I have a function for calculating GC content:
>
>    > GC <- function(s) {
>        if (!is.character(s)) return(NA)
>        n <- nchar(s)
>        if (n == 0) return(NA)
>        m <- gregexpr('[GCSgcs]', s)[[1]]
>        if (m[1] < 1) return(0)
>        return(100.0 * length(m) / n)
>    }
>
> It works:
>
>    > GC('')
>    [1] NA
>    > GC('G')
>    [1] 100
>    > GC('GAG')
>    [1] 66.66667
>    > sapply(seqs, GC)
>                      G         C       CCC         T               TTCCT
>                C
>           NA 100.00000 100.00000 100.00000   0.00000        NA  40.00000
>     NA 100.00000
>          CTC
>     66.66667
>
> I want to memoize it. Then, I want to vectorize it. Should be easy, right?
>
> Apparently, I must have the wrong mindset for using the `memoise` or
> `R.cache`
> R packages:
>
>    > system.time(dummy <- sapply(rep(seqs,100), GC))
>       user  system elapsed
>      0.044   0.000   0.054
>    >
>    > library(memoise)
>    > GCm1 <- memoise(GC)
>    > system.time(dummy <- sapply(rep(seqs,100), GCm1))
>       user  system elapsed
>      0.164   0.000   0.173
>    >
>    > library(R.cache)
>    > GCm2 <- addMemoization(GC)
>    > system.time(dummy <- sapply(rep(seqs,100), GCm2))
>       user  system elapsed
>     10.601   0.252  10.926
>
> Notice that the memoized functions are several orders of magnitude slower.

About R.cache: All memoization by R.cache is currently done toward the
file system.  In other words, it is designed for larger objects (so
you cannot hold all of the cache in memory) and more computationally
expensive tasks.

/Henrik

>
> I tried the `hash` package, but things seem to be happening behind the
> scenes
> and I don't understand the output:
>
>    > cache <- hash()
>    > GCc <- function(s) {
>        if (!is.character(s) || nchar(s) == 0) {
>            return(NA)
>        }
>        if(exists(s, cache)) {
>            return(cache[[s]])
>        }
>        result <- GC(s)
>        cache[[s]] <<- result
>        return(result)
>    }
>    > sapply(seqs,GCc)
>    [[1]]
>    [1] NA
>
>    $G
>    [1] 100
>
>    $C
>    NULL
>
>    $CCC
>    [1] 100
>
>    $T
>    NULL
>
>    [[6]]
>    [1] NA
>
>    $TTCCT
>    [1] 40
>
>    [[8]]
>    [1] NA
>
>    $C
>    NULL
>
>    $CTC
>    [1] 66.66667
>
> At least I figured out how to vectorize:
>
>    > GCv <- Vectorize(GC)
>    > GCv(seqs)
>                      G         C       CCC         T               TTCCT
>                C
>      0.00000 100.00000 100.00000 100.00000   0.00000   0.00000  40.00000
> 0.00000 100.00000
>          CTC
>     66.66667
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Loading...