Maximum number of patterns and speed in grep

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Maximum number of patterns and speed in grep

mdvaan
Hi,

I am using R's grep function to find patterns in vectors of strings. The number of patterns I would like to match is 7,700 (of different sizes). I noticed that I get an error message when I do the following:

data <- array()
for (j in 1:length(x))
{
array[j] <- length(grep(paste(patterns[1:7700], collapse = "|"),  x[j], value = T))
}

When I break this up into 4 chunks of patterns it works:

data <- array()
for (j in 1:length(x))
{
array$chunk1[j] <- length(grep(paste(patterns[1:2500], collapse = "|"),  x[j], value = T))
array$chunk1[j] <- length(grep(paste(patterns[2501:5000], collapse = "|"),  x[j], value = T))
array$chunk1[j] <- length(grep(paste(patterns[5001:7500], collapse = "|"),  x[j], value = T))
array$chunk1[j] <- length(grep(paste(patterns[7501:7700], collapse = "|"),  x[j], value = T))
}

My questions: what's the maximum size of the patterns argument in grep? Is there a way to do this faster? It is very slow.

Thanks.

Math

Sorry for not providing a reproducible example. It's a size issue which makes it difficult to provide an example.

 
Reply | Threaded
Open this post in threaded view
|

Re: Maximum number of patterns and speed in grep

Sarah Goslee
Hi,

Given that you can't provide a full example, please at least provide
str() on your data, more complete information on the problem, and
ideally a small toy example that demonstrates precisely what you are
doing.

For instance, you tell us that you "get an error message" but you
never tell us what it is. Don't you think we might need to know what
the error is to be able to diagnose and fix it?

Also, note that your "working" example simply overwrites
array$chunk1[j] four times.

Sarah

On Fri, Jul 6, 2012 at 10:45 AM, mdvaan <[hidden email]> wrote:

> Hi,
>
> I am using R's grep function to find patterns in vectors of strings. The
> number of patterns I would like to match is 7,700 (of different sizes). I
> noticed that I get an error message when I do the following:
>
> data <- array()
> for (j in 1:length(x))
> {
> array[j] <- length(grep(paste(patterns[1:7700], collapse = "|"),  x[j],
> value = T))
> }
>
> When I break this up into 4 chunks of patterns it works:
>
> data <- array()
> for (j in 1:length(x))
> {
> array$chunk1[j] <- length(grep(paste(patterns[1:2500], collapse = "|"),
> x[j], value = T))
> array$chunk1[j] <- length(grep(paste(patterns[2501:5000], collapse = "|"),
> x[j], value = T))
> array$chunk1[j] <- length(grep(paste(patterns[5001:7500], collapse = "|"),
> x[j], value = T))
> array$chunk1[j] <- length(grep(paste(patterns[7501:7700], collapse = "|"),
> x[j], value = T))
> }
>
> My questions: what's the maximum size of the patterns argument in grep? Is
> there a way to do this faster? It is very slow.
>
> Thanks.
>
> Math
>
> Sorry for not providing a reproducible example. It's a size issue which
> makes it difficult to provide an example.
>


--
Sarah Goslee
http://www.functionaldiversity.org

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Maximum number of patterns and speed in grep

mdvaan
Thanks for the quick response. I should phrase my question differently because everything is working fine, I am just trying to find a more efficient approach:

1. What's the maximum size of the patterns argument in grep? Can't find it online.
2. I am trying to match 7,700 character strings to about 10,000 vectors each containing about 5,000 strings using grep. Is there a way to do this faster? It is very slow.

Thanks

Sarah Goslee wrote
Hi,

Given that you can't provide a full example, please at least provide
str() on your data, more complete information on the problem, and
ideally a small toy example that demonstrates precisely what you are
doing.

For instance, you tell us that you "get an error message" but you
never tell us what it is. Don't you think we might need to know what
the error is to be able to diagnose and fix it?

Also, note that your "working" example simply overwrites
array$chunk1[j] four times.

Sarah

On Fri, Jul 6, 2012 at 10:45 AM, mdvaan <[hidden email]> wrote:
> Hi,
>
> I am using R's grep function to find patterns in vectors of strings. The
> number of patterns I would like to match is 7,700 (of different sizes). I
> noticed that I get an error message when I do the following:
>
> data <- array()
> for (j in 1:length(x))
> {
> array[j] <- length(grep(paste(patterns[1:7700], collapse = "|"),  x[j],
> value = T))
> }
>
> When I break this up into 4 chunks of patterns it works:
>
> data <- array()
> for (j in 1:length(x))
> {
> array$chunk1[j] <- length(grep(paste(patterns[1:2500], collapse = "|"),
> x[j], value = T))
> array$chunk1[j] <- length(grep(paste(patterns[2501:5000], collapse = "|"),
> x[j], value = T))
> array$chunk1[j] <- length(grep(paste(patterns[5001:7500], collapse = "|"),
> x[j], value = T))
> array$chunk1[j] <- length(grep(paste(patterns[7501:7700], collapse = "|"),
> x[j], value = T))
> }
>
> My questions: what's the maximum size of the patterns argument in grep? Is
> there a way to do this faster? It is very slow.
>
> Thanks.
>
> Math
>
> Sorry for not providing a reproducible example. It's a size issue which
> makes it difficult to provide an example.
>


--
Sarah Goslee
http://www.functionaldiversity.org

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Maximum number of patterns and speed in grep

Gabor Grothendieck
In reply to this post by mdvaan
On Fri, Jul 6, 2012 at 10:45 AM, mdvaan <[hidden email]> wrote:

> Hi,
>
> I am using R's grep function to find patterns in vectors of strings. The
> number of patterns I would like to match is 7,700 (of different sizes). I
> noticed that I get an error message when I do the following:
>
> data <- array()
> for (j in 1:length(x))
> {
> array[j] <- length(grep(paste(patterns[1:7700], collapse = "|"),  x[j],
> value = T))
> }
>
> When I break this up into 4 chunks of patterns it works:
>
> data <- array()
> for (j in 1:length(x))
> {
> array$chunk1[j] <- length(grep(paste(patterns[1:2500], collapse = "|"),
> x[j], value = T))
> array$chunk1[j] <- length(grep(paste(patterns[2501:5000], collapse = "|"),
> x[j], value = T))
> array$chunk1[j] <- length(grep(paste(patterns[5001:7500], collapse = "|"),
> x[j], value = T))
> array$chunk1[j] <- length(grep(paste(patterns[7501:7700], collapse = "|"),
> x[j], value = T))
> }
>
> My questions: what's the maximum size of the patterns argument in grep? Is
> there a way to do this faster? It is very slow.

Try strapplyc in gsubfn and see
  http://gsubfn.googlecode.com
for more info.

# test data
x <- c("abcd", "z", "dbef")

# re is regexp with 7700 alternatives
#  to test with
g <- expand.grid(letters, letters, letters)
gp <- do.call("paste0", g)
gp7700 <- head(gp, 7700)
re <- paste(gp7700, collapse = "|")

# grep gives error message
grep.out <- grep(re, x)

# strapplyc works
library(gsubfn)
which(sapply(strapplyc(x, re), length) > 0)


--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Maximum number of patterns and speed in grep

mdvaan
Thanks, I see that it is working in the sample data. My data, however, gives me an error message:

data <- strapplyc(text, batch[[l]])
Error in structure(.External("dotTcl", ..., PACKAGE = "tcltk"), class = "tclObj") :
  [tcl] couldn't compile regular expression pattern: parentheses () not balanced.

batch[[l]] is similar to your "re" string except that there is a larger variety of characters. I haven't been able to figure out which characters are causing trouble here. Any thoughts?

Thank you very much.

Math



Gabor Grothendieck wrote
On Fri, Jul 6, 2012 at 10:45 AM, mdvaan <[hidden email]> wrote:
> Hi,
>
> I am using R's grep function to find patterns in vectors of strings. The
> number of patterns I would like to match is 7,700 (of different sizes). I
> noticed that I get an error message when I do the following:
>
> data <- array()
> for (j in 1:length(x))
> {
> array[j] <- length(grep(paste(patterns[1:7700], collapse = "|"),  x[j],
> value = T))
> }
>
> When I break this up into 4 chunks of patterns it works:
>
> data <- array()
> for (j in 1:length(x))
> {
> array$chunk1[j] <- length(grep(paste(patterns[1:2500], collapse = "|"),
> x[j], value = T))
> array$chunk1[j] <- length(grep(paste(patterns[2501:5000], collapse = "|"),
> x[j], value = T))
> array$chunk1[j] <- length(grep(paste(patterns[5001:7500], collapse = "|"),
> x[j], value = T))
> array$chunk1[j] <- length(grep(paste(patterns[7501:7700], collapse = "|"),
> x[j], value = T))
> }
>
> My questions: what's the maximum size of the patterns argument in grep? Is
> there a way to do this faster? It is very slow.

Try strapplyc in gsubfn and see
  http://gsubfn.googlecode.com
for more info.

# test data
x <- c("abcd", "z", "dbef")

# re is regexp with 7700 alternatives
#  to test with
g <- expand.grid(letters, letters, letters)
gp <- do.call("paste0", g)
gp7700 <- head(gp, 7700)
re <- paste(gp7700, collapse = "|")

# grep gives error message
grep.out <- grep(re, x)

# strapplyc works
library(gsubfn)
which(sapply(strapplyc(x, re), length) > 0)


--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Maximum number of patterns and speed in grep

Gabor Grothendieck
On Fri, Jul 13, 2012 at 9:40 AM, mdvaan <[hidden email]> wrote:

> Thanks, I see that it is working in the sample data. My data, however, gives
> me an error message:
>
> data <- strapplyc(text, batch[[l]])
> Error in structure(.External("dotTcl", ..., PACKAGE = "tcltk"), class =
> "tclObj") :
>   [tcl] couldn't compile regular expression pattern: parentheses () not
> balanced.
>
> batch[[l]] is similar to your "re" string except that there is a larger
> variety of characters. I haven't been able to figure out which characters
> are causing trouble here. Any thoughts?
>
> Thank you very much.
>
> Math
...
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Note part on last line about posting reproducible code.

--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Maximum number of patterns and speed in grep

mdvaan
Here's some data (which should give you the error messages):

    # read in data
    data <- read.csv("https://dl.dropbox.com/u/13631687/data.csv", header = T, sep = ",")
   
    # first paste all data
    data1 <- paste(data[,1], collapse = "|")
   
    # second paste subsets of the data
    data2a <- paste(data[1:750,1], collapse = "|")
    data2b <- paste(data[751:1500,1], collapse = "|")
   
    # define the object to be searched
    text <- c("the first is Santa Fe Gold Corp", "the second is Starpharma Holdings")
   
    # match
    strapplyc(text, data1)
    strapplyc(text, data2a)
    strapplyc(text, data2b)

Thanks in advance!

Math


Gabor Grothendieck wrote
On Fri, Jul 13, 2012 at 9:40 AM, mdvaan <[hidden email]> wrote:
> Thanks, I see that it is working in the sample data. My data, however, gives
> me an error message:
>
> data <- strapplyc(text, batch[[l]])
> Error in structure(.External("dotTcl", ..., PACKAGE = "tcltk"), class =
> "tclObj") :
>   [tcl] couldn't compile regular expression pattern: parentheses () not
> balanced.
>
> batch[[l]] is similar to your "re" string except that there is a larger
> variety of characters. I haven't been able to figure out which characters
> are causing trouble here. Any thoughts?
>
> Thank you very much.
>
> Math
...
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Note part on last line about posting reproducible code.

--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Maximum number of patterns and speed in grep

Gabor Grothendieck
On Fri, Jul 13, 2012 at 1:41 PM, mdvaan <[hidden email]> wrote:

> Here's some data (which should give you the error messages):
>
>     # read in data
>     data <- read.csv("https://dl.dropbox.com/u/13631687/data.csv", header =
> T, sep = ",")
>
>     # first paste all data
>     data1 <- paste(data[,1], collapse = "|")
>
>     # second paste subsets of the data
>     data2a <- paste(data[1:750,1], collapse = "|")
>     data2b <- paste(data[751:1500,1], collapse = "|")
>
>     # define the object to be searched
>     text <- c("the first is Santa Fe Gold Corp", "the second is Starpharma
> Holdings")
>
>     # match
>     strapplyc(text, data1)
>     strapplyc(text, data2a)
>     strapplyc(text, data2b)
>
> Thanks in advance!
>

Although it seems that strapplyc can handle larger regular expressions
than grep in R it seems neither can handle as many as in your example
so process it in chunks:

k <- 3000 # chunk size

f <- function(from, text) {
        to <- min(from + k - 1, nrow(data))
        r <- paste(data[seq(from, to), 1], collapse = "|")
        r <- gsub("[().*?+{}]", "", r)
        strapply(text, r)
}
ix <- seq(1, nrow(data), k)
out <- lapply(text, function(text) unlist(lapply(ix, f, text)))


--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Maximum number of patterns and speed in grep

mdvaan
Thanks! That worked like a charm.

Math

Gabor Grothendieck wrote
On Fri, Jul 13, 2012 at 1:41 PM, mdvaan <[hidden email]> wrote:
> Here's some data (which should give you the error messages):
>
>     # read in data
>     data <- read.csv("https://dl.dropbox.com/u/13631687/data.csv", header =
> T, sep = ",")
>
>     # first paste all data
>     data1 <- paste(data[,1], collapse = "|")
>
>     # second paste subsets of the data
>     data2a <- paste(data[1:750,1], collapse = "|")
>     data2b <- paste(data[751:1500,1], collapse = "|")
>
>     # define the object to be searched
>     text <- c("the first is Santa Fe Gold Corp", "the second is Starpharma
> Holdings")
>
>     # match
>     strapplyc(text, data1)
>     strapplyc(text, data2a)
>     strapplyc(text, data2b)
>
> Thanks in advance!
>

Although it seems that strapplyc can handle larger regular expressions
than grep in R it seems neither can handle as many as in your example
so process it in chunks:

k <- 3000 # chunk size

f <- function(from, text) {
        to <- min(from + k - 1, nrow(data))
        r <- paste(data[seq(from, to), 1], collapse = "|")
        r <- gsub("[().*?+{}]", "", r)
        strapply(text, r)
}
ix <- seq(1, nrow(data), k)
out <- lapply(text, function(text) unlist(lapply(ix, f, text)))


--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Maximum number of patterns and speed in grep

mdvaan
In reply to this post by Gabor Grothendieck
Hi,

I have a minor follow-up question:

In the example below, "ann" and "nn" in the third element of text are matched. I would like to ignore all matches in which the character following the match is one of [:alpha:]. How do I do this without removing the "ignore.case = TRUE" argument of the strapply function?

So the output should be:

[[1]]
[1] "Santa Fe Gold Corp"

[[2]]
[1] "Starpharma Holdings"

[[3]]
NULL

Rather than:

[[1]]
[1] "Santa Fe Gold Corp"

[[2]]
[1] "Starpharma Holdings"

[[3]]
[1] "ann" "nn"

Thanks!


require(gsubfn)

# read in data
data <- read.csv("https://dl.dropbox.com/u/13631687/data.csv", header = T, sep = ",")

# define the object to be searched
text <- c("the first is Santa Fe Gold Corp", "the second is Starpharma Holdings", "the annual earnings exceed those of last year")

k <- 3000 # chunk size

f <- function(from, text) {
  to <- min(from + k - 1, nrow(data))
  r <- paste(data[seq(from, to), 1], collapse = "|")
  r <- gsub("[().*?+{}]", "", r)
  strapply(text, r, ignore.case = TRUE)
}
ix <- seq(1, nrow(data), k)
out <- lapply(text, function(text) unlist(lapply(ix, f, text)))