Regex exercise

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Regex exercise

Bert Gunter
For regular expression afficianados, I'd like a cleverer solution to
the following problem (my solution works just fine for my needs; I'm
just trying to improve my regex skills):

Given the string (entered, say, at a readline prompt):

 "1   2 -5, 3- 6 4  8 5-7 10"   ## only integers will be entered

parse it to produce the numeric vector:

c(1, 2, 3, 4, 5, 3, 4, 5, 6, 8, 5, 6, 7, 10)

Note that "-" in the expression is used to indicate a range of values
instead of ":"

Here's my UNclever solution:

First convert more than one space to a single space and then replace
"<any spaces>-<any spaces>" by ":" by:

>  x1 <- gsub(" *- *",":",gsub(" +"," ",resp))  #giving
> x1
[1] "1 2:5, 3:6 4 8 5:7 10"    ## Note that the comma remains

Next convert the single string into a character vector via strsplit by
splitting on anything but ":" or a digit:

> x2 <- strsplit(x1,split="[^:[:digit:]]+")[[1]]   #giving
> x2
[1] "1"    "2:5"  "3:6" "4"    "8"    "5:7"  "10"

Finally, parse() the vector, eval() each element, and unlist() the
resulting list of numeric vectors:

>  unlist(lapply(parse(text=x2),eval)) #giving, as desired,
 [1]  1  2  3  4  5  3  4  5  6  4  8  5  6  7 10


This seems far too clumsy and circumlocuitous not to have a more
elegant solution from a true regex expert.

(Special note to Thomas Lumley: This seems one of the few instances
where eval(parse..)) may actually be appropriate.)

Cheers to all,

Bert

--
Bert Gunter
Genentech Nonclinical Biostatistics

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Regex exercise

Richard M. Heiberger
Bert,

we can save a lot of time by using paste and then only one call to eval and
parse.

> x2 <- c("1",    "2:5",  "3:6", "4",    "8",    "5:7",  "10")
> system.time(for (i in 1:100)  unlist(lapply(parse(text=x2),eval)))
   user  system elapsed
   0.06    0.00    0.03
> system.time(for (i in 1:100)  eval(parse(text=paste("c(",paste(x2,
collapse=","),")"))))
   user  system elapsed
   0.01    0.00    0.03
>

Rich

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Regex exercise

Thomas Lumley
In reply to this post by Bert Gunter
On Fri, 20 Aug 2010, Bert Gunter wrote:

> Given the string (entered, say, at a readline prompt):
>
> "1   2 -5, 3- 6 4  8 5-7 10"   ## only integers will be entered

Presumably only non-negative integers

> (Special note to Thomas Lumley: This seems one of the few instances
> where eval(parse..)) may actually be appropriate.)
>

Yes, implementing a new minilanguage is a valid use.  It isn't necessary, and the following could probably be improved on

s<-"1   2 -5, 3- 6 4  8 5-7 10"

npos<-gregexpr("[0-9]+",s)[[1]]
numbers<-as.numeric(substring(s,npos,attr(npos,"match.length")+npos-1))
hyphens<-findInterval(gregexpr("-",s)[[1]],npos)
nn<-as.list(numbers)
nn[hyphens+1]<-mapply(seq,numbers[hyphens]+1,numbers[hyphens+1])
unlist(nn)


      -thomas

Thomas Lumley
Professor of Biostatistics
University of Washington, Seattle

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Regex exercise

Marc Schwartz-3
In reply to this post by Richard M. Heiberger

On Aug 20, 2010, at 4:16 PM, RICHARD M. HEIBERGER wrote:

> Bert,
>
> we can save a lot of time by using paste and then only one call to eval and
> parse.
>
>> x2 <- c("1",    "2:5",  "3:6", "4",    "8",    "5:7",  "10")
>> system.time(for (i in 1:100)  unlist(lapply(parse(text=x2),eval)))
>   user  system elapsed
>   0.06    0.00    0.03
>> system.time(for (i in 1:100)  eval(parse(text=paste("c(",paste(x2,
> collapse=","),")"))))
>   user  system elapsed
>   0.01    0.00    0.03
>>
>
> Rich


Building on Rich's approach:

> x
[1] "1   2 -5, 3- 6 4  8 5-7 10"


> eval(parse(text = paste("c(", paste(strsplit(gsub(" *- *", ":", x), split = " +|, +")[[1]], collapse = ","), ")")))
 [1]  1  2  3  4  5  3  4  5  6  4  8  5  6  7 10


The key inner part is:

> strsplit(gsub(" *- *", ":", x), split = " +|, +")[[1]]
[1] "1"   "2:5" "3:6" "4"   "8"   "5:7" "10"


HTH,

Marc Schwartz

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Regex exercise

Michael Hannon
In reply to this post by Bert Gunter
> For regular expression afficianados, I'd like a cleverer solution to
> the following problem (my solution works just fine for my needs; I'm
> just trying to improve my regex skills):
>
> Given the string (entered, say, at a readline prompt):
>
> "1  2 -5, 3- 6 4  8 5-7 10"  ## only integers will be entered
>
> parse it to produce the numeric vector:
>
> c(1, 2, 3, 4, 5, 3, 4, 5, 6, 8, 5, 6, 7, 10)
>
> Note that "-" in the expression is used to indicate a range of values
> instead of ":"
>
> Here's my UNclever solution:
>
> First convert more than one space to a single space and then replace
> "<any spaces>-<any spaces>" by ":" by:
>
> >  x1 <- gsub(" *- *",":",gsub(" +"," ",resp))  #giving
> > x1
> [1] "1 2:5, 3:6 4 8 5:7 10"    ## Note that the comma remains
>
> Next convert the single string into a character vector via strsplit by
> splitting on anything but ":" or a digit:
>
> > x2 <- strsplit(x1,split="[^:[:digit:]]+")[[1]]  #giving
> > x2
> [1] "1"    "2:5"  "3:6" "4"    "8"    "5:7"  "10"
>
> Finally, parse() the vector, eval() each element, and unlist() the
> resulting list of numeric vectors:
>
> >  unlist(lapply(parse(text=x2),eval)) #giving, as desired,
> [1]  1  2  3  4  5  3  4  5  6  4  8  5  6  7 10
>
>
> This seems far too clumsy and circumlocuitous not to have a more
> elegant solution from a true regex expert.
>
> (Special note to Thomas Lumley: This seems one of the few instances
> where eval(parse..)) may actually be appropriate.)

Howdy.  I don't know that I can produce anything less circumlocutory, but I
note that your "x2" form has a simple-enough structure that it can be further
parsed with regular expressions, i.e., as opposed to using parse and eval.  I
don't know that this is an improvement -- just a variation on the theme.

I've appended an example.

-- Mike

#### Original vector
x <- "1  2 -5, 3- 6 4  8 5-7 10"; x

#### Convert ranges to standard R form
x1 <- gsub("[ ]*-[ ]*", ":", x); x1

#### Get rid of the comma
x2 <- gsub(",", " ", x1); x2

#### Remove extra spaces
x3 <- gsub("[ ]+", " ", x2); x3

#### Split off elements, now in standard form
x4 <- unlist(strsplit(x3, " ")); x4

#### Use regular expression for simple parse of elements
x5 <- sapply(x4, function(a) {
          n1 <- gsub("([[:digit:]]):[[:digit:]]", "\\1", a)
          n2 <- gsub("[[:digit:]]:([[:digit:]])", "\\1", a)
          n1:n2}, USE.NAMES=FALSE); x5
x6 <- unlist(x5); x6

##########################################################

> #### Original vector
> x <- "1  2 -5, 3- 6 4  8 5-7 10"; x
[1] "1  2 -5, 3- 6 4  8 5-7 10"
>
> #### Convert ranges to standard R form
> x1 <- gsub("[ ]*-[ ]*", ":", x); x1
[1] "1  2:5, 3:6 4  8 5:7 10"
>
> #### Get rid of the comma
> x2 <- gsub(",", " ", x1); x2
[1] "1  2:5  3:6 4  8 5:7 10"
>
> #### Remove extra spaces
> x3 <- gsub("[ ]+", " ", x2); x3
[1] "1 2:5 3:6 4 8 5:7 10"
>
> #### Split off elements, now in standard form
> x4 <- unlist(strsplit(x3, " ")); x4
[1] "1"   "2:5" "3:6" "4"   "8"   "5:7" "10"
>
> #### Use regular expression for simple parse of elements
> x5 <- sapply(x4, function(a) {
+           n1 <- gsub("([[:digit:]]):[[:digit:]]", "\\1", a)
+           n2 <- gsub("[[:digit:]]:([[:digit:]])", "\\1", a)
+           n1:n2}, USE.NAMES=FALSE); x5
[[1]]
[1] 1

[[2]]
[1] 2 3 4 5

[[3]]
[1] 3 4 5 6

[[4]]
[1] 4

[[5]]
[1] 8

[[6]]
[1] 5 6 7

[[7]]
[1] 10

> x6 <- unlist(x5); x6
 [1]  1  2  3  4  5  3  4  5  6  4  8  5  6  7 10
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Regex exercise

Bert Gunter
Thanks Michael:

You are essentially doing the eval and parsing by hand instead of
letting eval(parse()) do the work. I prefer the latter.

However, your code did something that I did not expect and for which I
can find no documentation -- I would have thought it shouldn't work.

... and that is, the return of your sapply is n1:n2  where n1 and n2
are _character values_ (because that's what gsub returns, of course).
I would have thought this would give an error, but in fact it gives
the "correct" result. That is, to my complete surprise:

> "3":"5"
[1] 3 4 5
> seq(from= "3", to= "5")
[1] 3 4 5
> seq.int( "3", "5")
[1] 3 4 5
> "3":5
[1] 3 4 5

all work! Is this behavior documented anywhere and I've missed it or
is it a secret "feature."  And to what extent does it work, noting
that:

seq(from="3.5",to="5.5",by="1")
Error in to - from : non-numeric argument to binary operator


Cheers,
Bert

On Fri, Aug 20, 2010 at 4:39 PM, Michael Hannon <[hidden email]> wrote:

>> For regular expression afficianados, I'd like a cleverer solution to
>> the following problem (my solution works just fine for my needs; I'm
>> just trying to improve my regex skills):
>>
>> Given the string (entered, say, at a readline prompt):
>>
>> "1  2 -5, 3- 6 4  8 5-7 10"  ## only integers will be entered
>>
>> parse it to produce the numeric vector:
>>
>> c(1, 2, 3, 4, 5, 3, 4, 5, 6, 8, 5, 6, 7, 10)
>>
>> Note that "-" in the expression is used to indicate a range of values
>> instead of ":"
>>
>> Here's my UNclever solution:
>>
>> First convert more than one space to a single space and then replace
>> "<any spaces>-<any spaces>" by ":" by:
>>
>> >  x1 <- gsub(" *- *",":",gsub(" +"," ",resp))  #giving
>> > x1
>> [1] "1 2:5, 3:6 4 8 5:7 10"    ## Note that the comma remains
>>
>> Next convert the single string into a character vector via strsplit by
>> splitting on anything but ":" or a digit:
>>
>> > x2 <- strsplit(x1,split="[^:[:digit:]]+")[[1]]  #giving
>> > x2
>> [1] "1"    "2:5"  "3:6" "4"    "8"    "5:7"  "10"
>>
>> Finally, parse() the vector, eval() each element, and unlist() the
>> resulting list of numeric vectors:
>>
>> >  unlist(lapply(parse(text=x2),eval)) #giving, as desired,
>> [1]  1  2  3  4  5  3  4  5  6  4  8  5  6  7 10
>>
>>
>> This seems far too clumsy and circumlocuitous not to have a more
>> elegant solution from a true regex expert.
>>
>> (Special note to Thomas Lumley: This seems one of the few instances
>> where eval(parse..)) may actually be appropriate.)
>
> Howdy.  I don't know that I can produce anything less circumlocutory, but I
> note that your "x2" form has a simple-enough structure that it can be further
> parsed with regular expressions, i.e., as opposed to using parse and eval.  I
> don't know that this is an improvement -- just a variation on the theme.
>
> I've appended an example.
>
> -- Mike
>
> #### Original vector
> x <- "1  2 -5, 3- 6 4  8 5-7 10"; x
>
> #### Convert ranges to standard R form
> x1 <- gsub("[ ]*-[ ]*", ":", x); x1
>
> #### Get rid of the comma
> x2 <- gsub(",", " ", x1); x2
>
> #### Remove extra spaces
> x3 <- gsub("[ ]+", " ", x2); x3
>
> #### Split off elements, now in standard form
> x4 <- unlist(strsplit(x3, " ")); x4
>
> #### Use regular expression for simple parse of elements
> x5 <- sapply(x4, function(a) {
>          n1 <- gsub("([[:digit:]]):[[:digit:]]", "\\1", a)
>          n2 <- gsub("[[:digit:]]:([[:digit:]])", "\\1", a)
>          n1:n2}, USE.NAMES=FALSE); x5
> x6 <- unlist(x5); x6
>
> ##########################################################
>
>> #### Original vector
>> x <- "1  2 -5, 3- 6 4  8 5-7 10"; x
> [1] "1  2 -5, 3- 6 4  8 5-7 10"
>>
>> #### Convert ranges to standard R form
>> x1 <- gsub("[ ]*-[ ]*", ":", x); x1
> [1] "1  2:5, 3:6 4  8 5:7 10"
>>
>> #### Get rid of the comma
>> x2 <- gsub(",", " ", x1); x2
> [1] "1  2:5  3:6 4  8 5:7 10"
>>
>> #### Remove extra spaces
>> x3 <- gsub("[ ]+", " ", x2); x3
> [1] "1 2:5 3:6 4 8 5:7 10"
>>
>> #### Split off elements, now in standard form
>> x4 <- unlist(strsplit(x3, " ")); x4
> [1] "1"   "2:5" "3:6" "4"   "8"   "5:7" "10"
>>
>> #### Use regular expression for simple parse of elements
>> x5 <- sapply(x4, function(a) {
> +           n1 <- gsub("([[:digit:]]):[[:digit:]]", "\\1", a)
> +           n2 <- gsub("[[:digit:]]:([[:digit:]])", "\\1", a)
> +           n1:n2}, USE.NAMES=FALSE); x5
> [[1]]
> [1] 1
>
> [[2]]
> [1] 2 3 4 5
>
> [[3]]
> [1] 3 4 5 6
>
> [[4]]
> [1] 4
>
> [[5]]
> [1] 8
>
> [[6]]
> [1] 5 6 7
>
> [[7]]
> [1] 10
>
>> x6 <- unlist(x5); x6
>  [1]  1  2  3  4  5  3  4  5  6  4  8  5  6  7 10
>>
>
>
>
>



--
Bert Gunter
Genentech Nonclinical Biostatistics
467-7374
http://devo.gene.com/groups/devo/depts/ncb/home.shtml

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Regex exercise

Michael Hannon
> You are essentially doing the eval and parsing by hand instead of
> letting eval(parse()) do the work. I prefer the latter.

Hi, Bert.  Yes, I agree with both your analysis and your preference.

> However, your code did something that I did not expect and for which I
> can find no documentation -- I would have thought it shouldn't work.
>
> ... and that is, the return of your sapply is n1:n2  where n1 and n2
> are _character values_ (because that's what gsub returns, of course).
> I would have thought this would give an error, but in fact it gives
> the "correct" result. That is, to my complete surprise:
>
> > "3":"5"
> [1] 3 4 5
> > seq(from= "3", to= "5")
> [1] 3 4 5
> > seq.int( "3", "5")
> [1] 3 4 5
> > "3":5
> [1] 3 4 5
>
> all work! Is this behavior documented anywhere and I've missed it or
> is it a secret "feature."  And to what extent does it work, noting
> that:
>
> seq(from="3.5",to="5.5",by="1")
> Error in to - from : non-numeric argument to binary operator

I think this worked because I started my variable names with 'n', the same as
in "number" ;-)

I just stumbled across this behavior, and I don't understand it either.  There
does seem to be an ethos in some parts of the R community that says "let's
make this sucker work, regardless of the type of garbage that's input", but it
certainly isn't applied consistently:

    > sin(pi)
    [1] 1.2246e-16

    > sin("3.1415927")
    Error in sin("3.1415927") : Non-numeric argument to mathematical function
    No suitable frames for recover()

Note that I'm not complaining about this, just noting an apparent
inconsistency.

Also, the details of the ':' function are not immediately obvious:

    > `:`
    .Primitive(":")
    >

I suppose the source code would reveal all, but I have other things to do.

-- Mike

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Regex exercise

Greg Snow-2
In reply to this post by Bert Gunter
How about:

x <- "1  2 -5, 3- 6 4  8 5-7 10"; x

library(gsubfn)

strapply( x, '(([0-9]+) *- *([0-9]+))|([0-9]+)',
        function(one,two,three,four) {
                if( nchar(four) > 0 ) return(as.numeric(four) )
                return( seq( from=as.numeric(two), to=as.numeric(three) ) )
        }
)[[1]]



If x is a vector of strings and you remove the [[1]] then you will get a list with each element corresponding to a string in x (unlisting will give a single vector).

This could be easily extended to handle floating point numbers instead of just integers and even negative numbers (as long as you have a clear rule to distinguish between a negative and a the end of the range).

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
[hidden email]
801.408.8111


> -----Original Message-----
> From: [hidden email] [mailto:r-help-bounces@r-
> project.org] On Behalf Of Bert Gunter
> Sent: Friday, August 20, 2010 2:55 PM
> To: [hidden email]
> Subject: [R] Regex exercise
>
> For regular expression afficianados, I'd like a cleverer solution to
> the following problem (my solution works just fine for my needs; I'm
> just trying to improve my regex skills):
>
> Given the string (entered, say, at a readline prompt):
>
>  "1   2 -5, 3- 6 4  8 5-7 10"   ## only integers will be entered
>
> parse it to produce the numeric vector:
>
> c(1, 2, 3, 4, 5, 3, 4, 5, 6, 8, 5, 6, 7, 10)
>
> Note that "-" in the expression is used to indicate a range of values
> instead of ":"
>
> Here's my UNclever solution:
>
> First convert more than one space to a single space and then replace
> "<any spaces>-<any spaces>" by ":" by:
>
> >  x1 <- gsub(" *- *",":",gsub(" +"," ",resp))  #giving
> > x1
> [1] "1 2:5, 3:6 4 8 5:7 10"    ## Note that the comma remains
>
> Next convert the single string into a character vector via strsplit by
> splitting on anything but ":" or a digit:
>
> > x2 <- strsplit(x1,split="[^:[:digit:]]+")[[1]]   #giving
> > x2
> [1] "1"    "2:5"  "3:6" "4"    "8"    "5:7"  "10"
>
> Finally, parse() the vector, eval() each element, and unlist() the
> resulting list of numeric vectors:
>
> >  unlist(lapply(parse(text=x2),eval)) #giving, as desired,
>  [1]  1  2  3  4  5  3  4  5  6  4  8  5  6  7 10
>
>
> This seems far too clumsy and circumlocuitous not to have a more
> elegant solution from a true regex expert.
>
> (Special note to Thomas Lumley: This seems one of the few instances
> where eval(parse..)) may actually be appropriate.)
>
> Cheers to all,
>
> Bert
>
> --
> Bert Gunter
> Genentech Nonclinical Biostatistics
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.