

For regular expression afficianados, I'd like a cleverer solution to
the following problem (my solution works just fine for my needs; I'm
just trying to improve my regex skills):
Given the string (entered, say, at a readline prompt):
"1 2 5, 3 6 4 8 57 10" ## only integers will be entered
parse it to produce the numeric vector:
c(1, 2, 3, 4, 5, 3, 4, 5, 6, 8, 5, 6, 7, 10)
Note that "" in the expression is used to indicate a range of values
instead of ":"
Here's my UNclever solution:
First convert more than one space to a single space and then replace
"<any spaces><any spaces>" by ":" by:
> x1 < gsub(" * *",":",gsub(" +"," ",resp)) #giving
> x1
[1] "1 2:5, 3:6 4 8 5:7 10" ## Note that the comma remains
Next convert the single string into a character vector via strsplit by
splitting on anything but ":" or a digit:
> x2 < strsplit(x1,split="[^:[:digit:]]+")[[1]] #giving
> x2
[1] "1" "2:5" "3:6" "4" "8" "5:7" "10"
Finally, parse() the vector, eval() each element, and unlist() the
resulting list of numeric vectors:
> unlist(lapply(parse(text=x2),eval)) #giving, as desired,
[1] 1 2 3 4 5 3 4 5 6 4 8 5 6 7 10
This seems far too clumsy and circumlocuitous not to have a more
elegant solution from a true regex expert.
(Special note to Thomas Lumley: This seems one of the few instances
where eval(parse..)) may actually be appropriate.)
Cheers to all,
Bert

Bert Gunter
Genentech Nonclinical Biostatistics
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Bert,
we can save a lot of time by using paste and then only one call to eval and
parse.
> x2 < c("1", "2:5", "3:6", "4", "8", "5:7", "10")
> system.time(for (i in 1:100) unlist(lapply(parse(text=x2),eval)))
user system elapsed
0.06 0.00 0.03
> system.time(for (i in 1:100) eval(parse(text=paste("c(",paste(x2,
collapse=","),")"))))
user system elapsed
0.01 0.00 0.03
>
Rich
[[alternative HTML version deleted]]
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


On Fri, 20 Aug 2010, Bert Gunter wrote:
> Given the string (entered, say, at a readline prompt):
>
> "1 2 5, 3 6 4 8 57 10" ## only integers will be entered
Presumably only nonnegative integers
> (Special note to Thomas Lumley: This seems one of the few instances
> where eval(parse..)) may actually be appropriate.)
>
Yes, implementing a new minilanguage is a valid use. It isn't necessary, and the following could probably be improved on
s<"1 2 5, 3 6 4 8 57 10"
npos<gregexpr("[09]+",s)[[1]]
numbers<as.numeric(substring(s,npos,attr(npos,"match.length")+npos1))
hyphens<findInterval(gregexpr("",s)[[1]],npos)
nn<as.list(numbers)
nn[hyphens+1]<mapply(seq,numbers[hyphens]+1,numbers[hyphens+1])
unlist(nn)
thomas
Thomas Lumley
Professor of Biostatistics
University of Washington, Seattle
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


In reply to this post by Richard M. Heiberger
On Aug 20, 2010, at 4:16 PM, RICHARD M. HEIBERGER wrote:
> Bert,
>
> we can save a lot of time by using paste and then only one call to eval and
> parse.
>
>> x2 < c("1", "2:5", "3:6", "4", "8", "5:7", "10")
>> system.time(for (i in 1:100) unlist(lapply(parse(text=x2),eval)))
> user system elapsed
> 0.06 0.00 0.03
>> system.time(for (i in 1:100) eval(parse(text=paste("c(",paste(x2,
> collapse=","),")"))))
> user system elapsed
> 0.01 0.00 0.03
>>
>
> Rich
Building on Rich's approach:
> x
[1] "1 2 5, 3 6 4 8 57 10"
> eval(parse(text = paste("c(", paste(strsplit(gsub(" * *", ":", x), split = " +, +")[[1]], collapse = ","), ")")))
[1] 1 2 3 4 5 3 4 5 6 4 8 5 6 7 10
The key inner part is:
> strsplit(gsub(" * *", ":", x), split = " +, +")[[1]]
[1] "1" "2:5" "3:6" "4" "8" "5:7" "10"
HTH,
Marc Schwartz
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


> For regular expression afficianados, I'd like a cleverer solution to
> the following problem (my solution works just fine for my needs; I'm
> just trying to improve my regex skills):
>
> Given the string (entered, say, at a readline prompt):
>
> "1 2 5, 3 6 4 8 57 10" ## only integers will be entered
>
> parse it to produce the numeric vector:
>
> c(1, 2, 3, 4, 5, 3, 4, 5, 6, 8, 5, 6, 7, 10)
>
> Note that "" in the expression is used to indicate a range of values
> instead of ":"
>
> Here's my UNclever solution:
>
> First convert more than one space to a single space and then replace
> "<any spaces><any spaces>" by ":" by:
>
> > x1 < gsub(" * *",":",gsub(" +"," ",resp)) #giving
> > x1
> [1] "1 2:5, 3:6 4 8 5:7 10" ## Note that the comma remains
>
> Next convert the single string into a character vector via strsplit by
> splitting on anything but ":" or a digit:
>
> > x2 < strsplit(x1,split="[^:[:digit:]]+")[[1]] #giving
> > x2
> [1] "1" "2:5" "3:6" "4" "8" "5:7" "10"
>
> Finally, parse() the vector, eval() each element, and unlist() the
> resulting list of numeric vectors:
>
> > unlist(lapply(parse(text=x2),eval)) #giving, as desired,
> [1] 1 2 3 4 5 3 4 5 6 4 8 5 6 7 10
>
>
> This seems far too clumsy and circumlocuitous not to have a more
> elegant solution from a true regex expert.
>
> (Special note to Thomas Lumley: This seems one of the few instances
> where eval(parse..)) may actually be appropriate.)
Howdy. I don't know that I can produce anything less circumlocutory, but I
note that your "x2" form has a simpleenough structure that it can be further
parsed with regular expressions, i.e., as opposed to using parse and eval. I
don't know that this is an improvement  just a variation on the theme.
I've appended an example.
 Mike
#### Original vector
x < "1 2 5, 3 6 4 8 57 10"; x
#### Convert ranges to standard R form
x1 < gsub("[ ]*[ ]*", ":", x); x1
#### Get rid of the comma
x2 < gsub(",", " ", x1); x2
#### Remove extra spaces
x3 < gsub("[ ]+", " ", x2); x3
#### Split off elements, now in standard form
x4 < unlist(strsplit(x3, " ")); x4
#### Use regular expression for simple parse of elements
x5 < sapply(x4, function(a) {
n1 < gsub("([[:digit:]]):[[:digit:]]", "\\1", a)
n2 < gsub("[[:digit:]]:([[:digit:]])", "\\1", a)
n1:n2}, USE.NAMES=FALSE); x5
x6 < unlist(x5); x6
##########################################################
> #### Original vector
> x < "1 2 5, 3 6 4 8 57 10"; x
[1] "1 2 5, 3 6 4 8 57 10"
>
> #### Convert ranges to standard R form
> x1 < gsub("[ ]*[ ]*", ":", x); x1
[1] "1 2:5, 3:6 4 8 5:7 10"
>
> #### Get rid of the comma
> x2 < gsub(",", " ", x1); x2
[1] "1 2:5 3:6 4 8 5:7 10"
>
> #### Remove extra spaces
> x3 < gsub("[ ]+", " ", x2); x3
[1] "1 2:5 3:6 4 8 5:7 10"
>
> #### Split off elements, now in standard form
> x4 < unlist(strsplit(x3, " ")); x4
[1] "1" "2:5" "3:6" "4" "8" "5:7" "10"
>
> #### Use regular expression for simple parse of elements
> x5 < sapply(x4, function(a) {
+ n1 < gsub("([[:digit:]]):[[:digit:]]", "\\1", a)
+ n2 < gsub("[[:digit:]]:([[:digit:]])", "\\1", a)
+ n1:n2}, USE.NAMES=FALSE); x5
[[1]]
[1] 1
[[2]]
[1] 2 3 4 5
[[3]]
[1] 3 4 5 6
[[4]]
[1] 4
[[5]]
[1] 8
[[6]]
[1] 5 6 7
[[7]]
[1] 10
> x6 < unlist(x5); x6
[1] 1 2 3 4 5 3 4 5 6 4 8 5 6 7 10
>
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Thanks Michael:
You are essentially doing the eval and parsing by hand instead of
letting eval(parse()) do the work. I prefer the latter.
However, your code did something that I did not expect and for which I
can find no documentation  I would have thought it shouldn't work.
... and that is, the return of your sapply is n1:n2 where n1 and n2
are _character values_ (because that's what gsub returns, of course).
I would have thought this would give an error, but in fact it gives
the "correct" result. That is, to my complete surprise:
> "3":"5"
[1] 3 4 5
> seq(from= "3", to= "5")
[1] 3 4 5
> seq.int( "3", "5")
[1] 3 4 5
> "3":5
[1] 3 4 5
all work! Is this behavior documented anywhere and I've missed it or
is it a secret "feature." And to what extent does it work, noting
that:
seq(from="3.5",to="5.5",by="1")
Error in to  from : nonnumeric argument to binary operator
Cheers,
Bert
On Fri, Aug 20, 2010 at 4:39 PM, Michael Hannon < [hidden email]> wrote:
>> For regular expression afficianados, I'd like a cleverer solution to
>> the following problem (my solution works just fine for my needs; I'm
>> just trying to improve my regex skills):
>>
>> Given the string (entered, say, at a readline prompt):
>>
>> "1 2 5, 3 6 4 8 57 10" ## only integers will be entered
>>
>> parse it to produce the numeric vector:
>>
>> c(1, 2, 3, 4, 5, 3, 4, 5, 6, 8, 5, 6, 7, 10)
>>
>> Note that "" in the expression is used to indicate a range of values
>> instead of ":"
>>
>> Here's my UNclever solution:
>>
>> First convert more than one space to a single space and then replace
>> "<any spaces><any spaces>" by ":" by:
>>
>> > x1 < gsub(" * *",":",gsub(" +"," ",resp)) #giving
>> > x1
>> [1] "1 2:5, 3:6 4 8 5:7 10" ## Note that the comma remains
>>
>> Next convert the single string into a character vector via strsplit by
>> splitting on anything but ":" or a digit:
>>
>> > x2 < strsplit(x1,split="[^:[:digit:]]+")[[1]] #giving
>> > x2
>> [1] "1" "2:5" "3:6" "4" "8" "5:7" "10"
>>
>> Finally, parse() the vector, eval() each element, and unlist() the
>> resulting list of numeric vectors:
>>
>> > unlist(lapply(parse(text=x2),eval)) #giving, as desired,
>> [1] 1 2 3 4 5 3 4 5 6 4 8 5 6 7 10
>>
>>
>> This seems far too clumsy and circumlocuitous not to have a more
>> elegant solution from a true regex expert.
>>
>> (Special note to Thomas Lumley: This seems one of the few instances
>> where eval(parse..)) may actually be appropriate.)
>
> Howdy. I don't know that I can produce anything less circumlocutory, but I
> note that your "x2" form has a simpleenough structure that it can be further
> parsed with regular expressions, i.e., as opposed to using parse and eval. I
> don't know that this is an improvement  just a variation on the theme.
>
> I've appended an example.
>
>  Mike
>
> #### Original vector
> x < "1 2 5, 3 6 4 8 57 10"; x
>
> #### Convert ranges to standard R form
> x1 < gsub("[ ]*[ ]*", ":", x); x1
>
> #### Get rid of the comma
> x2 < gsub(",", " ", x1); x2
>
> #### Remove extra spaces
> x3 < gsub("[ ]+", " ", x2); x3
>
> #### Split off elements, now in standard form
> x4 < unlist(strsplit(x3, " ")); x4
>
> #### Use regular expression for simple parse of elements
> x5 < sapply(x4, function(a) {
> n1 < gsub("([[:digit:]]):[[:digit:]]", "\\1", a)
> n2 < gsub("[[:digit:]]:([[:digit:]])", "\\1", a)
> n1:n2}, USE.NAMES=FALSE); x5
> x6 < unlist(x5); x6
>
> ##########################################################
>
>> #### Original vector
>> x < "1 2 5, 3 6 4 8 57 10"; x
> [1] "1 2 5, 3 6 4 8 57 10"
>>
>> #### Convert ranges to standard R form
>> x1 < gsub("[ ]*[ ]*", ":", x); x1
> [1] "1 2:5, 3:6 4 8 5:7 10"
>>
>> #### Get rid of the comma
>> x2 < gsub(",", " ", x1); x2
> [1] "1 2:5 3:6 4 8 5:7 10"
>>
>> #### Remove extra spaces
>> x3 < gsub("[ ]+", " ", x2); x3
> [1] "1 2:5 3:6 4 8 5:7 10"
>>
>> #### Split off elements, now in standard form
>> x4 < unlist(strsplit(x3, " ")); x4
> [1] "1" "2:5" "3:6" "4" "8" "5:7" "10"
>>
>> #### Use regular expression for simple parse of elements
>> x5 < sapply(x4, function(a) {
> + n1 < gsub("([[:digit:]]):[[:digit:]]", "\\1", a)
> + n2 < gsub("[[:digit:]]:([[:digit:]])", "\\1", a)
> + n1:n2}, USE.NAMES=FALSE); x5
> [[1]]
> [1] 1
>
> [[2]]
> [1] 2 3 4 5
>
> [[3]]
> [1] 3 4 5 6
>
> [[4]]
> [1] 4
>
> [[5]]
> [1] 8
>
> [[6]]
> [1] 5 6 7
>
> [[7]]
> [1] 10
>
>> x6 < unlist(x5); x6
> [1] 1 2 3 4 5 3 4 5 6 4 8 5 6 7 10
>>
>
>
>
>

Bert Gunter
Genentech Nonclinical Biostatistics
4677374
http://devo.gene.com/groups/devo/depts/ncb/home.shtml______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


> You are essentially doing the eval and parsing by hand instead of
> letting eval(parse()) do the work. I prefer the latter.
Hi, Bert. Yes, I agree with both your analysis and your preference.
> However, your code did something that I did not expect and for which I
> can find no documentation  I would have thought it shouldn't work.
>
> ... and that is, the return of your sapply is n1:n2 where n1 and n2
> are _character values_ (because that's what gsub returns, of course).
> I would have thought this would give an error, but in fact it gives
> the "correct" result. That is, to my complete surprise:
>
> > "3":"5"
> [1] 3 4 5
> > seq(from= "3", to= "5")
> [1] 3 4 5
> > seq.int( "3", "5")
> [1] 3 4 5
> > "3":5
> [1] 3 4 5
>
> all work! Is this behavior documented anywhere and I've missed it or
> is it a secret "feature." And to what extent does it work, noting
> that:
>
> seq(from="3.5",to="5.5",by="1")
> Error in to  from : nonnumeric argument to binary operator
I think this worked because I started my variable names with 'n', the same as
in "number" ;)
I just stumbled across this behavior, and I don't understand it either. There
does seem to be an ethos in some parts of the R community that says "let's
make this sucker work, regardless of the type of garbage that's input", but it
certainly isn't applied consistently:
> sin(pi)
[1] 1.2246e16
> sin("3.1415927")
Error in sin("3.1415927") : Nonnumeric argument to mathematical function
No suitable frames for recover()
Note that I'm not complaining about this, just noting an apparent
inconsistency.
Also, the details of the ':' function are not immediately obvious:
> `:`
.Primitive(":")
>
I suppose the source code would reveal all, but I have other things to do.
 Mike
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


How about:
x < "1 2 5, 3 6 4 8 57 10"; x
library(gsubfn)
strapply( x, '(([09]+) * *([09]+))([09]+)',
function(one,two,three,four) {
if( nchar(four) > 0 ) return(as.numeric(four) )
return( seq( from=as.numeric(two), to=as.numeric(three) ) )
}
)[[1]]
If x is a vector of strings and you remove the [[1]] then you will get a list with each element corresponding to a string in x (unlisting will give a single vector).
This could be easily extended to handle floating point numbers instead of just integers and even negative numbers (as long as you have a clear rule to distinguish between a negative and a the end of the range).

Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
[hidden email]
801.408.8111
> Original Message
> From: [hidden email] [mailto:rhelpbounces@r
> project.org] On Behalf Of Bert Gunter
> Sent: Friday, August 20, 2010 2:55 PM
> To: [hidden email]
> Subject: [R] Regex exercise
>
> For regular expression afficianados, I'd like a cleverer solution to
> the following problem (my solution works just fine for my needs; I'm
> just trying to improve my regex skills):
>
> Given the string (entered, say, at a readline prompt):
>
> "1 2 5, 3 6 4 8 57 10" ## only integers will be entered
>
> parse it to produce the numeric vector:
>
> c(1, 2, 3, 4, 5, 3, 4, 5, 6, 8, 5, 6, 7, 10)
>
> Note that "" in the expression is used to indicate a range of values
> instead of ":"
>
> Here's my UNclever solution:
>
> First convert more than one space to a single space and then replace
> "<any spaces><any spaces>" by ":" by:
>
> > x1 < gsub(" * *",":",gsub(" +"," ",resp)) #giving
> > x1
> [1] "1 2:5, 3:6 4 8 5:7 10" ## Note that the comma remains
>
> Next convert the single string into a character vector via strsplit by
> splitting on anything but ":" or a digit:
>
> > x2 < strsplit(x1,split="[^:[:digit:]]+")[[1]] #giving
> > x2
> [1] "1" "2:5" "3:6" "4" "8" "5:7" "10"
>
> Finally, parse() the vector, eval() each element, and unlist() the
> resulting list of numeric vectors:
>
> > unlist(lapply(parse(text=x2),eval)) #giving, as desired,
> [1] 1 2 3 4 5 3 4 5 6 4 8 5 6 7 10
>
>
> This seems far too clumsy and circumlocuitous not to have a more
> elegant solution from a true regex expert.
>
> (Special note to Thomas Lumley: This seems one of the few instances
> where eval(parse..)) may actually be appropriate.)
>
> Cheers to all,
>
> Bert
>
> 
> Bert Gunter
> Genentech Nonclinical Biostatistics
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/rhelp> PLEASE do read the posting guide http://www.Rproject.org/posting> guide.html
> and provide commented, minimal, selfcontained, reproducible code.
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.

