extending strsplit(): supply pattern to keep, not to split by

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

extending strsplit(): supply pattern to keep, not to split by

Bill Dunlap
strsplit() is a convenient way to get a
list of items from a string when you
have a regular expression for what is not
an item.  E.g.,

   > strsplit("1.2, 34, 1.7e-2", split="[ ,] *")
   [[1]]:
   [1] "1.2"    "34"     "1.7e-2"

However, sometimes is it more convenient to
give a pattern for the items you do want.
E.g., suppose you want to pull all the numbers
out of a string which contains a mix of numbers
and words.  Making a pattern for what a
number is simpler than making a pattern
for what may come between the number.
   > number.pattern <- "[-+]?(([0-9]+(\\.[0-9]*)?)|(\\.[0-9]+))([eE][+-]?[0-9]+)?"

I propose adding a keep=FALSE argument to
strsplit() to do this.  If keep is FALSE,
then the split argument matches the stuff to
omit from the output; if keep is TRUE then
split matches the stuff to put into the
output.  Then we could do the following to
get a list of all the numbers in a string
(done in a version of strsplit() I'm working on
for S-PLUS):

   > strsplit("1.2, 34, 1.7e-2", split=number.pattern,keep=TRUE)
   [[1]]:
   [1] "1.2"    "34"     "1.7e-2"

   > strsplit("Ibuprofin 200mg", split=number.pattern,keep=TRUE)
   [[1]]:
   [1] "200"

Is this a reasonable thing to want strsplit to do?
Is this a reasonable parameterization of it?

----------------------------------------------------------------------------
Bill Dunlap
Insightful Corporation
bill at insightful dot com
360-428-8146

 "All statements in this message represent the opinions of the author and do
 not necessarily reflect Insightful Corporation policy or position."

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: extending strsplit(): supply pattern to keep, not to split by

Gabor Grothendieck
gsubfn in package gsubfn can do this.  See the examples
in ?gsubfn


On 4/4/06, Bill Dunlap <[hidden email]> wrote:

> strsplit() is a convenient way to get a
> list of items from a string when you
> have a regular expression for what is not
> an item.  E.g.,
>
>   > strsplit("1.2, 34, 1.7e-2", split="[ ,] *")
>   [[1]]:
>   [1] "1.2"    "34"     "1.7e-2"
>
> However, sometimes is it more convenient to
> give a pattern for the items you do want.
> E.g., suppose you want to pull all the numbers
> out of a string which contains a mix of numbers
> and words.  Making a pattern for what a
> number is simpler than making a pattern
> for what may come between the number.
>   > number.pattern <- "[-+]?(([0-9]+(\\.[0-9]*)?)|(\\.[0-9]+))([eE][+-]?[0-9]+)?"
>
> I propose adding a keep=FALSE argument to
> strsplit() to do this.  If keep is FALSE,
> then the split argument matches the stuff to
> omit from the output; if keep is TRUE then
> split matches the stuff to put into the
> output.  Then we could do the following to
> get a list of all the numbers in a string
> (done in a version of strsplit() I'm working on
> for S-PLUS):
>
>   > strsplit("1.2, 34, 1.7e-2", split=number.pattern,keep=TRUE)
>   [[1]]:
>   [1] "1.2"    "34"     "1.7e-2"
>
>   > strsplit("Ibuprofin 200mg", split=number.pattern,keep=TRUE)
>   [[1]]:
>   [1] "200"
>
> Is this a reasonable thing to want strsplit to do?
> Is this a reasonable parameterization of it?
>
> ----------------------------------------------------------------------------
> Bill Dunlap
> Insightful Corporation
> bill at insightful dot com
> 360-428-8146
>
>  "All statements in this message represent the opinions of the author and do
>  not necessarily reflect Insightful Corporation policy or position."
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: extending strsplit(): supply pattern to keep, not to split by

Bill Dunlap
On Tue, 4 Apr 2006, Gabor Grothendieck wrote:

> gsubfn in package gsubfn can do this.  See the examples
> in ?gsubfn

Thanks.  gsubfn looks useful, but may be overkill
for this, and it isn't vectorized.  To do what
strsplit(keep=T) would do I think you need to do something like:

   > findMatches<-function(strings, pattern){
        lapply(strings, function(string){
               v <- character()
               gsubfn(number.pattern, function(x,...)v<<-c(v,x), string)
               v})
     }
   > number.pattern <- "[-+]?(([0-9]+(\\.[0-9]*)?)|(\\.[0-9]+))([eE][+-]?[0-9]+)?"
   > findMatches(c("12;34:56,89,,12", "1.2, .4, 1., 1e3"), number.pattern)
   [[1]]
   [1] "12" "34" "56" "89" "12"

   [[2]]
   [1] "1.2" ".4"  "1."  "1e3"

Is this worth encapsulating in a standard R function?
If so, is doing via an extra argument to strsplit()
a reasonable way to do it?

   > strsplit(c("12;34:56,89,,12", "1.2, .4, 1., 1e3"), number.pattern, keep=T)
   [[1]]:
   [1] "12" "34" "56" "89" "12"

   [[2]]:
   [1] "1.2" ".4"  "1."  "1e3"


> On 4/4/06, Bill Dunlap <[hidden email]> wrote:
> > strsplit() is a convenient way to get a
> > list of items from a string when you
> > have a regular expression for what is not
> > an item.  E.g.,
> >
> >   > strsplit("1.2, 34, 1.7e-2", split="[ ,] *")
> >   [[1]]:
> >   [1] "1.2"    "34"     "1.7e-2"
> >
> > However, sometimes is it more convenient to
> > give a pattern for the items you do want.
> > E.g., suppose you want to pull all the numbers
> > out of a string which contains a mix of numbers
> > and words.  Making a pattern for what a
> > number is simpler than making a pattern
> > for what may come between the number.
> >   > number.pattern <- "[-+]?(([0-9]+(\\.[0-9]*)?)|(\\.[0-9]+))([eE][+-]?[0-9]+)?"
> >
> > I propose adding a keep=FALSE argument to
> > strsplit() to do this.  If keep is FALSE,
> > then the split argument matches the stuff to
> > omit from the output; if keep is TRUE then
> > split matches the stuff to put into the
> > output.  Then we could do the following to
> > get a list of all the numbers in a string
> > (done in a version of strsplit() I'm working on
> > for S-PLUS):
> >
> >   > strsplit("1.2, 34, 1.7e-2", split=number.pattern,keep=TRUE)
> >   [[1]]:
> >   [1] "1.2"    "34"     "1.7e-2"
> >
> >   > strsplit("Ibuprofin 200mg", split=number.pattern,keep=TRUE)
> >   [[1]]:
> >   [1] "200"
> >
> > Is this a reasonable thing to want strsplit to do?
> > Is this a reasonable parameterization of it?

----------------------------------------------------------------------------
Bill Dunlap
Insightful Corporation
bill at insightful dot com
360-428-8146

 "All statements in this message represent the opinions of the author and do
 not necessarily reflect Insightful Corporation policy or position."

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: extending strsplit(): supply pattern to keep, not to split by

Gabor Grothendieck
On 4/4/06, Bill Dunlap <[hidden email]> wrote:
> On Tue, 4 Apr 2006, Gabor Grothendieck wrote:
>
> > gsubfn in package gsubfn can do this.  See the examples
> > in ?gsubfn
>
> Thanks.  gsubfn looks useful, but may be overkill
> for this, and it isn't vectorized.  To do what

gsubfn is vectorized.  Its just that you are not using the output of
gsubfn in this case.

> strsplit(keep=T) would do I think you need to do something like:
>
>   > findMatches<-function(strings, pattern){
>        lapply(strings, function(string){
>               v <- character()
>               gsubfn(number.pattern, function(x,...)v<<-c(v,x), string)
>               v})
>     }
>   > number.pattern <- "[-+]?(([0-9]+(\\.[0-9]*)?)|(\\.[0-9]+))([eE][+-]?[0-9]+)?"
>   > findMatches(c("12;34:56,89,,12", "1.2, .4, 1., 1e3"), number.pattern)
>   [[1]]
>   [1] "12" "34" "56" "89" "12"
>
>   [[2]]
>   [1] "1.2" ".4"  "1."  "1e3"
>
> Is this worth encapsulating in a standard R function?

I will likely add a wrapper to the gsubfn package for this.

> If so, is doing via an extra argument to strsplit()
> a reasonable way to do it?

My current thought was to create a strapply function to do that.

>
>   > strsplit(c("12;34:56,89,,12", "1.2, .4, 1., 1e3"), number.pattern, keep=T)
>   [[1]]:
>   [1] "12" "34" "56" "89" "12"
>
>   [[2]]:
>   [1] "1.2" ".4"  "1."  "1e3"
>
>
> > On 4/4/06, Bill Dunlap <[hidden email]> wrote:
> > > strsplit() is a convenient way to get a
> > > list of items from a string when you
> > > have a regular expression for what is not
> > > an item.  E.g.,
> > >
> > >   > strsplit("1.2, 34, 1.7e-2", split="[ ,] *")
> > >   [[1]]:
> > >   [1] "1.2"    "34"     "1.7e-2"
> > >
> > > However, sometimes is it more convenient to
> > > give a pattern for the items you do want.
> > > E.g., suppose you want to pull all the numbers
> > > out of a string which contains a mix of numbers
> > > and words.  Making a pattern for what a
> > > number is simpler than making a pattern
> > > for what may come between the number.
> > >   > number.pattern <- "[-+]?(([0-9]+(\\.[0-9]*)?)|(\\.[0-9]+))([eE][+-]?[0-9]+)?"
> > >
> > > I propose adding a keep=FALSE argument to
> > > strsplit() to do this.  If keep is FALSE,
> > > then the split argument matches the stuff to
> > > omit from the output; if keep is TRUE then
> > > split matches the stuff to put into the
> > > output.  Then we could do the following to
> > > get a list of all the numbers in a string
> > > (done in a version of strsplit() I'm working on
> > > for S-PLUS):
> > >
> > >   > strsplit("1.2, 34, 1.7e-2", split=number.pattern,keep=TRUE)
> > >   [[1]]:
> > >   [1] "1.2"    "34"     "1.7e-2"
> > >
> > >   > strsplit("Ibuprofin 200mg", split=number.pattern,keep=TRUE)
> > >   [[1]]:
> > >   [1] "200"
> > >
> > > Is this a reasonable thing to want strsplit to do?
> > > Is this a reasonable parameterization of it?
>
> ----------------------------------------------------------------------------
> Bill Dunlap
> Insightful Corporation
> bill at insightful dot com
> 360-428-8146
>
>  "All statements in this message represent the opinions of the author and do
>  not necessarily reflect Insightful Corporation policy or position."
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: extending strsplit(): supply pattern to keep, not to split by

Gabor Grothendieck
To follow up, strapply has been added to the
gsubfn package (gsubfn 0.1-1) which should make it
easier to address this problem.

Its basically just a sapply call around gsubfn which
returns the transformed matches rather than performing
substitution.  Its analogous to apply:

        apply(object, margin, function)
        strapply(object, pattern, function)

(The arguments shown above are not a complete list
nor are they they actual arg names but are simply
intended to show the close parallel between strapply
and apply.)

The default function in strapply returns its
first argument so for this problem we could omit
the function altogether and write:

  library(gsubfn)  # ver 0.1-1 needed
  x <- c("12;34:56,89,,12", "1.2, .4, 1., 1e3")
  strapply(x, number.pattern)

See ?strapply for more info.


On 4/4/06, Gabor Grothendieck <[hidden email]> wrote:

> On 4/4/06, Bill Dunlap <[hidden email]> wrote:
> > On Tue, 4 Apr 2006, Gabor Grothendieck wrote:
> >
> > > gsubfn in package gsubfn can do this.  See the examples
> > > in ?gsubfn
> >
> > Thanks.  gsubfn looks useful, but may be overkill
> > for this, and it isn't vectorized.  To do what
>
> gsubfn is vectorized.  Its just that you are not using the output of
> gsubfn in this case.
>
> > strsplit(keep=T) would do I think you need to do something like:
> >
> >   > findMatches<-function(strings, pattern){
> >        lapply(strings, function(string){
> >               v <- character()
> >               gsubfn(number.pattern, function(x,...)v<<-c(v,x), string)
> >               v})
> >     }
> >   > number.pattern <- "[-+]?(([0-9]+(\\.[0-9]*)?)|(\\.[0-9]+))([eE][+-]?[0-9]+)?"
> >   > findMatches(c("12;34:56,89,,12", "1.2, .4, 1., 1e3"), number.pattern)
> >   [[1]]
> >   [1] "12" "34" "56" "89" "12"
> >
> >   [[2]]
> >   [1] "1.2" ".4"  "1."  "1e3"
> >
> > Is this worth encapsulating in a standard R function?
>
> I will likely add a wrapper to the gsubfn package for this.
>
> > If so, is doing via an extra argument to strsplit()
> > a reasonable way to do it?
>
> My current thought was to create a strapply function to do that.
>
> >
> >   > strsplit(c("12;34:56,89,,12", "1.2, .4, 1., 1e3"), number.pattern, keep=T)
> >   [[1]]:
> >   [1] "12" "34" "56" "89" "12"
> >
> >   [[2]]:
> >   [1] "1.2" ".4"  "1."  "1e3"
> >
> >
> > > On 4/4/06, Bill Dunlap <[hidden email]> wrote:
> > > > strsplit() is a convenient way to get a
> > > > list of items from a string when you
> > > > have a regular expression for what is not
> > > > an item.  E.g.,
> > > >
> > > >   > strsplit("1.2, 34, 1.7e-2", split="[ ,] *")
> > > >   [[1]]:
> > > >   [1] "1.2"    "34"     "1.7e-2"
> > > >
> > > > However, sometimes is it more convenient to
> > > > give a pattern for the items you do want.
> > > > E.g., suppose you want to pull all the numbers
> > > > out of a string which contains a mix of numbers
> > > > and words.  Making a pattern for what a
> > > > number is simpler than making a pattern
> > > > for what may come between the number.
> > > >   > number.pattern <- "[-+]?(([0-9]+(\\.[0-9]*)?)|(\\.[0-9]+))([eE][+-]?[0-9]+)?"
> > > >
> > > > I propose adding a keep=FALSE argument to
> > > > strsplit() to do this.  If keep is FALSE,
> > > > then the split argument matches the stuff to
> > > > omit from the output; if keep is TRUE then
> > > > split matches the stuff to put into the
> > > > output.  Then we could do the following to
> > > > get a list of all the numbers in a string
> > > > (done in a version of strsplit() I'm working on
> > > > for S-PLUS):
> > > >
> > > >   > strsplit("1.2, 34, 1.7e-2", split=number.pattern,keep=TRUE)
> > > >   [[1]]:
> > > >   [1] "1.2"    "34"     "1.7e-2"
> > > >
> > > >   > strsplit("Ibuprofin 200mg", split=number.pattern,keep=TRUE)
> > > >   [[1]]:
> > > >   [1] "200"
> > > >
> > > > Is this a reasonable thing to want strsplit to do?
> > > > Is this a reasonable parameterization of it?
> >
> > ----------------------------------------------------------------------------
> > Bill Dunlap
> > Insightful Corporation
> > bill at insightful dot com
> > 360-428-8146
> >
> >  "All statements in this message represent the opinions of the author and do
> >  not necessarily reflect Insightful Corporation policy or position."
> >
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel