Feature request: non-dropping regmatches/strextract

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Feature request: non-dropping regmatches/strextract

R devel mailing list
A very common use case for regmatches is to extract regex matches into a new column in a data.frame (or data.table, etc.) or otherwise use the extracted strings alongside the input. However, the default behavior is to drop empty matches, which results in mismatches in column length if reassignment is done without subsetting.

For consistency with other R functions and compatibility with this use case, it would be nice if regmatches did not automatically drop empty matches and would instead insert an NA_character_ value (similar to stringr::str_extract). This alternative regmatches could be implemented through an optional drop argument, a new function, or mentioned in the documentation (a la resample in ?sample). 

Alternatively, at the moment, there is a non-exported function strextract in utils which is very similar to stringr::str_extract. It would be great if this function, once exported, were to include a drop argument to prevent dropping positions with no matches. 

An example solution (last option):

strextract <- function(pattern, x, perl = FALSE, useBytes = FALSE, drop = T) {
 m <- regexec(pattern, x, perl=perl, useBytes=useBytes)
 result <- regmatches(x, m)
 
 if(isTRUE(drop)){
 unlist(result)
 } else if(isFALSE(drop)) {
 unlist({result[lengths(result)==0] <- NA_character_; result})
 } else {
 stop("Invalid argument for `drop`")
 }
}

Based on Ricardo Saporta's response to How to prevent regmatches drop non matches?

--CG

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Feature request: non-dropping regmatches/strextract

R devel mailing list
Changing the default behavior of regmatches would break its use with
gregexpr, where
the number of matches per input element faries, so a zero-length character
vector
makes more sense than NA_character_.

> x <- c("John Doe", "e e cummings", "Juan de la Madrid")
> m <- gregexpr("[A-Z]", x)
> regmatches(x,m)
[[1]]
[1] "J" "D"

[[2]]
character(0)

[[3]]
[1] "J" "M"

> vapply(.Last.value, function(x)paste(paste0(x, "."),collapse=""), "")
[1] "J.D." "."    "J.M."

(We don't want e e cummings initials mapped to "NA.")

Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Thu, Aug 15, 2019 at 12:15 AM Cyclic Group Z_1 via R-devel <
[hidden email]> wrote:

> A very common use case for regmatches is to extract regex matches into a
> new column in a data.frame (or data.table, etc.) or otherwise use the
> extracted strings alongside the input. However, the default behavior is to
> drop empty matches, which results in mismatches in column length if
> reassignment is done without subsetting.
>
> For consistency with other R functions and compatibility with this use
> case, it would be nice if regmatches did not automatically drop empty
> matches and would instead insert an NA_character_ value (similar to
> stringr::str_extract). This alternative regmatches could be implemented
> through an optional drop argument, a new function, or mentioned in the
> documentation (a la resample in ?sample).
>
> Alternatively, at the moment, there is a non-exported function strextract
> in utils which is very similar to stringr::str_extract. It would be great
> if this function, once exported, were to include a drop argument to prevent
> dropping positions with no matches.
>
> An example solution (last option):
>
> strextract <- function(pattern, x, perl = FALSE, useBytes = FALSE, drop =
> T) {
>  m <- regexec(pattern, x, perl=perl, useBytes=useBytes)
>  result <- regmatches(x, m)
>
>  if(isTRUE(drop)){
>  unlist(result)
>  } else if(isFALSE(drop)) {
>  unlist({result[lengths(result)==0] <- NA_character_; result})
>  } else {
>  stop("Invalid argument for `drop`")
>  }
> }
>
> Based on Ricardo Saporta's response to How to prevent regmatches drop non
> matches?
>
> --CG
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Feature request: non-dropping regmatches/strextract

R devel mailing list
I do think keeping the default behavior is desirable for backwards compatibility; my suggestion is not to change default behavior but to add an optional argument that allows a different behavior. Although this can be implemented in a user-defined function, retaining empty matches facilitates programmatic use, and seems to be something that should be available in base R. It is available, for example, in MATLAB, a comparable array language.

Alternatively, perhaps a nomatch (or maybe emptymatch) argument in the spirit of `[.data.table`? That is, an argument nomatch where nomatch = NULL (the default) results in drops for vector outputs and character(0) for list outputs and nomatch = NA results in insertion of NA_character_, and nomatch = '' results in insertion of empty string.

I can submit proposed patch code if others think this is a good idea.

What are your thoughts on the proposed alteration to (currently nonexported) strextract? I assume (maybe wrongly) that the plan is to eventually export that function.

Thank you,
CG

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Feature request: non-dropping regmatches/strextract

R devel mailing list
I don't care much for regmatches and haven't tried strextract, but I think
replacing the character(0) by NA_character_ is almost always inappropriate
if the match information comes from gregexpr.

I think strcapture() does a pretty good job of what I think you are trying
to do.  Perhaps adding an argument to map no match to NA instead of ""
would give you just what you wanted.

> x <- c("Groucho <[hidden email]>", "<[hidden email]>", "Harpo")
> d <- strcapture("([[:alpha:]]+)?( *<([[:alpha:]. ]+@[[:alpha:]. ]+)>)?",
x, proto=data.frame(Name=character(), Junk=character(),
Address=character(), stringsAsFactors=FALSE))
> d[c("Name", "Address")]
     Name          Address
1 Groucho [hidden email]
2           [hidden email]
3   Harpo
> str(.Last.value)
'data.frame':   3 obs. of  2 variables:
 $ Name   : chr  "Groucho" "" "Harpo"
 $ Address: chr  "[hidden email]" "[hidden email]" ""
Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Thu, Aug 15, 2019 at 11:31 AM Cyclic Group Z_1 <[hidden email]>
wrote:

> I do think keeping the default behavior is desirable for backwards
> compatibility; my suggestion is not to change default behavior but to add
> an optional argument that allows a different behavior. Although this can be
> implemented in a user-defined function, retaining empty matches facilitates
> programmatic use, and seems to be something that should be available in
> base R. It is available, for example, in MATLAB, a comparable array
> language.
>
> Alternatively, perhaps a nomatch (or maybe emptymatch) argument in the
> spirit of `[.data.table`? That is, an argument nomatch where nomatch = NULL
> (the default) results in drops for vector outputs and character(0) for list
> outputs and nomatch = NA results in insertion of NA_character_, and nomatch
> = '' results in insertion of empty string.
>
> I can submit proposed patch code if others think this is a good idea.
>
> What are your thoughts on the proposed alteration to (currently
> nonexported) strextract? I assume (maybe wrongly) that the plan is to
> eventually export that function.
>
> Thank you,
> CG
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Feature request: non-dropping regmatches/strextract

R devel mailing list
Using a non-capturing group, "(?:...)" instead of "(...)", simplifies my
example a bit

> x <- c("Groucho <[hidden email]>", "<[hidden email]>", "Harpo")
> strcapture("([[:alpha:]]+)?(?: *<([[:alpha:]. ]+@[[:alpha:]. ]+)>)?", x,
proto=data.frame(Name=character(), Address=character(),
stringsAsFactors=FALSE))
     Name          Address
1 Groucho [hidden email]
2           [hidden email]
3   Harpo

Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Thu, Aug 15, 2019 at 1:04 PM William Dunlap <[hidden email]> wrote:

> I don't care much for regmatches and haven't tried strextract, but I think
> replacing the character(0) by NA_character_ is almost always inappropriate
> if the match information comes from gregexpr.
>
> I think strcapture() does a pretty good job of what I think you are trying
> to do.  Perhaps adding an argument to map no match to NA instead of ""
> would give you just what you wanted.
>
> > x <- c("Groucho <[hidden email]>", "<[hidden email]>", "Harpo")
> > d <- strcapture("([[:alpha:]]+)?( *<([[:alpha:]. ]+@[[:alpha:]. ]+)>)?",
> x, proto=data.frame(Name=character(), Junk=character(),
> Address=character(), stringsAsFactors=FALSE))
> > d[c("Name", "Address")]
>      Name          Address
> 1 Groucho [hidden email]
> 2           [hidden email]
> 3   Harpo
> > str(.Last.value)
> 'data.frame':   3 obs. of  2 variables:
>  $ Name   : chr  "Groucho" "" "Harpo"
>  $ Address: chr  "[hidden email]" "[hidden email]" ""
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
>
> On Thu, Aug 15, 2019 at 11:31 AM Cyclic Group Z_1 <
> [hidden email]> wrote:
>
>> I do think keeping the default behavior is desirable for backwards
>> compatibility; my suggestion is not to change default behavior but to add
>> an optional argument that allows a different behavior. Although this can be
>> implemented in a user-defined function, retaining empty matches facilitates
>> programmatic use, and seems to be something that should be available in
>> base R. It is available, for example, in MATLAB, a comparable array
>> language.
>>
>> Alternatively, perhaps a nomatch (or maybe emptymatch) argument in the
>> spirit of `[.data.table`? That is, an argument nomatch where nomatch = NULL
>> (the default) results in drops for vector outputs and character(0) for list
>> outputs and nomatch = NA results in insertion of NA_character_, and nomatch
>> = '' results in insertion of empty string.
>>
>> I can submit proposed patch code if others think this is a good idea.
>>
>> What are your thoughts on the proposed alteration to (currently
>> nonexported) strextract? I assume (maybe wrongly) that the plan is to
>> eventually export that function.
>>
>> Thank you,
>> CG
>>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Feature request: non-dropping regmatches/strextract

R devel mailing list
Using strcapture seems like a great workaround for use cases of this kind, at least in base R. I agree as well that filling with NA for regmatches(..., gregexpr(...)) makes less sense, given the outputs are lists and thus are retained in the list.  Also, I suppose in the meantime the stringr package can be used when non-dropping vector outputs are desired.

However, I do think that non-dropping regex string extraction/matching in vector outputs from regmatches(..., regexpr(...)) or strextract would be a great (optional) design feature to have in base R for sake of consistency with the rest of the language (missing values, denoted by NA, are generally not dropped from vectors elsewhere and seem to agree conceptually with empty matches) and would help R to reach greater feature parity with MATLAB and Pandas in this respect (granted, Pandas is not technically a language on its own).

Although I have written personal wrappers and used stringr to accomplish the non-dropping behavior in the past, I have nevertheless found the behavior of base R string operations mildly astonishing (in the sense of POLA) and think others may have as well. As the stringr documentation puts it, "they lag behind the string operations in other programming languages, so that some things that are easy to do in languages like Ruby or Python are rather hard to do in R." Since consistent, robust string operations are often a standard base feature of other data science and scientific programming languages, I think this minor change would be a great improvement to the language and hopefully help promote adoption of R, especially given the surge in text-based data analysis in recent years.

Alternatively, although I generally don't use the Tidyverse packages very often, stringr seems like a great candidate for inclusion in base or recommended R if the R Core team and the package developer see it fitting (just a suggestion and probably a long shot). 

However, I will try not to belabor this point further. In any case, thank you!

Best,CG
CG
        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Feature request: non-dropping regmatches/strextract

Toby Hocking-2
In reply to this post by R devel mailing list
if you want "to extract regex matches into a new column in a data.frame"
then there are some package functions which do exactly that. three examples
are namedCapture::df_match_variable, rematch2::bind_re_match, and
tidyr::extract. For a more detailed discussion see my R journal submission
(under review) about regular expression packages,
https://raw.githubusercontent.com/tdhock/namedCapture-article/master/RJwrapper.pdf
Comments/suggestions welcome.

On Thu, Aug 15, 2019 at 12:15 AM Cyclic Group Z_1 via R-devel <
[hidden email]> wrote:

> A very common use case for regmatches is to extract regex matches into a
> new column in a data.frame (or data.table, etc.) or otherwise use the
> extracted strings alongside the input. However, the default behavior is to
> drop empty matches, which results in mismatches in column length if
> reassignment is done without subsetting.
>
> For consistency with other R functions and compatibility with this use
> case, it would be nice if regmatches did not automatically drop empty
> matches and would instead insert an NA_character_ value (similar to
> stringr::str_extract). This alternative regmatches could be implemented
> through an optional drop argument, a new function, or mentioned in the
> documentation (a la resample in ?sample).
>
> Alternatively, at the moment, there is a non-exported function strextract
> in utils which is very similar to stringr::str_extract. It would be great
> if this function, once exported, were to include a drop argument to prevent
> dropping positions with no matches.
>
> An example solution (last option):
>
> strextract <- function(pattern, x, perl = FALSE, useBytes = FALSE, drop =
> T) {
>  m <- regexec(pattern, x, perl=perl, useBytes=useBytes)
>  result <- regmatches(x, m)
>
>  if(isTRUE(drop)){
>  unlist(result)
>  } else if(isFALSE(drop)) {
>  unlist({result[lengths(result)==0] <- NA_character_; result})
>  } else {
>  stop("Invalid argument for `drop`")
>  }
> }
>
> Based on Ricardo Saporta's response to How to prevent regmatches drop non
> matches?
>
> --CG
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Feature request: non-dropping regmatches/strextract

R devel mailing list
Thank you, I am aware that there are packages that can accomplish this. I mentioned stringr::str_extract as a function that does not drop empty matches. I think that the behavior of regmatches(..., regexpr(...)) in base R should permit an option to prevent dropping of empty matches both for sake of consistency with the rest of the language (missing data does not yield a dropped index in other sorts of R functions, and an empty match conceptually corresponds with missing data) and facility of use in data.frames. The behavior of regmatches(..., gregexpr(...)) is not objectionable to me, as lists do not drop indices when they contain character(0) vectors. Alternatively, perhaps this should be reflected in the (currently non-exported) strextract.

Best,
CG

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Feature request: non-dropping regmatches/strextract

R devel mailing list
I'd be happy to entertain patches or at least more specific
suggestions to improve strextract() and strcapture(). I hadn't
exported strextract(), because I wasn't quite sure how it should
behave. This feedback should be helpful.

Thanks,
Michael

On Thu, Aug 29, 2019 at 2:20 PM Cyclic Group Z_1 via R-devel
<[hidden email]> wrote:
>
> Thank you, I am aware that there are packages that can accomplish this. I mentioned stringr::str_extract as a function that does not drop empty matches. I think that the behavior of regmatches(..., regexpr(...)) in base R should permit an option to prevent dropping of empty matches both for sake of consistency with the rest of the language (missing data does not yield a dropped index in other sorts of R functions, and an empty match conceptually corresponds with missing data) and facility of use in data.frames. The behavior of regmatches(..., gregexpr(...)) is not objectionable to me, as lists do not drop indices when they contain character(0) vectors. Alternatively, perhaps this should be reflected in the (currently non-exported) strextract.
>
> Best,
> CG
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



--
Michael Lawrence
Scientist, Bioinformatics and Computational Biology
Genentech, A Member of the Roche Group
Office +1 (650) 225-7760
[hidden email]

Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Feature request: non-dropping regmatches/strextract

R devel mailing list
Thank you! I greatly appreciate your consideration, though of course it is up to you. I think many people switch to stringr/stringi simply because functions in those packages have some consistent design choices, for example, they do not drop empty/missing matches, which facilitates array-based programming. For example, in the cases where one needs to make a new column in a data.frame (data.table, tibble, etc.) of regex extractions. Or in any other case where there needs to be an element-wise correspondence between input and output. I think insertion of NA_character_ to prevent dropping indices seems like the natural choice for an array language (which, I think, motivated the creation of stringr/stringi). While those are great packages and this behavior can be easily replicated with simple wrappers, string operations are normally easy to accomplish in base languages, so this seems like something that would be appropriate to have in base. For example, MATLAB and Pandas regex both allow
  non-dropping empty matches (though of course I acknowledge Pandas is not a base language).

Best,
CG

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Feature request: non-dropping regmatches/strextract

R devel mailing list
Just started thinking about this. The name of regmatches() suggests
that it will only extract the matches but not return anything for the
non-matches. We might need another function that returns a value for
non-matches. Perhaps the value should be the empty string for
non-matches and NA for matches to NA. The rationale is that we
delegate to regexpr() (at least conceptually), and it returns a
"matching region" which would be empty when there is no match. We
could allow strcapture() to accept an atomic vector as a prototype,
which would do what you want for regexec() (NA on no match, empty
string on empty capture). Then we could call the regexpr()-based
function strextract().

What do you think?

Michael

On Thu, Aug 29, 2019 at 3:27 PM Cyclic Group Z_1
<[hidden email]> wrote:
>
> Thank you! I greatly appreciate your consideration, though of course it is up to you. I think many people switch to stringr/stringi simply because functions in those packages have some consistent design choices, for example, they do not drop empty/missing matches, which facilitates array-based programming. For example, in the cases where one needs to make a new column in a data.frame (data.table, tibble, etc.) of regex extractions. Or in any other case where there needs to be an element-wise correspondence between input and output. I think insertion of NA_character_ to prevent dropping indices seems like the natural choice for an array language (which, I think, motivated the creation of stringr/stringi). While those are great packages and this behavior can be easily replicated with simple wrappers, string operations are normally easy to accomplish in base languages, so this seems like something that would be appropriate to have in base. For example, MATLAB and Pandas regex both all
 ow non-dropping empty matches (though of course I acknowledge Pandas is not a base language).
>
> Best,
> CG



--
Michael Lawrence
Scientist, Bioinformatics and Computational Biology
Genentech, A Member of the Roche Group
Office +1 (650) 225-7760
[hidden email]

Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Feature request: non-dropping regmatches/strextract

R devel mailing list
I think that's a good reason for not including this in regmatches; you're right, its name is somewhat suggestive of yielding matches. Also, that sounds like a great design for strcapture with an atomic prototype.

Best,
CG

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Feature request: non-dropping regmatches/strextract

R devel mailing list
After some discussion within R core, we decided that a "nomatch"
argument on regmatches() may be a good initial step. We might add a
new function later that combines the regexpr() and regmatches() steps.
The gregexpr() and regexec() inputs are both lists so it's not clear
whether a "nomatch" value would be relevant (the elements are empty)
in those cases.

On Mon, Sep 2, 2019 at 11:38 AM Cyclic Group Z_1
<[hidden email]> wrote:
>
> I think that's a good reason for not including this in regmatches; you're right, its name is somewhat suggestive of yielding matches. Also, that sounds like a great design for strcapture with an atomic prototype.
>
> Best,
> CG



--
Michael Lawrence
Scientist, Bioinformatics and Computational Biology
Genentech, A Member of the Roche Group
Office +1 (650) 225-7760
[hidden email]

Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Feature request: non-dropping regmatches/strextract

R devel mailing list
That sounds great! Thank you for your consideration.

Best,
CG

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel