sub/grep question: extract year

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

sub/grep question: extract year

R help mailing list-2
Hi everybody,

I have some questions about the way that sub is working. I hope that
someone has the answer:

1/ Why the second example does not return an empty string ? There is no
match.

subtext <- "-1980-"
sub(".*(1980).*", "\\1", subtext) # return 1980
sub(".*(1981).*", "\\1", subtext) # return -1980-

2/ Based on sub documentation, it replaces the first occurence of a
pattern: why it does not return 1980 ?

subtext <- " 1980 1981 "
sub(".*(198[01]).*", "\\1", subtext) # return 1981

3/ I want extract year from text; I use:

subtext <- "bla 1980 bla"
sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1", subtext) #
return 1980
subtext <- "bla 2010 bla"
sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1", subtext) #
return 2010

but

subtext <- "bla 1010 bla"
sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1", subtext) #
return 1010

I would like exclude the case 1010 and other like this.

The solution would be:

18[0-9][0-9] or 19[0-9][0-9] or 200[0-9] or 201[0-9]

Is there a solution to write such a pattern in grep ?

Thanks a lot

Marc

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: sub/grep question: extract year

Marc Girondot
I answer myself to the third point:
This pattern is better :

pattern.year <- ".*\\b(18|19|20)([0-9][0-9])\\b.*"

subtext <- "bla 1880 bla"
sub(pattern.year, "\\1\\2", subtext) # return 1880
subtext <- "bla 1980 bla"
sub(pattern.year, "\\1\\2", subtext) # return 1980
subtext <- "bla 2010 bla"
sub(pattern.year, "\\1\\2", subtext) # return 2010
subtext <- "bla 1010 bla"
sub(pattern.year, "\\1\\2", subtext) # return bla 1010 bla
subtext <- "bla 3010 bla"
sub(pattern.year, "\\1\\2", subtext) # return bla 3010 bla

Marc

Le 09/08/2018 à 09:57, Marc Girondot via R-help a écrit :

> Hi everybody,
>
> I have some questions about the way that sub is working. I hope that
> someone has the answer:
>
> 1/ Why the second example does not return an empty string ? There is
> no match.
>
> subtext <- "-1980-"
> sub(".*(1980).*", "\\1", subtext) # return 1980
> sub(".*(1981).*", "\\1", subtext) # return -1980-
>
> 2/ Based on sub documentation, it replaces the first occurence of a
> pattern: why it does not return 1980 ?
>
> subtext <- " 1980 1981 "
> sub(".*(198[01]).*", "\\1", subtext) # return 1981
>
> 3/ I want extract year from text; I use:
>
> subtext <- "bla 1980 bla"
> sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1", subtext)
> # return 1980
> subtext <- "bla 2010 bla"
> sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1", subtext)
> # return 2010
>
> but
>
> subtext <- "bla 1010 bla"
> sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1", subtext)
> # return 1010
>
> I would like exclude the case 1010 and other like this.
>
> The solution would be:
>
> 18[0-9][0-9] or 19[0-9][0-9] or 200[0-9] or 201[0-9]
>
> Is there a solution to write such a pattern in grep ?
>
> Thanks a lot
>
> Marc
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

--
__________________________________________________________
Marc Girondot, Pr

Laboratoire Ecologie, Systématique et Evolution
Equipe de Conservation des Populations et des Communautés
CNRS, AgroParisTech et Université Paris-Sud 11 , UMR 8079
Bâtiment 362
91405 Orsay Cedex, France

Tel:  33 1 (0)1.69.15.72.30   Fax: 33 1 (0)1.69.15.73.53
e-mail: [hidden email]
Web: http://www.ese.u-psud.fr/epc/conservation/Marc.html
Skype: girondot

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: sub/grep question: extract year

R help mailing list-2
In reply to this post by R help mailing list-2
I answer myself to the third point:
This pattern is better to get a year:

pattern.year <- ".*\\b(18|19|20)([0-9][0-9])\\b.*"

subtext <- "bla 1880 bla"
sub(pattern.year, "\\1\\2", subtext) # return 1880
subtext <- "bla 1980 bla"
sub(pattern.year, "\\1\\2", subtext) # return 1980
subtext <- "bla 2010 bla"
sub(pattern.year, "\\1\\2", subtext) # return 2010
subtext <- "bla 1010 bla"
sub(pattern.year, "\\1\\2", subtext) # return bla 1010 bla
subtext <- "bla 3010 bla"
sub(pattern.year, "\\1\\2", subtext) # return bla 3010 bla

Marc


Le 09/08/2018 à 09:57, Marc Girondot via R-help a écrit :

> Hi everybody,
>
> I have some questions about the way that sub is working. I hope that
> someone has the answer:
>
> 1/ Why the second example does not return an empty string ? There is
> no match.
>
> subtext <- "-1980-"
> sub(".*(1980).*", "\\1", subtext) # return 1980
> sub(".*(1981).*", "\\1", subtext) # return -1980-
>
> 2/ Based on sub documentation, it replaces the first occurence of a
> pattern: why it does not return 1980 ?
>
> subtext <- " 1980 1981 "
> sub(".*(198[01]).*", "\\1", subtext) # return 1981
>
> 3/ I want extract year from text; I use:
>
> subtext <- "bla 1980 bla"
> sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1", subtext)
> # return 1980
> subtext <- "bla 2010 bla"
> sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1", subtext)
> # return 2010
>
> but
>
> subtext <- "bla 1010 bla"
> sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1", subtext)
> # return 1010
>
> I would like exclude the case 1010 and other like this.
>
> The solution would be:
>
> 18[0-9][0-9] or 19[0-9][0-9] or 200[0-9] or 201[0-9]
>
> Is there a solution to write such a pattern in grep ?
>
> Thanks a lot
>
> Marc
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: sub/grep question: extract year

R help mailing list-2
In reply to this post by R help mailing list-2
Hi Marc.
For question 1.
I know in Perl that regular expressions when captured can be saved if not
overwritten. \\1 is the capture variable in your R examples.

So the 2nd regular expression does not match but \\1 still has 1980
captured from the previous expression, hence the result.

Maybe if you restart R and try your 2nd expression first, \\1 will be empty
or no match result.

Just speculation :)

John


On 9 Aug 2018 08:58, "Marc Girondot via R-help" <[hidden email]>
wrote:

> Hi everybody,
>
> I have some questions about the way that sub is working. I hope that
> someone has the answer:
>
> 1/ Why the second example does not return an empty string ? There is no
> match.
>
> subtext <- "-1980-"
> sub(".*(1980).*", "\\1", subtext) # return 1980
> sub(".*(1981).*", "\\1", subtext) # return -1980-
>
> 2/ Based on sub documentation, it replaces the first occurence of a
> pattern: why it does not return 1980 ?
>
> subtext <- " 1980 1981 "
> sub(".*(198[01]).*", "\\1", subtext) # return 1981
>
> 3/ I want extract year from text; I use:
>
> subtext <- "bla 1980 bla"
> sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1", subtext) #
> return 1980
> subtext <- "bla 2010 bla"
> sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1", subtext) #
> return 2010
>
> but
>
> subtext <- "bla 1010 bla"
> sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1", subtext) #
> return 1010
>
> I would like exclude the case 1010 and other like this.
>
> The solution would be:
>
> 18[0-9][0-9] or 19[0-9][0-9] or 200[0-9] or 201[0-9]
>
> Is there a solution to write such a pattern in grep ?
>
> Thanks a lot
>
> Marc
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posti
> ng-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: sub/grep question: extract year

R help mailing list-2
So there is probably a command that resets the capture variables as I call
them. No doubt someone will write what it is.

On 9 Aug 2018 10:36, "john matthew" <[hidden email]> wrote:

> Hi Marc.
> For question 1.
> I know in Perl that regular expressions when captured can be saved if not
> overwritten. \\1 is the capture variable in your R examples.
>
> So the 2nd regular expression does not match but \\1 still has 1980
> captured from the previous expression, hence the result.
>
> Maybe if you restart R and try your 2nd expression first, \\1 will be
> empty or no match result.
>
> Just speculation :)
>
> John
>
>
> On 9 Aug 2018 08:58, "Marc Girondot via R-help" <[hidden email]>
> wrote:
>
>> Hi everybody,
>>
>> I have some questions about the way that sub is working. I hope that
>> someone has the answer:
>>
>> 1/ Why the second example does not return an empty string ? There is no
>> match.
>>
>> subtext <- "-1980-"
>> sub(".*(1980).*", "\\1", subtext) # return 1980
>> sub(".*(1981).*", "\\1", subtext) # return -1980-
>>
>> 2/ Based on sub documentation, it replaces the first occurence of a
>> pattern: why it does not return 1980 ?
>>
>> subtext <- " 1980 1981 "
>> sub(".*(198[01]).*", "\\1", subtext) # return 1981
>>
>> 3/ I want extract year from text; I use:
>>
>> subtext <- "bla 1980 bla"
>> sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1", subtext) #
>> return 1980
>> subtext <- "bla 2010 bla"
>> sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1", subtext) #
>> return 2010
>>
>> but
>>
>> subtext <- "bla 1010 bla"
>> sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1", subtext) #
>> return 1010
>>
>> I would like exclude the case 1010 and other like this.
>>
>> The solution would be:
>>
>> 18[0-9][0-9] or 19[0-9][0-9] or 200[0-9] or 201[0-9]
>>
>> Is there a solution to write such a pattern in grep ?
>>
>> Thanks a lot
>>
>> Marc
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posti
>> ng-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: sub/grep question: extract year

Enrico Schumann-2
In reply to this post by R help mailing list-2

Quoting Marc Girondot via R-help <[hidden email]>:

> Hi everybody,
>
> I have some questions about the way that sub is working. I hope that  
> someone has the answer:
>
> 1/ Why the second example does not return an empty string ? There is  
> no match.
>
> subtext <- "-1980-"
> sub(".*(1980).*", "\\1", subtext) # return 1980
> sub(".*(1981).*", "\\1", subtext) # return -1980-

This is as documented in ?sub:
    "Elements of character vectors x which are not
     substituted will be returned unchanged"

> 2/ Based on sub documentation, it replaces the first occurence of a  
> pattern: why it does not return 1980 ?
>
> subtext <- " 1980 1981 "
> sub(".*(198[01]).*", "\\1", subtext) # return 1981

Because the pattern matches the whole string,
not just the year:

     regexpr(".*(198[01]).*", subtext)
     ## [1] 1
     ## attr(,"match.length")
     ## [1] 11
     ## attr(,"useBytes")
     ## [1] TRUE

 From this match, the RE engine will give you the last backreference-match,
which is "1981". If you want to _extract_ the first year, use a  
non-greedy RE instead:

     sub(".*?(198[01]).*", "\\1", subtext)
     ## [1] "1980"

I say _extract_ because you may _replace_ the pattern, as expected:

     sub("198[01]", "YYYY", subtext)
     ## [1] " YYYY 1981 "

That is because the pattern does not match the whole string.
Perhaps this example makes it clearer:

     test <- "1 2 3 4 5"
     sub("([0-9])", "\\1\\1", test)
     ## [1] "11 2 3 4 5"
     sub(".*([0-9]).*", "\\1\\1", test)
     ## [1] "55"
     sub(".*?([0-9]).*", "\\1\\1", test)
     ## [1] "11"



> 3/ I want extract year from text; I use:
>
> subtext <- "bla 1980 bla"
> sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1",  
> subtext) # return 1980
> subtext <- "bla 2010 bla"
> sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1",  
> subtext) # return 2010
>
> but
>
> subtext <- "bla 1010 bla"
> sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1",  
> subtext) # return 1010
>
> I would like exclude the case 1010 and other like this.
>
> The solution would be:
>
> 18[0-9][0-9] or 19[0-9][0-9] or 200[0-9] or 201[0-9]
>
> Is there a solution to write such a pattern in grep ?

You answered this yourself, I think.


> Thanks a lot
>
> Marc
>


--
Enrico Schumann
Lucerne, Switzerland
http://enricoschumann.net

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.