regexp mystery

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

regexp mystery

PIKAL Petr
Dear all

I have text file with lines like this.

> dput(x[9])
"PYedehYev:  300 s                    Záva~í: 2.160 kg"
> dput(x[11])
"et odYezko: 3                     \fas odYezku:   15 s"

I am able to extract some numbers but others give me headache.

gsub("^.*[^:] (\\d+.\\d+).*$", "\\1", x[9])
works for 300

gsub("^.*[:] (\\d+.\\d+).*$", "\\1", x[9])
works for 2.160

gsub("^.*: (\\d+).*$", "\\1", x[11])
works for 3

but only after many attempts I found that
gsub("^.*[^:] (\\d+).*$", "\\1", x[11])
works for 15

Can somebody explain my why for line 11 second item requires almost equvalent regular expression as first item in line 9 and why just

gsub("^.*[:] (\\d+).*$", "\\1", x[11])

does not produce 15 but 3???

Cheers
Petr

Osobní údaje: Informace o zpracování a ochraně osobních údajů obchodních partnerů PRECHEZA a.s. jsou zveřejněny na: https://www.precheza.cz/zasady-ochrany-osobnich-udaju/ | Information about processing and protection of business partner’s personal data are available on website: https://www.precheza.cz/en/personal-data-protection-principles/
Důvěrnost: Tento e-mail a jakékoliv k němu připojené dokumenty jsou důvěrné a podléhají tomuto právně závaznému prohláąení o vyloučení odpovědnosti: https://www.precheza.cz/01-dovetek/ | This email and any documents attached to it may be confidential and are subject to the legally binding disclaimer: https://www.precheza.cz/en/01-disclaimer/

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: regexp mystery

Ivan Krylov
On Tue, 16 Oct 2018 08:36:27 +0000
PIKAL Petr <[hidden email]> wrote:

> > dput(x[11])  
> "et odYezko: 3                     \fas odYezku:   15 s"

> gsub("^.*: (\\d+).*$", "\\1", x[11])
> works for 3

This regular expression only matches one space between the colon and
the number, but you have more than one of them before "15".

> gsub("^.*[^:] (\\d+).*$", "\\1", x[11])
> works for 15

Match succeeds because a space is not a colon:

 ^.* matches "et odYezko: 3                     \fas odYezku:  "
 [^:] matches space " "
 space " " matches another space " "
 finally, (\\d+) matches "15"
 and .*$ matches " s"

If you need just the numbers, you might have more success by extracting
matches directly with gregexpr and regmatches:

(
        function(s) regmatches(
                s,
                gregexpr("\\d+(\\.\\d+)?", s)
        )
)("et odYezko: 3                     \fas odYezku:   15 s")

[[1]]
[1] "3"  "15"

(I'm creating an anonymous function and evaluating it immediately
because I need to pass the same string to both gregexpr and regmatches.)

If you need to capture numbers appearing in a specific context, a
better regular expression suiting your needs might be

":\\s*(\\d+(?:\\.\\d+)?)"

(A colon, followed by optional whitespace, followed by a number to
capture, consisting of decimals followed by optional, non-captured dot
followed by decimals)

but I couldn't find a way to extract captures from repeated match by
using vanilla R pattern matching (it's either regexec which returns
captures for the first match or gregexpr which returns all matches but
without the captures). If you can load the stringr package, it's very
easy, though:

str_match_all(
        c(
                "PYedehYev:  300 s              Záva~í: 2.160 kg",
                "et odYezko: 3               \fas odYezku:   15 s"
        ),
        ":\\s*(\\d+(?:\\.\\d+)?)"
)
[[1]]
     [,1]      [,2]  
[1,] ":  300"  "300"  
[2,] ": 2.160" "2.160"

[[2]]
     [,1]     [,2]
[1,] ": 3"    "3"
[2,] ":   15" "15"

Column 2 of each list item contains the requested captures.

--
Best regards,
Ivan

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: regexp mystery

PIKAL Petr
Hi

Thanks a lot for your insightful answer. I will need to study it in detail, gregexpr and regexpr seems to be quite handy for what I need.

Cheers
Petr

> -----Original Message-----
> From: Ivan Krylov <[hidden email]>
> Sent: Tuesday, October 16, 2018 11:08 AM
> To: PIKAL Petr <[hidden email]>
> Cc: [hidden email]
> Subject: Re: [R] regexp mystery
>
> On Tue, 16 Oct 2018 08:36:27 +0000
> PIKAL Petr <[hidden email]> wrote:
>
> > > dput(x[11])
> > "et odYezko: 3                     \fas odYezku:   15 s"
>
> > gsub("^.*: (\\d+).*$", "\\1", x[11])
> > works for 3
>
> This regular expression only matches one space between the colon and the
> number, but you have more than one of them before "15".
>
> > gsub("^.*[^:] (\\d+).*$", "\\1", x[11]) works for 15
>
> Match succeeds because a space is not a colon:
>
>  ^.* matches "et odYezko: 3                     \fas odYezku:  "
>  [^:] matches space " "
>  space " " matches another space " "
>  finally, (\\d+) matches "15"
>  and .*$ matches " s"
>
> If you need just the numbers, you might have more success by extracting
> matches directly with gregexpr and regmatches:
>
> (
> function(s) regmatches(
> s,
> gregexpr("\\d+(\\.\\d+)?", s)
> )
> )("et odYezko: 3                     \fas odYezku:   15 s")
>
> [[1]]
> [1] "3"  "15"
>
> (I'm creating an anonymous function and evaluating it immediately because I
> need to pass the same string to both gregexpr and regmatches.)
>
> If you need to capture numbers appearing in a specific context, a better regular
> expression suiting your needs might be
>
> ":\\s*(\\d+(?:\\.\\d+)?)"
>
> (A colon, followed by optional whitespace, followed by a number to capture,
> consisting of decimals followed by optional, non-captured dot followed by
> decimals)
>
> but I couldn't find a way to extract captures from repeated match by using
> vanilla R pattern matching (it's either regexec which returns captures for the
> first match or gregexpr which returns all matches but without the captures). If
> you can load the stringr package, it's very easy, though:
>
> str_match_all(
> c(
> "PYedehYev:  300 s              Záva~í: 2.160 kg",
> "et odYezko: 3               \fas odYezku:   15 s"
> ),
> ":\\s*(\\d+(?:\\.\\d+)?)"
> )
> [[1]]
>      [,1]      [,2]
> [1,] ":  300"  "300"
> [2,] ": 2.160" "2.160"
>
> [[2]]
>      [,1]     [,2]
> [1,] ": 3"    "3"
> [2,] ":   15" "15"
>
> Column 2 of each list item contains the requested captures.
>
> --
> Best regards,
> Ivan
Osobní údaje: Informace o zpracování a ochraně osobních údajů obchodních partnerů PRECHEZA a.s. jsou zveřejněny na: https://www.precheza.cz/zasady-ochrany-osobnich-udaju/ | Information about processing and protection of business partner’s personal data are available on website: https://www.precheza.cz/en/personal-data-protection-principles/
Důvěrnost: Tento e-mail a jakékoliv k němu připojené dokumenty jsou důvěrné a podléhají tomuto právně závaznému prohláąení o vyloučení odpovědnosti: https://www.precheza.cz/01-dovetek/ | This email and any documents attached to it may be confidential and are subject to the legally binding disclaimer: https://www.precheza.cz/en/01-disclaimer/

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.