Regular expression help

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Regular expression help

Duncan Murdoch-2
I have a file containing "words" like


a

a/b

a/b/c

where there may be multiple words on a line (separated by spaces).  The
a, b, and c strings can contain non-space, non-slash characters. I'd
like to use gsub() to extract the c strings (which should be empty if
there are none).

A real example is

"f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587"

which I'd like to transform to

" 587 587 587 587"

Another real example is

"f 1067 28680 24462"

which should transform to "   ".

I've tried a few different regexprs, but am unable to find a way to say
"transform words by deleting everything up to and including the 2nd
slash" when there might be zero, one or two slashes.  Any suggestions?

Duncan Murdoch

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Regular expression help

Ulrik Stervbo-2
Hi Duncan,

why not split on / and take the correct elements? It is not as elegant as
regex but could do the trick.

Best,
Ulrik

On Mon, 9 Oct 2017 at 17:03 Duncan Murdoch <[hidden email]> wrote:

> I have a file containing "words" like
>
>
> a
>
> a/b
>
> a/b/c
>
> where there may be multiple words on a line (separated by spaces).  The
> a, b, and c strings can contain non-space, non-slash characters. I'd
> like to use gsub() to extract the c strings (which should be empty if
> there are none).
>
> A real example is
>
> "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587"
>
> which I'd like to transform to
>
> " 587 587 587 587"
>
> Another real example is
>
> "f 1067 28680 24462"
>
> which should transform to "   ".
>
> I've tried a few different regexprs, but am unable to find a way to say
> "transform words by deleting everything up to and including the 2nd
> slash" when there might be zero, one or two slashes.  Any suggestions?
>
> Duncan Murdoch
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Regular expression help

Eric Berger
Hi Duncan,
You can try this:

library(readr)
f <- function(s) {
  t <- unlist(readr::tokenize(paste0(gsub(" ",",",s),"\n",collapse="")))
  i <- grep("[a-zA-Z0-9]*/[a-zA-Z0-9]*/",t)
  u <- sub("[a-zA-Z0-9]*/[a-zA-Z0-9]*/","",t[i])
  paste0(u,collapse=" ")
}

f("f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587")
# "587 587 587 587"

f("f 1067 28680 24462")
# ""

HTH,
Eric


On Mon, Oct 9, 2017 at 6:23 PM, Ulrik Stervbo <[hidden email]>
wrote:

> Hi Duncan,
>
> why not split on / and take the correct elements? It is not as elegant as
> regex but could do the trick.
>
> Best,
> Ulrik
>
> On Mon, 9 Oct 2017 at 17:03 Duncan Murdoch <[hidden email]>
> wrote:
>
> > I have a file containing "words" like
> >
> >
> > a
> >
> > a/b
> >
> > a/b/c
> >
> > where there may be multiple words on a line (separated by spaces).  The
> > a, b, and c strings can contain non-space, non-slash characters. I'd
> > like to use gsub() to extract the c strings (which should be empty if
> > there are none).
> >
> > A real example is
> >
> > "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587"
> >
> > which I'd like to transform to
> >
> > " 587 587 587 587"
> >
> > Another real example is
> >
> > "f 1067 28680 24462"
> >
> > which should transform to "   ".
> >
> > I've tried a few different regexprs, but am unable to find a way to say
> > "transform words by deleting everything up to and including the 2nd
> > slash" when there might be zero, one or two slashes.  Any suggestions?
> >
> > Duncan Murdoch
> >
> > ______________________________________________
> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Regular expression help

Peter Dalgaard-2
In reply to this post by Duncan Murdoch-2

> On 9 Oct 2017, at 17:02 , Duncan Murdoch <[hidden email]> wrote:
>
> I have a file containing "words" like
>
>
> a
>
> a/b
>
> a/b/c
>
> where there may be multiple words on a line (separated by spaces).  The a, b, and c strings can contain non-space, non-slash characters. I'd like to use gsub() to extract the c strings (which should be empty if there are none).
>
> A real example is
>
> "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587"
>
> which I'd like to transform to
>
> " 587 587 587 587"
>
> Another real example is
>
> "f 1067 28680 24462"
>
> which should transform to "   ".
>
> I've tried a few different regexprs, but am unable to find a way to say "transform words by deleting everything up to and including the 2nd slash" when there might be zero, one or two slashes.  Any suggestions?
>

I think you might need something like this:

s <- "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587"
l <- strsplit(s, " ")[[1]]
pat <- "[[:alnum:]]*/[[:alnum:]]*/([[:alnum:]]*)"
paste(ifelse(grepl(pat,l),gsub(pat, "\\1", l), ""), collapse=" ")

-pd

> Duncan Murdoch
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Regular expression help

R help mailing list-2
In reply to this post by Duncan Murdoch-2
> x <- "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587"
> gsub("(^| *)([^/ ]*/?){0,2}", "\\1", x)
[1] " 587 587 587 587"
> y <- "aa aa/ aa/bb aa/bb/ aa/bb/cc aa/bb/cc/ aa/bb/cc/dd aa/bb/cc/dd/"
> gsub("(^| *)([^/ ]*/?){0,2}", "\\1", y)
[1] "    cc cc/ cc/dd cc/dd/"


Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Mon, Oct 9, 2017 at 8:02 AM, Duncan Murdoch <[hidden email]>
wrote:

> I have a file containing "words" like
>
>
> a
>
> a/b
>
> a/b/c
>
> where there may be multiple words on a line (separated by spaces).  The a,
> b, and c strings can contain non-space, non-slash characters. I'd like to
> use gsub() to extract the c strings (which should be empty if there are
> none).
>
> A real example is
>
> "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587"
>
> which I'd like to transform to
>
> " 587 587 587 587"
>
> Another real example is
>
> "f 1067 28680 24462"
>
> which should transform to "   ".
>
> I've tried a few different regexprs, but am unable to find a way to say
> "transform words by deleting everything up to and including the 2nd slash"
> when there might be zero, one or two slashes.  Any suggestions?
>
> Duncan Murdoch
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posti
> ng-guide.html
> and provide commented, minimal, self-contained, reproducible code.

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Regular expression help

R help mailing list-2
"(^| +)([^/ ]*/?){0,2}", with the first "*" replaced by "+" would be a bit
better.

Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Mon, Oct 9, 2017 at 8:50 AM, William Dunlap <[hidden email]> wrote:

> > x <- "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587"
> > gsub("(^| *)([^/ ]*/?){0,2}", "\\1", x)
> [1] " 587 587 587 587"
> > y <- "aa aa/ aa/bb aa/bb/ aa/bb/cc aa/bb/cc/ aa/bb/cc/dd aa/bb/cc/dd/"
> > gsub("(^| *)([^/ ]*/?){0,2}", "\\1", y)
> [1] "    cc cc/ cc/dd cc/dd/"
>
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
> On Mon, Oct 9, 2017 at 8:02 AM, Duncan Murdoch <[hidden email]>
> wrote:
>
>> I have a file containing "words" like
>>
>>
>> a
>>
>> a/b
>>
>> a/b/c
>>
>> where there may be multiple words on a line (separated by spaces).  The
>> a, b, and c strings can contain non-space, non-slash characters. I'd like
>> to use gsub() to extract the c strings (which should be empty if there are
>> none).
>>
>> A real example is
>>
>> "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587"
>>
>> which I'd like to transform to
>>
>> " 587 587 587 587"
>>
>> Another real example is
>>
>> "f 1067 28680 24462"
>>
>> which should transform to "   ".
>>
>> I've tried a few different regexprs, but am unable to find a way to say
>> "transform words by deleting everything up to and including the 2nd slash"
>> when there might be zero, one or two slashes.  Any suggestions?
>>
>> Duncan Murdoch
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posti
>> ng-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Regular expression help

Duncan Murdoch-2
In reply to this post by Ulrik Stervbo-2
On 09/10/2017 11:23 AM, Ulrik Stervbo wrote:
> Hi Duncan,
>
> why not split on / and take the correct elements? It is not as elegant
> as regex but could do the trick.

Thanks for the suggestion.  There are likely many thousands of lines of
data like the two real examples (which had about 5000 and 60000 lines
respectively), so I was thinking that would be too slow, as it would
involve nested strsplit() calls.  But in fact, it's not so bad, so I
might go with it.  Here's a stab at it:

lines <- <the lines to be split, e.g. the lines starting with "f" in
http://sci.esa.int/science-e/www/object/doc.cfm?fobjectid=54726>

l2 <- strsplit(lines, " ")
l3 <- lapply(l2, function(x) {
         y <- strsplit(x, "/")
         sapply(y, function(z) if (length(z) == 3) z[3] else "")
       })

Duncan

>
> Best,
> Ulrik
>
> On Mon, 9 Oct 2017 at 17:03 Duncan Murdoch <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     I have a file containing "words" like
>
>
>     a
>
>     a/b
>
>     a/b/c
>
>     where there may be multiple words on a line (separated by spaces).  The
>     a, b, and c strings can contain non-space, non-slash characters. I'd
>     like to use gsub() to extract the c strings (which should be empty if
>     there are none).
>
>     A real example is
>
>     "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587"
>
>     which I'd like to transform to
>
>     " 587 587 587 587"
>
>     Another real example is
>
>     "f 1067 28680 24462"
>
>     which should transform to "   ".
>
>     I've tried a few different regexprs, but am unable to find a way to say
>     "transform words by deleting everything up to and including the 2nd
>     slash" when there might be zero, one or two slashes.  Any suggestions?
>
>     Duncan Murdoch
>
>     ______________________________________________
>     [hidden email] <mailto:[hidden email]> mailing list --
>     To UNSUBSCRIBE and more, see
>     https://stat.ethz.ch/mailman/listinfo/r-help
>     PLEASE do read the posting guide
>     http://www.R-project.org/posting-guide.html
>     and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Regular expression help

Duncan Murdoch-2
In reply to this post by R help mailing list-2
On 09/10/2017 12:06 PM, William Dunlap wrote:
> "(^| +)([^/ ]*/?){0,2}", with the first "*" replaced by "+" would be a
> bit better.

Thanks!  I think I actually need the *, because theoretically the b part
of the word could be empty, i.e. "a//c" would be legal and should become
"c".

Duncan Murdoch

>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com <http://tibco.com>
>
> On Mon, Oct 9, 2017 at 8:50 AM, William Dunlap <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>      > x <- "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587"
>      > gsub("(^| *)([^/ ]*/?){0,2}", "\\1", x)
>     [1] " 587 587 587 587"
>      > y <- "aa aa/ aa/bb aa/bb/ aa/bb/cc aa/bb/cc/ aa/bb/cc/dd
>     aa/bb/cc/dd/"
>      > gsub("(^| *)([^/ ]*/?){0,2}", "\\1", y)
>     [1] "    cc cc/ cc/dd cc/dd/"
>
>
>     Bill Dunlap
>     TIBCO Software
>     wdunlap tibco.com <http://tibco.com>
>
>     On Mon, Oct 9, 2017 at 8:02 AM, Duncan Murdoch
>     <[hidden email] <mailto:[hidden email]>> wrote:
>
>         I have a file containing "words" like
>
>
>         a
>
>         a/b
>
>         a/b/c
>
>         where there may be multiple words on a line (separated by
>         spaces).  The a, b, and c strings can contain non-space,
>         non-slash characters. I'd like to use gsub() to extract the c
>         strings (which should be empty if there are none).
>
>         A real example is
>
>         "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587"
>
>         which I'd like to transform to
>
>         " 587 587 587 587"
>
>         Another real example is
>
>         "f 1067 28680 24462"
>
>         which should transform to "   ".
>
>         I've tried a few different regexprs, but am unable to find a way
>         to say "transform words by deleting everything up to and
>         including the 2nd slash" when there might be zero, one or two
>         slashes.  Any suggestions?
>
>         Duncan Murdoch
>
>         ______________________________________________
>         [hidden email] <mailto:[hidden email]> mailing list
>         -- To UNSUBSCRIBE and more, see
>         https://stat.ethz.ch/mailman/listinfo/r-help
>         <https://stat.ethz.ch/mailman/listinfo/r-help>
>         PLEASE do read the posting guide
>         http://www.R-project.org/posting-guide.html
>         <http://www.R-project.org/posting-guide.html>
>         and provide commented, minimal, self-contained, reproducible code.
>
>
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Regular expression help

Georges Monette-2
In reply to this post by Duncan Murdoch-2
How about this (I'm showing it as a pipe because it's easier to read
that way):

library(magrittr)
"f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587" %>%
   strsplit(' ') %>%
   unlist %>%
   sub('^[^/]*/*','',.) %>%
   sub('^[^/]*/*','',.) %>%
   paste(collapse = ' ')

Georges Monette

--
Georges Monette, PhD P.Stat.(SSC) | Associate Professor. Faculty of Science, Department of Mathematics & Statistics | North 626 Ross Building | York University | 4700 Keele Street, Toronto, ON M3J 1P3 | Telephone: 416-736-5250 | Fax: 416-736-5757 | E-Mail: [hidden email]


On 2017-10-09 11:02 AM, Duncan Murdoch wrote:

> I have a file containing "words" like
>
>
> a
>
> a/b
>
> a/b/c
>
> where there may be multiple words on a line (separated by spaces). 
> The a, b, and c strings can contain non-space, non-slash characters.
> I'd like to use gsub() to extract the c strings (which should be empty
> if there are none).
>
> A real example is
>
> "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587"
>
> which I'd like to transform to
>
> " 587 587 587 587"
>
> Another real example is
>
> "f 1067 28680 24462"
>
> which should transform to "   ".
>
> I've tried a few different regexprs, but am unable to find a way to
> say "transform words by deleting everything up to and including the
> 2nd slash" when there might be zero, one or two slashes.  Any
> suggestions?
>
> Duncan Murdoch
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Regular expression help

David Winsemius

> On Oct 9, 2017, at 6:08 PM, Georges Monette <[hidden email]> wrote:
>
> How about this (I'm showing it as a pipe because it's easier to read that way):
>
> library(magrittr)
> "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587" %>%
>   strsplit(' ') %>%
>   unlist %>%
>   sub('^[^/]*/*','',.) %>%
>   sub('^[^/]*/*','',.) %>%
>   paste(collapse = ' ')

I'm old school R, so I don't find that particularly readable. I read the later specification as saying each line began with an f, so the fourth item after an strsplit becomes the target.

This seemed more readable to me:

Lines <- readLines(url("http://sci.esa.int/science-e/www/object/doc.cfm?fobjectid=54726"))
lines <- Lines[ grepl("^f", Lines) ]

str(lines)
# chr [1:62908] "f 14327 6959 18747" "f 8258 15598 18980" "f 27662 21871 21939" ...

l2 <- strsplit(lines, " ")  # in that file the separators were spaces
l3 <- sapply(l2[1:3], function(x) { if (length(x) == 4) x[4] else ""
      })
l3
#[1] "18747" "18980" "21939"

# Remove the `[1:3]` to get the entire result.


Best;
David.

>
> Georges Monette
>
> --
> Georges Monette, PhD P.Stat.(SSC) | Associate Professor. Faculty of Science, Department of Mathematics & Statistics | North 626 Ross Building | York University | 4700 Keele Street, Toronto, ON M3J 1P3 | Telephone: 416-736-5250 | Fax: 416-736-5757 | E-Mail: [hidden email]
>
>
> On 2017-10-09 11:02 AM, Duncan Murdoch wrote:
>> I have a file containing "words" like
>>
>>
>> a
>>
>> a/b
>>
>> a/b/c
>>
>> where there may be multiple words on a line (separated by spaces).  The a, b, and c strings can contain non-space, non-slash characters. I'd like to use gsub() to extract the c strings (which should be empty if there are none).
>>
>> A real example is
>>
>> "f 147/1315/587 2820/1320/587 3624/1321/587 1852/1322/587"
>>
>> which I'd like to transform to
>>
>> " 587 587 587 587"
>>
>> Another real example is
>>
>> "f 1067 28680 24462"
>>
>> which should transform to "   ".
>>
>> I've tried a few different regexprs, but am unable to find a way to say "transform words by deleting everything up to and including the 2nd slash" when there might be zero, one or two slashes.  Any suggestions?
>>
>> Duncan Murdoch
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

'Any technology distinguishable from magic is insufficiently advanced.'   -Gehm's Corollary to Clarke's Third Law

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.