Split

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Split

Val-17
HI All,

I am trying to create   new columns based on another column string
content. First I want to identify rows that contain a particular
string.  If it contains, I want to split the string and create two
variables.

Here is my sample of data.
F1<-read.table(text="ID1  ID2  text
A1 B1   NONE
A1 B1   cf_12
A1 B1   NONE
A2 B2   X2_25
A2 B3   fd_15  ",header=TRUE,stringsAsFactors=F)
If the variable "text" contains this "_" I want to create an indicator
variable as shown below

F1$Y1 <- ifelse(grepl("_", F1$text),1,0)


Then I want to split that string in to two, before "_" and after "_"
and create two variables as shown below
x1= strsplit(as.character(F1$text),'_',2)

My problem is how to combine this with the original data frame. The
desired  output is shown   below,


ID1 ID2  Y1   X1    X2
A1  B1    0   NONE   .
A1  B1   1    cf        12
A1  B1   0  NONE   .
A2  B2   1    X2    25
A2  B3   1    fd    15

Any help?
Thank you.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Split

Rui Barradas
Hello,

Something like this?


F1$Y1 <- +grepl("_", F1$text)
F1 <- F1[c(1, 2, 4, 3)]
F1 <- tidyr::separate(F1, text, into = c("X1", "X2"), sep = "_", fill =
"right")
F1


Hope this helps,

Rui Barradas

Às 19:55 de 22/09/20, Val escreveu:

> HI All,
>
> I am trying to create   new columns based on another column string
> content. First I want to identify rows that contain a particular
> string.  If it contains, I want to split the string and create two
> variables.
>
> Here is my sample of data.
> F1<-read.table(text="ID1  ID2  text
> A1 B1   NONE
> A1 B1   cf_12
> A1 B1   NONE
> A2 B2   X2_25
> A2 B3   fd_15  ",header=TRUE,stringsAsFactors=F)
> If the variable "text" contains this "_" I want to create an indicator
> variable as shown below
>
> F1$Y1 <- ifelse(grepl("_", F1$text),1,0)
>
>
> Then I want to split that string in to two, before "_" and after "_"
> and create two variables as shown below
> x1= strsplit(as.character(F1$text),'_',2)
>
> My problem is how to combine this with the original data frame. The
> desired  output is shown   below,
>
>
> ID1 ID2  Y1   X1    X2
> A1  B1    0   NONE   .
> A1  B1   1    cf        12
> A1  B1   0  NONE   .
> A2  B2   1    X2    25
> A2  B3   1    fd    15
>
> Any help?
> Thank you.
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Split

Rui Barradas
Hello,

A base R solution with strsplit, like in your code.

F1$Y1 <- +grepl("_", F1$text)

tmp <- strsplit(as.character(F1$text), "_")
tmp <- lapply(tmp, function(x) if(length(x) == 1) c(x, ".") else x)
tmp <- do.call(rbind, tmp)
colnames(tmp) <- c("X1", "X2")
F1 <- cbind(F1[-3], tmp)    # remove the original column
rm(tmp)

F1
#  ID1 ID2 Y1   X1 X2
#1  A1  B1  0 NONE  .
#2  A1  B1  1   cf 12
#3  A1  B1  0 NONE  .
#4  A2  B2  1   X2 25
#5  A2  B3  1   fd 15


Note that cbind dispatches on F1, an object of class "data.frame".
Therefore it's the method cbind.data.frame that is called and the result
is also a df, though tmp is a "matrix".


Hope this helps,

Rui Barradas


Às 20:07 de 22/09/20, Rui Barradas escreveu:

> Hello,
>
> Something like this?
>
>
> F1$Y1 <- +grepl("_", F1$text)
> F1 <- F1[c(1, 2, 4, 3)]
> F1 <- tidyr::separate(F1, text, into = c("X1", "X2"), sep = "_", fill =
> "right")
> F1
>
>
> Hope this helps,
>
> Rui Barradas
>
> Às 19:55 de 22/09/20, Val escreveu:
>> HI All,
>>
>> I am trying to create   new columns based on another column string
>> content. First I want to identify rows that contain a particular
>> string.  If it contains, I want to split the string and create two
>> variables.
>>
>> Here is my sample of data.
>> F1<-read.table(text="ID1  ID2  text
>> A1 B1   NONE
>> A1 B1   cf_12
>> A1 B1   NONE
>> A2 B2   X2_25
>> A2 B3   fd_15  ",header=TRUE,stringsAsFactors=F)
>> If the variable "text" contains this "_" I want to create an indicator
>> variable as shown below
>>
>> F1$Y1 <- ifelse(grepl("_", F1$text),1,0)
>>
>>
>> Then I want to split that string in to two, before "_" and after "_"
>> and create two variables as shown below
>> x1= strsplit(as.character(F1$text),'_',2)
>>
>> My problem is how to combine this with the original data frame. The
>> desired  output is shown   below,
>>
>>
>> ID1 ID2  Y1   X1    X2
>> A1  B1    0   NONE   .
>> A1  B1   1    cf        12
>> A1  B1   0  NONE   .
>> A2  B2   1    X2    25
>> A2  B3   1    fd    15
>>
>> Any help?
>> Thank you.
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Split

LMHmedchem
In reply to this post by Val-17
Sometimes it just makes more sense to pre-process your data and get it into the format you need. It
just depends on whether you are more comfortable programing in R or in some other text manipulation
language like bash/sed/awk/grep etc.

If you know how to do this with other tools, you could write a script and probably call the script
from R. I could post a sample if you are interested.

LMH


Val wrote:

> HI All,
>
> I am trying to create   new columns based on another column string
> content. First I want to identify rows that contain a particular
> string.  If it contains, I want to split the string and create two
> variables.
>
> Here is my sample of data.
> F1<-read.table(text="ID1  ID2  text
> A1 B1   NONE
> A1 B1   cf_12
> A1 B1   NONE
> A2 B2   X2_25
> A2 B3   fd_15  ",header=TRUE,stringsAsFactors=F)
> If the variable "text" contains this "_" I want to create an indicator
> variable as shown below
>
> F1$Y1 <- ifelse(grepl("_", F1$text),1,0)
>
>
> Then I want to split that string in to two, before "_" and after "_"
> and create two variables as shown below
> x1= strsplit(as.character(F1$text),'_',2)
>
> My problem is how to combine this with the original data frame. The
> desired  output is shown   below,
>
>
> ID1 ID2  Y1   X1    X2
> A1  B1    0   NONE   .
> A1  B1   1    cf        12
> A1  B1   0  NONE   .
> A2  B2   1    X2    25
> A2  B3   1    fd    15
>
> Any help?
> Thank you.
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Split

Bert Gunter-2
In reply to this post by Rui Barradas
To be clear, I think Rui's solution is perfectly fine and probably better
than what I offer below. But just for fun, I wanted to do it without the
lapply().  Here is one way. I think my comments suffice to explain.

> ## which are the  non "_" indices?
> wh <- grep("_",F1$text, fixed = TRUE, invert = TRUE)
> ## paste "_." to these
> F1[wh,"text"] <- paste(F1[wh,"text"],".",sep = "_")
> ## Now strsplit() and unlist() them to get a vector
> z <- unlist(strsplit(F1$text, "_"))
> ## now cbind() to the data frame
> F1 <- cbind(F1, matrix(z, ncol = 2, byrow = TRUE))
> F1
  ID1 ID2   text    1  2
1  A1  B1 NONE_. NONE  .
2  A1  B1  cf_12   cf 12
3  A1  B1 NONE_. NONE  .
4  A2  B2  X2_25   X2 25
5  A2  B3  fd_15   fd 15
>## You can change the names of the 2 columns yourself

Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Tue, Sep 22, 2020 at 12:19 PM Rui Barradas <[hidden email]> wrote:

> Hello,
>
> A base R solution with strsplit, like in your code.
>
> F1$Y1 <- +grepl("_", F1$text)
>
> tmp <- strsplit(as.character(F1$text), "_")
> tmp <- lapply(tmp, function(x) if(length(x) == 1) c(x, ".") else x)
> tmp <- do.call(rbind, tmp)
> colnames(tmp) <- c("X1", "X2")
> F1 <- cbind(F1[-3], tmp)    # remove the original column
> rm(tmp)
>
> F1
> #  ID1 ID2 Y1   X1 X2
> #1  A1  B1  0 NONE  .
> #2  A1  B1  1   cf 12
> #3  A1  B1  0 NONE  .
> #4  A2  B2  1   X2 25
> #5  A2  B3  1   fd 15
>
>
> Note that cbind dispatches on F1, an object of class "data.frame".
> Therefore it's the method cbind.data.frame that is called and the result
> is also a df, though tmp is a "matrix".
>
>
> Hope this helps,
>
> Rui Barradas
>
>
> Às 20:07 de 22/09/20, Rui Barradas escreveu:
> > Hello,
> >
> > Something like this?
> >
> >
> > F1$Y1 <- +grepl("_", F1$text)
> > F1 <- F1[c(1, 2, 4, 3)]
> > F1 <- tidyr::separate(F1, text, into = c("X1", "X2"), sep = "_", fill =
> > "right")
> > F1
> >
> >
> > Hope this helps,
> >
> > Rui Barradas
> >
> > Às 19:55 de 22/09/20, Val escreveu:
> >> HI All,
> >>
> >> I am trying to create   new columns based on another column string
> >> content. First I want to identify rows that contain a particular
> >> string.  If it contains, I want to split the string and create two
> >> variables.
> >>
> >> Here is my sample of data.
> >> F1<-read.table(text="ID1  ID2  text
> >> A1 B1   NONE
> >> A1 B1   cf_12
> >> A1 B1   NONE
> >> A2 B2   X2_25
> >> A2 B3   fd_15  ",header=TRUE,stringsAsFactors=F)
> >> If the variable "text" contains this "_" I want to create an indicator
> >> variable as shown below
> >>
> >> F1$Y1 <- ifelse(grepl("_", F1$text),1,0)
> >>
> >>
> >> Then I want to split that string in to two, before "_" and after "_"
> >> and create two variables as shown below
> >> x1= strsplit(as.character(F1$text),'_',2)
> >>
> >> My problem is how to combine this with the original data frame. The
> >> desired  output is shown   below,
> >>
> >>
> >> ID1 ID2  Y1   X1    X2
> >> A1  B1    0   NONE   .
> >> A1  B1   1    cf        12
> >> A1  B1   0  NONE   .
> >> A2  B2   1    X2    25
> >> A2  B3   1    fd    15
> >>
> >> Any help?
> >> Thank you.
> >>
> >> ______________________________________________
> >> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >
> > ______________________________________________
> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Split

Val-17
Thank you all for the help!

LMH, Yes I would like to see the alternative.  I am using this for a
large data set and if the  alternative is more efficient than this
then I would be happy.

On Tue, Sep 22, 2020 at 6:25 PM Bert Gunter <[hidden email]> wrote:

>
> To be clear, I think Rui's solution is perfectly fine and probably better than what I offer below. But just for fun, I wanted to do it without the lapply().  Here is one way. I think my comments suffice to explain.
>
> > ## which are the  non "_" indices?
> > wh <- grep("_",F1$text, fixed = TRUE, invert = TRUE)
> > ## paste "_." to these
> > F1[wh,"text"] <- paste(F1[wh,"text"],".",sep = "_")
> > ## Now strsplit() and unlist() them to get a vector
> > z <- unlist(strsplit(F1$text, "_"))
> > ## now cbind() to the data frame
> > F1 <- cbind(F1, matrix(z, ncol = 2, byrow = TRUE))
> > F1
>   ID1 ID2   text    1  2
> 1  A1  B1 NONE_. NONE  .
> 2  A1  B1  cf_12   cf 12
> 3  A1  B1 NONE_. NONE  .
> 4  A2  B2  X2_25   X2 25
> 5  A2  B3  fd_15   fd 15
> >## You can change the names of the 2 columns yourself
>
> Cheers,
> Bert
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Tue, Sep 22, 2020 at 12:19 PM Rui Barradas <[hidden email]> wrote:
>>
>> Hello,
>>
>> A base R solution with strsplit, like in your code.
>>
>> F1$Y1 <- +grepl("_", F1$text)
>>
>> tmp <- strsplit(as.character(F1$text), "_")
>> tmp <- lapply(tmp, function(x) if(length(x) == 1) c(x, ".") else x)
>> tmp <- do.call(rbind, tmp)
>> colnames(tmp) <- c("X1", "X2")
>> F1 <- cbind(F1[-3], tmp)    # remove the original column
>> rm(tmp)
>>
>> F1
>> #  ID1 ID2 Y1   X1 X2
>> #1  A1  B1  0 NONE  .
>> #2  A1  B1  1   cf 12
>> #3  A1  B1  0 NONE  .
>> #4  A2  B2  1   X2 25
>> #5  A2  B3  1   fd 15
>>
>>
>> Note that cbind dispatches on F1, an object of class "data.frame".
>> Therefore it's the method cbind.data.frame that is called and the result
>> is also a df, though tmp is a "matrix".
>>
>>
>> Hope this helps,
>>
>> Rui Barradas
>>
>>
>> Às 20:07 de 22/09/20, Rui Barradas escreveu:
>> > Hello,
>> >
>> > Something like this?
>> >
>> >
>> > F1$Y1 <- +grepl("_", F1$text)
>> > F1 <- F1[c(1, 2, 4, 3)]
>> > F1 <- tidyr::separate(F1, text, into = c("X1", "X2"), sep = "_", fill =
>> > "right")
>> > F1
>> >
>> >
>> > Hope this helps,
>> >
>> > Rui Barradas
>> >
>> > Às 19:55 de 22/09/20, Val escreveu:
>> >> HI All,
>> >>
>> >> I am trying to create   new columns based on another column string
>> >> content. First I want to identify rows that contain a particular
>> >> string.  If it contains, I want to split the string and create two
>> >> variables.
>> >>
>> >> Here is my sample of data.
>> >> F1<-read.table(text="ID1  ID2  text
>> >> A1 B1   NONE
>> >> A1 B1   cf_12
>> >> A1 B1   NONE
>> >> A2 B2   X2_25
>> >> A2 B3   fd_15  ",header=TRUE,stringsAsFactors=F)
>> >> If the variable "text" contains this "_" I want to create an indicator
>> >> variable as shown below
>> >>
>> >> F1$Y1 <- ifelse(grepl("_", F1$text),1,0)
>> >>
>> >>
>> >> Then I want to split that string in to two, before "_" and after "_"
>> >> and create two variables as shown below
>> >> x1= strsplit(as.character(F1$text),'_',2)
>> >>
>> >> My problem is how to combine this with the original data frame. The
>> >> desired  output is shown   below,
>> >>
>> >>
>> >> ID1 ID2  Y1   X1    X2
>> >> A1  B1    0   NONE   .
>> >> A1  B1   1    cf        12
>> >> A1  B1   0  NONE   .
>> >> A2  B2   1    X2    25
>> >> A2  B3   1    fd    15
>> >>
>> >> Any help?
>> >> Thank you.
>> >>
>> >> ______________________________________________
>> >> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> >> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> PLEASE do read the posting guide
>> >> http://www.R-project.org/posting-guide.html
>> >> and provide commented, minimal, self-contained, reproducible code.
>> >>
>> >
>> > ______________________________________________
>> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Split

Bert Gunter-2
Oh, if efficiency is a consideration, then my code is about 15 times as
fast as Rui's:
> F2 <- F1[rep(1:5,1e6),]  ## 5 million rows
##Rui's
> system.time({
+     F2$Y1 <- +grepl("_", F2$text)
+     tmp <- strsplit(as.character(F2$text), "_")
+     tmp <- lapply(tmp, function(x) if(length(x) == 1) c(x, ".") else x)
+     tmp <- do.call(rbind, tmp)
+     colnames(tmp) <- c("X1", "X2")
+     F2 <- cbind(F2[-3], tmp)    # remove the original column
+ })
   user  system elapsed
 20.072   0.625  20.786

## my version
> system.time({
+     wh <- grep("_",F2$text, fixed = TRUE, invert = TRUE)
+     F2[wh,"text"] <- paste(F2[wh,"text"],".",sep = "_")
+     z <- unlist(strsplit(F1$text,"_"))
+     F2 <- cbind(F2, matrix(z, ncol = 2, byrow = TRUE))
+     F2
+ })
   user  system elapsed
  1.256   0.019   1.281

Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Tue, Sep 22, 2020 at 5:04 PM Val <[hidden email]> wrote:

> Thank you all for the help!
>
> LMH, Yes I would like to see the alternative.  I am using this for a
> large data set and if the  alternative is more efficient than this
> then I would be happy.
>
> On Tue, Sep 22, 2020 at 6:25 PM Bert Gunter <[hidden email]>
> wrote:
> >
> > To be clear, I think Rui's solution is perfectly fine and probably
> better than what I offer below. But just for fun, I wanted to do it without
> the lapply().  Here is one way. I think my comments suffice to explain.
> >
> > > ## which are the  non "_" indices?
> > > wh <- grep("_",F1$text, fixed = TRUE, invert = TRUE)
> > > ## paste "_." to these
> > > F1[wh,"text"] <- paste(F1[wh,"text"],".",sep = "_")
> > > ## Now strsplit() and unlist() them to get a vector
> > > z <- unlist(strsplit(F1$text, "_"))
> > > ## now cbind() to the data frame
> > > F1 <- cbind(F1, matrix(z, ncol = 2, byrow = TRUE))
> > > F1
> >   ID1 ID2   text    1  2
> > 1  A1  B1 NONE_. NONE  .
> > 2  A1  B1  cf_12   cf 12
> > 3  A1  B1 NONE_. NONE  .
> > 4  A2  B2  X2_25   X2 25
> > 5  A2  B3  fd_15   fd 15
> > >## You can change the names of the 2 columns yourself
> >
> > Cheers,
> > Bert
> >
> > Bert Gunter
> >
> > "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >
> >
> > On Tue, Sep 22, 2020 at 12:19 PM Rui Barradas <[hidden email]>
> wrote:
> >>
> >> Hello,
> >>
> >> A base R solution with strsplit, like in your code.
> >>
> >> F1$Y1 <- +grepl("_", F1$text)
> >>
> >> tmp <- strsplit(as.character(F1$text), "_")
> >> tmp <- lapply(tmp, function(x) if(length(x) == 1) c(x, ".") else x)
> >> tmp <- do.call(rbind, tmp)
> >> colnames(tmp) <- c("X1", "X2")
> >> F1 <- cbind(F1[-3], tmp)    # remove the original column
> >> rm(tmp)
> >>
> >> F1
> >> #  ID1 ID2 Y1   X1 X2
> >> #1  A1  B1  0 NONE  .
> >> #2  A1  B1  1   cf 12
> >> #3  A1  B1  0 NONE  .
> >> #4  A2  B2  1   X2 25
> >> #5  A2  B3  1   fd 15
> >>
> >>
> >> Note that cbind dispatches on F1, an object of class "data.frame".
> >> Therefore it's the method cbind.data.frame that is called and the result
> >> is also a df, though tmp is a "matrix".
> >>
> >>
> >> Hope this helps,
> >>
> >> Rui Barradas
> >>
> >>
> >> Às 20:07 de 22/09/20, Rui Barradas escreveu:
> >> > Hello,
> >> >
> >> > Something like this?
> >> >
> >> >
> >> > F1$Y1 <- +grepl("_", F1$text)
> >> > F1 <- F1[c(1, 2, 4, 3)]
> >> > F1 <- tidyr::separate(F1, text, into = c("X1", "X2"), sep = "_", fill
> =
> >> > "right")
> >> > F1
> >> >
> >> >
> >> > Hope this helps,
> >> >
> >> > Rui Barradas
> >> >
> >> > Às 19:55 de 22/09/20, Val escreveu:
> >> >> HI All,
> >> >>
> >> >> I am trying to create   new columns based on another column string
> >> >> content. First I want to identify rows that contain a particular
> >> >> string.  If it contains, I want to split the string and create two
> >> >> variables.
> >> >>
> >> >> Here is my sample of data.
> >> >> F1<-read.table(text="ID1  ID2  text
> >> >> A1 B1   NONE
> >> >> A1 B1   cf_12
> >> >> A1 B1   NONE
> >> >> A2 B2   X2_25
> >> >> A2 B3   fd_15  ",header=TRUE,stringsAsFactors=F)
> >> >> If the variable "text" contains this "_" I want to create an
> indicator
> >> >> variable as shown below
> >> >>
> >> >> F1$Y1 <- ifelse(grepl("_", F1$text),1,0)
> >> >>
> >> >>
> >> >> Then I want to split that string in to two, before "_" and after "_"
> >> >> and create two variables as shown below
> >> >> x1= strsplit(as.character(F1$text),'_',2)
> >> >>
> >> >> My problem is how to combine this with the original data frame. The
> >> >> desired  output is shown   below,
> >> >>
> >> >>
> >> >> ID1 ID2  Y1   X1    X2
> >> >> A1  B1    0   NONE   .
> >> >> A1  B1   1    cf        12
> >> >> A1  B1   0  NONE   .
> >> >> A2  B2   1    X2    25
> >> >> A2  B3   1    fd    15
> >> >>
> >> >> Any help?
> >> >> Thank you.
> >> >>
> >> >> ______________________________________________
> >> >> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> >> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> >> PLEASE do read the posting guide
> >> >> http://www.R-project.org/posting-guide.html
> >> >> and provide commented, minimal, self-contained, reproducible code.
> >> >>
> >> >
> >> > ______________________________________________
> >> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> >> > https://stat.ethz.ch/mailman/listinfo/r-help
> >> > PLEASE do read the posting guide
> >> > http://www.R-project.org/posting-guide.html
> >> > and provide commented, minimal, self-contained, reproducible code.
> >>
> >> ______________________________________________
> >> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Split

Bill Dunlap-2
In reply to this post by Bert Gunter-2
Another way to make columns out of the stuff before and after the
underscore, with NAs if there is no underscore, is

utils::strcapture("([^_]*)_(.*)", F1$text,
proto=data.frame(Before_=character(), After_=character()))

-Bill

On Tue, Sep 22, 2020 at 4:25 PM Bert Gunter <[hidden email]> wrote:

> To be clear, I think Rui's solution is perfectly fine and probably better
> than what I offer below. But just for fun, I wanted to do it without the
> lapply().  Here is one way. I think my comments suffice to explain.
>
> > ## which are the  non "_" indices?
> > wh <- grep("_",F1$text, fixed = TRUE, invert = TRUE)
> > ## paste "_." to these
> > F1[wh,"text"] <- paste(F1[wh,"text"],".",sep = "_")
> > ## Now strsplit() and unlist() them to get a vector
> > z <- unlist(strsplit(F1$text, "_"))
> > ## now cbind() to the data frame
> > F1 <- cbind(F1, matrix(z, ncol = 2, byrow = TRUE))
> > F1
>   ID1 ID2   text    1  2
> 1  A1  B1 NONE_. NONE  .
> 2  A1  B1  cf_12   cf 12
> 3  A1  B1 NONE_. NONE  .
> 4  A2  B2  X2_25   X2 25
> 5  A2  B3  fd_15   fd 15
> >## You can change the names of the 2 columns yourself
>
> Cheers,
> Bert
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Tue, Sep 22, 2020 at 12:19 PM Rui Barradas <[hidden email]>
> wrote:
>
> > Hello,
> >
> > A base R solution with strsplit, like in your code.
> >
> > F1$Y1 <- +grepl("_", F1$text)
> >
> > tmp <- strsplit(as.character(F1$text), "_")
> > tmp <- lapply(tmp, function(x) if(length(x) == 1) c(x, ".") else x)
> > tmp <- do.call(rbind, tmp)
> > colnames(tmp) <- c("X1", "X2")
> > F1 <- cbind(F1[-3], tmp)    # remove the original column
> > rm(tmp)
> >
> > F1
> > #  ID1 ID2 Y1   X1 X2
> > #1  A1  B1  0 NONE  .
> > #2  A1  B1  1   cf 12
> > #3  A1  B1  0 NONE  .
> > #4  A2  B2  1   X2 25
> > #5  A2  B3  1   fd 15
> >
> >
> > Note that cbind dispatches on F1, an object of class "data.frame".
> > Therefore it's the method cbind.data.frame that is called and the result
> > is also a df, though tmp is a "matrix".
> >
> >
> > Hope this helps,
> >
> > Rui Barradas
> >
> >
> > Às 20:07 de 22/09/20, Rui Barradas escreveu:
> > > Hello,
> > >
> > > Something like this?
> > >
> > >
> > > F1$Y1 <- +grepl("_", F1$text)
> > > F1 <- F1[c(1, 2, 4, 3)]
> > > F1 <- tidyr::separate(F1, text, into = c("X1", "X2"), sep = "_", fill =
> > > "right")
> > > F1
> > >
> > >
> > > Hope this helps,
> > >
> > > Rui Barradas
> > >
> > > Às 19:55 de 22/09/20, Val escreveu:
> > >> HI All,
> > >>
> > >> I am trying to create   new columns based on another column string
> > >> content. First I want to identify rows that contain a particular
> > >> string.  If it contains, I want to split the string and create two
> > >> variables.
> > >>
> > >> Here is my sample of data.
> > >> F1<-read.table(text="ID1  ID2  text
> > >> A1 B1   NONE
> > >> A1 B1   cf_12
> > >> A1 B1   NONE
> > >> A2 B2   X2_25
> > >> A2 B3   fd_15  ",header=TRUE,stringsAsFactors=F)
> > >> If the variable "text" contains this "_" I want to create an indicator
> > >> variable as shown below
> > >>
> > >> F1$Y1 <- ifelse(grepl("_", F1$text),1,0)
> > >>
> > >>
> > >> Then I want to split that string in to two, before "_" and after "_"
> > >> and create two variables as shown below
> > >> x1= strsplit(as.character(F1$text),'_',2)
> > >>
> > >> My problem is how to combine this with the original data frame. The
> > >> desired  output is shown   below,
> > >>
> > >>
> > >> ID1 ID2  Y1   X1    X2
> > >> A1  B1    0   NONE   .
> > >> A1  B1   1    cf        12
> > >> A1  B1   0  NONE   .
> > >> A2  B2   1    X2    25
> > >> A2  B3   1    fd    15
> > >>
> > >> Any help?
> > >> Thank you.
> > >>
> > >> ______________________________________________
> > >> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > >> https://stat.ethz.ch/mailman/listinfo/r-help
> > >> PLEASE do read the posting guide
> > >> http://www.R-project.org/posting-guide.html
> > >> and provide commented, minimal, self-contained, reproducible code.
> > >>
> > >
> > > ______________________________________________
> > > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
> > > http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> >
> > ______________________________________________
> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Split

Bert Gunter-2
That was still slower and doesn't quite give what was requested:

> cbind(F1,utils::strcapture("([^_]*)_(.*)", F1$text,
proto=data.frame(Before_=character(), After_=character())))
  ID1 ID2  text Before_ After_
1  A1  B1  NONE    <NA>   <NA>
2  A1  B1 cf_12      cf     12
3  A1  B1  NONE    <NA>   <NA>
4  A2  B2 X2_25      X2     25
5  A2  B3 fd_15      fd     15

> system.time({
+ cbind(F2,utils::strcapture("([^_]*)_(.*)", F2$text,
proto=data.frame(Before_=character(), After_=character())))
+ }
+ )
   user  system elapsed
 32.712   0.736  33.587

Cheers,
Bert




On Tue, Sep 22, 2020 at 5:45 PM Bill Dunlap <[hidden email]>
wrote:

> Another way to make columns out of the stuff before and after the
> underscore, with NAs if there is no underscore, is
>
> utils::strcapture("([^_]*)_(.*)", F1$text,
> proto=data.frame(Before_=character(), After_=character()))
>
> -Bill
>
> On Tue, Sep 22, 2020 at 4:25 PM Bert Gunter <[hidden email]>
> wrote:
>
>> To be clear, I think Rui's solution is perfectly fine and probably better
>> than what I offer below. But just for fun, I wanted to do it without the
>> lapply().  Here is one way. I think my comments suffice to explain.
>>
>> > ## which are the  non "_" indices?
>> > wh <- grep("_",F1$text, fixed = TRUE, invert = TRUE)
>> > ## paste "_." to these
>> > F1[wh,"text"] <- paste(F1[wh,"text"],".",sep = "_")
>> > ## Now strsplit() and unlist() them to get a vector
>> > z <- unlist(strsplit(F1$text, "_"))
>> > ## now cbind() to the data frame
>> > F1 <- cbind(F1, matrix(z, ncol = 2, byrow = TRUE))
>> > F1
>>   ID1 ID2   text    1  2
>> 1  A1  B1 NONE_. NONE  .
>> 2  A1  B1  cf_12   cf 12
>> 3  A1  B1 NONE_. NONE  .
>> 4  A2  B2  X2_25   X2 25
>> 5  A2  B3  fd_15   fd 15
>> >## You can change the names of the 2 columns yourself
>>
>> Cheers,
>> Bert
>>
>> Bert Gunter
>>
>> "The trouble with having an open mind is that people keep coming along and
>> sticking things into it."
>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>
>>
>> On Tue, Sep 22, 2020 at 12:19 PM Rui Barradas <[hidden email]>
>> wrote:
>>
>> > Hello,
>> >
>> > A base R solution with strsplit, like in your code.
>> >
>> > F1$Y1 <- +grepl("_", F1$text)
>> >
>> > tmp <- strsplit(as.character(F1$text), "_")
>> > tmp <- lapply(tmp, function(x) if(length(x) == 1) c(x, ".") else x)
>> > tmp <- do.call(rbind, tmp)
>> > colnames(tmp) <- c("X1", "X2")
>> > F1 <- cbind(F1[-3], tmp)    # remove the original column
>> > rm(tmp)
>> >
>> > F1
>> > #  ID1 ID2 Y1   X1 X2
>> > #1  A1  B1  0 NONE  .
>> > #2  A1  B1  1   cf 12
>> > #3  A1  B1  0 NONE  .
>> > #4  A2  B2  1   X2 25
>> > #5  A2  B3  1   fd 15
>> >
>> >
>> > Note that cbind dispatches on F1, an object of class "data.frame".
>> > Therefore it's the method cbind.data.frame that is called and the result
>> > is also a df, though tmp is a "matrix".
>> >
>> >
>> > Hope this helps,
>> >
>> > Rui Barradas
>> >
>> >
>> > Às 20:07 de 22/09/20, Rui Barradas escreveu:
>> > > Hello,
>> > >
>> > > Something like this?
>> > >
>> > >
>> > > F1$Y1 <- +grepl("_", F1$text)
>> > > F1 <- F1[c(1, 2, 4, 3)]
>> > > F1 <- tidyr::separate(F1, text, into = c("X1", "X2"), sep = "_", fill
>> =
>> > > "right")
>> > > F1
>> > >
>> > >
>> > > Hope this helps,
>> > >
>> > > Rui Barradas
>> > >
>> > > Às 19:55 de 22/09/20, Val escreveu:
>> > >> HI All,
>> > >>
>> > >> I am trying to create   new columns based on another column string
>> > >> content. First I want to identify rows that contain a particular
>> > >> string.  If it contains, I want to split the string and create two
>> > >> variables.
>> > >>
>> > >> Here is my sample of data.
>> > >> F1<-read.table(text="ID1  ID2  text
>> > >> A1 B1   NONE
>> > >> A1 B1   cf_12
>> > >> A1 B1   NONE
>> > >> A2 B2   X2_25
>> > >> A2 B3   fd_15  ",header=TRUE,stringsAsFactors=F)
>> > >> If the variable "text" contains this "_" I want to create an
>> indicator
>> > >> variable as shown below
>> > >>
>> > >> F1$Y1 <- ifelse(grepl("_", F1$text),1,0)
>> > >>
>> > >>
>> > >> Then I want to split that string in to two, before "_" and after "_"
>> > >> and create two variables as shown below
>> > >> x1= strsplit(as.character(F1$text),'_',2)
>> > >>
>> > >> My problem is how to combine this with the original data frame. The
>> > >> desired  output is shown   below,
>> > >>
>> > >>
>> > >> ID1 ID2  Y1   X1    X2
>> > >> A1  B1    0   NONE   .
>> > >> A1  B1   1    cf        12
>> > >> A1  B1   0  NONE   .
>> > >> A2  B2   1    X2    25
>> > >> A2  B3   1    fd    15
>> > >>
>> > >> Any help?
>> > >> Thank you.
>> > >>
>> > >> ______________________________________________
>> > >> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> > >> https://stat.ethz.ch/mailman/listinfo/r-help
>> > >> PLEASE do read the posting guide
>> > >> http://www.R-project.org/posting-guide.html
>> > >> and provide commented, minimal, self-contained, reproducible code.
>> > >>
>> > >
>> > > ______________________________________________
>> > > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> > > https://stat.ethz.ch/mailman/listinfo/r-help
>> > > PLEASE do read the posting guide
>> > > http://www.R-project.org/posting-guide.html
>> > > and provide commented, minimal, self-contained, reproducible code.
>> >
>> > ______________________________________________
>> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>> >
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Split

Rui Barradas
Hello,

If speed is important, and following the previous discussion and Bert's
tests, here are two other alternatives, both faster.

1. Bert2 is Bert's original but with scan(., sep = "_") substituted for
unlist/strsplit.
2. A package data.table solution. These are always fast, many times the
fastest. But have the inconvenience of coercing the data to class
"data.table" and the rest of the code needs to be adapted to handle
data.tables. Namely, the second index in dt[i, j] is no longer a column
index.

Unlike Bert, I time my first code, the one with package tidyr and its
performance clearly beats the second one.
I define a test function, running several input sizes. It doesn't take
much time to complete, only several minutes. The times' differences are
not as impressive as Bert's, probably due to be on a different OS. I'm
running R 4.0.2 on Ubuntu 20.04, sessionInfo at the end.

Also, I find X$Y1 <- as.integer(grepl("_", X$text)) more readable than
coercion to numeric with +grepl(.).



library(data.table)
library(microbenchmark)
library(ggplot2)

Rui1 <- function(X){
   #X$Y1 <- as.integer(grepl("_", X$text))
   tidyr::separate(X, text, into = c("X1", "X2"), sep = "_", fill = "right")
}
Bert <- function(X){
   ## which are the  non "_" indices?
   wh <- grep("_",X$text, fixed = TRUE, invert = TRUE)
   ## paste "_." to these
   X[wh,"text"] <- paste(X[wh,"text"],".",sep = "_")
   ## Now strsplit() and unlist() them to get a vector
   z <- unlist(strsplit(X$text, "_"))
   ## now cbind() to the data frame
   cbind(X, matrix(z, ncol = 2, byrow = TRUE))
}
Bert2 <- function(X){
   wh <- grep("_",X$text, fixed = TRUE, invert = TRUE)
   X[wh,"text"] <- paste(X[wh,"text"],".",sep = "_")
   z <- scan(what = character(), text = X$text, sep = "_")
   cbind(X, matrix(z, ncol = 2, byrow = TRUE))
}
DT <- function(X){
   Y <- as.data.table(X)
   Y[, c("X1", "X2") := tstrsplit(text, "_", fixed = TRUE)]
}

testSeparate <- function(X, size = 0:6, times = 10){
   row_nums <- seq_len(nrow(X))
   res <- lapply(size, function(s){
     Y <- X[rep(row_nums, 10^s), ]
     mb <- microbenchmark(
       Rui = Rui1(Y),
       Bert = Bert(Y),
       Bert2 = Bert2(Y),
       DT = DT(Y),
       times = times
     )
     mb$size <- s
     mb
   })
   # return median times
   res <- do.call(rbind, res)
   aggregate(time ~ size + expr, res, median)
}

F1 <- read.table(text="ID1  ID2  text
A1 B1   NONE
A1 B1   cf_12
A1 B1   NONE
A2 B2   X2_25
A2 B3   fd_15  ",header=TRUE,stringsAsFactors=F)

agg <- testSeparate(F1, times = 5)

ggplot(agg, aes(size, time, color = expr)) +
   geom_line() + geom_point() +
   scale_y_continuous(trans = "log10") +
   xlab(expression(log[10] ~ "(size)")) +
   ylab(expression(log[10] ~ "(time)"))


sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
  [1] LC_CTYPE=pt_PT.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=pt_PT.UTF-8        LC_COLLATE=pt_PT.UTF-8
  [5] LC_MONETARY=pt_PT.UTF-8    LC_MESSAGES=pt_PT.UTF-8
  [7] LC_PAPER=pt_PT.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=pt_PT.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] ggplot2_3.3.2        microbenchmark_1.4-7 data.table_1.12.8

loaded via a namespace (and not attached):
  [1] Rcpp_1.0.5       magrittr_1.5     tidyselect_1.1.0 munsell_0.5.0
  [5] colorspace_1.4-1 R6_2.4.1         rlang_0.4.7      dplyr_1.0.2
  [9] tools_4.0.2      grid_4.0.2       gtable_0.3.0     withr_2.2.0
[13] ellipsis_0.3.1   digest_0.6.25    tibble_3.0.3     lifecycle_0.2.0
[17] crayon_1.3.4     purrr_0.3.4      farver_2.0.3     tidyr_1.0.2
[21] vctrs_0.3.4      glue_1.4.2       labeling_0.3     stringi_1.4.6
[25] compiler_4.0.2   pillar_1.4.6     generics_0.0.2   scales_1.1.0
[29] pkgconfig_2.0.3


Hope this helps,

Rui Barradas


Às 02:47 de 23/09/20, Bert Gunter escreveu:

> That was still slower and doesn't quite give what was requested:
>
>  > cbind(F1,utils::strcapture("([^_]*)_(.*)", F1$text,
> proto=data.frame(Before_=character(), After_=character())))
>    ID1 ID2  text Before_ After_
> 1  A1  B1  NONE    <NA>   <NA>
> 2  A1  B1 cf_12      cf     12
> 3  A1  B1  NONE    <NA>   <NA>
> 4  A2  B2 X2_25      X2     25
> 5  A2  B3 fd_15      fd     15
>
>  > system.time({
> + cbind(F2,utils::strcapture("([^_]*)_(.*)", F2$text,
> proto=data.frame(Before_=character(), After_=character())))
> + }
> + )
>     user  system elapsed
>   32.712   0.736  33.587
>
> Cheers,
> Bert
>
>
>
>
> On Tue, Sep 22, 2020 at 5:45 PM Bill Dunlap <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Another way to make columns out of the stuff before and after the
>     underscore, with NAs if there is no underscore, is
>
>     utils::strcapture("([^_]*)_(.*)", F1$text,
>     proto=data.frame(Before_=character(), After_=character()))
>
>     -Bill
>
>     On Tue, Sep 22, 2020 at 4:25 PM Bert Gunter <[hidden email]
>     <mailto:[hidden email]>> wrote:
>
>         To be clear, I think Rui's solution is perfectly fine and
>         probably better
>         than what I offer below. But just for fun, I wanted to do it
>         without the
>         lapply().  Here is one way. I think my comments suffice to explain.
>
>          > ## which are the  non "_" indices?
>          > wh <- grep("_",F1$text, fixed = TRUE, invert = TRUE)
>          > ## paste "_." to these
>          > F1[wh,"text"] <- paste(F1[wh,"text"],".",sep = "_")
>          > ## Now strsplit() and unlist() them to get a vector
>          > z <- unlist(strsplit(F1$text, "_"))
>          > ## now cbind() to the data frame
>          > F1 <- cbind(F1, matrix(z, ncol = 2, byrow = TRUE))
>          > F1
>            ID1 ID2   text    1  2
>         1  A1  B1 NONE_. NONE  .
>         2  A1  B1  cf_12   cf 12
>         3  A1  B1 NONE_. NONE  .
>         4  A2  B2  X2_25   X2 25
>         5  A2  B3  fd_15   fd 15
>          >## You can change the names of the 2 columns yourself
>
>         Cheers,
>         Bert
>
>         Bert Gunter
>
>         "The trouble with having an open mind is that people keep coming
>         along and
>         sticking things into it."
>         -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
>         On Tue, Sep 22, 2020 at 12:19 PM Rui Barradas
>         <[hidden email] <mailto:[hidden email]>> wrote:
>
>          > Hello,
>          >
>          > A base R solution with strsplit, like in your code.
>          >
>          > F1$Y1 <- +grepl("_", F1$text)
>          >
>          > tmp <- strsplit(as.character(F1$text), "_")
>          > tmp <- lapply(tmp, function(x) if(length(x) == 1) c(x, ".")
>         else x)
>          > tmp <- do.call(rbind, tmp)
>          > colnames(tmp) <- c("X1", "X2")
>          > F1 <- cbind(F1[-3], tmp)    # remove the original column
>          > rm(tmp)
>          >
>          > F1
>          > #  ID1 ID2 Y1   X1 X2
>          > #1  A1  B1  0 NONE  .
>          > #2  A1  B1  1   cf 12
>          > #3  A1  B1  0 NONE  .
>          > #4  A2  B2  1   X2 25
>          > #5  A2  B3  1   fd 15
>          >
>          >
>          > Note that cbind dispatches on F1, an object of class
>         "data.frame".
>          > Therefore it's the method cbind.data.frame that is called and
>         the result
>          > is also a df, though tmp is a "matrix".
>          >
>          >
>          > Hope this helps,
>          >
>          > Rui Barradas
>          >
>          >
>          > Às 20:07 de 22/09/20, Rui Barradas escreveu:
>          > > Hello,
>          > >
>          > > Something like this?
>          > >
>          > >
>          > > F1$Y1 <- +grepl("_", F1$text)
>          > > F1 <- F1[c(1, 2, 4, 3)]
>          > > F1 <- tidyr::separate(F1, text, into = c("X1", "X2"), sep =
>         "_", fill =
>          > > "right")
>          > > F1
>          > >
>          > >
>          > > Hope this helps,
>          > >
>          > > Rui Barradas
>          > >
>          > > Às 19:55 de 22/09/20, Val escreveu:
>          > >> HI All,
>          > >>
>          > >> I am trying to create   new columns based on another
>         column string
>          > >> content. First I want to identify rows that contain a
>         particular
>          > >> string.  If it contains, I want to split the string and
>         create two
>          > >> variables.
>          > >>
>          > >> Here is my sample of data.
>          > >> F1<-read.table(text="ID1  ID2  text
>          > >> A1 B1   NONE
>          > >> A1 B1   cf_12
>          > >> A1 B1   NONE
>          > >> A2 B2   X2_25
>          > >> A2 B3   fd_15  ",header=TRUE,stringsAsFactors=F)
>          > >> If the variable "text" contains this "_" I want to create
>         an indicator
>          > >> variable as shown below
>          > >>
>          > >> F1$Y1 <- ifelse(grepl("_", F1$text),1,0)
>          > >>
>          > >>
>          > >> Then I want to split that string in to two, before "_" and
>         after "_"
>          > >> and create two variables as shown below
>          > >> x1= strsplit(as.character(F1$text),'_',2)
>          > >>
>          > >> My problem is how to combine this with the original data
>         frame. The
>          > >> desired  output is shown   below,
>          > >>
>          > >>
>          > >> ID1 ID2  Y1   X1    X2
>          > >> A1  B1    0   NONE   .
>          > >> A1  B1   1    cf        12
>          > >> A1  B1   0  NONE   .
>          > >> A2  B2   1    X2    25
>          > >> A2  B3   1    fd    15
>          > >>
>          > >> Any help?
>          > >> Thank you.
>          > >>
>          > >> ______________________________________________
>          > >> [hidden email] <mailto:[hidden email]> mailing
>         list -- To UNSUBSCRIBE and more, see
>          > >> https://stat.ethz.ch/mailman/listinfo/r-help
>          > >> PLEASE do read the posting guide
>          > >> http://www.R-project.org/posting-guide.html
>          > >> and provide commented, minimal, self-contained,
>         reproducible code.
>          > >>
>          > >
>          > > ______________________________________________
>          > > [hidden email] <mailto:[hidden email]> mailing
>         list -- To UNSUBSCRIBE and more, see
>          > > https://stat.ethz.ch/mailman/listinfo/r-help
>          > > PLEASE do read the posting guide
>          > > http://www.R-project.org/posting-guide.html
>          > > and provide commented, minimal, self-contained,
>         reproducible code.
>          >
>          > ______________________________________________
>          > [hidden email] <mailto:[hidden email]> mailing
>         list -- To UNSUBSCRIBE and more, see
>          > https://stat.ethz.ch/mailman/listinfo/r-help
>          > PLEASE do read the posting guide
>          > http://www.R-project.org/posting-guide.html
>          > and provide commented, minimal, self-contained, reproducible
>         code.
>          >
>
>                  [[alternative HTML version deleted]]
>
>         ______________________________________________
>         [hidden email] <mailto:[hidden email]> mailing list
>         -- To UNSUBSCRIBE and more, see
>         https://stat.ethz.ch/mailman/listinfo/r-help
>         PLEASE do read the posting guide
>         http://www.R-project.org/posting-guide.html
>         and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Split

LMHmedchem
In reply to this post by Val-17
What is the delimiter is in the input data? Is it tab, space, etc?

Is this going to be the same for the output data that you will use for R input?

LMH


Val wrote:

> Thank you all for the help!
>
> LMH, Yes I would like to see the alternative.  I am using this for a
> large data set and if the  alternative is more efficient than this
> then I would be happy.
>
> On Tue, Sep 22, 2020 at 6:25 PM Bert Gunter <[hidden email]> wrote:
>>
>> To be clear, I think Rui's solution is perfectly fine and probably better than what I offer below. But just for fun, I wanted to do it without the lapply().  Here is one way. I think my comments suffice to explain.
>>
>>> ## which are the  non "_" indices?
>>> wh <- grep("_",F1$text, fixed = TRUE, invert = TRUE)
>>> ## paste "_." to these
>>> F1[wh,"text"] <- paste(F1[wh,"text"],".",sep = "_")
>>> ## Now strsplit() and unlist() them to get a vector
>>> z <- unlist(strsplit(F1$text, "_"))
>>> ## now cbind() to the data frame
>>> F1 <- cbind(F1, matrix(z, ncol = 2, byrow = TRUE))
>>> F1
>>   ID1 ID2   text    1  2
>> 1  A1  B1 NONE_. NONE  .
>> 2  A1  B1  cf_12   cf 12
>> 3  A1  B1 NONE_. NONE  .
>> 4  A2  B2  X2_25   X2 25
>> 5  A2  B3  fd_15   fd 15
>>> ## You can change the names of the 2 columns yourself
>>
>> Cheers,
>> Bert
>>
>> Bert Gunter
>>
>> "The trouble with having an open mind is that people keep coming along and sticking things into it."
>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>
>>
>> On Tue, Sep 22, 2020 at 12:19 PM Rui Barradas <[hidden email]> wrote:
>>>
>>> Hello,
>>>
>>> A base R solution with strsplit, like in your code.
>>>
>>> F1$Y1 <- +grepl("_", F1$text)
>>>
>>> tmp <- strsplit(as.character(F1$text), "_")
>>> tmp <- lapply(tmp, function(x) if(length(x) == 1) c(x, ".") else x)
>>> tmp <- do.call(rbind, tmp)
>>> colnames(tmp) <- c("X1", "X2")
>>> F1 <- cbind(F1[-3], tmp)    # remove the original column
>>> rm(tmp)
>>>
>>> F1
>>> #  ID1 ID2 Y1   X1 X2
>>> #1  A1  B1  0 NONE  .
>>> #2  A1  B1  1   cf 12
>>> #3  A1  B1  0 NONE  .
>>> #4  A2  B2  1   X2 25
>>> #5  A2  B3  1   fd 15
>>>
>>>
>>> Note that cbind dispatches on F1, an object of class "data.frame".
>>> Therefore it's the method cbind.data.frame that is called and the result
>>> is also a df, though tmp is a "matrix".
>>>
>>>
>>> Hope this helps,
>>>
>>> Rui Barradas
>>>
>>>
>>> Às 20:07 de 22/09/20, Rui Barradas escreveu:
>>>> Hello,
>>>>
>>>> Something like this?
>>>>
>>>>
>>>> F1$Y1 <- +grepl("_", F1$text)
>>>> F1 <- F1[c(1, 2, 4, 3)]
>>>> F1 <- tidyr::separate(F1, text, into = c("X1", "X2"), sep = "_", fill =
>>>> "right")
>>>> F1
>>>>
>>>>
>>>> Hope this helps,
>>>>
>>>> Rui Barradas
>>>>
>>>> Às 19:55 de 22/09/20, Val escreveu:
>>>>> HI All,
>>>>>
>>>>> I am trying to create   new columns based on another column string
>>>>> content. First I want to identify rows that contain a particular
>>>>> string.  If it contains, I want to split the string and create two
>>>>> variables.
>>>>>
>>>>> Here is my sample of data.
>>>>> F1<-read.table(text="ID1  ID2  text
>>>>> A1 B1   NONE
>>>>> A1 B1   cf_12
>>>>> A1 B1   NONE
>>>>> A2 B2   X2_25
>>>>> A2 B3   fd_15  ",header=TRUE,stringsAsFactors=F)
>>>>> If the variable "text" contains this "_" I want to create an indicator
>>>>> variable as shown below
>>>>>
>>>>> F1$Y1 <- ifelse(grepl("_", F1$text),1,0)
>>>>>
>>>>>
>>>>> Then I want to split that string in to two, before "_" and after "_"
>>>>> and create two variables as shown below
>>>>> x1= strsplit(as.character(F1$text),'_',2)
>>>>>
>>>>> My problem is how to combine this with the original data frame. The
>>>>> desired  output is shown   below,
>>>>>
>>>>>
>>>>> ID1 ID2  Y1   X1    X2
>>>>> A1  B1    0   NONE   .
>>>>> A1  B1   1    cf        12
>>>>> A1  B1   0  NONE   .
>>>>> A2  B2   1    X2    25
>>>>> A2  B3   1    fd    15
>>>>>
>>>>> Any help?
>>>>> Thank you.
>>>>>
>>>>> ______________________________________________
>>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>> ______________________________________________
>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Split

LMHmedchem
In reply to this post by Val-17
Below is a script in bash the uses the awk tokenizer to do the work.

This assumes that your input and output delimiter is space. The number of consecutive delimiters in
the input is not important. This also assumes that the input file does not have a header row. That
is easy to modify if you want. I always keep header rows in my data files as I think that removing
them is asking for trouble down the road.

I added a NULL for cases where there is no value for the last field. You could use "." if you want.

You should be able to find how to run this from inside R if you want. You will, of course, need a
bash environment to run this, so if you are not in linux you will need cygwin or something similar.

This should be very fast, but let me know if needs to be faster. If the X1_X2 variant occurs less
frequently than not then we should switch the order in which the logic evaluates the options.

LMH


#! /bin/bash

# input filename
input_file=$1

# output filename
output_file=$2

# make sure the input file exists
if [ ! -f $input_file ]; then
   echo $input_file "  cannot be found"
   exit 0
fi

# create the output file
touch $output_file

# make sure the output was created
if [ ! -f $output_file ]; then
   echo $output_file "  was not created"
   exit 0
fi

# write the header row
echo "ID1 ID2 Y1 X1 X2" >> $output_file

# character to find in the third token
look_for='_'

# process with awk
# if the 3rd token contains '_'
#   split the third token on '_' into F[1] and F[2]
#   print the first two tokens, the indicator value of 1, and the split fields F[1] and F[2]
# otherwise,
#   print the first two tokens, the indicator value of 0, the 3rd token, and NULL

cat $input_file | \
awk -v find_char=$look_for '{ if($3 ~ find_char) { { split ($3, F, "_") }
                                                   { print $1, $2, "1", F[1], F[2] }
                                                 }
                              else { print $1, $2, "0", $3, "NULL" }
                            }' >> $output_file







Val wrote:

> Thank you all for the help!
>
> LMH, Yes I would like to see the alternative.  I am using this for a
> large data set and if the  alternative is more efficient than this
> then I would be happy.
>
> On Tue, Sep 22, 2020 at 6:25 PM Bert Gunter <[hidden email]> wrote:
>>
>> To be clear, I think Rui's solution is perfectly fine and probably better than what I offer below. But just for fun, I wanted to do it without the lapply().  Here is one way. I think my comments suffice to explain.
>>
>>> ## which are the  non "_" indices?
>>> wh <- grep("_",F1$text, fixed = TRUE, invert = TRUE)
>>> ## paste "_." to these
>>> F1[wh,"text"] <- paste(F1[wh,"text"],".",sep = "_")
>>> ## Now strsplit() and unlist() them to get a vector
>>> z <- unlist(strsplit(F1$text, "_"))
>>> ## now cbind() to the data frame
>>> F1 <- cbind(F1, matrix(z, ncol = 2, byrow = TRUE))
>>> F1
>>   ID1 ID2   text    1  2
>> 1  A1  B1 NONE_. NONE  .
>> 2  A1  B1  cf_12   cf 12
>> 3  A1  B1 NONE_. NONE  .
>> 4  A2  B2  X2_25   X2 25
>> 5  A2  B3  fd_15   fd 15
>>> ## You can change the names of the 2 columns yourself
>>
>> Cheers,
>> Bert
>>
>> Bert Gunter
>>
>> "The trouble with having an open mind is that people keep coming along and sticking things into it."
>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>
>>
>> On Tue, Sep 22, 2020 at 12:19 PM Rui Barradas <[hidden email]> wrote:
>>>
>>> Hello,
>>>
>>> A base R solution with strsplit, like in your code.
>>>
>>> F1$Y1 <- +grepl("_", F1$text)
>>>
>>> tmp <- strsplit(as.character(F1$text), "_")
>>> tmp <- lapply(tmp, function(x) if(length(x) == 1) c(x, ".") else x)
>>> tmp <- do.call(rbind, tmp)
>>> colnames(tmp) <- c("X1", "X2")
>>> F1 <- cbind(F1[-3], tmp)    # remove the original column
>>> rm(tmp)
>>>
>>> F1
>>> #  ID1 ID2 Y1   X1 X2
>>> #1  A1  B1  0 NONE  .
>>> #2  A1  B1  1   cf 12
>>> #3  A1  B1  0 NONE  .
>>> #4  A2  B2  1   X2 25
>>> #5  A2  B3  1   fd 15
>>>
>>>
>>> Note that cbind dispatches on F1, an object of class "data.frame".
>>> Therefore it's the method cbind.data.frame that is called and the result
>>> is also a df, though tmp is a "matrix".
>>>
>>>
>>> Hope this helps,
>>>
>>> Rui Barradas
>>>
>>>
>>> Às 20:07 de 22/09/20, Rui Barradas escreveu:
>>>> Hello,
>>>>
>>>> Something like this?
>>>>
>>>>
>>>> F1$Y1 <- +grepl("_", F1$text)
>>>> F1 <- F1[c(1, 2, 4, 3)]
>>>> F1 <- tidyr::separate(F1, text, into = c("X1", "X2"), sep = "_", fill =
>>>> "right")
>>>> F1
>>>>
>>>>
>>>> Hope this helps,
>>>>
>>>> Rui Barradas
>>>>
>>>> Às 19:55 de 22/09/20, Val escreveu:
>>>>> HI All,
>>>>>
>>>>> I am trying to create   new columns based on another column string
>>>>> content. First I want to identify rows that contain a particular
>>>>> string.  If it contains, I want to split the string and create two
>>>>> variables.
>>>>>
>>>>> Here is my sample of data.
>>>>> F1<-read.table(text="ID1  ID2  text
>>>>> A1 B1   NONE
>>>>> A1 B1   cf_12
>>>>> A1 B1   NONE
>>>>> A2 B2   X2_25
>>>>> A2 B3   fd_15  ",header=TRUE,stringsAsFactors=F)
>>>>> If the variable "text" contains this "_" I want to create an indicator
>>>>> variable as shown below
>>>>>
>>>>> F1$Y1 <- ifelse(grepl("_", F1$text),1,0)
>>>>>
>>>>>
>>>>> Then I want to split that string in to two, before "_" and after "_"
>>>>> and create two variables as shown below
>>>>> x1= strsplit(as.character(F1$text),'_',2)
>>>>>
>>>>> My problem is how to combine this with the original data frame. The
>>>>> desired  output is shown   below,
>>>>>
>>>>>
>>>>> ID1 ID2  Y1   X1    X2
>>>>> A1  B1    0   NONE   .
>>>>> A1  B1   1    cf        12
>>>>> A1  B1   0  NONE   .
>>>>> A2  B2   1    X2    25
>>>>> A2  B3   1    fd    15
>>>>>
>>>>> Any help?
>>>>> Thank you.
>>>>>
>>>>> ______________________________________________
>>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>> ______________________________________________
>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Split

Val-17
Thank you again for your help  and giving me the opportunity to choose
the efficient method.  For a small data set there is no discernable
difference between the different approaches.  I will carry out a
comparison using  the large data set.


On Wed, Sep 23, 2020 at 11:52 AM LMH <[hidden email]> wrote:

>
> Below is a script in bash the uses the awk tokenizer to do the work.
>
> This assumes that your input and output delimiter is space. The number of consecutive delimiters in
> the input is not important. This also assumes that the input file does not have a header row. That
> is easy to modify if you want. I always keep header rows in my data files as I think that removing
> them is asking for trouble down the road.
>
> I added a NULL for cases where there is no value for the last field. You could use "." if you want.
>
> You should be able to find how to run this from inside R if you want. You will, of course, need a
> bash environment to run this, so if you are not in linux you will need cygwin or something similar.
>
> This should be very fast, but let me know if needs to be faster. If the X1_X2 variant occurs less
> frequently than not then we should switch the order in which the logic evaluates the options.
>
> LMH
>
>
> #! /bin/bash
>
> # input filename
> input_file=$1
>
> # output filename
> output_file=$2
>
> # make sure the input file exists
> if [ ! -f $input_file ]; then
>    echo $input_file "  cannot be found"
>    exit 0
> fi
>
> # create the output file
> touch $output_file
>
> # make sure the output was created
> if [ ! -f $output_file ]; then
>    echo $output_file "  was not created"
>    exit 0
> fi
>
> # write the header row
> echo "ID1 ID2 Y1 X1 X2" >> $output_file
>
> # character to find in the third token
> look_for='_'
>
> # process with awk
> # if the 3rd token contains '_'
> #   split the third token on '_' into F[1] and F[2]
> #   print the first two tokens, the indicator value of 1, and the split fields F[1] and F[2]
> # otherwise,
> #   print the first two tokens, the indicator value of 0, the 3rd token, and NULL
>
> cat $input_file | \
> awk -v find_char=$look_for '{ if($3 ~ find_char) { { split ($3, F, "_") }
>                                                    { print $1, $2, "1", F[1], F[2] }
>                                                  }
>                               else { print $1, $2, "0", $3, "NULL" }
>                             }' >> $output_file
>
>
>
>
>
>
>
> Val wrote:
> > Thank you all for the help!
> >
> > LMH, Yes I would like to see the alternative.  I am using this for a
> > large data set and if the  alternative is more efficient than this
> > then I would be happy.
> >
> > On Tue, Sep 22, 2020 at 6:25 PM Bert Gunter <[hidden email]> wrote:
> >>
> >> To be clear, I think Rui's solution is perfectly fine and probably better than what I offer below. But just for fun, I wanted to do it without the lapply().  Here is one way. I think my comments suffice to explain.
> >>
> >>> ## which are the  non "_" indices?
> >>> wh <- grep("_",F1$text, fixed = TRUE, invert = TRUE)
> >>> ## paste "_." to these
> >>> F1[wh,"text"] <- paste(F1[wh,"text"],".",sep = "_")
> >>> ## Now strsplit() and unlist() them to get a vector
> >>> z <- unlist(strsplit(F1$text, "_"))
> >>> ## now cbind() to the data frame
> >>> F1 <- cbind(F1, matrix(z, ncol = 2, byrow = TRUE))
> >>> F1
> >>   ID1 ID2   text    1  2
> >> 1  A1  B1 NONE_. NONE  .
> >> 2  A1  B1  cf_12   cf 12
> >> 3  A1  B1 NONE_. NONE  .
> >> 4  A2  B2  X2_25   X2 25
> >> 5  A2  B3  fd_15   fd 15
> >>> ## You can change the names of the 2 columns yourself
> >>
> >> Cheers,
> >> Bert
> >>
> >> Bert Gunter
> >>
> >> "The trouble with having an open mind is that people keep coming along and sticking things into it."
> >> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >>
> >>
> >> On Tue, Sep 22, 2020 at 12:19 PM Rui Barradas <[hidden email]> wrote:
> >>>
> >>> Hello,
> >>>
> >>> A base R solution with strsplit, like in your code.
> >>>
> >>> F1$Y1 <- +grepl("_", F1$text)
> >>>
> >>> tmp <- strsplit(as.character(F1$text), "_")
> >>> tmp <- lapply(tmp, function(x) if(length(x) == 1) c(x, ".") else x)
> >>> tmp <- do.call(rbind, tmp)
> >>> colnames(tmp) <- c("X1", "X2")
> >>> F1 <- cbind(F1[-3], tmp)    # remove the original column
> >>> rm(tmp)
> >>>
> >>> F1
> >>> #  ID1 ID2 Y1   X1 X2
> >>> #1  A1  B1  0 NONE  .
> >>> #2  A1  B1  1   cf 12
> >>> #3  A1  B1  0 NONE  .
> >>> #4  A2  B2  1   X2 25
> >>> #5  A2  B3  1   fd 15
> >>>
> >>>
> >>> Note that cbind dispatches on F1, an object of class "data.frame".
> >>> Therefore it's the method cbind.data.frame that is called and the result
> >>> is also a df, though tmp is a "matrix".
> >>>
> >>>
> >>> Hope this helps,
> >>>
> >>> Rui Barradas
> >>>
> >>>
> >>> Às 20:07 de 22/09/20, Rui Barradas escreveu:
> >>>> Hello,
> >>>>
> >>>> Something like this?
> >>>>
> >>>>
> >>>> F1$Y1 <- +grepl("_", F1$text)
> >>>> F1 <- F1[c(1, 2, 4, 3)]
> >>>> F1 <- tidyr::separate(F1, text, into = c("X1", "X2"), sep = "_", fill =
> >>>> "right")
> >>>> F1
> >>>>
> >>>>
> >>>> Hope this helps,
> >>>>
> >>>> Rui Barradas
> >>>>
> >>>> Às 19:55 de 22/09/20, Val escreveu:
> >>>>> HI All,
> >>>>>
> >>>>> I am trying to create   new columns based on another column string
> >>>>> content. First I want to identify rows that contain a particular
> >>>>> string.  If it contains, I want to split the string and create two
> >>>>> variables.
> >>>>>
> >>>>> Here is my sample of data.
> >>>>> F1<-read.table(text="ID1  ID2  text
> >>>>> A1 B1   NONE
> >>>>> A1 B1   cf_12
> >>>>> A1 B1   NONE
> >>>>> A2 B2   X2_25
> >>>>> A2 B3   fd_15  ",header=TRUE,stringsAsFactors=F)
> >>>>> If the variable "text" contains this "_" I want to create an indicator
> >>>>> variable as shown below
> >>>>>
> >>>>> F1$Y1 <- ifelse(grepl("_", F1$text),1,0)
> >>>>>
> >>>>>
> >>>>> Then I want to split that string in to two, before "_" and after "_"
> >>>>> and create two variables as shown below
> >>>>> x1= strsplit(as.character(F1$text),'_',2)
> >>>>>
> >>>>> My problem is how to combine this with the original data frame. The
> >>>>> desired  output is shown   below,
> >>>>>
> >>>>>
> >>>>> ID1 ID2  Y1   X1    X2
> >>>>> A1  B1    0   NONE   .
> >>>>> A1  B1   1    cf        12
> >>>>> A1  B1   0  NONE   .
> >>>>> A2  B2   1    X2    25
> >>>>> A2  B3   1    fd    15
> >>>>>
> >>>>> Any help?
> >>>>> Thank you.
> >>>>>
> >>>>> ______________________________________________
> >>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> >>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>> PLEASE do read the posting guide
> >>>>> http://www.R-project.org/posting-guide.html
> >>>>> and provide commented, minimal, self-contained, reproducible code.
> >>>>>
> >>>>
> >>>> ______________________________________________
> >>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>> PLEASE do read the posting guide
> >>>> http://www.R-project.org/posting-guide.html
> >>>> and provide commented, minimal, self-contained, reproducible code.
> >>>
> >>> ______________________________________________
> >>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >
> > ______________________________________________
> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.