Cumulative split of value in data frame column

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Cumulative split of value in data frame column

Ravi Jeyaraman
Assuming, I have a data frame like this ..

df <- data.frame(ID=1:3, FOO=c('A_B','A_B_C','A_B_C_D_E'))

I want to do a 'cumulative split' of the values in column FOO based on the
delimiter '_'.  The end result should be like this ..

ID  FOO FOO_SPLIT1 FOO_SPLIT2 FOO_SPLIT3
FOO_SPLIT4 FOO_SPLIT5
1   A_B A     A_B
2   A_B_C     A A_B
A_B_C
3   A_B_C_D_E A     A_B     A_B_C
A_B_C_D A_B_C_D_E

Any efficient, optimized way to do this?


--
This email has been checked for viruses by AVG.
https://www.avg.com

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Cumulative split of value in data frame column

Bert Gunter-2
This is a **plain text list **. In future please post in plain text so that
your post does not get mangled.

Anyway,...

I don't know about "efficient, optimized", but here's one simple way to do
it using ?strsplit to unsplit and then ?paste to recombine:

df <- data.frame(ID=1:3, FOO=c('A_B','A_B_C','A_B_C_D_E'))

cumsplit<- function(x,split = "_"){
    w <- x[1]
    for(i in seq_along(x)[-1])  w <- c(w, paste(w[i-1],x[i], sep = split))
    w
}

> lapply(strsplit(df$FOO, split = "_"), cumsplit)
[[1]]
[1] "A"   "A_B"

[[2]]
[1] "A"     "A_B"   "A_B_C"

[[3]]
[1] "A"         "A_B"       "A_B_C"     "A_B_C_D"   "A_B_C_D_E"

I wouldn't be surprised if clever use of regex's would be faster, but as I
said, this is simple.

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Fri, Jun 5, 2020 at 9:33 AM Ravi Jeyaraman <[hidden email]> wrote:

> Assuming, I have a data frame like this ..
>
> df <- data.frame(ID=1:3, FOO=c('A_B','A_B_C','A_B_C_D_E'))
>
> I want to do a 'cumulative split' of the values in column FOO based on the
> delimiter '_'.  The end result should be like this ..
>
> ID  FOO         FOO_SPLIT1              FOO_SPLIT2      FOO_SPLIT3
> FOO_SPLIT4              FOO_SPLIT5
> 1   A_B         A                    A_B
> 2   A_B_C               A                       A_B
> A_B_C
> 3   A_B_C_D_E   A                    A_B                        A_B_C
> A_B_C_D         A_B_C_D_E
>
> Any efficient, optimized way to do this?
>
>
> --
> This email has been checked for viruses by AVG.
> https://www.avg.com
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.