How to create a new column based on the values from multiple columns which are matching a particular string?

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

How to create a new column based on the values from multiple columns which are matching a particular string?

anikaM
I have data frame which looks like this:

df=data.frame(
  eye_problemsdisorders_f6148_0_1=c(A,C,D,NA,D,A,C,NA,B,A),
  eye_problemsdisorders_f6148_0_2=c(B,C,NA,A,C,B,NA,NA,A,D),
  eye_problemsdisorders_f6148_0_3=c(C,A,D,D,B,A,NA,NA,A,B),
  eye_problemsdisorders_f6148_0_4=c(D,D,NA,B,A,C,NA,C,A,B),
  eye_problemsdisorders_f6148_0_5=c(C,C,NA,D,B,C,NA,D,D,B))

In reality I have much more columns and they don't always match
"eye_problemsdisorders_f6148" this string, and there is much more rows.

What I would like to do is create a new column, say named "case" where I
would have value "1" for every row where string "A" appears at least once
in any column, if not the value would be "0". So in the above example
column "case" would have these values: 1,1,1,1,0
Thanks
Ana

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How to create a new column based on the values from multiple columns which are matching a particular string?

Eric Berger
You may have a typo/misstatement in your question.
You define a data frame with 5 columns, each of which has 10 elements, so
your data frame has dimensions 10 x 5.
Then you request a new COLUMN which will have only 5 elements, which is not
allowed. All columns of a data frame
must have the same length.

On Mon, Jul 29, 2019 at 8:42 PM Ana Marija <[hidden email]>
wrote:

> I have data frame which looks like this:
>
> df=data.frame(
>   eye_problemsdisorders_f6148_0_1=c(A,C,D,NA,D,A,C,NA,B,A),
>   eye_problemsdisorders_f6148_0_2=c(B,C,NA,A,C,B,NA,NA,A,D),
>   eye_problemsdisorders_f6148_0_3=c(C,A,D,D,B,A,NA,NA,A,B),
>   eye_problemsdisorders_f6148_0_4=c(D,D,NA,B,A,C,NA,C,A,B),
>   eye_problemsdisorders_f6148_0_5=c(C,C,NA,D,B,C,NA,D,D,B))
>
> In reality I have much more columns and they don't always match
> "eye_problemsdisorders_f6148" this string, and there is much more rows.
>
> What I would like to do is create a new column, say named "case" where I
> would have value "1" for every row where string "A" appears at least once
> in any column, if not the value would be "0". So in the above example
> column "case" would have these values: 1,1,1,1,0
> Thanks
> Ana
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How to create a new column based on the values from multiple columns which are matching a particular string?

anikaM
sorry my bad, here is the edited version:

so the data frame is this:

df=data.frame(
  eye_problemsdisorders_f6148_0_1=c("A","C","D",NA,"D","A","C",NA,"B","A"),
  eye_problemsdisorders_f6148_0_2=c("B","C",NA,"A","C","B",NA,NA,"A","D"),
  eye_problemsdisorders_f6148_0_3=c("C","A","D","D","B","A",NA,NA,"A","B"),
  eye_problemsdisorders_f6148_0_4=c("D","D",NA,"B","A","C",NA,"C","A","B"),
  eye_problemsdisorders_f6148_0_5=c("C","C",NA,"D","B","C",NA,"D","D","B")

and I would need to put inside the column which would be named "case" and
values inside would be: 1,1,0,1,1,1,0,0,1,1

so "case" column is where value "A" can be found in any column.

On Mon, Jul 29, 2019 at 12:53 PM Eric Berger <[hidden email]> wrote:

> You may have a typo/misstatement in your question.
> You define a data frame with 5 columns, each of which has 10 elements, so
> your data frame has dimensions 10 x 5.
> Then you request a new COLUMN which will have only 5 elements, which is
> not allowed. All columns of a data frame
> must have the same length.
>
> On Mon, Jul 29, 2019 at 8:42 PM Ana Marija <[hidden email]>
> wrote:
>
>> I have data frame which looks like this:
>>
>> df=data.frame(
>>   eye_problemsdisorders_f6148_0_1=c(A,C,D,NA,D,A,C,NA,B,A),
>>   eye_problemsdisorders_f6148_0_2=c(B,C,NA,A,C,B,NA,NA,A,D),
>>   eye_problemsdisorders_f6148_0_3=c(C,A,D,D,B,A,NA,NA,A,B),
>>   eye_problemsdisorders_f6148_0_4=c(D,D,NA,B,A,C,NA,C,A,B),
>>   eye_problemsdisorders_f6148_0_5=c(C,C,NA,D,B,C,NA,D,D,B))
>>
>> In reality I have much more columns and they don't always match
>> "eye_problemsdisorders_f6148" this string, and there is much more rows.
>>
>> What I would like to do is create a new column, say named "case" where I
>> would have value "1" for every row where string "A" appears at least once
>> in any column, if not the value would be "0". So in the above example
>> column "case" would have these values: 1,1,1,1,0
>> Thanks
>> Ana
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How to create a new column based on the values from multiple columns which are matching a particular string?

Eric Berger
df$case <- apply(df,MARGIN = 1,function(v) { as.integer("A" %in% v) })


On Mon, Jul 29, 2019 at 9:02 PM Ana Marija <[hidden email]>
wrote:

> sorry my bad, here is the edited version:
>
> so the data frame is this:
>
> df=data.frame(
>   eye_problemsdisorders_f6148_0_1=c("A","C","D",NA,"D","A","C",NA,"B","A"),
>   eye_problemsdisorders_f6148_0_2=c("B","C",NA,"A","C","B",NA,NA,"A","D"),
>   eye_problemsdisorders_f6148_0_3=c("C","A","D","D","B","A",NA,NA,"A","B"),
>   eye_problemsdisorders_f6148_0_4=c("D","D",NA,"B","A","C",NA,"C","A","B"),
>   eye_problemsdisorders_f6148_0_5=c("C","C",NA,"D","B","C",NA,"D","D","B")
>
> and I would need to put inside the column which would be named "case" and
> values inside would be: 1,1,0,1,1,1,0,0,1,1
>
> so "case" column is where value "A" can be found in any column.
>
> On Mon, Jul 29, 2019 at 12:53 PM Eric Berger <[hidden email]>
> wrote:
>
>> You may have a typo/misstatement in your question.
>> You define a data frame with 5 columns, each of which has 10 elements, so
>> your data frame has dimensions 10 x 5.
>> Then you request a new COLUMN which will have only 5 elements, which is
>> not allowed. All columns of a data frame
>> must have the same length.
>>
>> On Mon, Jul 29, 2019 at 8:42 PM Ana Marija <[hidden email]>
>> wrote:
>>
>>> I have data frame which looks like this:
>>>
>>> df=data.frame(
>>>   eye_problemsdisorders_f6148_0_1=c(A,C,D,NA,D,A,C,NA,B,A),
>>>   eye_problemsdisorders_f6148_0_2=c(B,C,NA,A,C,B,NA,NA,A,D),
>>>   eye_problemsdisorders_f6148_0_3=c(C,A,D,D,B,A,NA,NA,A,B),
>>>   eye_problemsdisorders_f6148_0_4=c(D,D,NA,B,A,C,NA,C,A,B),
>>>   eye_problemsdisorders_f6148_0_5=c(C,C,NA,D,B,C,NA,D,D,B))
>>>
>>> In reality I have much more columns and they don't always match
>>> "eye_problemsdisorders_f6148" this string, and there is much more rows.
>>>
>>> What I would like to do is create a new column, say named "case" where I
>>> would have value "1" for every row where string "A" appears at least once
>>> in any column, if not the value would be "0". So in the above example
>>> column "case" would have these values: 1,1,1,1,0
>>> Thanks
>>> Ana
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How to create a new column based on the values from multiple columns which are matching a particular string?

anikaM
Thank you so much! Just to confirm here MARGIN=1 indicates that "A" should
appear at least once per row?

On Mon, Jul 29, 2019 at 1:53 PM Eric Berger <[hidden email]> wrote:

> df$case <- apply(df,MARGIN = 1,function(v) { as.integer("A" %in% v) })
>
>
> On Mon, Jul 29, 2019 at 9:02 PM Ana Marija <[hidden email]>
> wrote:
>
>> sorry my bad, here is the edited version:
>>
>> so the data frame is this:
>>
>> df=data.frame(
>>
>> eye_problemsdisorders_f6148_0_1=c("A","C","D",NA,"D","A","C",NA,"B","A"),
>>   eye_problemsdisorders_f6148_0_2=c("B","C",NA,"A","C","B",NA,NA,"A","D"),
>>
>> eye_problemsdisorders_f6148_0_3=c("C","A","D","D","B","A",NA,NA,"A","B"),
>>
>> eye_problemsdisorders_f6148_0_4=c("D","D",NA,"B","A","C",NA,"C","A","B"),
>>   eye_problemsdisorders_f6148_0_5=c("C","C",NA,"D","B","C",NA,"D","D","B")
>>
>> and I would need to put inside the column which would be named "case" and
>> values inside would be: 1,1,0,1,1,1,0,0,1,1
>>
>> so "case" column is where value "A" can be found in any column.
>>
>> On Mon, Jul 29, 2019 at 12:53 PM Eric Berger <[hidden email]>
>> wrote:
>>
>>> You may have a typo/misstatement in your question.
>>> You define a data frame with 5 columns, each of which has 10 elements,
>>> so your data frame has dimensions 10 x 5.
>>> Then you request a new COLUMN which will have only 5 elements, which is
>>> not allowed. All columns of a data frame
>>> must have the same length.
>>>
>>> On Mon, Jul 29, 2019 at 8:42 PM Ana Marija <[hidden email]>
>>> wrote:
>>>
>>>> I have data frame which looks like this:
>>>>
>>>> df=data.frame(
>>>>   eye_problemsdisorders_f6148_0_1=c(A,C,D,NA,D,A,C,NA,B,A),
>>>>   eye_problemsdisorders_f6148_0_2=c(B,C,NA,A,C,B,NA,NA,A,D),
>>>>   eye_problemsdisorders_f6148_0_3=c(C,A,D,D,B,A,NA,NA,A,B),
>>>>   eye_problemsdisorders_f6148_0_4=c(D,D,NA,B,A,C,NA,C,A,B),
>>>>   eye_problemsdisorders_f6148_0_5=c(C,C,NA,D,B,C,NA,D,D,B))
>>>>
>>>> In reality I have much more columns and they don't always match
>>>> "eye_problemsdisorders_f6148" this string, and there is much more rows.
>>>>
>>>> What I would like to do is create a new column, say named "case" where I
>>>> would have value "1" for every row where string "A" appears at least
>>>> once
>>>> in any column, if not the value would be "0". So in the above example
>>>> column "case" would have these values: 1,1,1,1,0
>>>> Thanks
>>>> Ana
>>>>
>>>>         [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How to create a new column based on the values from multiple columns which are matching a particular string?

Eric Berger
Read the help for apply and %in%

?apply
?%in%


Sent from my iPhone

> On 29 Jul 2019, at 22:23, Ana Marija <[hidden email]> wrote:
>
> Thank you so much! Just to confirm here MARGIN=1 indicates that "A" should appear at least once per row?
>
>> On Mon, Jul 29, 2019 at 1:53 PM Eric Berger <[hidden email]> wrote:
>> df$case <- apply(df,MARGIN = 1,function(v) { as.integer("A" %in% v) })
>>
>>
>>> On Mon, Jul 29, 2019 at 9:02 PM Ana Marija <[hidden email]> wrote:
>>> sorry my bad, here is the edited version:
>>>
>>> so the data frame is this:
>>>
>>> df=data.frame(
>>>   eye_problemsdisorders_f6148_0_1=c("A","C","D",NA,"D","A","C",NA,"B","A"),
>>>   eye_problemsdisorders_f6148_0_2=c("B","C",NA,"A","C","B",NA,NA,"A","D"),
>>>   eye_problemsdisorders_f6148_0_3=c("C","A","D","D","B","A",NA,NA,"A","B"),
>>>   eye_problemsdisorders_f6148_0_4=c("D","D",NA,"B","A","C",NA,"C","A","B"),
>>>   eye_problemsdisorders_f6148_0_5=c("C","C",NA,"D","B","C",NA,"D","D","B")
>>>
>>> and I would need to put inside the column which would be named "case" and values inside would be: 1,1,0,1,1,1,0,0,1,1
>>>
>>> so "case" column is where value "A" can be found in any column.
>>>
>>>> On Mon, Jul 29, 2019 at 12:53 PM Eric Berger <[hidden email]> wrote:
>>>> You may have a typo/misstatement in your question.
>>>> You define a data frame with 5 columns, each of which has 10 elements, so your data frame has dimensions 10 x 5.
>>>> Then you request a new COLUMN which will have only 5 elements, which is not allowed. All columns of a data frame
>>>> must have the same length.
>>>>
>>>>> On Mon, Jul 29, 2019 at 8:42 PM Ana Marija <[hidden email]> wrote:
>>>>> I have data frame which looks like this:
>>>>>
>>>>> df=data.frame(
>>>>>   eye_problemsdisorders_f6148_0_1=c(A,C,D,NA,D,A,C,NA,B,A),
>>>>>   eye_problemsdisorders_f6148_0_2=c(B,C,NA,A,C,B,NA,NA,A,D),
>>>>>   eye_problemsdisorders_f6148_0_3=c(C,A,D,D,B,A,NA,NA,A,B),
>>>>>   eye_problemsdisorders_f6148_0_4=c(D,D,NA,B,A,C,NA,C,A,B),
>>>>>   eye_problemsdisorders_f6148_0_5=c(C,C,NA,D,B,C,NA,D,D,B))
>>>>>
>>>>> In reality I have much more columns and they don't always match
>>>>> "eye_problemsdisorders_f6148" this string, and there is much more rows.
>>>>>
>>>>> What I would like to do is create a new column, say named "case" where I
>>>>> would have value "1" for every row where string "A" appears at least once
>>>>> in any column, if not the value would be "0". So in the above example
>>>>> column "case" would have these values: 1,1,1,1,0
>>>>> Thanks
>>>>> Ana
>>>>>
>>>>>         [[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________________________
>>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How to create a new column based on the values from multiple columns which are matching a particular string?

Bert Gunter-2
While Eric's solution is correct( mod "corner" cases like all NA's in
a row), it can be made considerably more efficient.

One minor improvement can be made by using the idiom
any(x == "A")
instead of matching via %in% for the simple case of matching just a
single value.

However, a considerable improvement can be made by getting fancy,
taking advantage of do.call() and the pmax() function to mostly
vectorize the calculation. Here are the details and timing on a large
data frame.

(Note: I removed the names in the %in% approach for simplicity. It has
almost no effect on timings.
I also moved the as.integer() call out of the function so that it is
called only once at the end, which improves efficiency a bit)

1. Eric's original:
fun1 <-function(df,what)
{
  as.integer(unname(apply(df,MARGIN = 1,function(v) { what %in% v })))
}

2. Using any( x == "A") instead:
fun2 <- function(df,what)
{
   as.integer(unname(apply(df,MARGIN =1, function(x)any(x == what,
na.rm=TRUE))))
}

3. Getting fancy to use pmax()
fun3 <- function(df,what)
{
   z <- lapply(df,function(x)as.integer((x==what)))
   do.call(pmax,c(z,na.rm=TRUE))
}

Here are the timings:

> bigdf <- df[rep(1:10,1e4), rep(1:5, 50)]
> dim(bigdf)
[1] 100000    250

> system.time(res1 <- fun1(bigdf, "A"))
   user  system elapsed
  2.204   0.432   2.637
>
> system.time(res2 <- fun2(bigdf, "A"))
   user  system elapsed
  1.898   0.403   2.302
>
> system.time(res3 <- fun3(bigdf, "A"))
   user  system elapsed
  0.187   0.048   0.235

## 10 times faster!

>
> all.equal(res1,res2)
[1] TRUE
> all.equal(res1,res3)
[1] TRUE


NB: I freely admit that Eric's original solution may well be perfectly
adequate, and the speed improvement is pointless. In that case, maybe
this is at least somewhat instructive for someone.

Nevertheless, I would welcome further suggestions for improvement, as
I suspect my "fancy" approach is still a ways from what one can do (in
R code, without resorting to C++).

Cheers,
Bert


Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Mon, Jul 29, 2019 at 12:38 PM Eric Berger <[hidden email]> wrote:

>
> Read the help for apply and %in%
>
> ?apply
> ?%in%
>
>
> Sent from my iPhone
>
> > On 29 Jul 2019, at 22:23, Ana Marija <[hidden email]> wrote:
> >
> > Thank you so much! Just to confirm here MARGIN=1 indicates that "A" should appear at least once per row?
> >
> >> On Mon, Jul 29, 2019 at 1:53 PM Eric Berger <[hidden email]> wrote:
> >> df$case <- apply(df,MARGIN = 1,function(v) { as.integer("A" %in% v) })
> >>
> >>
> >>> On Mon, Jul 29, 2019 at 9:02 PM Ana Marija <[hidden email]> wrote:
> >>> sorry my bad, here is the edited version:
> >>>
> >>> so the data frame is this:
> >>>
> >>> df=data.frame(
> >>>   eye_problemsdisorders_f6148_0_1=c("A","C","D",NA,"D","A","C",NA,"B","A"),
> >>>   eye_problemsdisorders_f6148_0_2=c("B","C",NA,"A","C","B",NA,NA,"A","D"),
> >>>   eye_problemsdisorders_f6148_0_3=c("C","A","D","D","B","A",NA,NA,"A","B"),
> >>>   eye_problemsdisorders_f6148_0_4=c("D","D",NA,"B","A","C",NA,"C","A","B"),
> >>>   eye_problemsdisorders_f6148_0_5=c("C","C",NA,"D","B","C",NA,"D","D","B")
> >>>
> >>> and I would need to put inside the column which would be named "case" and values inside would be: 1,1,0,1,1,1,0,0,1,1
> >>>
> >>> so "case" column is where value "A" can be found in any column.
> >>>
> >>>> On Mon, Jul 29, 2019 at 12:53 PM Eric Berger <[hidden email]> wrote:
> >>>> You may have a typo/misstatement in your question.
> >>>> You define a data frame with 5 columns, each of which has 10 elements, so your data frame has dimensions 10 x 5.
> >>>> Then you request a new COLUMN which will have only 5 elements, which is not allowed. All columns of a data frame
> >>>> must have the same length.
> >>>>
> >>>>> On Mon, Jul 29, 2019 at 8:42 PM Ana Marija <[hidden email]> wrote:
> >>>>> I have data frame which looks like this:
> >>>>>
> >>>>> df=data.frame(
> >>>>>   eye_problemsdisorders_f6148_0_1=c(A,C,D,NA,D,A,C,NA,B,A),
> >>>>>   eye_problemsdisorders_f6148_0_2=c(B,C,NA,A,C,B,NA,NA,A,D),
> >>>>>   eye_problemsdisorders_f6148_0_3=c(C,A,D,D,B,A,NA,NA,A,B),
> >>>>>   eye_problemsdisorders_f6148_0_4=c(D,D,NA,B,A,C,NA,C,A,B),
> >>>>>   eye_problemsdisorders_f6148_0_5=c(C,C,NA,D,B,C,NA,D,D,B))
> >>>>>
> >>>>> In reality I have much more columns and they don't always match
> >>>>> "eye_problemsdisorders_f6148" this string, and there is much more rows.
> >>>>>
> >>>>> What I would like to do is create a new column, say named "case" where I
> >>>>> would have value "1" for every row where string "A" appears at least once
> >>>>> in any column, if not the value would be "0". So in the above example
> >>>>> column "case" would have these values: 1,1,1,1,0
> >>>>> Thanks
> >>>>> Ana
> >>>>>
> >>>>>         [[alternative HTML version deleted]]
> >>>>>
> >>>>> ______________________________________________
> >>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> >>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >>>>> and provide commented, minimal, self-contained, reproducible code.
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How to create a new column based on the values from multiple columns which are matching a particular string?

Dénes Tóth-2
Hi Bert,

see inline.

On 7/30/19 1:12 AM, Bert Gunter wrote:

> While Eric's solution is correct( mod "corner" cases like all NA's in
> a row), it can be made considerably more efficient.
>
> One minor improvement can be made by using the idiom
> any(x == "A")
> instead of matching via %in% for the simple case of matching just a
> single value.
>
> However, a considerable improvement can be made by getting fancy,
> taking advantage of do.call() and the pmax() function to mostly
> vectorize the calculation. Here are the details and timing on a large
> data frame.
>
> (Note: I removed the names in the %in% approach for simplicity. It has
> almost no effect on timings.
> I also moved the as.integer() call out of the function so that it is
> called only once at the end, which improves efficiency a bit)
>
> 1. Eric's original:
> fun1 <-function(df,what)
> {
>    as.integer(unname(apply(df,MARGIN = 1,function(v) { what %in% v })))
> }
>
> 2. Using any( x == "A") instead:
> fun2 <- function(df,what)
> {
>     as.integer(unname(apply(df,MARGIN =1, function(x)any(x == what,
> na.rm=TRUE))))
> }
>
> 3. Getting fancy to use pmax()
> fun3 <- function(df,what)
> {
>     z <- lapply(df,function(x)as.integer((x==what)))
>     do.call(pmax,c(z,na.rm=TRUE))
> }
>
> Here are the timings:
>
>> bigdf <- df[rep(1:10,1e4), rep(1:5, 50)]
>> dim(bigdf)
> [1] 100000    250
>
>> system.time(res1 <- fun1(bigdf, "A"))
>     user  system elapsed
>    2.204   0.432   2.637
>>
>> system.time(res2 <- fun2(bigdf, "A"))
>     user  system elapsed
>    1.898   0.403   2.302
>>
>> system.time(res3 <- fun3(bigdf, "A"))
>     user  system elapsed
>    0.187   0.048   0.235
>
> ## 10 times faster!
>
>>
>> all.equal(res1,res2)
> [1] TRUE
>> all.equal(res1,res3)
> [1] TRUE
>
>
> NB: I freely admit that Eric's original solution may well be perfectly
> adequate, and the speed improvement is pointless. In that case, maybe
> this is at least somewhat instructive for someone.
>
> Nevertheless, I would welcome further suggestions for improvement, as
> I suspect my "fancy" approach is still a ways from what one can do (in
> R code, without resorting to C++).

fun4 <- function(df, what)
{
   as.integer(rowSums(df == what, na.rm = TRUE) > 0)
}

The function above works for data.frame and matrix inputs as well. It is
slower than fun3() if 'df' is a data.frame, but is faster if 'df' is a
matrix (which is a more efficient representation of the data if it
contains only character columns).

A note to Ana: 'df' is the name of a function in R (see ?stats::df); not
a perfect choice for a variable name.

Cheers,
Denes


>
> Cheers,
> Bert
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Mon, Jul 29, 2019 at 12:38 PM Eric Berger <[hidden email]> wrote:
>>
>> Read the help for apply and %in%
>>
>> ?apply
>> ?%in%
>>
>>
>> Sent from my iPhone
>>
>>> On 29 Jul 2019, at 22:23, Ana Marija <[hidden email]> wrote:
>>>
>>> Thank you so much! Just to confirm here MARGIN=1 indicates that "A" should appear at least once per row?
>>>
>>>> On Mon, Jul 29, 2019 at 1:53 PM Eric Berger <[hidden email]> wrote:
>>>> df$case <- apply(df,MARGIN = 1,function(v) { as.integer("A" %in% v) })
>>>>
>>>>
>>>>> On Mon, Jul 29, 2019 at 9:02 PM Ana Marija <[hidden email]> wrote:
>>>>> sorry my bad, here is the edited version:
>>>>>
>>>>> so the data frame is this:
>>>>>
>>>>> df=data.frame(
>>>>>    eye_problemsdisorders_f6148_0_1=c("A","C","D",NA,"D","A","C",NA,"B","A"),
>>>>>    eye_problemsdisorders_f6148_0_2=c("B","C",NA,"A","C","B",NA,NA,"A","D"),
>>>>>    eye_problemsdisorders_f6148_0_3=c("C","A","D","D","B","A",NA,NA,"A","B"),
>>>>>    eye_problemsdisorders_f6148_0_4=c("D","D",NA,"B","A","C",NA,"C","A","B"),
>>>>>    eye_problemsdisorders_f6148_0_5=c("C","C",NA,"D","B","C",NA,"D","D","B")
>>>>>
>>>>> and I would need to put inside the column which would be named "case" and values inside would be: 1,1,0,1,1,1,0,0,1,1
>>>>>
>>>>> so "case" column is where value "A" can be found in any column.
>>>>>
>>>>>> On Mon, Jul 29, 2019 at 12:53 PM Eric Berger <[hidden email]> wrote:
>>>>>> You may have a typo/misstatement in your question.
>>>>>> You define a data frame with 5 columns, each of which has 10 elements, so your data frame has dimensions 10 x 5.
>>>>>> Then you request a new COLUMN which will have only 5 elements, which is not allowed. All columns of a data frame
>>>>>> must have the same length.
>>>>>>
>>>>>>> On Mon, Jul 29, 2019 at 8:42 PM Ana Marija <[hidden email]> wrote:
>>>>>>> I have data frame which looks like this:
>>>>>>>
>>>>>>> df=data.frame(
>>>>>>>    eye_problemsdisorders_f6148_0_1=c(A,C,D,NA,D,A,C,NA,B,A),
>>>>>>>    eye_problemsdisorders_f6148_0_2=c(B,C,NA,A,C,B,NA,NA,A,D),
>>>>>>>    eye_problemsdisorders_f6148_0_3=c(C,A,D,D,B,A,NA,NA,A,B),
>>>>>>>    eye_problemsdisorders_f6148_0_4=c(D,D,NA,B,A,C,NA,C,A,B),
>>>>>>>    eye_problemsdisorders_f6148_0_5=c(C,C,NA,D,B,C,NA,D,D,B))
>>>>>>>
>>>>>>> In reality I have much more columns and they don't always match
>>>>>>> "eye_problemsdisorders_f6148" this string, and there is much more rows.
>>>>>>>
>>>>>>> What I would like to do is create a new column, say named "case" where I
>>>>>>> would have value "1" for every row where string "A" appears at least once
>>>>>>> in any column, if not the value would be "0". So in the above example
>>>>>>> column "case" would have these values: 1,1,1,1,0
>>>>>>> Thanks
>>>>>>> Ana
>>>>>>>
>>>>>>>          [[alternative HTML version deleted]]
>>>>>>>
>>>>>>> ______________________________________________
>>>>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>          [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How to create a new column based on the values from multiple columns which are matching a particular string?

Bert Gunter-2
Thanks, Dénes. Good points.

Bert


On Mon, Jul 29, 2019 at 4:33 PM Dénes Tóth <[hidden email]> wrote:

>
> Hi Bert,
>
> see inline.
>
> On 7/30/19 1:12 AM, Bert Gunter wrote:
> > While Eric's solution is correct( mod "corner" cases like all NA's in
> > a row), it can be made considerably more efficient.
> >
> > One minor improvement can be made by using the idiom
> > any(x == "A")
> > instead of matching via %in% for the simple case of matching just a
> > single value.
> >
> > However, a considerable improvement can be made by getting fancy,
> > taking advantage of do.call() and the pmax() function to mostly
> > vectorize the calculation. Here are the details and timing on a large
> > data frame.
> >
> > (Note: I removed the names in the %in% approach for simplicity. It has
> > almost no effect on timings.
> > I also moved the as.integer() call out of the function so that it is
> > called only once at the end, which improves efficiency a bit)
> >
> > 1. Eric's original:
> > fun1 <-function(df,what)
> > {
> >    as.integer(unname(apply(df,MARGIN = 1,function(v) { what %in% v })))
> > }
> >
> > 2. Using any( x == "A") instead:
> > fun2 <- function(df,what)
> > {
> >     as.integer(unname(apply(df,MARGIN =1, function(x)any(x == what,
> > na.rm=TRUE))))
> > }
> >
> > 3. Getting fancy to use pmax()
> > fun3 <- function(df,what)
> > {
> >     z <- lapply(df,function(x)as.integer((x==what)))
> >     do.call(pmax,c(z,na.rm=TRUE))
> > }
> >
> > Here are the timings:
> >
> >> bigdf <- df[rep(1:10,1e4), rep(1:5, 50)]
> >> dim(bigdf)
> > [1] 100000    250
> >
> >> system.time(res1 <- fun1(bigdf, "A"))
> >     user  system elapsed
> >    2.204   0.432   2.637
> >>
> >> system.time(res2 <- fun2(bigdf, "A"))
> >     user  system elapsed
> >    1.898   0.403   2.302
> >>
> >> system.time(res3 <- fun3(bigdf, "A"))
> >     user  system elapsed
> >    0.187   0.048   0.235
> >
> > ## 10 times faster!
> >
> >>
> >> all.equal(res1,res2)
> > [1] TRUE
> >> all.equal(res1,res3)
> > [1] TRUE
> >
> >
> > NB: I freely admit that Eric's original solution may well be perfectly
> > adequate, and the speed improvement is pointless. In that case, maybe
> > this is at least somewhat instructive for someone.
> >
> > Nevertheless, I would welcome further suggestions for improvement, as
> > I suspect my "fancy" approach is still a ways from what one can do (in
> > R code, without resorting to C++).
>
> fun4 <- function(df, what)
> {
>    as.integer(rowSums(df == what, na.rm = TRUE) > 0)
> }
>
> The function above works for data.frame and matrix inputs as well. It is
> slower than fun3() if 'df' is a data.frame, but is faster if 'df' is a
> matrix (which is a more efficient representation of the data if it
> contains only character columns).
>
> A note to Ana: 'df' is the name of a function in R (see ?stats::df); not
> a perfect choice for a variable name.
>
> Cheers,
> Denes
>
>
> >
> > Cheers,
> > Bert
> >
> >
> > Bert Gunter
> >
> > "The trouble with having an open mind is that people keep coming along
> > and sticking things into it."
> > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >
> > Bert Gunter
> >
> > "The trouble with having an open mind is that people keep coming along
> > and sticking things into it."
> > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >
> >
> > On Mon, Jul 29, 2019 at 12:38 PM Eric Berger <[hidden email]> wrote:
> >>
> >> Read the help for apply and %in%
> >>
> >> ?apply
> >> ?%in%
> >>
> >>
> >> Sent from my iPhone
> >>
> >>> On 29 Jul 2019, at 22:23, Ana Marija <[hidden email]> wrote:
> >>>
> >>> Thank you so much! Just to confirm here MARGIN=1 indicates that "A" should appear at least once per row?
> >>>
> >>>> On Mon, Jul 29, 2019 at 1:53 PM Eric Berger <[hidden email]> wrote:
> >>>> df$case <- apply(df,MARGIN = 1,function(v) { as.integer("A" %in% v) })
> >>>>
> >>>>
> >>>>> On Mon, Jul 29, 2019 at 9:02 PM Ana Marija <[hidden email]> wrote:
> >>>>> sorry my bad, here is the edited version:
> >>>>>
> >>>>> so the data frame is this:
> >>>>>
> >>>>> df=data.frame(
> >>>>>    eye_problemsdisorders_f6148_0_1=c("A","C","D",NA,"D","A","C",NA,"B","A"),
> >>>>>    eye_problemsdisorders_f6148_0_2=c("B","C",NA,"A","C","B",NA,NA,"A","D"),
> >>>>>    eye_problemsdisorders_f6148_0_3=c("C","A","D","D","B","A",NA,NA,"A","B"),
> >>>>>    eye_problemsdisorders_f6148_0_4=c("D","D",NA,"B","A","C",NA,"C","A","B"),
> >>>>>    eye_problemsdisorders_f6148_0_5=c("C","C",NA,"D","B","C",NA,"D","D","B")
> >>>>>
> >>>>> and I would need to put inside the column which would be named "case" and values inside would be: 1,1,0,1,1,1,0,0,1,1
> >>>>>
> >>>>> so "case" column is where value "A" can be found in any column.
> >>>>>
> >>>>>> On Mon, Jul 29, 2019 at 12:53 PM Eric Berger <[hidden email]> wrote:
> >>>>>> You may have a typo/misstatement in your question.
> >>>>>> You define a data frame with 5 columns, each of which has 10 elements, so your data frame has dimensions 10 x 5.
> >>>>>> Then you request a new COLUMN which will have only 5 elements, which is not allowed. All columns of a data frame
> >>>>>> must have the same length.
> >>>>>>
> >>>>>>> On Mon, Jul 29, 2019 at 8:42 PM Ana Marija <[hidden email]> wrote:
> >>>>>>> I have data frame which looks like this:
> >>>>>>>
> >>>>>>> df=data.frame(
> >>>>>>>    eye_problemsdisorders_f6148_0_1=c(A,C,D,NA,D,A,C,NA,B,A),
> >>>>>>>    eye_problemsdisorders_f6148_0_2=c(B,C,NA,A,C,B,NA,NA,A,D),
> >>>>>>>    eye_problemsdisorders_f6148_0_3=c(C,A,D,D,B,A,NA,NA,A,B),
> >>>>>>>    eye_problemsdisorders_f6148_0_4=c(D,D,NA,B,A,C,NA,C,A,B),
> >>>>>>>    eye_problemsdisorders_f6148_0_5=c(C,C,NA,D,B,C,NA,D,D,B))
> >>>>>>>
> >>>>>>> In reality I have much more columns and they don't always match
> >>>>>>> "eye_problemsdisorders_f6148" this string, and there is much more rows.
> >>>>>>>
> >>>>>>> What I would like to do is create a new column, say named "case" where I
> >>>>>>> would have value "1" for every row where string "A" appears at least once
> >>>>>>> in any column, if not the value would be "0". So in the above example
> >>>>>>> column "case" would have these values: 1,1,1,1,0
> >>>>>>> Thanks
> >>>>>>> Ana
> >>>>>>>
> >>>>>>>          [[alternative HTML version deleted]]
> >>>>>>>
> >>>>>>> ______________________________________________
> >>>>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >>>>>>> and provide commented, minimal, self-contained, reproducible code.
> >>
> >>          [[alternative HTML version deleted]]
> >>
> >> ______________________________________________
> >> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >
> > ______________________________________________
> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.