can not extract rows which match a string

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

can not extract rows which match a string

anikaM
Hello,

I have a dataframe (t1) with many columns, but the one I care about it this:
> unique(t1$sex_chromosome_aneuploidy_f22019_0_0)
[1] NA    "Yes"

it has these two values.

I would like to remove from my dataframe t1 all rows which have "Yes"
in t1$sex_chromosome_aneuploidy_f22019_0_0

I tried selecting those rows with "Yes" via:

t11=t1[t1$sex_chromosome_aneuploidy_f22019_0_0=="Yes",]

but I got t11 which has the exact same number of rows as t1.

If I do:
> table(t1$sex_chromosome_aneuploidy_f22019_0_0)

Yes
620

So there is for sure 620 rows which have "Yes". How to remove those
from my t1 data frame?

Thanks
Ana

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: can not extract rows which match a string

Rui Barradas
Hello,

You have to use is.na to get the NA values.


t1 <- data.frame(sex_chromosome_aneuploidy_f22019_0_0 = c(NA, "Yes"),
                  other = 1:2)

i <- t1$sex_chromosome_aneuploidy_f22019_0_0 == "Yes" &
!is.na(t1$sex_chromosome_aneuploidy_f22019_0_0)
i
t1[i, ]


Hope this helps,

Rui Barradas

Às 19:58 de 03/10/19, Ana Marija escreveu:

> Hello,
>
> I have a dataframe (t1) with many columns, but the one I care about it this:
>> unique(t1$sex_chromosome_aneuploidy_f22019_0_0)
> [1] NA    "Yes"
>
> it has these two values.
>
> I would like to remove from my dataframe t1 all rows which have "Yes"
> in t1$sex_chromosome_aneuploidy_f22019_0_0
>
> I tried selecting those rows with "Yes" via:
>
> t11=t1[t1$sex_chromosome_aneuploidy_f22019_0_0=="Yes",]
>
> but I got t11 which has the exact same number of rows as t1.
>
> If I do:
>> table(t1$sex_chromosome_aneuploidy_f22019_0_0)
>
> Yes
> 620
>
> So there is for sure 620 rows which have "Yes". How to remove those
> from my t1 data frame?
>
> Thanks
> Ana
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: can not extract rows which match a string

R help mailing list-2
In reply to this post by anikaM
Hello,

I expected the code you posted to work just as you presumed it would,
but without a reproducible example--I can only speculate as to why it
didn't.

In the t1 dataframe, if indeed you only want to remove rows of the
t1$sex_chromosome_aneuploidy_f22019_0_0 column which are undefined,
you could try the following:

> t11 <- t1[ !is.na(t1$sex_chromosome_aneuploidy_f22019_0_0), ]

HTH, Bill.

W. Michels, Ph.D.



On Thu, Oct 3, 2019 at 11:59 AM Ana Marija <[hidden email]> wrote:

>
> Hello,
>
> I have a dataframe (t1) with many columns, but the one I care about it this:
> > unique(t1$sex_chromosome_aneuploidy_f22019_0_0)
> [1] NA    "Yes"
>
> it has these two values.
>
> I would like to remove from my dataframe t1 all rows which have "Yes"
> in t1$sex_chromosome_aneuploidy_f22019_0_0
>
> I tried selecting those rows with "Yes" via:
>
> t11=t1[t1$sex_chromosome_aneuploidy_f22019_0_0=="Yes",]
>
> but I got t11 which has the exact same number of rows as t1.
>
> If I do:
> > table(t1$sex_chromosome_aneuploidy_f22019_0_0)
>
> Yes
> 620
>
> So there is for sure 620 rows which have "Yes". How to remove those
> from my t1 data frame?
>
> Thanks
> Ana
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: can not extract rows which match a string

Rui Barradas
In reply to this post by Rui Barradas
Hello,

Then it's easier, is.na alone will do it.

j <- is.na(t1$sex_chromosome_aneuploidy_f22019_0_0)
t1[j, ]


Hope this helps,

Rui Barradas


Às 20:29 de 03/10/19, Ana Marija escreveu:

> Hi Rui,
>
> sorry for confusion, I would only need to extract from my t1 dataframe
> rows which have NA in sex_chromosome_aneuploidy_f22019_0_0
> in other words to REMOVE rows with "Yes" and to keep rows with NA. How
> to do that?
>
> On Thu, Oct 3, 2019 at 2:26 PM Rui Barradas <[hidden email]> wrote:
>>
>> Hello,
>>
>> You have to use is.na to get the NA values.
>>
>>
>> t1 <- data.frame(sex_chromosome_aneuploidy_f22019_0_0 = c(NA, "Yes"),
>>                    other = 1:2)
>>
>> i <- t1$sex_chromosome_aneuploidy_f22019_0_0 == "Yes" &
>> !is.na(t1$sex_chromosome_aneuploidy_f22019_0_0)
>> i
>> t1[i, ]
>>
>>
>> Hope this helps,
>>
>> Rui Barradas
>>
>> Às 19:58 de 03/10/19, Ana Marija escreveu:
>>> Hello,
>>>
>>> I have a dataframe (t1) with many columns, but the one I care about it this:
>>>> unique(t1$sex_chromosome_aneuploidy_f22019_0_0)
>>> [1] NA    "Yes"
>>>
>>> it has these two values.
>>>
>>> I would like to remove from my dataframe t1 all rows which have "Yes"
>>> in t1$sex_chromosome_aneuploidy_f22019_0_0
>>>
>>> I tried selecting those rows with "Yes" via:
>>>
>>> t11=t1[t1$sex_chromosome_aneuploidy_f22019_0_0=="Yes",]
>>>
>>> but I got t11 which has the exact same number of rows as t1.
>>>
>>> If I do:
>>>> table(t1$sex_chromosome_aneuploidy_f22019_0_0)
>>>
>>> Yes
>>> 620
>>>
>>> So there is for sure 620 rows which have "Yes". How to remove those
>>> from my t1 data frame?
>>>
>>> Thanks
>>> Ana
>>>
>>> ______________________________________________
>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: can not extract rows which match a string

Rui Barradas
Hello again,

Sometimes it's better to create indices for each condition and then
assemble them with logical operations as needed.


i <- t1$sex_chromosome_aneuploidy_f22019_0_0 == "Yes"
j <- is.na(t1$sex_chromosome_aneuploidy_f22019_0_0)

t1[!i & j, ]


j means is.na(.)
!i means (.) != "Yes"


Hope this helps,

Rui Barradas

Às 21:21 de 03/10/19, Rui Barradas escreveu:

> Hello,
>
> Then it's easier, is.na alone will do it.
>
> j <- is.na(t1$sex_chromosome_aneuploidy_f22019_0_0)
> t1[j, ]
>
>
> Hope this helps,
>
> Rui Barradas
>
>
> Às 20:29 de 03/10/19, Ana Marija escreveu:
>> Hi Rui,
>>
>> sorry for confusion, I would only need to extract from my t1 dataframe
>> rows which have NA in sex_chromosome_aneuploidy_f22019_0_0
>> in other words to REMOVE rows with "Yes" and to keep rows with NA. How
>> to do that?
>>
>> On Thu, Oct 3, 2019 at 2:26 PM Rui Barradas <[hidden email]> wrote:
>>>
>>> Hello,
>>>
>>> You have to use is.na to get the NA values.
>>>
>>>
>>> t1 <- data.frame(sex_chromosome_aneuploidy_f22019_0_0 = c(NA, "Yes"),
>>>                    other = 1:2)
>>>
>>> i <- t1$sex_chromosome_aneuploidy_f22019_0_0 == "Yes" &
>>> !is.na(t1$sex_chromosome_aneuploidy_f22019_0_0)
>>> i
>>> t1[i, ]
>>>
>>>
>>> Hope this helps,
>>>
>>> Rui Barradas
>>>
>>> Às 19:58 de 03/10/19, Ana Marija escreveu:
>>>> Hello,
>>>>
>>>> I have a dataframe (t1) with many columns, but the one I care about
>>>> it this:
>>>>> unique(t1$sex_chromosome_aneuploidy_f22019_0_0)
>>>> [1] NA    "Yes"
>>>>
>>>> it has these two values.
>>>>
>>>> I would like to remove from my dataframe t1 all rows which have "Yes"
>>>> in t1$sex_chromosome_aneuploidy_f22019_0_0
>>>>
>>>> I tried selecting those rows with "Yes" via:
>>>>
>>>> t11=t1[t1$sex_chromosome_aneuploidy_f22019_0_0=="Yes",]
>>>>
>>>> but I got t11 which has the exact same number of rows as t1.
>>>>
>>>> If I do:
>>>>> table(t1$sex_chromosome_aneuploidy_f22019_0_0)
>>>>
>>>> Yes
>>>> 620
>>>>
>>>> So there is for sure 620 rows which have "Yes". How to remove those
>>>> from my t1 data frame?
>>>>
>>>> Thanks
>>>> Ana
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: can not extract rows which match a string

Pages, Herve
In reply to this post by anikaM
Hi,

On 10/3/19 11:58, Ana Marija wrote:

> Hello,
>
> I have a dataframe (t1) with many columns, but the one I care about it this:
>> unique(t1$sex_chromosome_aneuploidy_f22019_0_0)
> [1] NA    "Yes"
>
> it has these two values.
>
> I would like to remove from my dataframe t1 all rows which have "Yes"
> in t1$sex_chromosome_aneuploidy_f22019_0_0
>
> I tried selecting those rows with "Yes" via:
>
> t11=t1[t1$sex_chromosome_aneuploidy_f22019_0_0=="Yes",]

It's important that you realize that instead of removing rows with "Yes"
this actually keeps them.

>
> but I got t11 which has the exact same number of rows as t1.

which should not be outrageously unexpected. After all it's not entirely
impossible that when you selected the rows with "Yes" you selected them all.

>
> If I do:
>> table(t1$sex_chromosome_aneuploidy_f22019_0_0)
>
> Yes
> 620
>
> So there is for sure 620 rows which have "Yes".

This **seems** to indicate that all the rows contain "Yes". And this
would explain why when you selected the rows with "Yes" you selected
them all.

> How to remove those
> from my t1 data frame?

Unfortunately, this is a situation where we cannot trust the appearances.

Appearances: it **looks** like all the rows contain "Yes" and this seems
to be confirmed by the fact that selecting the rows with "Yes" didn't
drop any rows.

The truth: the truth is that there are some rows that don't contain
"Yes". However by default table() doesn't report counts for NAs so you
need to explicitly ask for that:

 > table(t1$sex_chromosome_aneuploidy_f22019_0_0, useNA="always")

  Yes <NA>
  620  111

So now you know how many rows to expect after removing those with "Yes".
Another complication is that the == operator propagates NAs so it tends
to return a subscript that is not safe to use for subsetting because
it's contaminated with NAs.

Other people have suggested that you use
is.na(t1$sex_chromosome_aneuploidy_f22019_0_0) or other more complicated
things (like t1$sex_chromosome_aneuploidy_f22019_0_0 != "Yes" &
is.na(t1$sex_chromosome_aneuploidy_f22019_0_0)) to work around this.
However the simplest and safest way to translate "compute the index of
the rows that match string 'babar'" into R code is with:

   t1$sex_chromosome_aneuploidy_f22019_0_0 %in% "babar"

Another advantage of using %in% is that you can have more than one
string on the right. For example

   t1$sex_chromosome_aneuploidy_f22019_0_0 %in% c("babar", "foo")

will produce an index that can be used to select the rows that match
"babar" or "foo". To remove these rows, use


   !(t1$sex_chromosome_aneuploidy_f22019_0_0 %in% c("babar", "foo"))

instead (parenthesis around the %in% operation highly recommended for
readability).

The bottom line is that %in% is almost always better than == for
computing a subscript because it doesn't propagate NAs.

Hope this helps,
H.

--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: [hidden email]
Phone:  (206) 667-5791
Fax:    (206) 667-1319
______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: can not extract rows which match a string

Richard O'Keefe-2
In reply to this post by anikaM
I think the problem may lie in your understanding of what "==" does with NA
and/or what "[]" does with NA.
> x <- c(NA, "Yes")
> x == "Yes"
[1]   NA TRUE
Since you say you DON'T want the rows with "Yes", you just want
x[is.na(x)]
or in your case
t11 <- t1[is.na(t1$sex_chromosome_aneuploidy_f22019_0_0),]
or if there could be other values than "Yes" that you want to keep,
is.definitely <- function (x, y) {
   !is.na(x) & !is.na(y) & x == y
}
t11 <- t1[!is.definitely(t1$sex_chromosome_aneuploidy_f22019_0_0, "Yes"),]

On Fri, 4 Oct 2019 at 07:59, Ana Marija <[hidden email]> wrote:

>
> Hello,
>
> I have a dataframe (t1) with many columns, but the one I care about it this:
> > unique(t1$sex_chromosome_aneuploidy_f22019_0_0)
> [1] NA    "Yes"
>
> it has these two values.
>
> I would like to remove from my dataframe t1 all rows which have "Yes"
> in t1$sex_chromosome_aneuploidy_f22019_0_0
>
> I tried selecting those rows with "Yes" via:
>
> t11=t1[t1$sex_chromosome_aneuploidy_f22019_0_0=="Yes",]
>
> but I got t11 which has the exact same number of rows as t1.
>
> If I do:
> > table(t1$sex_chromosome_aneuploidy_f22019_0_0)
>
> Yes
> 620
>
> So there is for sure 620 rows which have "Yes". How to remove those
> from my t1 data frame?
>
> Thanks
> Ana
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: can not extract rows which match a string

R help mailing list-2
In reply to this post by Rui Barradas
Apologies Ana, Of course Rui and Herve (and Richard) are correct here
in stating that NA values get 'carried through' when selecting using
the "==" operator.

To give an illustration of what (I believe) Herve means by "NAs
propagating", here's a small 11 x 8 dataframe ("zakaria") posted to
R-Help last year, which fortuitously has one column ("PO2T")
containing only the numeric value 50 as well as NAs. I compare
selecting with the "%in%" operator (as Herve suggests) and selecting
with the "==" operator. Notice the "propagating NAs" (last line of
code):

https://stat.ethz.ch/pipermail/r-help/2018-October/456798.html

> dim(zakaria)
[1] 11  8
> zakaria
   STUDENT_ID COURSE_CODE   PO1M PO1T PO2M PO2T  X X.1
1     AA15285     BAA1113 155.70  180   NA   NA NA  NA
2     AA15285     BAA1322  48.90   70   NA   NA NA  NA
3     AA15285     BAA2713  83.20  100   NA   NA NA  NA
4     AA15285     BAA2921     NA   NA   37   50 NA  NA
5     AA15285     BAA4273     NA   NA   NA   NA NA  NA
6     AA15285     BAA4513     NA   NA   NA   NA NA  NA
7     AA15286     BAA1322  48.05   70   NA   NA NA  NA
8     AA15286     BAA2113  68.40  100   NA   NA NA  NA
9     AA15286     BAA2513  41.65   60   NA   NA NA  NA
10    AA15286     BAA2713  82.35  100   NA   NA NA  NA
11    AA15286     BAA2921     NA   NA   41   50 NA  NA
> unique(zakaria$PO2T)
[1] NA 50
> table(zakaria$PO2T, exclude=NULL)

  50 <NA>
   2    9
> zakaria[!is.na(zakaria$PO2T), ]
   STUDENT_ID COURSE_CODE PO1M PO1T PO2M PO2T  X X.1
4     AA15285     BAA2921   NA   NA   37   50 NA  NA
11    AA15286     BAA2921   NA   NA   41   50 NA  NA
> zakaria[zakaria$PO2T %in% 50, ]
   STUDENT_ID COURSE_CODE PO1M PO1T PO2M PO2T  X X.1
4     AA15285     BAA2921   NA   NA   37   50 NA  NA
11    AA15286     BAA2921   NA   NA   41   50 NA  NA
> zakaria[zakaria$PO2T==50, ]
     STUDENT_ID COURSE_CODE PO1M PO1T PO2M PO2T  X X.1
NA         <NA>        <NA>   NA   NA   NA   NA NA  NA
NA.1       <NA>        <NA>   NA   NA   NA   NA NA  NA
NA.2       <NA>        <NA>   NA   NA   NA   NA NA  NA
4       AA15285     BAA2921   NA   NA   37   50 NA  NA
NA.3       <NA>        <NA>   NA   NA   NA   NA NA  NA
NA.4       <NA>        <NA>   NA   NA   NA   NA NA  NA
NA.5       <NA>        <NA>   NA   NA   NA   NA NA  NA
NA.6       <NA>        <NA>   NA   NA   NA   NA NA  NA
NA.7       <NA>        <NA>   NA   NA   NA   NA NA  NA
NA.8       <NA>        <NA>   NA   NA   NA   NA NA  NA
11      AA15286     BAA2921   NA   NA   41   50 NA  NA
>

I am certainly taking Herve's advice seriously, but I also believe
that when importing data into R, carefully setting parameters such as
the "na.strings" parameter of read.table() can help you avoid
surprises later on.

HTH, Bill.

W. Michels, Ph.D.

On Thu, Oct 3, 2019 at 1:34 PM Rui Barradas <[hidden email]> wrote:

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.