how to filter variables which appear in any row but do not include

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

how to filter variables which appear in any row but do not include

anikaM
Hello.

I am trying to filter only rows that have ANY of these variables:
E109, E119, E149

so I did:
controls=t %>% filter_all(any_vars(. %in% c("E109", "E119","E149")))

than I checked what I got:
> s0 <- sapply(controls, function(x) grep('^E10', x, value = TRUE))
> d0=unlist(s0)
> d10=unique(d0)
> d10
 [1] "E10"  "E103" "E104" "E109" "E101" "E108" "E105" "E100" "E106" "E102"
[11] "E107"
s1 <- sapply(controls, function(x) grep('^E11', x, value = TRUE))
d1=unlist(s1)
d11=unique(d1)
> d11
 [1] "E11"  "E119" "E113" "E115" "E111" "E114" "E110" "E118" "E116" "E112"
[11] "E117"

I need help with changing this command
controls=t %>% filter_all(any_vars(. %in% c("E109", "E119","E149")))

so that in the output I do not have any rows that include E102 or E112?

Thanks
Ana

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: how to filter variables which appear in any row but do not include

Bert Gunter-2
I suggest that you forget all that fancy stuff  (and this is not a use case
for regular expressions).
Use %in%  with logical subscripting instead -- basic R functionality that
can be found in any good R tutorial.

> x <- c("ab","bc","cd")
> x[x %in% c("ab","cd")]
[1] "ab" "cd"
> x[!x %in% c("ab","cd")]
[1] "bc"


Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Wed, Jun 3, 2020 at 7:56 AM Ana Marija <[hidden email]>
wrote:

> Hello.
>
> I am trying to filter only rows that have ANY of these variables:
> E109, E119, E149
>
> so I did:
> controls=t %>% filter_all(any_vars(. %in% c("E109", "E119","E149")))
>
> than I checked what I got:
> > s0 <- sapply(controls, function(x) grep('^E10', x, value = TRUE))
> > d0=unlist(s0)
> > d10=unique(d0)
> > d10
>  [1] "E10"  "E103" "E104" "E109" "E101" "E108" "E105" "E100" "E106" "E102"
> [11] "E107"
> s1 <- sapply(controls, function(x) grep('^E11', x, value = TRUE))
> d1=unlist(s1)
> d11=unique(d1)
> > d11
>  [1] "E11"  "E119" "E113" "E115" "E111" "E114" "E110" "E118" "E116" "E112"
> [11] "E117"
>
> I need help with changing this command
> controls=t %>% filter_all(any_vars(. %in% c("E109", "E119","E149")))
>
> so that in the output I do not have any rows that include E102 or E112?
>
> Thanks
> Ana
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: how to filter variables which appear in any row but do not include

anikaM
Hi Bert

The issue is that I have around 2000 columns so I can not be checking if
those two are not present in each column of any row “by hand” so to
speak....And I need my output to be a data frame where neither E102 nor
E112 are present. Basically from the data frame columns that I already
created just remove any row that contains any of those variables.

Thanks
Ana

On Wed, 3 Jun 2020 at 11:00, Bert Gunter <[hidden email]> wrote:

> I suggest that you forget all that fancy stuff  (and this is not a use
> case for regular expressions).
> Use %in%  with logical subscripting instead -- basic R functionality that
> can be found in any good R tutorial.
>
> > x <- c("ab","bc","cd")
> > x[x %in% c("ab","cd")]
> [1] "ab" "cd"
> > x[!x %in% c("ab","cd")]
> [1] "bc"
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Wed, Jun 3, 2020 at 7:56 AM Ana Marija <[hidden email]>
> wrote:
>
>> Hello.
>>
>> I am trying to filter only rows that have ANY of these variables:
>> E109, E119, E149
>>
>> so I did:
>> controls=t %>% filter_all(any_vars(. %in% c("E109", "E119","E149")))
>>
>> than I checked what I got:
>> > s0 <- sapply(controls, function(x) grep('^E10', x, value = TRUE))
>> > d0=unlist(s0)
>> > d10=unique(d0)
>> > d10
>>  [1] "E10"  "E103" "E104" "E109" "E101" "E108" "E105" "E100" "E106" "E102"
>> [11] "E107"
>> s1 <- sapply(controls, function(x) grep('^E11', x, value = TRUE))
>> d1=unlist(s1)
>> d11=unique(d1)
>> > d11
>>  [1] "E11"  "E119" "E113" "E115" "E111" "E114" "E110" "E118" "E116" "E112"
>> [11] "E117"
>>
>> I need help with changing this command
>> controls=t %>% filter_all(any_vars(. %in% c("E109", "E119","E149")))
>>
>> so that in the output I do not have any rows that include E102 or E112?
>>
>> Thanks
>> Ana
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: how to filter variables which appear in any row but do not include

Rui Barradas
In reply to this post by anikaM
Hello,

If you want to filter out rows with any of the values in a 'unwanted'
vector, try the following.

First, create a test data set.

x <- scan(what = character(), text = '
"E10"  "E103" "E104" "E109" "E101" "E108" "E105" "E100" "E106" "E102"
"E107" "E11"  "E119" "E113" "E115" "E111" "E114" "E110" "E118" "E116" "E112"
"E117"
')

set.seed(2020)
dat <- replicate(5, sample(x, 20, TRUE))
dat <- as.data.frame(dat)


Now, remove all rows that have at least one of "E102" or "E112"


unwanted <- c("E102", "E112")
no <- sapply(dat, function(x){
   grepl(paste(unwanted, collapse = "|"), x)
})
no <- apply(no, 1, any)
dat[!no, ]


That's it, if I understood the problem.


Hope this helps,

Rui Barradas


Às 15:55 de 03/06/20, Ana Marija escreveu:

> Hello.
>
> I am trying to filter only rows that have ANY of these variables:
> E109, E119, E149
>
> so I did:
> controls=t %>% filter_all(any_vars(. %in% c("E109", "E119","E149")))
>
> than I checked what I got:
>> s0 <- sapply(controls, function(x) grep('^E10', x, value = TRUE))
>> d0=unlist(s0)
>> d10=unique(d0)
>> d10
>   [1] "E10"  "E103" "E104" "E109" "E101" "E108" "E105" "E100" "E106" "E102"
> [11] "E107"
> s1 <- sapply(controls, function(x) grep('^E11', x, value = TRUE))
> d1=unlist(s1)
> d11=unique(d1)
>> d11
>   [1] "E11"  "E119" "E113" "E115" "E111" "E114" "E110" "E118" "E116" "E112"
> [11] "E117"
>
> I need help with changing this command
> controls=t %>% filter_all(any_vars(. %in% c("E109", "E119","E149")))
>
> so that in the output I do not have any rows that include E102 or E112?
>
> Thanks
> Ana
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: how to filter variables which appear in any row but do not include

R help mailing list-2
In reply to this post by anikaM
#Below returns long list of TRUE/FALSE values,
#Note: "IDs" is a column name,
#Wrap with head() to shorten:
df$IDs %in% c("ident_1", "ident_2");

#Below returns index of IDs that are TRUE,
#Wrap with head() to shorten:
which(df$IDs %in% c("ident_1", "ident_2"));

#Below returns short TRUE/FALSE table:
table(df$IDs %in% c("ident_1", "ident_2"));

#Below check df to see unique IDs returned by "%in%" code above,
#(Good for identifying missing "desired" IDs):
unique(df[df$IDs %in% c("ident_1", "ident_2"), "IDs"]);

#Below returns dimensions of dataframe "filtered" (retained) by desired IDs,
#(Note rows below should equal number of TRUE in table above):
dim(df[df$IDs %in% c("ident_1", "ident_2"), ]);

#Create filtered dataframe object:
df_filtered  <-  df[df$IDs %in% c("ident_1", "ident_2"),  ];

#Below returns row counts per "IDs" ("ident_1", "ident_2", etc.) in df_filtered:
aggregate(df_filtered$IDs, by=list(df_filtered$IDs), FUN = "length");


HTH, Bill.

W. Michels, Ph.D.





On Wed, Jun 3, 2020 at 7:56 AM Ana Marija <[hidden email]> wrote:

>
> Hello.
>
> I am trying to filter only rows that have ANY of these variables:
> E109, E119, E149
>
> so I did:
> controls=t %>% filter_all(any_vars(. %in% c("E109", "E119","E149")))
>
> than I checked what I got:
> > s0 <- sapply(controls, function(x) grep('^E10', x, value = TRUE))
> > d0=unlist(s0)
> > d10=unique(d0)
> > d10
>  [1] "E10"  "E103" "E104" "E109" "E101" "E108" "E105" "E100" "E106" "E102"
> [11] "E107"
> s1 <- sapply(controls, function(x) grep('^E11', x, value = TRUE))
> d1=unlist(s1)
> d11=unique(d1)
> > d11
>  [1] "E11"  "E119" "E113" "E115" "E111" "E114" "E110" "E118" "E116" "E112"
> [11] "E117"
>
> I need help with changing this command
> controls=t %>% filter_all(any_vars(. %in% c("E109", "E119","E149")))
>
> so that in the output I do not have any rows that include E102 or E112?
>
> Thanks
> Ana
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: how to filter variables which appear in any row but do not include

Bert Gunter-2
In reply to this post by Rui Barradas
regex's are not needed. Using Rui's example:

> bad <- mapply(function(x) x %in% unwanted,dat)
> dat[!rowSums(bad),]

     V1   V2   V3   V4   V5
2  E117 E113 E119 E100  E10
4  E114  E11 E119 E119 E114
5  E109 E111 E103 E103 E100
7  E108 E113 E119 E117  E11
8  E114 E105  E10 E109 E110
9  E119 E116 E108 E118 E119
10 E100 E110 E104 E111 E101
13 E111 E116 E101 E110 E116
15 E103  E11 E108  E10 E113
16 E111 E117 E103 E115 E119
17 E104 E110 E104 E117 E114
19 E100 E108  E10 E111 E105
20 E109 E115 E117 E108 E106

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Wed, Jun 3, 2020 at 9:57 AM Rui Barradas <[hidden email]> wrote:

> Hello,
>
> If you want to filter out rows with any of the values in a 'unwanted'
> vector, try the following.
>
> First, create a test data set.
>
> x <- scan(what = character(), text = '
> "E10"  "E103" "E104" "E109" "E101" "E108" "E105" "E100" "E106" "E102"
> "E107" "E11"  "E119" "E113" "E115" "E111" "E114" "E110" "E118" "E116"
> "E112"
> "E117"
> ')
>
> set.seed(2020)
> dat <- replicate(5, sample(x, 20, TRUE))
> dat <- as.data.frame(dat)
>
>
> Now, remove all rows that have at least one of "E102" or "E112"
>
>
> unwanted <- c("E102", "E112")
> no <- sapply(dat, function(x){
>    grepl(paste(unwanted, collapse = "|"), x)
> })
> no <- apply(no, 1, any)
> dat[!no, ]
>
>
> That's it, if I understood the problem.
>
>
> Hope this helps,
>
> Rui Barradas
>
>
> Às 15:55 de 03/06/20, Ana Marija escreveu:
> > Hello.
> >
> > I am trying to filter only rows that have ANY of these variables:
> > E109, E119, E149
> >
> > so I did:
> > controls=t %>% filter_all(any_vars(. %in% c("E109", "E119","E149")))
> >
> > than I checked what I got:
> >> s0 <- sapply(controls, function(x) grep('^E10', x, value = TRUE))
> >> d0=unlist(s0)
> >> d10=unique(d0)
> >> d10
> >   [1] "E10"  "E103" "E104" "E109" "E101" "E108" "E105" "E100" "E106"
> "E102"
> > [11] "E107"
> > s1 <- sapply(controls, function(x) grep('^E11', x, value = TRUE))
> > d1=unlist(s1)
> > d11=unique(d1)
> >> d11
> >   [1] "E11"  "E119" "E113" "E115" "E111" "E114" "E110" "E118" "E116"
> "E112"
> > [11] "E117"
> >
> > I need help with changing this command
> > controls=t %>% filter_all(any_vars(. %in% c("E109", "E119","E149")))
> >
> > so that in the output I do not have any rows that include E102 or E112?
> >
> > Thanks
> > Ana
> >
> > ______________________________________________
> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: how to filter variables which appear in any row but do not include

anikaM
In reply to this post by Rui Barradas
Hi Rui,

thank you so much, that is exactly what I needed!

Cheers,
Ana

On Wed, Jun 3, 2020 at 11:50 AM Rui Barradas <[hidden email]> wrote:

>
> Hello,
>
> If you want to filter out rows with any of the values in a 'unwanted'
> vector, try the following.
>
> First, create a test data set.
>
> x <- scan(what = character(), text = '
> "E10"  "E103" "E104" "E109" "E101" "E108" "E105" "E100" "E106" "E102"
> "E107" "E11"  "E119" "E113" "E115" "E111" "E114" "E110" "E118" "E116" "E112"
> "E117"
> ')
>
> set.seed(2020)
> dat <- replicate(5, sample(x, 20, TRUE))
> dat <- as.data.frame(dat)
>
>
> Now, remove all rows that have at least one of "E102" or "E112"
>
>
> unwanted <- c("E102", "E112")
> no <- sapply(dat, function(x){
>    grepl(paste(unwanted, collapse = "|"), x)
> })
> no <- apply(no, 1, any)
> dat[!no, ]
>
>
> That's it, if I understood the problem.
>
>
> Hope this helps,
>
> Rui Barradas
>
>
> Às 15:55 de 03/06/20, Ana Marija escreveu:
> > Hello.
> >
> > I am trying to filter only rows that have ANY of these variables:
> > E109, E119, E149
> >
> > so I did:
> > controls=t %>% filter_all(any_vars(. %in% c("E109", "E119","E149")))
> >
> > than I checked what I got:
> >> s0 <- sapply(controls, function(x) grep('^E10', x, value = TRUE))
> >> d0=unlist(s0)
> >> d10=unique(d0)
> >> d10
> >   [1] "E10"  "E103" "E104" "E109" "E101" "E108" "E105" "E100" "E106" "E102"
> > [11] "E107"
> > s1 <- sapply(controls, function(x) grep('^E11', x, value = TRUE))
> > d1=unlist(s1)
> > d11=unique(d1)
> >> d11
> >   [1] "E11"  "E119" "E113" "E115" "E111" "E114" "E110" "E118" "E116" "E112"
> > [11] "E117"
> >
> > I need help with changing this command
> > controls=t %>% filter_all(any_vars(. %in% c("E109", "E119","E149")))
> >
> > so that in the output I do not have any rows that include E102 or E112?
> >
> > Thanks
> > Ana
> >
> > ______________________________________________
> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: how to filter variables which appear in any row but do not include

Rui Barradas
In reply to this post by Bert Gunter-2
Hello,

I forgot about %in%. Maybe because in the OP there were regex's.
And rowSums is much faster than apply.

In my tests this is 7 times faster than mine but with

%in% instead of grepl and apply(no, 1, any)

Hope this helps,

Rui Barradas

Às 18:34 de 03/06/20, Bert Gunter escreveu:

> regex's are not needed. Using Rui's example:
>
>  > bad <- mapply(function(x) x %in% unwanted,dat)
>  > dat[!rowSums(bad),]
>
>       V1   V2   V3   V4   V5
> 2  E117 E113 E119 E100  E10
> 4  E114  E11 E119 E119 E114
> 5  E109 E111 E103 E103 E100
> 7  E108 E113 E119 E117  E11
> 8  E114 E105  E10 E109 E110
> 9  E119 E116 E108 E118 E119
> 10 E100 E110 E104 E111 E101
> 13 E111 E116 E101 E110 E116
> 15 E103  E11 E108  E10 E113
> 16 E111 E117 E103 E115 E119
> 17 E104 E110 E104 E117 E114
> 19 E100 E108  E10 E111 E105
> 20 E109 E115 E117 E108 E106
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Wed, Jun 3, 2020 at 9:57 AM Rui Barradas <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Hello,
>
>     If you want to filter out rows with any of the values in a 'unwanted'
>     vector, try the following.
>
>     First, create a test data set.
>
>     x <- scan(what = character(), text = '
>     "E10"  "E103" "E104" "E109" "E101" "E108" "E105" "E100" "E106" "E102"
>     "E107" "E11"  "E119" "E113" "E115" "E111" "E114" "E110" "E118"
>     "E116" "E112"
>     "E117"
>     ')
>
>     set.seed(2020)
>     dat <- replicate(5, sample(x, 20, TRUE))
>     dat <- as.data.frame(dat)
>
>
>     Now, remove all rows that have at least one of "E102" or "E112"
>
>
>     unwanted <- c("E102", "E112")
>     no <- sapply(dat, function(x){
>         grepl(paste(unwanted, collapse = "|"), x)
>     })
>     no <- apply(no, 1, any)
>     dat[!no, ]
>
>
>     That's it, if I understood the problem.
>
>
>     Hope this helps,
>
>     Rui Barradas
>
>
>     Às 15:55 de 03/06/20, Ana Marija escreveu:
>      > Hello.
>      >
>      > I am trying to filter only rows that have ANY of these variables:
>      > E109, E119, E149
>      >
>      > so I did:
>      > controls=t %>% filter_all(any_vars(. %in% c("E109", "E119","E149")))
>      >
>      > than I checked what I got:
>      >> s0 <- sapply(controls, function(x) grep('^E10', x, value = TRUE))
>      >> d0=unlist(s0)
>      >> d10=unique(d0)
>      >> d10
>      >   [1] "E10"  "E103" "E104" "E109" "E101" "E108" "E105" "E100"
>     "E106" "E102"
>      > [11] "E107"
>      > s1 <- sapply(controls, function(x) grep('^E11', x, value = TRUE))
>      > d1=unlist(s1)
>      > d11=unique(d1)
>      >> d11
>      >   [1] "E11"  "E119" "E113" "E115" "E111" "E114" "E110" "E118"
>     "E116" "E112"
>      > [11] "E117"
>      >
>      > I need help with changing this command
>      > controls=t %>% filter_all(any_vars(. %in% c("E109", "E119","E149")))
>      >
>      > so that in the output I do not have any rows that include E102 or
>     E112?
>      >
>      > Thanks
>      > Ana
>      >
>      > ______________________________________________
>      > [hidden email] <mailto:[hidden email]> mailing list
>     -- To UNSUBSCRIBE and more, see
>      > https://stat.ethz.ch/mailman/listinfo/r-help
>      > PLEASE do read the posting guide
>     http://www.R-project.org/posting-guide.html
>      > and provide commented, minimal, self-contained, reproducible code.
>      >
>
>     ______________________________________________
>     [hidden email] <mailto:[hidden email]> mailing list --
>     To UNSUBSCRIBE and more, see
>     https://stat.ethz.ch/mailman/listinfo/r-help
>     PLEASE do read the posting guide
>     http://www.R-project.org/posting-guide.html
>     and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.