HOW TO FILTER DATA

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

HOW TO FILTER DATA

Saptorshee Kanto Chakraborty
Hello,

I have a data of Patents from OECD in delimited text format with IPC being
one column, I want to filter the data by selecting only certain IPC in that
column and delete other rows which do not have my required IPCs. Please,
can anybody guide me doing it, also the IPC codes are string variables.

The data is somewhat like below, but its a huge dataset containing more
than 11 million rows


Appln_id|Prio_Year|App_year|IPC
1|1999|2000|H04Q007/32
1|1999|2000|G06K019/077
1|1999|2000|H01R012/18
1|1999|2000|G06K017/00
1|1999|2000|H04M001/2745
1|1999|2000|G06K007/00
1|1999|2000|H04M001/02
1|1999|2000|H04M001/275
2|1991|1992|C12N015/62
2|1991|1992|C12N015/09
2|1991|1992|C07K019/00
2|1991|1992|C07K016/26



Thanking You

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: HOW TO FILTER DATA

Leilei Ruan
Try the code below:


df <- read_delim("C:/Users/lruan1/Desktop/1112.csv", "|", escape_double =
FALSE, trim_ws = TRUE)

df_new <- subset(df,df$IPC == 'H04M001/02'| df$IPC == 'C07K016/26' )

You can add more condition with "|" in the subset function. Good luck!

On Wed, Jan 3, 2018 at 2:53 PM, Saptorshee Kanto Chakraborty <
[hidden email]> wrote:

> Hello,
>
> I have a data of Patents from OECD in delimited text format with IPC being
> one column, I want to filter the data by selecting only certain IPC in that
> column and delete other rows which do not have my required IPCs. Please,
> can anybody guide me doing it, also the IPC codes are string variables.
>
> The data is somewhat like below, but its a huge dataset containing more
> than 11 million rows
>
>
> Appln_id|Prio_Year|App_year|IPC
> 1|1999|2000|H04Q007/32
> 1|1999|2000|G06K019/077
> 1|1999|2000|H01R012/18
> 1|1999|2000|G06K017/00
> 1|1999|2000|H04M001/2745
> 1|1999|2000|G06K007/00
> 1|1999|2000|H04M001/02
> 1|1999|2000|H04M001/275
> 2|1991|1992|C12N015/62
> 2|1991|1992|C12N015/09
> 2|1991|1992|C07K019/00
> 2|1991|1992|C07K016/26
>
>
>
> Thanking You
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: HOW TO FILTER DATA

Rui Barradas
In reply to this post by Saptorshee Kanto Chakraborty
Hello,

If you want to select rows with just one IPC, use `==`.
If you want to select rows with several IPC's, use `%in%`.
See the code below for the two ways of doing this.


oecd <- read.table(text = "
Appln_id|Prio_Year|App_year|IPC
1|1999|2000|H04Q007/32
1|1999|2000|G06K019/077
1|1999|2000|H01R012/18
1|1999|2000|G06K017/00
1|1999|2000|H04M001/2745
1|1999|2000|G06K007/00
1|1999|2000|H04M001/02
1|1999|2000|H04M001/275
2|1991|1992|C12N015/62
2|1991|1992|C12N015/09
2|1991|1992|C07K019/00
2|1991|1992|C07K016/26
", header = TRUE, sep = "|")


select_one <- "H04Q007/32"
select_many <- c("H04Q007/32", "H04M001/275")

oecd2 <- subset(oecd, IPC == select_one)
oecd3 <- subset(oecd, IPC %in% select_many)


Hope this helps,

Rui Barradas

On 1/3/2018 7:53 PM, Saptorshee Kanto Chakraborty wrote:

> Hello,
>
> I have a data of Patents from OECD in delimited text format with IPC being
> one column, I want to filter the data by selecting only certain IPC in that
> column and delete other rows which do not have my required IPCs. Please,
> can anybody guide me doing it, also the IPC codes are string variables.
>
> The data is somewhat like below, but its a huge dataset containing more
> than 11 million rows
>
>
> Appln_id|Prio_Year|App_year|IPC
> 1|1999|2000|H04Q007/32
> 1|1999|2000|G06K019/077
> 1|1999|2000|H01R012/18
> 1|1999|2000|G06K017/00
> 1|1999|2000|H04M001/2745
> 1|1999|2000|G06K007/00
> 1|1999|2000|H04M001/02
> 1|1999|2000|H04M001/275
> 2|1991|1992|C12N015/62
> 2|1991|1992|C12N015/09
> 2|1991|1992|C07K019/00
> 2|1991|1992|C07K016/26
>
>
>
> Thanking You
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: HOW TO FILTER DATA

MacQueen, Don
In reply to this post by Leilei Ruan
Just a couple of minor comments:

> help.search('read_delim')
No vignettes or demos or help files found with alias or concept or
title matching 'read_delim' using regular expression matching.

read_delim is not part of base R; it must come from some unnamed non-base package. I'd recommend using base R as much as possible for someone who is new to R, as I suspect the original poster is.

The call to subset would be better written as

  df_new <- subset(df, IPC == 'H04M001/02' | IPC == 'C07K016/26' )
instead of
  df_new <- subset(df, df$IPC == 'H04M001/02' | df$IPC == 'C07K016/26' )

IPC is a variable within the data frame, so it is unnecessary to include the data frame's name in the logical expression.

-Don


--
Don MacQueen
Lawrence Livermore National Laboratory
7000 East Ave., L-627
Livermore, CA 94550
925-423-1062
Lab cell 925-724-7509
 
 

On 1/3/18, 12:54 PM, "R-help on behalf of Leilei Ruan" <[hidden email] on behalf of [hidden email]> wrote:

    Try the code below:
   
   
    df <- read_delim("C:/Users/lruan1/Desktop/1112.csv", "|", escape_double =
    FALSE, trim_ws = TRUE)
   
    df_new <- subset(df,df$IPC == 'H04M001/02'| df$IPC == 'C07K016/26' )
   
    You can add more condition with "|" in the subset function. Good luck!
   
    On Wed, Jan 3, 2018 at 2:53 PM, Saptorshee Kanto Chakraborty <
    [hidden email]> wrote:
   
    > Hello,
    >
    > I have a data of Patents from OECD in delimited text format with IPC being
    > one column, I want to filter the data by selecting only certain IPC in that
    > column and delete other rows which do not have my required IPCs. Please,
    > can anybody guide me doing it, also the IPC codes are string variables.
    >
    > The data is somewhat like below, but its a huge dataset containing more
    > than 11 million rows
    >
    >
    > Appln_id|Prio_Year|App_year|IPC
    > 1|1999|2000|H04Q007/32
    > 1|1999|2000|G06K019/077
    > 1|1999|2000|H01R012/18
    > 1|1999|2000|G06K017/00
    > 1|1999|2000|H04M001/2745
    > 1|1999|2000|G06K007/00
    > 1|1999|2000|H04M001/02
    > 1|1999|2000|H04M001/275
    > 2|1991|1992|C12N015/62
    > 2|1991|1992|C12N015/09
    > 2|1991|1992|C07K019/00
    > 2|1991|1992|C07K016/26
    >
    >
    >
    > Thanking You
    >
    >         [[alternative HTML version deleted]]
    >
    > ______________________________________________
    > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
    > https://stat.ethz.ch/mailman/listinfo/r-help
    > PLEASE do read the posting guide http://www.R-project.org/
    > posting-guide.html
    > and provide commented, minimal, self-contained, reproducible code.
    >
   
    [[alternative HTML version deleted]]
   
    ______________________________________________
    [hidden email] mailing list -- To UNSUBSCRIBE and more, see
    https://stat.ethz.ch/mailman/listinfo/r-help
    PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
    and provide commented, minimal, self-contained, reproducible code.
   

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.