Help with Identify the number (Count) of values that are less than 5 char and replace with 99999

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Help with Identify the number (Count) of values that are less than 5 char and replace with 99999

Bill Poling
#RStudio Version 1.2.5019
sessionInfo()
# R version 3.6.1 (2019-07-05)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 10 x64 (build 17134)

Good morning. I have a factor that contains 1,418,303 Clinical Procedure Code (CPT).

A CPT Code is 5 char. However, among my data there are many values that are less, 2, 3, 4, as well as NA's
I get the count of NA's from the str() function = 58,481

Using the nchar function (I converted the Factor to a character column first) I get the first 1K values.
(Perhaps this is not necessary with an alternative function?)
# edt1a$ProcedureCode1 <- levels(edt1a$ProcedureCode)[edt1a$ProcedureCode]
#https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/nchar

[989] 5 5 5 5 5 5 5 5 5 5 5 5
 [ reached getOption("max.print") -- omitted 1417303 entries ]

What I would like to do is:

1. Identify the number (Count) of values that are less than 5 char (i.e. 2 char = 150, 3 char = 925, 4 char = 1002)
Probably look something like this:
|Var1   |  Freq|
|:------|-----:|
|2   |     150 |
|3   |     925 |
|4   |    1002|
2. Replace with 99999 as well as replace the NA's with 99999

head(edt1a$ProcedureCode1, n= 50) #Not apparent in top 50 but they are there
 [1] "44207" "99478" "99478" "99479" "98927" "01610" "99396" "81025" "64645" "99478" "99479" "99479" "99479" "99479" "99479" "97110" "J1885" "19081" "99479"
[20] "99478" "99479" "99479" "99479" "99213" "99213" "98927" "96372" "92507" "99479" "99478" "99478" "99478" "99479" "77065" "19083" "95874" "99244" "A7034"
[39] "A7046" "71275" "J1170" "90471" "87591" "80053" "98926" "A4649" "A7033" "43644" "85025" "73080"

str(edt1a$ProcedureCode) #Factor w/ 6244
 Factor w/ 6244 levels "0003M","00100",..: 1775 4732 4732 4733 4586 147 4708 3108 2400 4732 ...
str(edt1a$ProcedureCode1)
 chr [1:1418303] "44207" "99478" "99478" "99479" "98927" "01610" "99396" "81025" "64645" "99478" "99479" "99479" "99479" "99479" "99479" "97110" "J1885" ...

#Some examples from using sink and knitr

sink("ProcCodeV2.txt")
knitr::kable(table(edt1a$ProcedureCode1))
closeAllConnections()

|Var1   |  Freq|
|:------|-----:|
|0003M  |     1|
|0110   |     4|<--
|0111   |     5|<--
|01112  |    11|
|0112   |    14|<--
|01120  |     3|
|0113   |     2|<--
|01130  |     1|
|0114   |     1|<--
|01160  |     3|
|01170  |     4|
|0120   |     7|<--
|01200  |     8|
|01202  |    26|
|0121   |     7|<--
|01210  |    19|
|01214  |   125|
|01215  |     5|
|0122   |     2|<--
|01220  |     2|
|01230  |    11|
|0124   |     5|<--
|171    |     1|<--
|17106  |     6|

Thank you for any help.

WHP

Confidentiality Notice\ \ This email and the attachments...{{dropped:11}}

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Help with Identify the number (Count) of values that are less than 5 char and replace with 99999

Ivan Krylov
On Mon, 16 Dec 2019 13:24:36 +0000
Bill Poling <[hidden email]> wrote:

> Using the nchar function (I converted the Factor to a character
> column first) I get the first 1K values.

<...>

> 1. Identify the number (Count) of values that are less than 5 char
> (i.e. 2 char = 150, 3 char = 925, 4 char = 1002)

Use the table() function to get frequency counts for discrete-valued
data (like nchar()).

> 2. Replace with 99999 as well as replace the NA's with 99999

An expression like `nchar(x) < 5` returns a boolean vector with TRUE
where the condition is, well, true, and FALSE otherwise. Use this vector
together with the subset operator (square brackets []) and assignment
operator (<-) to perform a subassignment of "99999" to the elements
of your dataset where the condition is true:

https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Index-vectors

See also: ?`[` and ?table

--
Best regards,
Ivan

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Help with Identify the number (Count) of values that are less than 5 char and replace with 99999

Rui Barradas
In reply to this post by Bill Poling
Hello,

To count the number of variables with less than 5 characters, use nchar
and table or aggregate.
Since nchar needs a character vector and you have a factor, first
convert with as.character.


edt1a$ProcedureCode <- as.character(edt1a$ProcedureCode)


1.
Now any of the next 3 instructions will table the vector by number of
characters.

table(nchar(edt1a$ProcedureCode))
aggregate(ProcedureCode ~ nchar(ProcedureCode), edt1a, length)
tapply(edt1a$ProcedureCode, nchar(edt1a$ProcedureCode), length)


2.
If you want to change the values with less than 5 chars or all NA's to
"99999", a vectorized logical operation is a good way of doing it.

n <- nchar(edt1a$ProcedureCode) < 5
na <- is.na(edt1a$ProcedureCode)
edt1a$ProcedureCode[n | na] <- "99999"


Now back to factor, with the new level "99999".


edt1a$ProcedureCode <- factor(edt1a$ProcedureCode)



Hope this helps,

Rui Barradas


Às 13:24 de 16/12/19, Bill Poling escreveu:

> #RStudio Version 1.2.5019
> sessionInfo()
> # R version 3.6.1 (2019-07-05)
> # Platform: x86_64-w64-mingw32/x64 (64-bit)
> # Running under: Windows 10 x64 (build 17134)
>
> Good morning. I have a factor that contains 1,418,303 Clinical Procedure Code (CPT).
>
> A CPT Code is 5 char. However, among my data there are many values that are less, 2, 3, 4, as well as NA's
> I get the count of NA's from the str() function = 58,481
>
> Using the nchar function (I converted the Factor to a character column first) I get the first 1K values.
> (Perhaps this is not necessary with an alternative function?)
> # edt1a$ProcedureCode1 <- levels(edt1a$ProcedureCode)[edt1a$ProcedureCode]
> #https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/nchar
>
> [989] 5 5 5 5 5 5 5 5 5 5 5 5
>   [ reached getOption("max.print") -- omitted 1417303 entries ]
>
> What I would like to do is:
>
> 1. Identify the number (Count) of values that are less than 5 char (i.e. 2 char = 150, 3 char = 925, 4 char = 1002)
> Probably look something like this:
> |Var1   |  Freq|
> |:------|-----:|
> |2   |     150 |
> |3   |     925 |
> |4   |    1002|
> 2. Replace with 99999 as well as replace the NA's with 99999
>
> head(edt1a$ProcedureCode1, n= 50) #Not apparent in top 50 but they are there
>   [1] "44207" "99478" "99478" "99479" "98927" "01610" "99396" "81025" "64645" "99478" "99479" "99479" "99479" "99479" "99479" "97110" "J1885" "19081" "99479"
> [20] "99478" "99479" "99479" "99479" "99213" "99213" "98927" "96372" "92507" "99479" "99478" "99478" "99478" "99479" "77065" "19083" "95874" "99244" "A7034"
> [39] "A7046" "71275" "J1170" "90471" "87591" "80053" "98926" "A4649" "A7033" "43644" "85025" "73080"
>
> str(edt1a$ProcedureCode) #Factor w/ 6244
>   Factor w/ 6244 levels "0003M","00100",..: 1775 4732 4732 4733 4586 147 4708 3108 2400 4732 ...
> str(edt1a$ProcedureCode1)
>   chr [1:1418303] "44207" "99478" "99478" "99479" "98927" "01610" "99396" "81025" "64645" "99478" "99479" "99479" "99479" "99479" "99479" "97110" "J1885" ...
>
> #Some examples from using sink and knitr
>
> sink("ProcCodeV2.txt")
> knitr::kable(table(edt1a$ProcedureCode1))
> closeAllConnections()
>
> |Var1   |  Freq|
> |:------|-----:|
> |0003M  |     1|
> |0110   |     4|<--
> |0111   |     5|<--
> |01112  |    11|
> |0112   |    14|<--
> |01120  |     3|
> |0113   |     2|<--
> |01130  |     1|
> |0114   |     1|<--
> |01160  |     3|
> |01170  |     4|
> |0120   |     7|<--
> |01200  |     8|
> |01202  |    26|
> |0121   |     7|<--
> |01210  |    19|
> |01214  |   125|
> |01215  |     5|
> |0122   |     2|<--
> |01220  |     2|
> |01230  |    11|
> |0124   |     5|<--
> |171    |     1|<--
> |17106  |     6|
>
> Thank you for any help.
>
> WHP
>
> Confidentiality Notice\ \ This email and the attachments...{{dropped:11}}
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.