Subsetting a data.frame -> Read in with FWF format from .DAT file

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Subsetting a data.frame -> Read in with FWF format from .DAT file

RHelpPlease
Hi there,
I am having trouble subsetting a data frame by a conditional via one column (of many).

I read the file into R through "read.fwf," where I specified column widths.  Original data is .DAT.  I then utilized "names" function to read in column headings.

For one column, PRVDR_NUM, I wish to further amend the entire data set, but only have PRVDR_NUM == 050108.  This is where I'm having trouble.

I've tried code like this:

newinpatient <- subset(oldinpatient, oldinpatient$PRVDR_NUM == 050108)
#OR
newinpatient <- oldinpatient[oldinpatient$PRVDR_NUM == 050108, ]
#OR
providernum <- data.frame(newdim(PRVDR_NUM = c(050108))
newinpatient <- merge(providernum, oldinpatient)

With checking "class" at one point, I gathered that R interprets PRVDR_NUM as a factor, not a number .. so I've understood a potential reason why I would have errors (with code above).  So, I then tried something like this:

newPRVDR_NUM <- format(as.numeric(levels(oldinpatient$PRVDR_NUM) [oldinpatient$PRVDR_NUM]))
numericprvdr <- data.frame(oldinpatient, newPRVDR_NUM)
bestprvdr <- numericprvdr[,-2]

I thought that with converting PRVDR_NUM to numeric, then one of the three options above would be satisfied.  But, that has not worked either.  (I did confirm that the factor -> numeric worked, which it did)

Though R reads the three options (above) with no errors, upon performing a "dim" check I receive the output: 0 93.  The columns are correct, but rows (obviously) are not.  (I did confirm that the desired value exists multiple times in the noted column, so 0 is definitely incorrect)

As well, I would like to work with PRVDR_NUM as a variable alone, but I've found that with any of these variables/column names, I have to use "allinpatient$PRVDR_NUM."  R does not recognize PRVDR_NUM alone.  Why?

More and more I think my problem is more foundational, meaning using the read.fwf function in the first place?  Not using the read.fwf function correctly?  Again, I've made enough progress with other variables & data sets of this type I've been fine so far, but now & future I need to repeat this code enough times where help in better understanding my errors & a more elegant/efficient solution would be greatly appreciated.  

Also note that R does not read all 93 columns as factors.  Why would R interpret this six-wide column as a factor, but the nine-wide column next door as numeric?

Your help is most appreciated!
Reply | Threaded
Open this post in threaded view
|

Re: Subsetting a data.frame -> Read in with FWF format from .DAT file

Michael Weylandt
Inline.

On Fri, Mar 9, 2012 at 7:04 PM, RHelpPlease <[hidden email]> wrote:
> Hi there,
> I am having trouble subsetting a data frame by a conditional via one column
> (of many).
>
> I read the file into R through "read.fwf," where I specified column widths.
> Original data is .DAT.  I then utilized "names" function to read in column
> headings.

The easiest way for us to do diagnostics is if we can see your data:
the easiest way for us to see your data is for you to use

dput(head(oldinpatient, 30))

so we can get a plain text (email friendly) version of it.

>
> For one column, PRVDR_NUM, I wish to further amend the entire data set, but
> only have PRVDR_NUM == 050108.  This is where I'm having trouble.
>
> I've tried code like this:
>
> newinpatient <- subset(oldinpatient, oldinpatient$PRVDR_NUM == 050108)
> #OR
> newinpatient <- oldinpatient[oldinpatient$PRVDR_NUM == 050108, ]
> #OR

The two above this have a chance of working (and, once we figure out
what's going on, are good R idioms that should stay in your vocabulary
(though strictly speaking, the second "oldinpatient" in the first is
unnecessary due to some evaluation tricks); the two below are no good
so don't try that anymore.

> providernum <- data.frame(newdim(PRVDR_NUM = c(050108))
> newinpatient <- merge(providernum, oldinpatient)
>
> With checking "class" at one point, I gathered that R interprets PRVDR_NUM
> as a factor, not a number .. so I've understood a potential reason why I
> would have errors (with code above).  So, I then tried something like this:

Yes, it's a terrible legacy.... most I/O functions let you set the
option stringsAsFactors = FALSE to avoid this....

>
> newPRVDR_NUM <- format(as.numeric(levels(oldinpatient$PRVDR_NUM)
> [oldinpatient$PRVDR_NUM]))

This is almost right, though I think format sends things back to
character (and undoes as.numeric) -- I find this idiom a little
clearer (though, admittedly, still strange):

as.numeric(as.character(oldinpatient$PRVDR_NUM))

> numericprvdr <- data.frame(oldinpatient, newPRVDR_NUM)
> bestprvdr <- numericprvdr[,-2]
>
> I thought that with converting PRVDR_NUM to numeric, then one of the three
> options above would be satisfied.  But, that has not worked either.  (I did
> confirm that the factor -> numeric worked, which it did)
>

If it did work, these lines wouldn't: I think your earlier attempts
would have worked after conversion to numeric, but the format() gets
you back in trouble.

> Though R reads the three options (above) with no errors, upon performing a
> "dim" check I receive the output: 0 93.  The columns are correct, but rows
> (obviously) are not.  (I did confirm that the desired value exists multiple
> times in the noted column, so 0 is definitely incorrect)
>
> As well, I would like to work with PRVDR_NUM as a variable alone, but I've
> found that with any of these variables/column names, I have to use
> "allinpatient$PRVDR_NUM."  R does not recognize PRVDR_NUM alone.  Why?

Different question: the short answer is that, unlike SAS/SPSS, R can
take multiple data sets on at the same time, so you have to direct it
to which one you want. If you want to save keystrokes in a line where
you refer to a data set multiple times, you can use with(), e.g.,

DATS <- data.frame(x = 1:5, y = 1:5, z = 11:15)

DATS$x + DATS$y + DATS$z
with(DATS, x + y + z) # same

>
> More and more I think my problem is more foundational, meaning using the
> read.fwf function in the first place?  Not using the read.fwf function
> correctly?  Again, I've made enough progress with other variables & data
> sets of this type I've been fine so far, but now & future I need to repeat
> this code enough times where help in better understanding my errors & a more
> elegant/efficient solution would be greatly appreciated.

I think you're fine with the read.fwf() function -- though if .DAT is
a common file format, someone else might have done the heavy lifting
for you already. The definitive place to read all this is the R I/O
manual  --- http://cran.r-project.org/doc/manuals/R-data.html -- but
it's not the easiest read.

>
> Also note that R does not read all 93 columns as factors.  Why would R
> interpret this six-wide column as a factor, but the nine-wide column next
> door as numeric?

It has to do with what appear to be strings and what appear to be
numbers (and that line is not where you may think) -- anything that is
not totally unambiguously numeric becomes a string and, by default,
strings become factors -- hence, many factors.

Michael

PS -- Thanks for showing what you've tried.

>
> Your help is most appreciated!
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Subsetting-a-data-frame-Read-in-with-FWF-format-from-DAT-file-tp4461051p4461051.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Subsetting a data.frame -> Read in with FWF format from .DAT file

RHelpPlease
Hi Michael,
Thanks so much for your detailed reply!  

I gained a better understanding of the read.fwf function, along with ensuring I better note how these read-in functions convert variables, etc.  As well, your tip on removing "format" while converting the PRVDR_NUM variable to numeric (from factor) is the ticket!  Also, your reply aided with noting which code (of three options to subset data) is the "best" to use.

At this time I've been able to successfully output the data file at hand.  Thanks again for your help!