I am having trouble subsetting a data frame by a conditional via one column (of many). I read the file into R through "read.fwf," where I specified column widths. Original data is .DAT. I then utilized "names" function to read in column headings. For one column, PRVDR_NUM, I wish to further amend the entire data set, but only have PRVDR_NUM == 050108. This is where I'm having trouble. I've tried code like this: newinpatient <- subset(oldinpatient, oldinpatient$PRVDR_NUM == 050108) #OR newinpatient <- oldinpatient[oldinpatient$PRVDR_NUM == 050108, ] #OR providernum <- data.frame(newdim(PRVDR_NUM = c(050108)) newinpatient <- merge(providernum, oldinpatient) With checking "class" at one point, I gathered that R interprets PRVDR_NUM as a factor, not a number .. so I've understood a potential reason why I would have errors (with code above). So, I then tried something like this: newPRVDR_NUM <- format(as.numeric(levels(oldinpatient$PRVDR_NUM) [oldinpatient$PRVDR_NUM])) numericprvdr <- data.frame(oldinpatient, newPRVDR_NUM) bestprvdr <- numericprvdr[,-2] I thought that with converting PRVDR_NUM to numeric, then one of the three options above would be satisfied. But, that has not worked either. (I did confirm that the factor -> numeric worked, which it did) Though R reads the three options (above) with no errors, upon performing a "dim" check I receive the output: 0 93. The columns are correct, but rows (obviously) are not. (I did confirm that the desired value exists multiple times in the noted column, so 0 is definitely incorrect) As well, I would like to work with PRVDR_NUM as a variable alone, but I've found that with any of these variables/column names, I have to use "allinpatient$PRVDR_NUM." R does not recognize PRVDR_NUM alone. Why? More and more I think my problem is more foundational, meaning using the read.fwf function in the first place? Not using the read.fwf function correctly? Again, I've made enough progress with other variables & data sets of this type I've been fine so far, but now & future I need to repeat this code enough times where help in better understanding my errors & a more elegant/efficient solution would be greatly appreciated. Also note that R does not read all 93 columns as factors. Why would R interpret this six-wide column as a factor, but the nine-wide column next door as numeric? Your help is most appreciated! |
On Fri, Mar 9, 2012 at 7:04 PM, RHelpPlease <[hidden email]> wrote: > Hi there, > I am having trouble subsetting a data frame by a conditional via one column > (of many). > > I read the file into R through "read.fwf," where I specified column widths. > Original data is .DAT. I then utilized "names" function to read in column > headings. The easiest way for us to do diagnostics is if we can see your data: the easiest way for us to see your data is for you to use dput(head(oldinpatient, 30)) so we can get a plain text (email friendly) version of it. > > For one column, PRVDR_NUM, I wish to further amend the entire data set, but > only have PRVDR_NUM == 050108. This is where I'm having trouble. > > I've tried code like this: > > newinpatient <- subset(oldinpatient, oldinpatient$PRVDR_NUM == 050108) > #OR > newinpatient <- oldinpatient[oldinpatient$PRVDR_NUM == 050108, ] > #OR The two above this have a chance of working (and, once we figure out what's going on, are good R idioms that should stay in your vocabulary (though strictly speaking, the second "oldinpatient" in the first is unnecessary due to some evaluation tricks); the two below are no good so don't try that anymore. > providernum <- data.frame(newdim(PRVDR_NUM = c(050108)) > newinpatient <- merge(providernum, oldinpatient) > > With checking "class" at one point, I gathered that R interprets PRVDR_NUM > as a factor, not a number .. so I've understood a potential reason why I > would have errors (with code above). So, I then tried something like this: Yes, it's a terrible legacy.... most I/O functions let you set the option stringsAsFactors = FALSE to avoid this.... > > newPRVDR_NUM <- format(as.numeric(levels(oldinpatient$PRVDR_NUM) > [oldinpatient$PRVDR_NUM])) This is almost right, though I think format sends things back to character (and undoes as.numeric) -- I find this idiom a little clearer (though, admittedly, still strange): as.numeric(as.character(oldinpatient$PRVDR_NUM)) > numericprvdr <- data.frame(oldinpatient, newPRVDR_NUM) > bestprvdr <- numericprvdr[,-2] > > I thought that with converting PRVDR_NUM to numeric, then one of the three > options above would be satisfied. But, that has not worked either. (I did > confirm that the factor -> numeric worked, which it did) > If it did work, these lines wouldn't: I think your earlier attempts would have worked after conversion to numeric, but the format() gets you back in trouble. > Though R reads the three options (above) with no errors, upon performing a > "dim" check I receive the output: 0 93. The columns are correct, but rows > (obviously) are not. (I did confirm that the desired value exists multiple > times in the noted column, so 0 is definitely incorrect) > > As well, I would like to work with PRVDR_NUM as a variable alone, but I've > found that with any of these variables/column names, I have to use > "allinpatient$PRVDR_NUM." R does not recognize PRVDR_NUM alone. Why? Different question: the short answer is that, unlike SAS/SPSS, R can take multiple data sets on at the same time, so you have to direct it to which one you want. If you want to save keystrokes in a line where you refer to a data set multiple times, you can use with(), e.g., DATS <- data.frame(x = 1:5, y = 1:5, z = 11:15) DATS$x + DATS$y + DATS$z with(DATS, x + y + z) # same > > More and more I think my problem is more foundational, meaning using the > read.fwf function in the first place? Not using the read.fwf function > correctly? Again, I've made enough progress with other variables & data > sets of this type I've been fine so far, but now & future I need to repeat > this code enough times where help in better understanding my errors & a more > elegant/efficient solution would be greatly appreciated. I think you're fine with the read.fwf() function -- though if .DAT is a common file format, someone else might have done the heavy lifting for you already. The definitive place to read all this is the R I/O manual --- http://cran.r-project.org/doc/manuals/R-data.html -- but it's not the easiest read. > > Also note that R does not read all 93 columns as factors. Why would R > interpret this six-wide column as a factor, but the nine-wide column next > door as numeric? It has to do with what appear to be strings and what appear to be numbers (and that line is not where you may think) -- anything that is not totally unambiguously numeric becomes a string and, by default, strings become factors -- hence, many factors. Michael

PS -- Thanks for showing what you've tried.
Hi Michael,
Thanks so much for your detailed reply! I gained a better understanding of the read.fwf function, along with ensuring I better note how these read-in functions convert variables, etc. As well, your tip on removing "format" while converting the PRVDR_NUM variable to numeric (from factor) is the ticket! Also, your reply aided with noting which code (of three options to subset data) is the "best" to use. At this time I've been able to successfully output the data file at hand. Thanks again for your help! |
