

This post was updated on .
I've been trying to get some data from the National Survey for Family Growth into R  however, the data is in a .dat file and the data I need doesn't have any spaces or commas separating fields  rather you have to look into the codebook and what number of digits along the line the data you need is. The data I want are the following, where 1,12,int means that the data I'm interested starts in column 1 and finishes in column 12 and is an integer.
('caseid', 1, 12, int),
('nbrnaliv', 22, 22, int),
('babysex', 56, 56, int),
('birthwgt_lb', 57, 58, int),
('birthwgt_oz', 59, 60, int),
('prglength', 275, 276, int),
('outcome', 277, 277, int),
('birthord', 278, 279, int),
('agepreg', 284, 287, int),
('finalwgt', 423, 440, float)
How can I do this using R? I've written a python programme which basically does it but it'd be nicer if I could skip the Python bit and just do it using R. Cheers for any help, here's the data file in question if anyone's interested:
ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NSFG/2002FemPreg.dat


Tena koe
?read.fwf
HTH ....
Peter Alspach
Original Message
From: [hidden email] [mailto: [hidden email]] On Behalf Of barny
Sent: Friday, 10 February 2012 9:52 a.m.
To: [hidden email]
Subject: [R] Getting codebook data into R
I've been trying to get some data from the National Survey for Family Growth
into R  however, the data is in a .dat file and the data I need doesn't
have any spaces or commas separating fields  rather you have to look into
the codebook and what number of digits along the line the data you need is.
The data I want are the following, where 1,12,int means that the data I'm
interested starts in column 1 and finishes in column 12 and is an integer.
('caseid', 1, 12, int),
('nbrnaliv', 22, 22, int),
('babysex', 56, 56, int),
('birthwgt_lb', 57, 58, int),
('birthwgt_oz', 59, 60, int),
('prglength', 275, 276, int),
('outcome', 277, 277, int),
('birthord', 278, 279, int),
('agepreg', 284, 287, int),
('finalwgt', 423, 440, float)
How can I do this using R? I've written a python programme which basically
does it but it'd be nicer if I could skip the Python bit and just do it
using R. Cheers for any help.

View this message in context: http://r.789695.n4.nabble.com/GettingcodebookdataintoRtp4374331p4374331.htmlSent from the R help mailing list archive at Nabble.com.
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.
The contents of this email are confidential and may be subject to legal privilege.
If you are not the intended recipient you must not use, disseminate, distribute or
reproduce all or any part of this email or attachments. If you have received this
email in error, please notify the sender and delete all material pertaining to this
email. Any opinion or views expressed in this email are those of the individual
sender and may not represent those of The New Zealand Institute for Plant and
Food Research Limited.
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


On Thu, Feb 9, 2012 at 2:51 PM, barny < [hidden email]> wrote:
> I've been trying to get some data from the National Survey for Family Growth
> into R  however, the data is in a .dat file and the data I need doesn't
> have any spaces or commas separating fields  rather you have to look into
> the codebook and what number of digits along the line the data you need is.
> The data I want are the following, where 1,12,int means that the data I'm
> interested starts in column 1 and finishes in column 12 and is an integer.
>
> ('caseid', 1, 12, int),
> ('nbrnaliv', 22, 22, int),
> ('babysex', 56, 56, int),
> ('birthwgt_lb', 57, 58, int),
> ('birthwgt_oz', 59, 60, int),
> ('prglength', 275, 276, int),
> ('outcome', 277, 277, int),
> ('birthord', 278, 279, int),
> ('agepreg', 284, 287, int),
> ('finalwgt', 423, 440, float)
>
> How can I do this using R? I've written a python programme which basically
> does it but it'd be nicer if I could skip the Python bit and just do it
> using R. Cheers for any help.
?read.fwf
You should realize that read.fwf is not overly smart about how it does
things. You may want to consider readLines to read each line as a
text string and then use substring to pull out the fields.
It's amazing how these old habits of storing data like this persist.
The reason for fixedformat records was that you couldn't read free
format in a Fortran program in a standard way before Fortran77. And
35 years afterwards we are still jumping through hoops to read
fixedformat records. Sigh.
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


On Feb 9, 2012, at 22:47 , Douglas Bates wrote:
>
> It's amazing how these old habits of storing data like this persist.
> The reason for fixedformat records was that you couldn't read free
> format in a Fortran program in a standard way before Fortran77. And
> 35 years afterwards we are still jumping through hoops to read
> fixedformat records. Sigh.
Actually, I think it was more because data entry was widely done using 80column punch cards until at least the mid80s. So you had questionnaires filled out by hand, and keypunch operators typing them in. The latter were paid by keypress, and there was a general push to cram as much information onto the cards as possible.
Fortran had some role in it, but as far as I remember, even in the days of Hollerith constants there was nothing in the Fortran formats that prevented you from having spaces between your data columns. So the fixedwidth field convention may originally have had more to do with people being expected to write their data in aligned columns (which, for proofreading, is actually not too bad an idea).

Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: [hidden email] Priv: [hidden email]
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


On Feb 9, 2012, at 3:51 PM, barny wrote:
> I've been trying to get some data from the National Survey for
> Family Growth
> into R  however, the data is in a .dat file and the data I need
> doesn't
> have any spaces or commas separating fields  rather you have to
> look into
> the codebook and what number of digits along the line the data you
> need is.
> The data I want are the following, where 1,12,int means that the
> data I'm
> interested starts in column 1 and finishes in column 12 and is an
> integer.
>
> ('caseid', 1, 12, int),
> ('nbrnaliv', 22, 22, int),
> ('babysex', 56, 56, int),
> ('birthwgt_lb', 57, 58, int),
> ('birthwgt_oz', 59, 60, int),
> ('prglength', 275, 276, int),
> ('outcome', 277, 277, int),
> ('birthord', 278, 279, int),
> ('agepreg', 284, 287, int),
> ('finalwgt', 423, 440, float)
That's not the way the read.fwf is set up to accept data. You will
need to loop over that input stream and apply logic like:
vec<numeric(0);
nams <character(0)
getwidth = firstlast+1
vec=c(vec, getwidth)
nams=c(nams, <whatever>)
getwidblank = lastfirst.next1
If( getblank>0) namskip= <junkname>
Then remove all the zeros and that will be your vector of widths and
your string of col.names
David Winsemius, MD
West Hartford, CT
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Or, you can do it the lazy way...
Download spss errr... pspp at http://www.gnu.org/software/pspp/ and run
the spss code in which somebody else already figured all that out,
to create an spss file. Then use one of the spss importing
libraries. Lately I've become partial to memisc, but there are several
choices here.
eRic
 Original message 
From: "David Winsemius" < [hidden email]>
To: "barny" < [hidden email]>
Cc: [hidden email]
Date: Thu, 9 Feb 2012 17:50:39 0500
Subject: Re: [R] Getting codebook data into R
On Feb 9, 2012, at 3:51 PM, barny wrote:
> I've been trying to get some data from the National Survey for
> Family Growth
> into R  however, the data is in a .dat file and the data I need
> doesn't
> have any spaces or commas separating fields  rather you have to
> look into
> the codebook and what number of digits along the line the data you
> need is.
> The data I want are the following, where 1,12,int means that the
> data I'm
> interested starts in column 1 and finishes in column 12 and is an
> integer.
>
> ('caseid', 1, 12, int),
> ('nbrnaliv', 22, 22, int),
> ('babysex', 56, 56, int),
> ('birthwgt_lb', 57, 58, int),
> ('birthwgt_oz', 59, 60, int),
> ('prglength', 275, 276, int),
> ('outcome', 277, 277, int),
> ('birthord', 278, 279, int),
> ('agepreg', 284, 287, int),
> ('finalwgt', 423, 440, float)
That's not the way the read.fwf is set up to accept data. You will
need to loop over that input stream and apply logic like:
vec<numeric(0);
nams <character(0)
getwidth = firstlast+1
vec=c(vec, getwidth)
nams=c(nams, <whatever>)
getwidblank = lastfirst.next1
If( getblank>0) namskip= <junkname>
Then remove all the zeros and that will be your vector of widths and
your string of col.names
David Winsemius, MD
West Hartford, CT
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


> Original Message
> From: [hidden email] [mailto: [hidden email]]
> On Behalf Of barny
> Sent: Thursday, February 09, 2012 12:52 PM
> To: [hidden email]
> Subject: [R] Getting codebook data into R
>
> I've been trying to get some data from the National Survey for Family
> Growth
> into R  however, the data is in a .dat file and the data I need doesn't
> have any spaces or commas separating fields  rather you have to look into
> the codebook and what number of digits along the line the data you need
> is.
> The data I want are the following, where 1,12,int means that the data I'm
> interested starts in column 1 and finishes in column 12 and is an integer.
>
> ('caseid', 1, 12, int),
> ('nbrnaliv', 22, 22, int),
> ('babysex', 56, 56, int),
> ('birthwgt_lb', 57, 58, int),
> ('birthwgt_oz', 59, 60, int),
> ('prglength', 275, 276, int),
> ('outcome', 277, 277, int),
> ('birthord', 278, 279, int),
> ('agepreg', 284, 287, int),
> ('finalwgt', 423, 440, float)
>
> How can I do this using R? I've written a python programme which basically
> does it but it'd be nicer if I could skip the Python bit and just do it
> using R. Cheers for any help.
>
I didn't have time at work to look at this, but here is one possible approach. I did not look at how the code book file was actually structured; I just took what you presented above, cleaned it up a bit (like this)
'caseid',1,12,int
'nbrnaliv',22,22,int
'babysex',56,56,int
'birthwgt_lb',57,58,int
'birthwgt_oz',59,60,int
'prglength',275,276,int
'outcome',277,277,int
'birthord',278,279,int
'agepreg',284,287,int
'finalwgt',423,440,float
and copied it to the clipboard. Then read it in using the following syntax
## read in data layout
codebook < read.table('clipboard', sep=',', as.is=TRUE)
I will leave it to you to determine how you want to get the code book into your R session. Having done this, one can compute the fields widths and the numbers of columns to skip between fields and then build a command to read in the data. Something like this should get you started
## get number of rows in code book
nr < nrow(codebook)
## provide names for codebook layout data frame
names(codebook) < c('variable','begin','end','type')
## compute number of columns to read (and skip) for each variable
## store in the vector read.col
# compute field widths
codebook$width < codebook$end  codebook$begin + 1
# compute columns to skip between end of one field and
# beginning of next field
codebook$skip < c(codebook$begin[1]codebook$end[nr]1,0)
## create zero length numeric vector for holding column widths
## (required by read.fwf) to read and skip, and populate the vector
read.col < numeric()
for(i in 1:nr){
read.col < c(read.col,codebook$width[i])
if(codebook$skip[i] > 0) read.col < c(read.col,codebook$skip[i])
}
## recode type values to R classes
codebook$Rtype < ifelse(codebook$type %in% c('int','float'),'numeric', 'character')
## now read in the data
fwfdata < read.fwf('c:/tmp/testpreg.txt', col.names=codebook$variable,
widths=read.col, colClasses=codebook$Rtype)
The code is clearly not bullet proof and there is no error checking, etc. However, it does the job, given the information you provided is accurate. If you wanted, you could wrap it all up in a function and pass the data filename and code book name as parameters.
Hope this is helpful,
Dan
Daniel Nordlund
Bothell, WA USA
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/rhelpPLEASE do read the posting guide http://www.Rproject.org/postingguide.htmland provide commented, minimal, selfcontained, reproducible code.


Hi Eric  after seeing the difficulty of inputting this kind of data into R I decided to use your method. It was rather painless using PSPP to do what I wanted  however, how do I now create an SPSS file and then use the memisc package to read it in?


Just to follow up on Dan's code  once you have a data.frame listing column positions, then it's just a couple steps to download the file...
x < data.frame(name=c('caseid', 'nbrnaliv', 'babysex', 'birthwgt_lb','birthwgt_oz','prglength',
'outcome', 'birthord', 'agepreg', 'finalwgt'),
begin = c(1, 22, 56, 57, 59, 275, 277, 278, 284, 423),
end = c(12, 22, 56, 58, 60, 276, 277, 279, 287, 440)
)
x$width < x$end  x$begin + 1
x$skip < (c(x$begin[1]x$end[nrow(x)]1,0))
widths < c(t(x[,4:5]))
widths < widths[widths!=0]
ftp< "ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NSFG/2002FemPreg.dat"
# drop the n=10 option to get all lines
y< read.fwf(ftp, widths, n=10)
names(y) < x$name
y
caseid nbrnaliv babysex birthwgt_lb birthwgt_oz prglength outcome birthord agepreg finalwgt
1 1 1 1 8 13 39 1 1 3316 6448.271
2 1 1 2 7 14 39 1 2 3925 6448.271
3 2 3 1 9 2 39 1 1 1433 12999.542
4 2 1 2 7 0 39 1 2 1783 12999.542
5 2 1 2 6 3 39 1 3 1833 12999.542
6 6 1 1 8 9 38 1 1 2700 8874.441
7 6 1 2 9 9 40 1 2 2883 8874.441
8 6 1 2 8 6 42 1 3 3016 8874.441
9 7 1 1 7 9 39 1 1 2808 6911.880
10 7 1 2 6 10 35 1 2 3233 6911.880
Chris Stubben
Daniel Nordlund4 wrote
> Original Message
> I've been trying to get some data from the National Survey for Family
> Growth
> into R  however, the data is in a .dat file and the data I need doesn't
> have any spaces or commas separating fields  rather you have to look into
> the codebook and what number of digits along the line the data you need
> is.
> The data I want are the following, where 1,12,int means that the data I'm
> interested starts in column 1 and finishes in column 12 and is an integer.
>
> ('caseid', 1, 12, int),
> ('nbrnaliv', 22, 22, int),
> ('babysex', 56, 56, int),
> ('birthwgt_lb', 57, 58, int),
> ('birthwgt_oz', 59, 60, int),
> ('prglength', 275, 276, int),
> ('outcome', 277, 277, int),
> ('birthord', 278, 279, int),
> ('agepreg', 284, 287, int),
> ('finalwgt', 423, 440, float)
>
> How can I do this using R? I've written a python programme which basically
> does it but it'd be nicer if I could skip the Python bit and just do it
> using R. Cheers for any help.
>
Dan

