Getting codebook data into R

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Getting codebook data into R

barny
This post was updated on .
I've been trying to get some data from the National Survey for Family Growth into R - however, the data is in a .dat file and the data I need doesn't have any spaces or commas separating fields - rather you have to look into the codebook and what number of digits along the line the data you need is.  The data I want are the following, where 1,12,int means that the data I'm interested starts in column 1 and finishes in column 12 and is an integer.

            ('caseid', 1, 12, int),
             ('nbrnaliv', 22, 22, int),
            ('babysex', 56, 56, int),
            ('birthwgt_lb', 57, 58, int),
            ('birthwgt_oz', 59, 60, int),
            ('prglength', 275, 276, int),
            ('outcome', 277, 277, int),
            ('birthord', 278, 279, int),
            ('agepreg', 284, 287, int),
            ('finalwgt', 423, 440, float)

How can I do this using R? I've written a python programme which basically does it but it'd be nicer if I could skip the Python bit and just do it using R. Cheers for any help, here's the data file in question if anyone's interested:

ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NSFG/2002FemPreg.dat
           
Reply | Threaded
Open this post in threaded view
|

Re: Getting codebook data into R

Peter Alspach-2
Tena koe

?read.fwf

HTH ....

Peter Alspach

-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of barny
Sent: Friday, 10 February 2012 9:52 a.m.
To: [hidden email]
Subject: [R] Getting codebook data into R

I've been trying to get some data from the National Survey for Family Growth
into R - however, the data is in a .dat file and the data I need doesn't
have any spaces or commas separating fields - rather you have to look into
the codebook and what number of digits along the line the data you need is.
The data I want are the following, where 1,12,int means that the data I'm
interested starts in column 1 and finishes in column 12 and is an integer.

            ('caseid', 1, 12, int),
             ('nbrnaliv', 22, 22, int),
            ('babysex', 56, 56, int),
            ('birthwgt_lb', 57, 58, int),
            ('birthwgt_oz', 59, 60, int),
            ('prglength', 275, 276, int),
            ('outcome', 277, 277, int),
            ('birthord', 278, 279, int),
            ('agepreg', 284, 287, int),
            ('finalwgt', 423, 440, float)

How can I do this using R? I've written a python programme which basically
does it but it'd be nicer if I could skip the Python bit and just do it
using R. Cheers for any help.
           

--
View this message in context: http://r.789695.n4.nabble.com/Getting-codebook-data-into-R-tp4374331p4374331.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

The contents of this e-mail are confidential and may be subject to legal privilege.
 If you are not the intended recipient you must not use, disseminate, distribute or
 reproduce all or any part of this e-mail or attachments.  If you have received this
 e-mail in error, please notify the sender and delete all material pertaining to this
 e-mail.  Any opinion or views expressed in this e-mail are those of the individual
 sender and may not represent those of The New Zealand Institute for Plant and
 Food Research Limited.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Getting codebook data into R

Douglas Bates-2
In reply to this post by barny
On Thu, Feb 9, 2012 at 2:51 PM, barny <[hidden email]> wrote:

> I've been trying to get some data from the National Survey for Family Growth
> into R - however, the data is in a .dat file and the data I need doesn't
> have any spaces or commas separating fields - rather you have to look into
> the codebook and what number of digits along the line the data you need is.
> The data I want are the following, where 1,12,int means that the data I'm
> interested starts in column 1 and finishes in column 12 and is an integer.
>
>            ('caseid', 1, 12, int),
>             ('nbrnaliv', 22, 22, int),
>            ('babysex', 56, 56, int),
>            ('birthwgt_lb', 57, 58, int),
>            ('birthwgt_oz', 59, 60, int),
>            ('prglength', 275, 276, int),
>            ('outcome', 277, 277, int),
>            ('birthord', 278, 279, int),
>            ('agepreg', 284, 287, int),
>            ('finalwgt', 423, 440, float)
>
> How can I do this using R? I've written a python programme which basically
> does it but it'd be nicer if I could skip the Python bit and just do it
> using R. Cheers for any help.

?read.fwf

You should realize that read.fwf is not overly smart about how it does
things.  You may want to consider readLines to read each line as a
text string and then use substring to pull out the fields.

It's amazing how these old habits of storing data like this persist.
The reason for fixed-format records was that you couldn't read free
format in a Fortran program in a standard way before Fortran-77.  And
35 years afterwards we are still jumping through hoops to read
fixed-format records.  Sigh.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Getting codebook data into R

Peter Dalgaard-2

On Feb 9, 2012, at 22:47 , Douglas Bates wrote:
>
> It's amazing how these old habits of storing data like this persist.
> The reason for fixed-format records was that you couldn't read free
> format in a Fortran program in a standard way before Fortran-77.  And
> 35 years afterwards we are still jumping through hoops to read
> fixed-format records.  Sigh.

Actually, I think it was more because data entry was widely done using 80-column punch cards until at least the mid-80s. So you had questionnaires filled out by hand, and keypunch operators typing them in. The latter were paid by keypress, and there was a general push to cram as much information onto the cards as possible.

Fortran had some role in it, but as far as I remember, even in the days of Hollerith constants there was nothing in the Fortran formats that prevented you from having spaces between your data columns. So the fixed-width field convention may originally have had more to do with people being expected to write their data in aligned columns (which, for proofreading, is actually not too bad an idea).

--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: [hidden email]  Priv: [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Getting codebook data into R

David Winsemius
In reply to this post by barny

On Feb 9, 2012, at 3:51 PM, barny wrote:

> I've been trying to get some data from the National Survey for  
> Family Growth
> into R - however, the data is in a .dat file and the data I need  
> doesn't
> have any spaces or commas separating fields - rather you have to  
> look into
> the codebook and what number of digits along the line the data you  
> need is.
> The data I want are the following, where 1,12,int means that the  
> data I'm
> interested starts in column 1 and finishes in column 12 and is an  
> integer.
>
>            ('caseid', 1, 12, int),
>             ('nbrnaliv', 22, 22, int),
>            ('babysex', 56, 56, int),
>            ('birthwgt_lb', 57, 58, int),
>            ('birthwgt_oz', 59, 60, int),
>            ('prglength', 275, 276, int),
>            ('outcome', 277, 277, int),
>            ('birthord', 278, 279, int),
>            ('agepreg', 284, 287, int),
>            ('finalwgt', 423, 440, float)

That's not the way the read.fwf is set up to accept data. You will  
need to loop over that input stream and apply logic like:
vec<numeric(0);
nams <-character(0)
getwidth = first-last+1
vec=c(vec, getwidth)
nams=c(nams, <whatever>)
getwidblank = last-first.next-1
If( getblank>0) namskip= <junk-name>

Then remove all the zeros and that will be  your vector of widths and  
your string of col.names

>
> How can I do this using R? I've written a python programme which  
> basically
> does it but it'd be nicer if I could skip the Python bit and just do  
> it
> using R. Cheers for any help.
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Getting-codebook-data-into-R-tp4374331p4374331.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Getting codebook data into R

rmailbox
Or, you can do it the lazy way...

Download spss errr... pspp at http://www.gnu.org/software/pspp/ and run
the spss code in which somebody else already figured all that out,
 to create an spss file. Then use one of the spss importing
libraries. Lately I've become partial to memisc, but there are several
choices here.

eRic



----- Original message -----
From: "David Winsemius" <[hidden email]>
To: "barny" <[hidden email]>
Cc: [hidden email]
Date: Thu, 9 Feb 2012 17:50:39 -0500
Subject: Re: [R] Getting codebook data into R


On Feb 9, 2012, at 3:51 PM, barny wrote:

> I've been trying to get some data from the National Survey for  
> Family Growth
> into R - however, the data is in a .dat file and the data I need  
> doesn't
> have any spaces or commas separating fields - rather you have to  
> look into
> the codebook and what number of digits along the line the data you  
> need is.
> The data I want are the following, where 1,12,int means that the  
> data I'm
> interested starts in column 1 and finishes in column 12 and is an  
> integer.
>
>            ('caseid', 1, 12, int),
>             ('nbrnaliv', 22, 22, int),
>            ('babysex', 56, 56, int),
>            ('birthwgt_lb', 57, 58, int),
>            ('birthwgt_oz', 59, 60, int),
>            ('prglength', 275, 276, int),
>            ('outcome', 277, 277, int),
>            ('birthord', 278, 279, int),
>            ('agepreg', 284, 287, int),
>            ('finalwgt', 423, 440, float)

That's not the way the read.fwf is set up to accept data. You will  
need to loop over that input stream and apply logic like:
vec<numeric(0);
nams <-character(0)
getwidth = first-last+1
vec=c(vec, getwidth)
nams=c(nams, <whatever>)
getwidblank = last-first.next-1
If( getblank>0) namskip= <junk-name>

Then remove all the zeros and that will be  your vector of widths and  
your string of col.names

>
> How can I do this using R? I've written a python programme which  
> basically
> does it but it'd be nicer if I could skip the Python bit and just do  
> it
> using R. Cheers for any help.
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Getting-codebook-data-into-R-tp4374331p4374331.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Getting codebook data into R

Daniel Nordlund-4
In reply to this post by barny
> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]]
> On Behalf Of barny
> Sent: Thursday, February 09, 2012 12:52 PM
> To: [hidden email]
> Subject: [R] Getting codebook data into R
>
> I've been trying to get some data from the National Survey for Family
> Growth
> into R - however, the data is in a .dat file and the data I need doesn't
> have any spaces or commas separating fields - rather you have to look into
> the codebook and what number of digits along the line the data you need
> is.
> The data I want are the following, where 1,12,int means that the data I'm
> interested starts in column 1 and finishes in column 12 and is an integer.
>
>             ('caseid', 1, 12, int),
>              ('nbrnaliv', 22, 22, int),
>             ('babysex', 56, 56, int),
>             ('birthwgt_lb', 57, 58, int),
>             ('birthwgt_oz', 59, 60, int),
>             ('prglength', 275, 276, int),
>             ('outcome', 277, 277, int),
>             ('birthord', 278, 279, int),
>             ('agepreg', 284, 287, int),
>             ('finalwgt', 423, 440, float)
>
> How can I do this using R? I've written a python programme which basically
> does it but it'd be nicer if I could skip the Python bit and just do it
> using R. Cheers for any help.
>

I didn't have time at work to look at this, but here is one possible approach.  I did not look at how the code book file was actually structured; I just took what you presented above, cleaned it up a bit (like this)

'caseid',1,12,int
'nbrnaliv',22,22,int
'babysex',56,56,int
'birthwgt_lb',57,58,int
'birthwgt_oz',59,60,int
'prglength',275,276,int
'outcome',277,277,int
'birthord',278,279,int
'agepreg',284,287,int
'finalwgt',423,440,float

and copied it to the clipboard.  Then read it in using the following syntax

## read in data layout
codebook <- read.table('clipboard', sep=',', as.is=TRUE)

I will leave it to you to determine how you want to get the code book into your R session.  Having done this, one can compute the fields widths and the numbers of columns to skip between fields and then build a command to read in the data.  Something like this should get you started

## get number of rows in code book
nr <- nrow(codebook)
## provide names for codebook layout data frame
names(codebook) <- c('variable','begin','end','type')

## compute number of columns to read (and skip) for each variable
## store in the vector read.col
# compute field widths
codebook$width <- codebook$end - codebook$begin + 1

# compute columns to skip between end of one field and
# beginning of next field
codebook$skip <- c(codebook$begin[-1]-codebook$end[-nr]-1,0)

## create zero length numeric vector for holding column widths
## (required by read.fwf) to read and skip, and populate the vector
read.col <- numeric()
for(i in 1:nr){
  read.col <- c(read.col,codebook$width[i])
  if(codebook$skip[i] > 0) read.col <- c(read.col,-codebook$skip[i])
}

## recode type values to R classes
codebook$Rtype <- ifelse(codebook$type %in% c('int','float'),'numeric', 'character')

## now read in the data
fwfdata <- read.fwf('c:/tmp/testpreg.txt', col.names=codebook$variable,
                     widths=read.col, colClasses=codebook$Rtype)


The code is clearly not bullet proof and there is no error checking, etc.  However, it does the job, given the information you provided is accurate.  If you wanted, you could wrap it all up in a function and pass the data filename and code book name as parameters.


Hope this is helpful,

Dan

Daniel Nordlund
Bothell, WA USA

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Getting codebook data into R

barny
In reply to this post by rmailbox
Hi Eric - after seeing the difficulty of inputting this kind of data into R I decided to use your method. It was rather painless using PSPP to do what I wanted - however, how do I now create an SPSS file and then use the memisc package to read it in?
Reply | Threaded
Open this post in threaded view
|

Re: Getting codebook data into R

Daniel Nordlund-4
> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]]
> On Behalf Of barny
> Sent: Saturday, February 11, 2012 10:04 AM
> To: [hidden email]
> Subject: Re: [R] Getting codebook data into R
>
> Hi Eric - after seeing the difficulty of inputting this kind of data into
> R I
> decided to use your method. It was rather painless using PSPP to do what I
> wanted - however, how do I now create an SPSS file and then use the memisc
> package to read it in?
>

There is SPSS code for reading the files on the codebook page

http://www.cdc.gov/nchs/nsfg/nsfg_2006_2010_puf.htm#codebooks

hope this is helpful,

Dan

Daniel Nordlund
Bothell, WA USA

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Getting codebook data into R

rmailbox
In reply to this post by barny
This is how I get a whole SPSS data files into R. You specifically asked about the codebook, so this may not be exactly what you are after.
 
spssFileInfo <- spss.system.file ( file = "path to my SPSS file" )
spssDataSet <- as.data.set ( spssFileInfo)
spssDataFrame <- as.data.frame ( spssDataSet )

(Not tested. Adapted from working code.)

memisc documentation has more info about doing this and how it works.

eRic




----- Original message -----
From: "barny" <[hidden email]>
To: [hidden email]
Date: Sat, 11 Feb 2012 10:04:16 -0800 (PST)
Subject: Re: [R] Getting codebook data into R

Hi Eric - after seeing the difficulty of inputting this kind of data into R I
decided to use your method. It was rather painless using PSPP to do what I
wanted - however, how do I now create an SPSS file and then use the memisc
package to read it in?

--
View this message in context: http://r.789695.n4.nabble.com/Getting-codebook-data-into-R-tp4374331p4379433.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Getting codebook data into R

Chris Stubben
In reply to this post by Daniel Nordlund-4
Just to follow up on Dan's code - once you have a data.frame listing column positions, then it's just a couple steps to download the file...

x <- data.frame(name=c('caseid', 'nbrnaliv', 'babysex', 'birthwgt_lb','birthwgt_oz','prglength',
'outcome', 'birthord',  'agepreg',  'finalwgt'),
begin = c(1, 22, 56, 57, 59, 275, 277, 278, 284, 423),
end =  c(12, 22, 56, 58, 60, 276, 277, 279, 287, 440)
)


x$width <- x$end - x$begin + 1
x$skip <-  (-c(x$begin[-1]-x$end[-nrow(x)]-1,0))

widths <- c(t(x[,4:5]))
widths <- widths[widths!=0]

ftp<- "ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NSFG/2002FemPreg.dat"
# drop the n=10 option to get all lines
y<- read.fwf(ftp, widths, n=10)
names(y) <- x$name
y
   caseid nbrnaliv babysex birthwgt_lb birthwgt_oz prglength outcome birthord agepreg  finalwgt
1       1        1       1           8          13        39       1        1    3316  6448.271
2       1        1       2           7          14        39       1        2    3925  6448.271
3       2        3       1           9           2        39       1        1    1433 12999.542
4       2        1       2           7           0        39       1        2    1783 12999.542
5       2        1       2           6           3        39       1        3    1833 12999.542
6       6        1       1           8           9        38       1        1    2700  8874.441
7       6        1       2           9           9        40       1        2    2883  8874.441
8       6        1       2           8           6        42       1        3    3016  8874.441
9       7        1       1           7           9        39       1        1    2808  6911.880
10      7        1       2           6          10        35       1        2    3233  6911.880


Chris Stubben



Daniel Nordlund-4 wrote
> -----Original Message-----

> I've been trying to get some data from the National Survey for Family
> Growth
> into R - however, the data is in a .dat file and the data I need doesn't
> have any spaces or commas separating fields - rather you have to look into
> the codebook and what number of digits along the line the data you need
> is.
> The data I want are the following, where 1,12,int means that the data I'm
> interested starts in column 1 and finishes in column 12 and is an integer.
>
>             ('caseid', 1, 12, int),
>              ('nbrnaliv', 22, 22, int),
>             ('babysex', 56, 56, int),
>             ('birthwgt_lb', 57, 58, int),
>             ('birthwgt_oz', 59, 60, int),
>             ('prglength', 275, 276, int),
>             ('outcome', 277, 277, int),
>             ('birthord', 278, 279, int),
>             ('agepreg', 284, 287, int),
>             ('finalwgt', 423, 440, float)
>
> How can I do this using R? I've written a python programme which basically
> does it but it'd be nicer if I could skip the Python bit and just do it
> using R. Cheers for any help.
>


Dan