Getting information encoded in a SAS, SPSS or Stata command file into R.

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Getting information encoded in a SAS, SPSS or Stata command file into R.

andrewH
Dear folks –
I have a large (26 gig) ASCII flat file in fixed-width format with about 10 million observations of roughly 400 variables.  (It is 51 years of Current Population Survey micro data from IPUMS, roughly half the fields for each record).  The file was produced by automatic process in response to a data request of mine.

The file is not accompanied by a human-readable file giving the fieldnames and starting positions for each field.  Instead it comes with three command files that describe the file, one each for SAS SPSS, and Stata. I do not have ready access to any of these programs.  I understand that these files also include the equivalent of the levels attribute for the coded data.  I might be able to hand-extract the information I need from the command files, but this would involve days of tedious work that I am hoping to avoid.

I have read through the R Data Import/Export manual 2 and the foreign package documentation and I do not see anything that would allow me to extract the necessary information from these command files. Does anyone know of any r package or other non-proprietary tools that would allow me to get this data set from its current form into any of the following formats:
SAS, SPSS or Stata binary files read by R.
A MySQL data base
An ffdf object readable using the ff package.

My ultimate goal is to get the data into an ffdf object so that I can manipulate it in R, perhaps by way of a database. In allocation I will probably be using no more than 20 variables at a time, probably a bit under a gig. I am working on a machine with three gig of ram.

(I have seen some suggestions that data.table also provides a memory-efficient way of providing database-like functions, but I am unsure whether it would let me cope with an object of this size).

Any help or suggestions anyone could offer would be very much appreciated.

Warmest regards, andrewH
Jan
Reply | Threaded
Open this post in threaded view
|

Re: Getting information encoded in a SAS, SPSS or Stata command file into R.

Jan
Hi,

If it is your objective to get your data in an ffdf, I suggest you look at the SAS/SPSS/Stata code to see where each column is starting, next try out the LaF package as it allows you to read in large fixed width format files and once you have this up and running, you can use the laf_to_ffdf function at the ffbase package which works well with the LaF package and allows you import the flat file immediately into an ffdf for further transactions.

hope that helps,
Jan
Reply | Threaded
Open this post in threaded view
|

Re: Getting information encoded in a SAS, SPSS or Stata command file into R.

Ista Zahn
In reply to this post by andrewH
Hi Andrew,

You may be able to run the SPSS syntax file using pspp
(http://www.gnu.org/software/pspp/)

Best,
Ista

On Mon, Nov 12, 2012 at 11:23 PM, andrewH <[hidden email]> wrote:

> Dear folks –
> I have a large (26 gig) ASCII flat file in fixed-width format with about 10
> million observations of roughly 400 variables.  (It is 51 years of Current
> Population Survey micro data from IPUMS, roughly half the fields for each
> record).  The file was produced by automatic process in response to a data
> request of mine.
>
> The file is not accompanied by a human-readable file giving the fieldnames
> and starting positions for each field.  Instead it comes with three command
> files that describe the file, one each for SAS SPSS, and Stata. I do not
> have ready access to any of these programs.  I understand that these files
> also include the equivalent of the levels attribute for the coded data.  I
> might be able to hand-extract the information I need from the command files,
> but this would involve days of tedious work that I am hoping to avoid.
>
> I have read through the R Data Import/Export manual 2 and the foreign
> package documentation and I do not see anything that would allow me to
> extract the necessary information from these command files. Does anyone know
> of any r package or other non-proprietary tools that would allow me to get
> this data set from its current form into any of the following formats:
> SAS, SPSS or Stata binary files read by R.
> A MySQL data base
> An ffdf object readable using the ff package.
>
> My ultimate goal is to get the data into an ffdf object so that I can
> manipulate it in R, perhaps by way of a database. In allocation I will
> probably be using no more than 20 variables at a time, probably a bit under
> a gig. I am working on a machine with three gig of ram.
>
> (I have seen some suggestions that data.table also provides a
> memory-efficient way of providing database-like functions, but I am unsure
> whether it would let me cope with an object of this size).
>
> Any help or suggestions anyone could offer would be very much appreciated.
>
> Warmest regards, andrewH
>
>
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Getting-information-encoded-in-a-SAS-SPSS-or-Stata-command-file-into-R-tp4649353.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Getting information encoded in a SAS, SPSS or Stata command file into R.

ajdamico
In reply to this post by andrewH
Hi Andrew, to work with the Current Population Survey with R, your best
best is to use a variant of my SAScii package that works with a SQLite
database (and therefore doesn't overload RAM).

I have written obsessively-documented code about how to work with the CPS
in R here..

http://usgsd.blogspot.com/search/label/current%20population%20survey%20%28cps%29

..but example only loads one year of data at a time.  The function
read.SAScii.sqlite() used in that code can be run on a 51 year data set
just the same.

If you need to generate standard errors, confidence intervals, or
variances, I don't recommend using ffdf for complex sample surveys -- in my
experience it doesn't work well with R's survey package.

These scripts use the Census Bureau version of the CPS, but you can make
some slight changes and get it working on IPUMS files too..  Let me know if
you run into any trouble.  :)

Anthony



On Mon, Nov 12, 2012 at 11:23 PM, andrewH <[hidden email]> wrote:

> Dear folks –
> I have a large (26 gig) ASCII flat file in fixed-width format with about 10
> million observations of roughly 400 variables.  (It is 51 years of Current
> Population Survey micro data from IPUMS, roughly half the fields for each
> record).  The file was produced by automatic process in response to a data
> request of mine.
>
> The file is not accompanied by a human-readable file giving the fieldnames
> and starting positions for each field.  Instead it comes with three command
> files that describe the file, one each for SAS SPSS, and Stata. I do not
> have ready access to any of these programs.  I understand that these files
> also include the equivalent of the levels attribute for the coded data.  I
> might be able to hand-extract the information I need from the command
> files,
> but this would involve days of tedious work that I am hoping to avoid.
>
> I have read through the R Data Import/Export manual 2 and the foreign
> package documentation and I do not see anything that would allow me to
> extract the necessary information from these command files. Does anyone
> know
> of any r package or other non-proprietary tools that would allow me to get
> this data set from its current form into any of the following formats:
> SAS, SPSS or Stata binary files read by R.
> A MySQL data base
> An ffdf object readable using the ff package.
>
> My ultimate goal is to get the data into an ffdf object so that I can
> manipulate it in R, perhaps by way of a database. In allocation I will
> probably be using no more than 20 variables at a time, probably a bit under
> a gig. I am working on a machine with three gig of ram.
>
> (I have seen some suggestions that data.table also provides a
> memory-efficient way of providing database-like functions, but I am unsure
> whether it would let me cope with an object of this size).
>
> Any help or suggestions anyone could offer would be very much appreciated.
>
> Warmest regards, andrewH
>
>
>
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/Getting-information-encoded-in-a-SAS-SPSS-or-Stata-command-file-into-R-tp4649353.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Getting information encoded in a SAS, SPSS or Stata command file into R.

David Winsemius

On Nov 13, 2012, at 7:20 AM, Anthony Damico wrote:

> Hi Andrew, to work with the Current Population Survey with R, your best
> best is to use a variant of my SAScii package that works with a SQLite
> database (and therefore doesn't overload RAM).
>
> I have written obsessively-documented code about how to work with the CPS
> in R here..
>
> http://usgsd.blogspot.com/search/label/current%20population%20survey%20%28cps%29
>
> ..but example only loads one year of data at a time.  The function
> read.SAScii.sqlite() used in that code can be run on a 51 year data set
> just the same.
>
> If you need to generate standard errors, confidence intervals, or
> variances, I don't recommend using ffdf for complex sample surveys -- in my
> experience it doesn't work well with R's survey package.
>
> These scripts use the Census Bureau version of the CPS, but you can make
> some slight changes and get it working on IPUMS files too..  Let me know if
> you run into any trouble.  :)

I'd like to take this opportunity to thank Anthony for his work on this dataset as well as on several others. The ones I am most interested in are the NHANES-III and Continuous NHANES datasets and he has the 2009-2010 set from the Continuous NHANES series represented in his examples. Scraping the list of datasets from his website:

available data

        • area resource file (arf) (1)
        • consumer expenditure survey (ce) (1)
        • current population survey (cps) (1)
        • general social survey (gss) (1)
        • national health and nutrition examination survey (nhanes) (1)
        • national health interview survey (nhis) (1)
        • national study of drug use and health (nsduh) (1)

And thanks to you for this question, andrewH;

... it prompted a response from Jan to a package by Jan van der Laan which had subsequent links (via a reverseDepends citation) to a SEERabomb package by Tomas Radivoyevitch that provides examples of handling the SEER datasets, at least the Hematologic tumors dataset. My experience with SEER data in the past has been entirely mediated through SEER*Stat which is (somewhat) user-friendly Windows package for working with the SEER fixed field formats, but it should be exciting to see another accessible avenue through R.

Thanks, Anthony, Jan, and andrewH, and further thanks to Thomas Lumley on whose work I believe Anthony's package Depends because of the need for proper handling of the sampling weights.

--
David Winsemius

>
> Anthony
>
>
>
> On Mon, Nov 12, 2012 at 11:23 PM, andrewH <[hidden email]> wrote:
>
>> Dear folks ˆ
>> I have a large (26 gig) ASCII flat file in fixed-width format with about 10
>> million observations of roughly 400 variables.  (It is 51 years of Current
>> Population Survey micro data from IPUMS, roughly half the fields for each
>> record).  The file was produced by automatic process in response to a data
>> request of mine.
>>
>> The file is not accompanied by a human-readable file giving the fieldnames
>> and starting positions for each field.  Instead it comes with three command
>> files that describe the file, one each for SAS SPSS, and Stata. I do not
>> have ready access to any of these programs.  I understand that these files
>> also include the equivalent of the levels attribute for the coded data.  I
>> might be able to hand-extract the information I need from the command
>> files,
>> but this would involve days of tedious work that I am hoping to avoid.
>>
>> I have read through the R Data Import/Export manual 2 and the foreign
>> package documentation and I do not see anything that would allow me to
>> extract the necessary information from these command files. Does anyone
>> know
>> of any r package or other non-proprietary tools that would allow me to get
>> this data set from its current form into any of the following formats:
>> SAS, SPSS or Stata binary files read by R.
>> A MySQL data base
>> An ffdf object readable using the ff package.
>>
>> My ultimate goal is to get the data into an ffdf object so that I can
>> manipulate it in R, perhaps by way of a database. In allocation I will
>> probably be using no more than 20 variables at a time, probably a bit under
>> a gig. I am working on a machine with three gig of ram.
>>
>> (I have seen some suggestions that data.table also provides a
>> memory-efficient way of providing database-like functions, but I am unsure
>> whether it would let me cope with an object of this size).
>>
>> Any help or suggestions anyone could offer would be very much appreciated.
>>
>> Warmest regards, andrewH
>>
>


David Winsemius, MD
Alameda, CA, USA

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Getting information encoded in a SAS, SPSS or Stata command file into R.

andrewH
In reply to this post by ajdamico
Wow!  After reading Jan's post, I said "Great, I'll do that," because it was the closest to what I originally had in mind. Then I read Ista's post, and said "I think I'l try that first," because it got me back on the track of following directions in the R Data Import/Export manual. Then I read Anthony's post. Now, I am not so thrilled to go the database route, because frankly have hardly ever used them before, and this would make an already complex project take longer.

But, I know that I will need to use the sample survey package for what I am trying to do. So i think I am going to try to get the data into SQLite format, and just hope the effort builds character.  Anthony, I have not used your packages yet, but they look great!

It will probably be more than a week before i get all this worked out and implemented. Given how much work this will be, I do not want to do it twice, so I think I will go back to IPUMS and get the rest of the variables, and break the file up into smaller chunks at the same time, both so I really have the whole thing, and also so that it is easier to work with.   The IPUMS version of the file is rectangular (it duplicates the household data in each individual), and IPUMS has done a lot of valuable work in cleaning the data and harmonizing variable names and definitions that have changed over the history of the CPS. (Annoyingly, however, they have not connected the cross-sections between years. All the CPS samples consist of two sets of four consecutive months, eight months apart, so the March Supplement always consist half of people who were interviewed in the last year and half of people who will be interviewed in the next year (barring turnover)).

Anyway, when I have figured out my route to import I will report back here. In the meantime, I have three more questions that one of you may be able to answer:
1.   Anthony, does the read.SAScii.sqlite function  preserve the label names for factors in a data frame it imports into SQLite, when those labels are coded in the command file?
2.   If I want to make the resulting SQLite database available to the R community, is there a good place for me to put it? Assume it is 10-20 gigs in size.  Ideally, it would be set up so that it could be queried remotely and extracts downloaded. Setting this up is beyond my competence today, but maybe not in a couple of months.  (I'd like to do the same thing with the 30 years of Consumer Expenditure Survey data I have. I don't have access to SAS any more, but I converted it all to flat flies while I still did. Currently the BLS only makes 2011 microdata available free. Earlier years on cd are $200/year. But they have told me that they have no objection to my making them available).
3. I have not yet been able to determine whether CPS micro data from the period 1940-1961 exists. Does anyone know? It is not on http://thedataweb.rm.census.gov/ftp/cps_ftp.html, and  IPUMS and NBER (http://www.nber.org/data/current-population-survey-data.html)  both only give data back to 1962. I wrote to Census a week ago, but I have not heard back from them, and in the past they have not been very helpful about historical micro data.

Thanks to all! Andrew
Reply | Threaded
Open this post in threaded view
|

Re: Getting information encoded in a SAS, SPSS or Stata command file into R.

ajdamico
Hi Andrew, great to hear from you  :)

You really ought to review the (100% R-specific) US Government Survey
Datasets already available at http://usgsd.blogspot.com/ and contact me
directly if you hit a problem -- I am furiously working on a few right now
(ACS, SIPP, BSAPUFs, BRFSS, MEPS) , and am open to focusing on others if
there's good reason.


1.   Anthony, does the read.SAScii.sqlite function  preserve the label names
> for factors in a data frame it imports into SQLite, when those labels are
> coded in the command file?
>

SAScii doesn't touch labels.  I haven't really worried about them, but I'd
consider your suggestions about how to incorporate them into the package.



> 2.   If I want to make the resulting SQLite database available to the R
> community, is there a good place for me to put it? Assume it is 10-20 gigs
> in size.  Ideally, it would be set up so that it could be queried remotely
> and extracts downloaded. Setting this up is beyond my competence today, but
> maybe not in a couple of months.



I don't recommend this..  it probably violates IPUMS terms of usage.
Besides, it's not very hard for individuals to get IPUMS data into R.  See
the ?read.SAScii() example at the bottom of page 12 of
http://cran.r-project.org/web/packages/SAScii/SAScii.pdf.

That said, I personally wouldn't use SAScii for IPUMS data myself.  IPUMS
recently started allowing extracts to be downloaded as csvs, which means
analysts have many more options than just read.SAScii() for smaller data
sets and read.SAScii.sqlite() for larger ones.  (read.csv() and the sqldf
package's read.csv.sql() for example).

SAScii is just a giant workaround big enough for its own R package.
read.csv() and read.csv.sql() are much more developed, de-bugged, and
widely-used.

If your computer has enough RAM to get the IPUMS file (with replicate
weights - which generally double the filesize), skip SQLite altogether.
The example CPS code on usgsd uses SQLite so it works on computers down to
4GB of RAM, but it should be easy for you to alter that code and skip the
database components altogether.  Storing each year of CPS data as an R data
file (.rda) will speed everything up.


(I'd like to do the same thing with the 30
> years of Consumer Expenditure Survey data I have. I don't have access to
> SAS
> any more, but I converted it all to flat flies while I still did. Currently
> the BLS only makes 2011 microdata available free. Earlier years on cd are
> $200/year. But they have told me that they have no objection to my making
> them available).
>

You might want to browse around the rest of
http://usgsd.blogspot.com/before re-doing stuff, I have already done
that for 2011.  ;)

Getting the Consumer Expenditure Survey working properly in R was pretty
challenging.  But it's done now, with boatloads of detailed comments, and
nobody ever has to do it again..

http://usgsd.blogspot.com/search/label/consumer%20expenditure%20survey%20%28ce%29


BLS is slowly releasing the public use microdata on their own, so I am
waiting till they get back to 1996.  That way, everything is reproducible
-- everyone starts from the same BLS files, matches the same BLS
publications (to confirm the methodology is sound), and starts their own
analyses from the same complex sample survey design object.

They talk about their public release schedule on
http://www.bls.gov/cex/pumdhome.htm#online.  If you can't wait for them to
release it, you'll still probably want to link to the CE code I've
written.  Creating that survey object was tough stuff.




> 3. I have not yet been able to determine whether CPS micro data from the
> period 1940-1961 exists. Does anyone know? It is not on
> http://thedataweb.rm.census.gov/ftp/cps_ftp.html, and  IPUMS and NBER
> (http://www.nber.org/data/current-population-survey-data.html)  both only
> give data back to 1962. I wrote to Census a week ago, but I have not heard
> back from them, and in the past they have not been very helpful about
> historical micro data.
>

idk sorry.  But census is generally very responsive -- ping 'em in another
week.  :)


Good luck and keep in touch!

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Getting information encoded in a SAS, SPSS or Stata command file into R.

andrewH
Dear Anthony –

On closer examination, what I am talking about is not factor levels, but something different (but analogous). The data that is categorical all has integer codes, so the file is entirely numeric. The SAS proc format then gives text strings for each code for each categorical variable. Like this:

value REGION_f
  11 = "New England Division"
  12 = "Middle Atlantic Division"
  21 = "East North Central Division"
  22 = "West North Central Division"
  31 = "South Atlantic Division"
  32 = "East South Central Division"
  33 = "West South Central Division"
  41 = "Mountain Division"
  42 = "Pacific Division"
  97 = "State not identified"

So it would make sense to have a lookup table of these codes linked to the variables. I’m not sure if it makes more sense to have that table live in R or in the database. For R purposes, I imagine it would make sense to convert these integer-valued variables into factors.

What I do not understand is how SAS knows where the variables begin and end. I managed to break off a little hunk of the beginning of my file and look at it in an editor, and it is numbers without any obvious delimiters. Is the delimiter a particular numeric string? I thought the SAS command file would contain the starting location for each of the fixed-length fields, but I do not see anything in the file that could be interpreted that way – just a little wraparound code and then a long list of variable names followed by triplets of a code, an equals sign, and a text string, terminating with a semicolon.

I’m sorry if I am being obtuse. When I said before that I had saved the SAS files as flat files, what I really meant was that I had an intern do it. When I was doing my own analysis, I mainly used TSP, before I switched to R about a year ago. I’ve never used SAS.

I find your data project very interesting.  Very.   It is not actually necessary to wait for BLS to release the older CEX files, if you can lay your hands on the CDs. I spoke to the BLS data products office about  2 years ago, and they have no problem with people republishing purchased data in any format they like, including simple duplication.  In fact, they seemed to like the idea.  I think the sale of data was forced on them by some kind of mandate from above.

I'll be playing with your code (which is a model of readability, and a lesson to me on same, BTW) and keep you posted on my progress.

Warmly, Andrew
Reply | Threaded
Open this post in threaded view
|

Re: Getting information encoded in a SAS, SPSS or Stata command file into R.

David Winsemius

On Nov 14, 2012, at 2:33 PM, andrewH wrote:

> Dear Anthony –
>
> On closer examination, what I am talking about is not factor levels, but
> something different (but analogous). The data that is categorical all has
> integer codes, so the file is entirely numeric. The SAS proc format then
> gives text strings for each code for each categorical variable. Like this:
>
> value REGION_f
>  11 = "New England Division"
>  12 = "Middle Atlantic Division"
>  21 = "East North Central Division"
>  22 = "West North Central Division"
>  31 = "South Atlantic Division"
>  32 = "East South Central Division"
>  33 = "West South Central Division"
>  41 = "Mountain Division"
>  42 = "Pacific Division"
>  97 = "State not identified"

There will be a semi-colon to mark the end of the <integer = quoted-values> pairs.

http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a002473472.htm

I agree it might be nice to have a function that would take this and use a match() function something like:

newfac <- val_strings [ match(REGION_f, convtbl) ]

###----code----#
> conv <- read.table(text='11 = "New England Division"
+  12 = "Middle Atlantic Division"
+  21 = "East North Central Division"
+  22 = "West North Central Division"
+  31 = "South Atlantic Division"
+  32 = "East South Central Division"
+  33 = "West South Central Division"
+  41 = "Mountain Division"
+  42 = "Pacific Division"
+  97 = "State not identified"', sep="=", stringsAsFactors=FALSE)

> conv[[2]] [ match(c(11,97,42,31), conv[[1]] )]
[1] " New England Division"    " State not identified"    " Pacific Division"        " South Atlantic Division"

To pretty it up, you could give names to columns of the conversion table and perhaps readLine could be se tup to start at "value" and end at the next semi-colon.


>
> So it would make sense to have a lookup table of these codes linked to the
> variables. I’m not sure if it makes more sense to have that table live in R
> or in the database. For R purposes, I imagine it would make sense to convert
> these integer-valued variables into factors.
>
> What I do not understand is how SAS knows where the variables begin and end.
> I managed to break off a little hunk of the beginning of my file and look at
> it in an editor, and it is numbers without any obvious delimiters. Is the
> delimiter a particular numeric string?

Probably fixed field format.


> I thought the SAS command file would
> contain the starting location for each of the fixed-length fields, but I do
> not see anything in the file that could be interpreted that way – just a
> little wraparound code and then a long list of variable names followed by
> triplets of a code, an equals sign, and a text string, terminating with a
> semicolon.
>
Exactly

> I’m sorry if I am being obtuse. When I said before that I had saved the SAS
> files as flat files, what I really meant was that I had an intern do it.
> When I was doing my own analysis, I mainly used TSP, before I switched to R
> about a year ago. I’ve never used SAS.
>
> I find your data project very interesting.  Very.   It is not actually
> necessary to wait for BLS to release the older CEX files, if you can lay
> your hands on the CDs. I spoke to the BLS data products office about  2
> years ago, and they have no problem with people republishing purchased data
> in any format they like, including simple duplication.  In fact, they seemed
> to like the idea.  I think the sale of data was forced on them by some kind
> of mandate from above.

Your legal status will not depend on conversations with staff, but rather on your user-agreement.

>
> I'll be playing with your code (which is a model of readability, and a
> lesson to me on same, BTW) and keep you posted on my progress.
>
> Warmly, Andrew
>
--
David Winsemius, MD
Alameda, CA, USA

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Getting information encoded in a SAS, SPSS or Stata command file into R.

Daniel Nordlund-4
In reply to this post by andrewH
> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]]
> On Behalf Of andrewH
> Sent: Wednesday, November 14, 2012 2:34 PM
> To: [hidden email]
> Subject: Re: [R] Getting information encoded in a SAS, SPSS or Stata
> command file into R.
>
> Dear Anthony –
>
> On closer examination, what I am talking about is not factor levels, but
> something different (but analogous). The data that is categorical all has
> integer codes, so the file is entirely numeric. The SAS proc format then
> gives text strings for each code for each categorical variable. Like this:
>
> value REGION_f
>   11 = "New England Division"
>   12 = "Middle Atlantic Division"
>   21 = "East North Central Division"
>   22 = "West North Central Division"
>   31 = "South Atlantic Division"
>   32 = "East South Central Division"
>   33 = "West South Central Division"
>   41 = "Mountain Division"
>   42 = "Pacific Division"
>   97 = "State not identified"
>
> So it would make sense to have a lookup table of these codes linked to the
> variables. I’m not sure if it makes more sense to have that table live in
> R
> or in the database. For R purposes, I imagine it would make sense to
> convert
> these integer-valued variables into factors.
>
> What I do not understand is how SAS knows where the variables begin and
> end.
> I managed to break off a little hunk of the beginning of my file and look
> at
> it in an editor, and it is numbers without any obvious delimiters. Is the
> delimiter a particular numeric string? I thought the SAS command file
> would
> contain the starting location for each of the fixed-length fields, but I
> do
> not see anything in the file that could be interpreted that way – just a
> little wraparound code and then a long list of variable names followed by
> triplets of a code, an equals sign, and a text string, terminating with a
> semicolon.
>
> I’m sorry if I am being obtuse. When I said before that I had saved the
> SAS
> files as flat files, what I really meant was that I had an intern do it.
> When I was doing my own analysis, I mainly used TSP, before I switched to
> R
> about a year ago. I’ve never used SAS.
>
> I find your data project very interesting.  Very.   It is not actually
> necessary to wait for BLS to release the older CEX files, if you can lay
> your hands on the CDs. I spoke to the BLS data products office about  2
> years ago, and they have no problem with people republishing purchased
> data
> in any format they like, including simple duplication.  In fact, they
> seemed
> to like the idea.  I think the sale of data was forced on them by some
> kind
> of mandate from above.
>
> I'll be playing with your code (which is a model of readability, and a
> lesson to me on same, BTW) and keep you posted on my progress.
>
> Warmly, Andrew
>

Andrew,

R-help is not really the venue for discussing SAS programming and how the SAS data step reads fixed width files.  If you want to email me (off-list) the SAS program/script for reading the data, I would be willing to explain what it is doing.

Dan

Daniel Nordlund
Bothell, WA USA
 

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Getting information encoded in a SAS, SPSS or Stata command file into R.

ajdamico
In reply to this post by andrewH
> What I do not understand is how SAS knows where the variables begin and

> end.
> I managed to break off a little hunk of the beginning of my file and look
> at
> it in an editor, and it is numbers without any obvious delimiters. Is the
> delimiter a particular numeric string? I thought the SAS command file would
> contain the starting location for each of the fixed-length fields, but I do
> not see anything in the file that could be interpreted that way – just a
> little wraparound code and then a long list of variable names followed by
> triplets of a code, an equals sign, and a text string, terminating with a
> semicolon.
>

Search around for the word INPUT.  If that doesn't exist, you're probably
looking at a formatting-only script -- and the delimiting happens elsewhere.

Here are some examples of what the R SAScii package looks for when it uses
a SAS importation script to read in ASCII data:

library(SAScii)
?parse.SAScii

note that there are additional examples under ?read.SAScii that include
IPUMS data  ;)

also: David's idea is totally awesome

        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.