Summary tables of large datasets including character and numerical variables

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Summary tables of large datasets including character and numerical variables

sparandekar
Hello !

I am attempting to switch from being a long time SAS user to R, and would really appreciate a bit of help ! The first thing I do in getting a large dataset (thousands of obervations and hundreds of variables) is to run a SAS command PROC CONTENTS VARNUM command - this provides me a table with the name of each variable, its type and length;  then I run a PROC MEANS - for numerical variables it gives me a table with the number of non-missing values, min, max, mean and std. dev.  My data usually has errors and this first step helps me to spot the errors and 'clean' the dataset.

The 'summary' function in R and other function as part of Hmisc or Psych package do not work for me.

How can I get a table from an R data.frame that has the following structure (header row and example).

Rowname  Character/Integer  Length   Non-Missing    Minimum            Maximum              Mean                   SD

HHID            Integer                       12            32,344                114455007701   514756007812       2.345 x 10^10    1.456 x 10^10
Head            Character                   38            24,566                        -                                   -                         -                           -

Thank you very much.

Reply | Threaded
Open this post in threaded view
|

Re: Summary tables of large datasets including character and numerical variables

John Kane-2
It would be  very helpful to have an actual sample of your data.

As usual in R there are probably several different ways to approach the problem
but a small sample of the data or a mock-up would be most helpful.


Probably the easiest way to supply some data would be something like

df1 <- mydata[, 100]

dput(df1)


and paste the resulting output into a message.  

Sorry I'm not more helpful but seeing real data makes life a lot easier.



----- Original Message -----
From: sparandekar <[hidden email]>
To: [hidden email]
Cc:
Sent: Monday, December 26, 2011 5:44:53 AM
Subject: [R] Summary tables of large datasets including character and numerical variables

Hello !

I am attempting to switch from being a long time SAS user to R, and would
really appreciate a bit of help ! The first thing I do in getting a large
dataset (thousands of obervations and hundreds of variables) is to run a SAS
command PROC CONTENTS VARNUM command - this provides me a table with the
name of each variable, its type and length;  then I run a PROC MEANS - for
numerical variables it gives me a table with the number of non-missing
values, min, max, mean and std. dev.  My data usually has errors and this
first step helps me to spot the errors and 'clean' the dataset.

The 'summary' function in R and other function as part of Hmisc or Psych
package do not work for me.

How can I get a table from an R data.frame that has the following structure
(header row and example).

Rowname  Character/Integer  Length   Non-Missing    Minimum         
Maximum              Mean                   SD

HHID            Integer                       12            32,344             
114455007701   514756007812       2.345 x 10^10    1.456 x 10^10
Head            Character                   38            24,566                     
-                                   -                         -                         
-

Thank you very much.



--
View this message in context: http://r.789695.n4.nabble.com/Summary-tables-of-large-datasets-including-character-and-numerical-variables-tp4234296p4234296.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Summary tables of large datasets including character and numerical variables

Duncan Murdoch-2
In reply to this post by sparandekar
On 11-12-26 5:44 AM, sparandekar wrote:

> Hello !
>
> I am attempting to switch from being a long time SAS user to R, and would
> really appreciate a bit of help ! The first thing I do in getting a large
> dataset (thousands of obervations and hundreds of variables) is to run a SAS
> command PROC CONTENTS VARNUM command - this provides me a table with the
> name of each variable, its type and length;  then I run a PROC MEANS - for
> numerical variables it gives me a table with the number of non-missing
> values, min, max, mean and std. dev.  My data usually has errors and this
> first step helps me to spot the errors and 'clean' the dataset.
>
> The 'summary' function in R and other function as part of Hmisc or Psych
> package do not work for me.
>
> How can I get a table from an R data.frame that has the following structure
> (header row and example).
>
> Rowname  Character/Integer  Length   Non-Missing    Minimum
> Maximum              Mean                   SD
>
> HHID            Integer                       12            32,344
> 114455007701   514756007812       2.345 x 10^10    1.456 x 10^10
> Head            Character                   38            24,566
> -                                   -                         -
> -

Using the tables package, you can get something like that as follows,
assuming that "df" is your dataframe:

nonmissing <- function(x) sum(!is.na(x))

tabular(All(df, character=TRUE) ~ (typeof + length + nonmissing + min +
max + mean + sd))

It isn't perfect:  it will skip anything that isn't numeric or character
(e.g. factors).  There are ways to work around that, but they aren't as
simple as you might like.  You can also use

sapply(df, class)

to see the classes of all the columns.

Duncan Murdoch

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Summary tables of large datasets including character and numerical variables

David Winsemius
In reply to this post by sparandekar

On Dec 26, 2011, at 5:44 AM, sparandekar wrote:

> Hello !
>
> I am attempting to switch from being a long time SAS user to R, and  
> would
> really appreciate a bit of help ! The first thing I do in getting a  
> large
> dataset (thousands of obervations and hundreds of variables) is to  
> run a SAS
> command PROC CONTENTS VARNUM command - this provides me a table with  
> the
> name of each variable, its type and length;  then I run a PROC MEANS  
> - for
> numerical variables it gives me a table with the number of non-missing
> values, min, max, mean and std. dev.  My data usually has errors and  
> this
> first step helps me to spot the errors and 'clean' the dataset.
>
> The 'summary' function in R and other function as part of Hmisc or  
> Psych
> package do not work for me.
>
> How can I get a table from an R data.frame that has the following  
> structure
> (header row and example).
>
> Rowname  Character/Integer  Length   Non-Missing    Minimum
> Maximum              Mean                   SD
>
> HHID            Integer                       12            32,344
> 114455007701   514756007812       2.345 x 10^10    1.456 x 10^10
> Head            Character                   38            24,566
> -                                   -                         -
> -

I generally use ( in order of increasing information content and  
increasing length of output):

names(dfrm)

str(dfrm)

Hmisc::describe(dfrm)

(Several other packages have their own versions of 'describe'.)

--

David Winsemius, MD
West Hartford, CT

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Summary tables of large datasets including character and numerical variables

John Kane-2
In reply to this post by sparandekar
A function like the one below will give you the class and number of valid
entries for a dataset.  At sample data set would help determine if it works.
It works on a simple data set I created and one from the ggplot2 package but it is
not really tested.

With your data set as df1 something like
xx <- msum(df1)
should supply you with the information.

Other than that I really don't see why Hmisc :: describe() would not give you
most or all the information you seek with the exception of the "length"thing which
really don't understand. Actually I believe describe() will give you
the number of non-NA entries for a variable.


==========================================================
msum <- function(dat1) {
nams <- c("class", "non-missing")
cl <- lapply(dat1, class)
nas <-colSums(!is.na(dat1))
mysum <- data.frame(unlist(cl), unlist(nas))
names(mysum) <- nams
mysum <- mysum
}

===============================================================



----- Original Message -----
From: sparandekar <[hidden email]>
To: [hidden email]
Cc:
Sent: Monday, December 26, 2011 5:44:53 AM
Subject: [R] Summary tables of large datasets including character and numerical variables

Hello !

I am attempting to switch from being a long time SAS user to R, and would
really appreciate a bit of help ! The first thing I do in getting a large
dataset (thousands of obervations and hundreds of variables) is to run a SAS
command PROC CONTENTS VARNUM command - this provides me a table with the
name of each variable, its type and length;  then I run a PROC MEANS - for
numerical variables it gives me a table with the number of non-missing
values, min, max, mean and std. dev.  My data usually has errors and this
first step helps me to spot the errors and 'clean' the dataset.

The 'summary' function in R and other function as part of Hmisc or Psych
package do not work for me.

How can I get a table from an R data.frame that has the following structure
(header row and example).

Rowname  Character/Integer  Length   Non-Missing    Minimum         
Maximum              Mean                   SD

HHID            Integer                       12            32,344             
114455007701   514756007812       2.345 x 10^10    1.456 x 10^10
Head            Character                   38            24,566                     
-                                   -                         -                         
-

Thank you very much.



--
View this message in context: http://r.789695.n4.nabble.com/Summary-tables-of-large-datasets-including-character-and-numerical-variables-tp4234296p4234296.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.