Efficient way to determine if a data frame has missing observations

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Efficient way to determine if a data frame has missing observations

Al Roark

I have a data set covering a large number of cities with values for characteristics such as land area, population, and employment. The problem I have is that some cities lack observations for some of the characteristics and I'd like a quick way to determine which cities have missing data.  For example:

city<-c("A","A","A","B","B","C")
var<-c("sqmi","pop","emp","pop","emp","pop")
value<-c(10,100,40,30,10,20)
df<-data.frame(city,var,value)

In this data frame, city A has complete data for the three variables, while city B is missing land area, and city C only has population data. In the full data frame, my approach to finding the missing observations has been to create a data frame with all combinations of 'city' and 'var', merge this onto the original data frame, and then extract the observations with missing data for 'value':

city_unq<-c("A","B","C")
var_unq<-c("sqmi","pop","emp")
comb<-expand.grid(city=city_unq,var=var_unq)

mrg<-merge(comb,df,by=c("city","var"),all=T)
missing<-mrg[is.na(mrg$value),]

This works, but on a large dataset it gets slow and I'm looking for a a more efficient way to achieve this same result.  Any suggestions would be much appreciated.

Cheers
     
        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Efficient way to determine if a data frame has missing observations

Erik Iverson-3


H Roark wrote:

> I have a data set covering a large number of cities with values for characteristics such as land area, population, and employment. The problem I have is that some cities lack observations for some of the characteristics and I'd like a quick way to determine which cities have missing data.  For example:
>
> city<-c("A","A","A","B","B","C")
> var<-c("sqmi","pop","emp","pop","emp","pop")
> value<-c(10,100,40,30,10,20)
> df<-data.frame(city,var,value)
>
> In this data frame, city A has complete data for the three variables, while city B is missing land area, and city C only has population data. In the full data frame, my approach to finding the missing observations has been to create a data frame with all combinations of 'city' and 'var', merge this onto the original data frame, and then extract the observations with missing data for 'value':
>
> city_unq<-c("A","B","C")
> var_unq<-c("sqmi","pop","emp")
> comb<-expand.grid(city=city_unq,var=var_unq)
>
> mrg<-merge(comb,df,by=c("city","var"),all=T)
> missing<-mrg[is.na(mrg$value),]

Perhaps the following, or a variation thereof?

subset(as.data.frame(table(city = df$city, var = df$var)), Freq == 0)

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Efficient way to determine if a data frame has missing observations

Henrique Dallazuanna
In reply to this post by Al Roark
Try this:

subset(as.data.frame(xtabs( ~ city + var, df)), !Freq)


On Wed, Feb 2, 2011 at 1:49 PM, H Roark <[hidden email]> wrote:

>
> I have a data set covering a large number of cities with values for
> characteristics such as land area, population, and employment. The problem I
> have is that some cities lack observations for some of the characteristics
> and I'd like a quick way to determine which cities have missing data.  For
> example:
>
> city<-c("A","A","A","B","B","C")
> var<-c("sqmi","pop","emp","pop","emp","pop")
> value<-c(10,100,40,30,10,20)
> df<-data.frame(city,var,value)
>
> In this data frame, city A has complete data for the three variables, while
> city B is missing land area, and city C only has population data. In the
> full data frame, my approach to finding the missing observations has been to
> create a data frame with all combinations of 'city' and 'var', merge this
> onto the original data frame, and then extract the observations with missing
> data for 'value':
>
> city_unq<-c("A","B","C")
> var_unq<-c("sqmi","pop","emp")
> comb<-expand.grid(city=city_unq,var=var_unq)
>
> mrg<-merge(comb,df,by=c("city","var"),all=T)
> missing<-mrg[is.na(mrg$value),]
>
> This works, but on a large dataset it gets slow and I'm looking for a a
> more efficient way to achieve this same result.  Any suggestions would be
> much appreciated.
>
> Cheers
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


--
Henrique Dallazuanna
Curitiba-Paraná-Brasil
25° 25' 40" S 49° 16' 22" O

        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.