Hi:

On Fri, Jan 7, 2011 at 7:04 AM, wangwallace <

[hidden email]> wrote:

>

> hey, folks,

>

> I have two very simple questions. I am not quite sure if I can do this

> using

> R.

>

> so, I am analyzing a large data frame with thousands of variables. For

> example:

>

> Dependent variables: d1, d2, d3 (i.e., there are three dependent variables)

>

> Independent variables: s1, s2, s3, ......s1000 (i.e., there are 1000

> independent variables)

>

> now I want to run simple linear regression analyses of dependent variables

> on independent variables. A laborious way to do this is running 1000 linear

> regression models for each dependent variable separately. This would take a

> lot of time. My question is:

>

> 1) is there a easy way to run all 1000 linear regression analyses for each

> dependent variable at once?

>

Yes.

> 2) after running all 1000 linear regression analyses at once, how can I

> write 3000 regression weights (1000 regression weights for each dependent

> variable) and its significance level in to a file (e.g., csv, excel, ect).

>

Define 'weights'. Do you mean the parameter estimates?

Here's a simplified example, but the output is a separate list object for

each response. Each component of each list is an lm object, produced from

the lm() function.

# Generate some fake data: three responses and eight covariates

df <- data.frame(y1 = rnorm(50), y2 = rnorm(50), y3 = rnorm(50),

x1 = rnorm(50), x2 = rnorm(50), x3 = rnorm(50),

x4 = rpois(50, 30), x5 = rpois(50, 20), x6 = rpois(50,

10),

x7 = runif(50), x8 = runif(50))

# Create a vector of covariate names

xs <- paste('x', 1:8, sep = '')

# Initialize a list whose length is that of the vector xs

rl1 <- vector('list', 8)

rl2 <- vector('list', 8)

rl3 <- vector('list', 8)

# The following loop regresses all three responses individually on a single

covariate x[i]

# and exports the results to separate lists for each response

# The first statement creates a formula with the name of the i-th covariate

# The second statement does the regression and assigns the output to the

i-th

# component of the list corresponding to the j-th response (j = 1, 2, 3)

for(i in 1:8) { fm1<- as.formula(paste('y1', xs[i], sep = '~'))

fm2 <- as.formula(paste('y2', xs[i], sep = '~'))

fm3 <- as.formula(paste('y3', xs[i], sep = '~'))

rl1[[i]] <- lm(fm1, data = df)

rl2[[i]] <- lm(fm2, data = df)

rl3[[i]] <- lm(fm3, data = df)

}

# The print method of lm() applied to the first component is

rl1[[1]]

Each component of each list will resemble the output object you would get

from running lm() on a single response with one explanatory variable at a

time. In each list, there are as many components as there are covariates.

You can extract all sorts of things from each component of the list; I

prefer using

the ldply() function from package plyr, but there are other ways with base

functions.

Here are some examples:

library(plyr)

# R^2 values:

ldply(rl1, function(x) summary(x)$r.squared)

# Model coefficients:

ldply(rl1, function(x) coef(x))

# p-values of significance tests for intercept and slope

ldply(rl1, function(x) summary(x)$coefficients[, 4])

# residuals from each model

res1 <- t(ldply(rl1, function(x) resid(x))) # produces a matrix

Some comments:

1. If you want to run multivariate regression on the response matrix [y1 y2

y3],

you only need one list object for output. Substitute 'cbind(y1, y2, y3)'

for

yi when creating the formula. Only one call is needed for multivariate

regression rather than three for the univariate regressions. However, you

would then need to be more careful about output structures and pulling

them

together over list components. For instance, in a multivariate

regression, the

coefficients are output as a matrix rather than a vector, so to combine

them

all, you would need to use a 3D array rather than a data frame.

2. The 'x' in the anonymous functions in ldply() corresponds to a generic

list

component. In this context, x is an lm object, so anything you could do

with

the output of an lm object could be mapped componentwise with ldply. This

approach is very convenient if you want to pick off individual output

pieces

(e.g., R^2 or the root MSE) for all the regressions at once.

3. To associate the covariates with rows of output in each generated data

frame above, you could use, e.g.,

r2y1 <- ldply(rl1, function(x) summary(x)$r.squared)

rownames(r2y1) <- xs

In the residual example, the data frame is 8 x 50; using the t() function

[for transpose], the result is a 50 x 8 matrix whose columns correspond

to

the individual covariates, so in this case

colnames(res1) <- xs

4. Make sure you organize your code and desired output in advance; with 1000

covariates, things could get messy if you're not careful.

HTH,

Dennis

[[alternative HTML version deleted]]

______________________________________________

[hidden email] mailing list

https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide

http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.