Request for functions to calculate correlated factors influencing an outcome.

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Request for functions to calculate correlated factors influencing an outcome.

Lalitha Viswanathan
Hi
I am sorry, I saved the file removing the dot after the Disp (as I was
going wrong on a read.delim which threw an error about !header, etc...The
dot was not the culprit, but I continued to leave it out.
Let me paste the full code here.
x<-read.table("/Users/Documents/StatsTest/fuelEfficiency.txt", header=TRUE,
sep="\t")
x<-data.frame(x)
for (i in unique(x$Country)) { print (i); y <- subset(x, x$Country == i);
print(y); }
newx <- subset (x, select = c(Price, Reliability, Mileage, Weight, Disp,
HP))
cor(newx, method="pearson")
my.cor <-cor.test(newx$Weight, newx$Price, method="spearman")
my.cor <-cor.test(newx$Weight, newx$HP, method="spearman")
my.cor <-cor.test(newx$Disp, newx$HP, method="spearman")
Putting exact=NULL still doesn't remove the warning
my.cor <-cor.test(newx$Disp, newx$HP, method="kendall", exact=NULL)
I tried to find the correlation coeff for a various combination of
variables, but am unable to interpet the results. (Results pasted below in
an earlier post)

Followed it up with a normality test
shapiro.test(newx$Disp)
shapiro.test(newx$HP)

Then decided to do a kruskal.test(newx)
with the result
Kruskal-Wallis chi-squared = 328.94, df = 5, p-value < 2.2e-16

Question is : I am trying to find factors influencing efficiency (in this
case mileage)

What are the range of functions / examples I should be looking at, to find
a factor or combination of factors influencing efficiency?

Any pointers will be helpful

Thanks
Lalitha

On Sun, May 3, 2015 at 2:49 PM, Lalitha Viswanathan <
[hidden email]> wrote:

> Hi
> I have a dataset of the type attached.
> Here's my code thus far.
> dataset <-data.frame(read.delim("data", sep="\t", header=TRUE));
> newData<-subset(dataset, select = c(Price, Reliability, Mileage, Weight,
> Disp, HP));
> cor(newData, method="pearson");
> Results are
>                  Price Reliability    Mileage     Weight       Disp
>   HP
> Price        1.0000000          NA -0.6537541  0.7017999  0.4856769
>  0.6536433
> Reliability         NA           1         NA         NA         NA
>   NA
> Mileage     -0.6537541          NA  1.0000000 -0.8478541 -0.6931928
> -0.6667146
> Weight       0.7017999          NA -0.8478541  1.0000000  0.8032804
>  0.7629322
> Disp         0.4856769          NA -0.6931928  0.8032804  1.0000000
>  0.8181881
> HP           0.6536433          NA -0.6667146  0.7629322  0.8181881
>  1.0000000
>
> It appears that Wt and Price, Wt and Disp, Wt and HP, Disp and HP, HP and
> Price are strongly correlated.
> To find the statistical significance,
> I am trying  sample.correln<-cor.test(newData$Disp, newData$HP,
> method="kendall", exact=NULL)
> Kendall's rank correlation tau
>
> data:  newx$Disp and newx$HP
> z = 7.2192, p-value = 5.229e-13
> alternative hypothesis: true tau is not equal to 0
> sample estimates:
>       tau
> 0.6563871
>
> If I try the same with
> sample.correln<-cor.test(newData$Disp, newData$HP, method="pearson",
> exact=NULL)
> I get Warning message:
> In cor.test.default(newx$Disp, newx$HP, method = "spearman", exact = NULL)
> :
>   Cannot compute exact p-value with ties
> > sample.correln
>
> Spearman's rank correlation rho
>
> data:  newx$Disp and newx$HP
> S = 5716.8, p-value < 2.2e-16
> alternative hypothesis: true rho is not equal to 0
> sample estimates:
>       rho
> 0.8411566
>
> I am not sure how to interpret these values.
> Basically, I am trying to figure out which combination of factors
> influences efficiency.
>
> Thanks
> Lalitha
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Request for functions to calculate correlated factors influencing an outcome.

Prashant Sethi
Hi,

I'm not an expert in data analysis (a beginner still learning tricks of the
trade) but I believe in your case since you're trying to determine the
correlation of a dependent variable with a number of factor variables, you
should try doing the regression analysis of your model. The function you'll
use for that is the lm() function. You can use the forward building or the
backward elimination method to build your model with the most critical
factors included.

Maybe you can give it a try.

Thanks and regards,
Prashant Sethi
On 3 May 2015 23:18, "Lalitha Viswanathan" <[hidden email]>
wrote:

> Hi
> I am sorry, I saved the file removing the dot after the Disp (as I was
> going wrong on a read.delim which threw an error about !header, etc...The
> dot was not the culprit, but I continued to leave it out.
> Let me paste the full code here.
> x<-read.table("/Users/Documents/StatsTest/fuelEfficiency.txt", header=TRUE,
> sep="\t")
> x<-data.frame(x)
> for (i in unique(x$Country)) { print (i); y <- subset(x, x$Country == i);
> print(y); }
> newx <- subset (x, select = c(Price, Reliability, Mileage, Weight, Disp,
> HP))
> cor(newx, method="pearson")
> my.cor <-cor.test(newx$Weight, newx$Price, method="spearman")
> my.cor <-cor.test(newx$Weight, newx$HP, method="spearman")
> my.cor <-cor.test(newx$Disp, newx$HP, method="spearman")
> Putting exact=NULL still doesn't remove the warning
> my.cor <-cor.test(newx$Disp, newx$HP, method="kendall", exact=NULL)
> I tried to find the correlation coeff for a various combination of
> variables, but am unable to interpet the results. (Results pasted below in
> an earlier post)
>
> Followed it up with a normality test
> shapiro.test(newx$Disp)
> shapiro.test(newx$HP)
>
> Then decided to do a kruskal.test(newx)
> with the result
> Kruskal-Wallis chi-squared = 328.94, df = 5, p-value < 2.2e-16
>
> Question is : I am trying to find factors influencing efficiency (in this
> case mileage)
>
> What are the range of functions / examples I should be looking at, to find
> a factor or combination of factors influencing efficiency?
>
> Any pointers will be helpful
>
> Thanks
> Lalitha
>
> On Sun, May 3, 2015 at 2:49 PM, Lalitha Viswanathan <
> [hidden email]> wrote:
>
> > Hi
> > I have a dataset of the type attached.
> > Here's my code thus far.
> > dataset <-data.frame(read.delim("data", sep="\t", header=TRUE));
> > newData<-subset(dataset, select = c(Price, Reliability, Mileage, Weight,
> > Disp, HP));
> > cor(newData, method="pearson");
> > Results are
> >                  Price Reliability    Mileage     Weight       Disp
> >   HP
> > Price        1.0000000          NA -0.6537541  0.7017999  0.4856769
> >  0.6536433
> > Reliability         NA           1         NA         NA         NA
> >   NA
> > Mileage     -0.6537541          NA  1.0000000 -0.8478541 -0.6931928
> > -0.6667146
> > Weight       0.7017999          NA -0.8478541  1.0000000  0.8032804
> >  0.7629322
> > Disp         0.4856769          NA -0.6931928  0.8032804  1.0000000
> >  0.8181881
> > HP           0.6536433          NA -0.6667146  0.7629322  0.8181881
> >  1.0000000
> >
> > It appears that Wt and Price, Wt and Disp, Wt and HP, Disp and HP, HP and
> > Price are strongly correlated.
> > To find the statistical significance,
> > I am trying  sample.correln<-cor.test(newData$Disp, newData$HP,
> > method="kendall", exact=NULL)
> > Kendall's rank correlation tau
> >
> > data:  newx$Disp and newx$HP
> > z = 7.2192, p-value = 5.229e-13
> > alternative hypothesis: true tau is not equal to 0
> > sample estimates:
> >       tau
> > 0.6563871
> >
> > If I try the same with
> > sample.correln<-cor.test(newData$Disp, newData$HP, method="pearson",
> > exact=NULL)
> > I get Warning message:
> > In cor.test.default(newx$Disp, newx$HP, method = "spearman", exact =
> NULL)
> > :
> >   Cannot compute exact p-value with ties
> > > sample.correln
> >
> > Spearman's rank correlation rho
> >
> > data:  newx$Disp and newx$HP
> > S = 5716.8, p-value < 2.2e-16
> > alternative hypothesis: true rho is not equal to 0
> > sample estimates:
> >       rho
> > 0.8411566
> >
> > I am not sure how to interpret these values.
> > Basically, I am trying to figure out which combination of factors
> > influences efficiency.
> >
> > Thanks
> > Lalitha
> >
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Request for functions to calculate correlated factors influencing an outcome.

Lalitha Viswanathan
Hi
I used the MASS library
library(MASS)  (by reading about examples at
http://www.statmethods.net/stats/regression.html
<http://s.bl-1.com/h/ofLlK27?url=http://www.statmethods.net/stats/regression.html>
)
fit <- lm(Mileage~Disp+HP+Weight+Reliability,data=newx)
step <- stepAIC(fit, direction="both")
step$anova # display results

It showed the most relevant variables affecting Mileage.
While that is a start, I am looking for a model that fits the entire data
(including Mileage), not factors that influence Mileage.

Multi model inference / selection.

I was reading about glmulti.
Are there any other packages I could look at, for infering models that best
fit the data.

To use nlm / nls, I need a formula, as one of the parameters to best fit
the data and I am looking for functions that will help infer that formula
from the data.

Thanks
lalitha

On Sun, May 3, 2015 at 11:33 PM, Prashant Sethi <[hidden email]>
wrote:

> Hi,
>
> I'm not an expert in data analysis (a beginner still learning tricks of
> the trade) but I believe in your case since you're trying to determine the
> correlation of a dependent variable with a number of factor variables, you
> should try doing the regression analysis of your model. The function you'll
> use for that is the lm() function. You can use the forward building or the
> backward elimination method to build your model with the most critical
> factors included.
>
> Maybe you can give it a try.
>
> Thanks and regards,
> Prashant Sethi
> On 3 May 2015 23:18, "Lalitha Viswanathan" <
> [hidden email]> wrote:
>
>> Hi
>> I am sorry, I saved the file removing the dot after the Disp (as I was
>> going wrong on a read.delim which threw an error about !header, etc...The
>> dot was not the culprit, but I continued to leave it out.
>> Let me paste the full code here.
>> x<-read.table("/Users/Documents/StatsTest/fuelEfficiency.txt",
>> header=TRUE,
>> sep="\t")
>> x<-data.frame(x)
>> for (i in unique(x$Country)) { print (i); y <- subset(x, x$Country == i);
>> print(y); }
>> newx <- subset (x, select = c(Price, Reliability, Mileage, Weight, Disp,
>> HP))
>> cor(newx, method="pearson")
>> my.cor <-cor.test(newx$Weight, newx$Price, method="spearman")
>> my.cor <-cor.test(newx$Weight, newx$HP, method="spearman")
>> my.cor <-cor.test(newx$Disp, newx$HP, method="spearman")
>> Putting exact=NULL still doesn't remove the warning
>> my.cor <-cor.test(newx$Disp, newx$HP, method="kendall", exact=NULL)
>> I tried to find the correlation coeff for a various combination of
>> variables, but am unable to interpet the results. (Results pasted below in
>> an earlier post)
>>
>> Followed it up with a normality test
>> shapiro.test(newx$Disp)
>> shapiro.test(newx$HP)
>>
>> Then decided to do a kruskal.test(newx)
>> with the result
>> Kruskal-Wallis chi-squared = 328.94, df = 5, p-value < 2.2e-16
>>
>> Question is : I am trying to find factors influencing efficiency (in this
>> case mileage)
>>
>> What are the range of functions / examples I should be looking at, to find
>> a factor or combination of factors influencing efficiency?
>>
>> Any pointers will be helpful
>>
>> Thanks
>> Lalitha
>>
>> On Sun, May 3, 2015 at 2:49 PM, Lalitha Viswanathan <
>> [hidden email]> wrote:
>>
>> > Hi
>> > I have a dataset of the type attached.
>> > Here's my code thus far.
>> > dataset <-data.frame(read.delim("data", sep="\t", header=TRUE));
>> > newData<-subset(dataset, select = c(Price, Reliability, Mileage, Weight,
>> > Disp, HP));
>> > cor(newData, method="pearson");
>> > Results are
>> >                  Price Reliability    Mileage     Weight       Disp
>> >   HP
>> > Price        1.0000000          NA -0.6537541  0.7017999  0.4856769
>> >  0.6536433
>> > Reliability         NA           1         NA         NA         NA
>> >   NA
>> > Mileage     -0.6537541          NA  1.0000000 -0.8478541 -0.6931928
>> > -0.6667146
>> > Weight       0.7017999          NA -0.8478541  1.0000000  0.8032804
>> >  0.7629322
>> > Disp         0.4856769          NA -0.6931928  0.8032804  1.0000000
>> >  0.8181881
>> > HP           0.6536433          NA -0.6667146  0.7629322  0.8181881
>> >  1.0000000
>> >
>> > It appears that Wt and Price, Wt and Disp, Wt and HP, Disp and HP, HP
>> and
>> > Price are strongly correlated.
>> > To find the statistical significance,
>> > I am trying  sample.correln<-cor.test(newData$Disp, newData$HP,
>> > method="kendall", exact=NULL)
>> > Kendall's rank correlation tau
>> >
>> > data:  newx$Disp and newx$HP
>> > z = 7.2192, p-value = 5.229e-13
>> > alternative hypothesis: true tau is not equal to 0
>> > sample estimates:
>> >       tau
>> > 0.6563871
>> >
>> > If I try the same with
>> > sample.correln<-cor.test(newData$Disp, newData$HP, method="pearson",
>> > exact=NULL)
>> > I get Warning message:
>> > In cor.test.default(newx$Disp, newx$HP, method = "spearman", exact =
>> NULL)
>> > :
>> >   Cannot compute exact p-value with ties
>> > > sample.correln
>> >
>> > Spearman's rank correlation rho
>> >
>> > data:  newx$Disp and newx$HP
>> > S = 5716.8, p-value < 2.2e-16
>> > alternative hypothesis: true rho is not equal to 0
>> > sample estimates:
>> >       rho
>> > 0.8411566
>> >
>> > I am not sure how to interpret these values.
>> > Basically, I am trying to figure out which combination of factors
>> > influences efficiency.
>> >
>> > Thanks
>> > Lalitha
>> >
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Request for functions to calculate correlated factors influencing an outcome.

Prashant Sethi
Hi,

>From my understanding of a model, you need to have one or more dependent
parameter variables which you (y1, y2 etc) and the explanatory parameter
variables (x1, x2 etc) which you fit along with certain coefficients to
determine the dependent parameter with minimum error as possible. I don't
think a factor can be on both sides of the model equation.

In your earlier email you had mentioned this statement: Question is : I am
trying to find factors influencing efficiency (in this
case mileage)

​If this is indeed your question, then I believe the lm() function is what
you need. Otherwise, I think you need to reformulate your query to further
clarify what you are looking for.

​Thanks and regards,
Prashant


On Mon, May 4, 2015 at 2:10 PM, Lalitha Viswanathan <
[hidden email]> wrote:

> Hi
> I used the MASS library
> library(MASS)  (by reading about examples at
> http://www.statmethods.net/stats/regression.html
> <
> http://s.bl-1.com/h/ofLlK27?url=http://www.statmethods.net/stats/regression.html
> >
> )
> fit <- lm(Mileage~Disp+HP+Weight+Reliability,data=newx)
> step <- stepAIC(fit, direction="both")
> step$anova # display results
>
> It showed the most relevant variables affecting Mileage.
> While that is a start, I am looking for a model that fits the entire data
> (including Mileage), not factors that influence Mileage.
>
> Multi model inference / selection.
>
> I was reading about glmulti.
> Are there any other packages I could look at, for infering models that best
> fit the data.
>
> To use nlm / nls, I need a formula, as one of the parameters to best fit
> the data and I am looking for functions that will help infer that formula
> from the data.
>
> Thanks
> lalitha
>
> On Sun, May 3, 2015 at 11:33 PM, Prashant Sethi <
> [hidden email]>
> wrote:
>
> > Hi,
> >
> > I'm not an expert in data analysis (a beginner still learning tricks of
> > the trade) but I believe in your case since you're trying to determine
> the
> > correlation of a dependent variable with a number of factor variables,
> you
> > should try doing the regression analysis of your model. The function
> you'll
> > use for that is the lm() function. You can use the forward building or
> the
> > backward elimination method to build your model with the most critical
> > factors included.
> >
> > Maybe you can give it a try.
> >
> > Thanks and regards,
> > Prashant Sethi
> > On 3 May 2015 23:18, "Lalitha Viswanathan" <
> > [hidden email]> wrote:
> >
> >> Hi
> >> I am sorry, I saved the file removing the dot after the Disp (as I was
> >> going wrong on a read.delim which threw an error about !header,
> etc...The
> >> dot was not the culprit, but I continued to leave it out.
> >> Let me paste the full code here.
> >> x<-read.table("/Users/Documents/StatsTest/fuelEfficiency.txt",
> >> header=TRUE,
> >> sep="\t")
> >> x<-data.frame(x)
> >> for (i in unique(x$Country)) { print (i); y <- subset(x, x$Country ==
> i);
> >> print(y); }
> >> newx <- subset (x, select = c(Price, Reliability, Mileage, Weight, Disp,
> >> HP))
> >> cor(newx, method="pearson")
> >> my.cor <-cor.test(newx$Weight, newx$Price, method="spearman")
> >> my.cor <-cor.test(newx$Weight, newx$HP, method="spearman")
> >> my.cor <-cor.test(newx$Disp, newx$HP, method="spearman")
> >> Putting exact=NULL still doesn't remove the warning
> >> my.cor <-cor.test(newx$Disp, newx$HP, method="kendall", exact=NULL)
> >> I tried to find the correlation coeff for a various combination of
> >> variables, but am unable to interpet the results. (Results pasted below
> in
> >> an earlier post)
> >>
> >> Followed it up with a normality test
> >> shapiro.test(newx$Disp)
> >> shapiro.test(newx$HP)
> >>
> >> Then decided to do a kruskal.test(newx)
> >> with the result
> >> Kruskal-Wallis chi-squared = 328.94, df = 5, p-value < 2.2e-16
> >>
> >> Question is : I am trying to find factors influencing efficiency (in
> this
> >> case mileage)
> >>
> >> What are the range of functions / examples I should be looking at, to
> find
> >> a factor or combination of factors influencing efficiency?
> >>
> >> Any pointers will be helpful
> >>
> >> Thanks
> >> Lalitha
> >>
> >> On Sun, May 3, 2015 at 2:49 PM, Lalitha Viswanathan <
> >> [hidden email]> wrote:
> >>
> >> > Hi
> >> > I have a dataset of the type attached.
> >> > Here's my code thus far.
> >> > dataset <-data.frame(read.delim("data", sep="\t", header=TRUE));
> >> > newData<-subset(dataset, select = c(Price, Reliability, Mileage,
> Weight,
> >> > Disp, HP));
> >> > cor(newData, method="pearson");
> >> > Results are
> >> >                  Price Reliability    Mileage     Weight       Disp
> >> >   HP
> >> > Price        1.0000000          NA -0.6537541  0.7017999  0.4856769
> >> >  0.6536433
> >> > Reliability         NA           1         NA         NA         NA
> >> >   NA
> >> > Mileage     -0.6537541          NA  1.0000000 -0.8478541 -0.6931928
> >> > -0.6667146
> >> > Weight       0.7017999          NA -0.8478541  1.0000000  0.8032804
> >> >  0.7629322
> >> > Disp         0.4856769          NA -0.6931928  0.8032804  1.0000000
> >> >  0.8181881
> >> > HP           0.6536433          NA -0.6667146  0.7629322  0.8181881
> >> >  1.0000000
> >> >
> >> > It appears that Wt and Price, Wt and Disp, Wt and HP, Disp and HP, HP
> >> and
> >> > Price are strongly correlated.
> >> > To find the statistical significance,
> >> > I am trying  sample.correln<-cor.test(newData$Disp, newData$HP,
> >> > method="kendall", exact=NULL)
> >> > Kendall's rank correlation tau
> >> >
> >> > data:  newx$Disp and newx$HP
> >> > z = 7.2192, p-value = 5.229e-13
> >> > alternative hypothesis: true tau is not equal to 0
> >> > sample estimates:
> >> >       tau
> >> > 0.6563871
> >> >
> >> > If I try the same with
> >> > sample.correln<-cor.test(newData$Disp, newData$HP, method="pearson",
> >> > exact=NULL)
> >> > I get Warning message:
> >> > In cor.test.default(newx$Disp, newx$HP, method = "spearman", exact =
> >> NULL)
> >> > :
> >> >   Cannot compute exact p-value with ties
> >> > > sample.correln
> >> >
> >> > Spearman's rank correlation rho
> >> >
> >> > data:  newx$Disp and newx$HP
> >> > S = 5716.8, p-value < 2.2e-16
> >> > alternative hypothesis: true rho is not equal to 0
> >> > sample estimates:
> >> >       rho
> >> > 0.8411566
> >> >
> >> > I am not sure how to interpret these values.
> >> > Basically, I am trying to figure out which combination of factors
> >> > influences efficiency.
> >> >
> >> > Thanks
> >> > Lalitha
> >> >
> >>
> >>         [[alternative HTML version deleted]]
> >>
> >> ______________________________________________
> >> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Request for functions to calculate correlated factors influencing an outcome.

Bert Gunter
In reply to this post by Lalitha Viswanathan
This would be better posted on a statistical list like
stats.stackexchange.com, as it is largely about statistical
methodology, not R code. Once you have determined what kinds of
methods you want, you might then post back here -- or better yet, just
search! -- for packages that implement those methods in R.

Cheers,
Bert

Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll




On Mon, May 4, 2015 at 1:40 AM, Lalitha Viswanathan
<[hidden email]> wrote:

> Hi
> I used the MASS library
> library(MASS)  (by reading about examples at
> http://www.statmethods.net/stats/regression.html
> <http://s.bl-1.com/h/ofLlK27?url=http://www.statmethods.net/stats/regression.html>
> )
> fit <- lm(Mileage~Disp+HP+Weight+Reliability,data=newx)
> step <- stepAIC(fit, direction="both")
> step$anova # display results
>
> It showed the most relevant variables affecting Mileage.
> While that is a start, I am looking for a model that fits the entire data
> (including Mileage), not factors that influence Mileage.
>
> Multi model inference / selection.
>
> I was reading about glmulti.
> Are there any other packages I could look at, for infering models that best
> fit the data.
>
> To use nlm / nls, I need a formula, as one of the parameters to best fit
> the data and I am looking for functions that will help infer that formula
> from the data.
>
> Thanks
> lalitha
>
> On Sun, May 3, 2015 at 11:33 PM, Prashant Sethi <[hidden email]>
> wrote:
>
>> Hi,
>>
>> I'm not an expert in data analysis (a beginner still learning tricks of
>> the trade) but I believe in your case since you're trying to determine the
>> correlation of a dependent variable with a number of factor variables, you
>> should try doing the regression analysis of your model. The function you'll
>> use for that is the lm() function. You can use the forward building or the
>> backward elimination method to build your model with the most critical
>> factors included.
>>
>> Maybe you can give it a try.
>>
>> Thanks and regards,
>> Prashant Sethi
>> On 3 May 2015 23:18, "Lalitha Viswanathan" <
>> [hidden email]> wrote:
>>
>>> Hi
>>> I am sorry, I saved the file removing the dot after the Disp (as I was
>>> going wrong on a read.delim which threw an error about !header, etc...The
>>> dot was not the culprit, but I continued to leave it out.
>>> Let me paste the full code here.
>>> x<-read.table("/Users/Documents/StatsTest/fuelEfficiency.txt",
>>> header=TRUE,
>>> sep="\t")
>>> x<-data.frame(x)
>>> for (i in unique(x$Country)) { print (i); y <- subset(x, x$Country == i);
>>> print(y); }
>>> newx <- subset (x, select = c(Price, Reliability, Mileage, Weight, Disp,
>>> HP))
>>> cor(newx, method="pearson")
>>> my.cor <-cor.test(newx$Weight, newx$Price, method="spearman")
>>> my.cor <-cor.test(newx$Weight, newx$HP, method="spearman")
>>> my.cor <-cor.test(newx$Disp, newx$HP, method="spearman")
>>> Putting exact=NULL still doesn't remove the warning
>>> my.cor <-cor.test(newx$Disp, newx$HP, method="kendall", exact=NULL)
>>> I tried to find the correlation coeff for a various combination of
>>> variables, but am unable to interpet the results. (Results pasted below in
>>> an earlier post)
>>>
>>> Followed it up with a normality test
>>> shapiro.test(newx$Disp)
>>> shapiro.test(newx$HP)
>>>
>>> Then decided to do a kruskal.test(newx)
>>> with the result
>>> Kruskal-Wallis chi-squared = 328.94, df = 5, p-value < 2.2e-16
>>>
>>> Question is : I am trying to find factors influencing efficiency (in this
>>> case mileage)
>>>
>>> What are the range of functions / examples I should be looking at, to find
>>> a factor or combination of factors influencing efficiency?
>>>
>>> Any pointers will be helpful
>>>
>>> Thanks
>>> Lalitha
>>>
>>> On Sun, May 3, 2015 at 2:49 PM, Lalitha Viswanathan <
>>> [hidden email]> wrote:
>>>
>>> > Hi
>>> > I have a dataset of the type attached.
>>> > Here's my code thus far.
>>> > dataset <-data.frame(read.delim("data", sep="\t", header=TRUE));
>>> > newData<-subset(dataset, select = c(Price, Reliability, Mileage, Weight,
>>> > Disp, HP));
>>> > cor(newData, method="pearson");
>>> > Results are
>>> >                  Price Reliability    Mileage     Weight       Disp
>>> >   HP
>>> > Price        1.0000000          NA -0.6537541  0.7017999  0.4856769
>>> >  0.6536433
>>> > Reliability         NA           1         NA         NA         NA
>>> >   NA
>>> > Mileage     -0.6537541          NA  1.0000000 -0.8478541 -0.6931928
>>> > -0.6667146
>>> > Weight       0.7017999          NA -0.8478541  1.0000000  0.8032804
>>> >  0.7629322
>>> > Disp         0.4856769          NA -0.6931928  0.8032804  1.0000000
>>> >  0.8181881
>>> > HP           0.6536433          NA -0.6667146  0.7629322  0.8181881
>>> >  1.0000000
>>> >
>>> > It appears that Wt and Price, Wt and Disp, Wt and HP, Disp and HP, HP
>>> and
>>> > Price are strongly correlated.
>>> > To find the statistical significance,
>>> > I am trying  sample.correln<-cor.test(newData$Disp, newData$HP,
>>> > method="kendall", exact=NULL)
>>> > Kendall's rank correlation tau
>>> >
>>> > data:  newx$Disp and newx$HP
>>> > z = 7.2192, p-value = 5.229e-13
>>> > alternative hypothesis: true tau is not equal to 0
>>> > sample estimates:
>>> >       tau
>>> > 0.6563871
>>> >
>>> > If I try the same with
>>> > sample.correln<-cor.test(newData$Disp, newData$HP, method="pearson",
>>> > exact=NULL)
>>> > I get Warning message:
>>> > In cor.test.default(newx$Disp, newx$HP, method = "spearman", exact =
>>> NULL)
>>> > :
>>> >   Cannot compute exact p-value with ties
>>> > > sample.correln
>>> >
>>> > Spearman's rank correlation rho
>>> >
>>> > data:  newx$Disp and newx$HP
>>> > S = 5716.8, p-value < 2.2e-16
>>> > alternative hypothesis: true rho is not equal to 0
>>> > sample estimates:
>>> >       rho
>>> > 0.8411566
>>> >
>>> > I am not sure how to interpret these values.
>>> > Basically, I am trying to figure out which combination of factors
>>> > influences efficiency.
>>> >
>>> > Thanks
>>> > Lalitha
>>> >
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.