predict () for LDA and GLM

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

predict () for LDA and GLM

palanski
Hi!

I'm using GLM, LDA and NaiveBayes for binomial classification. My training set is 70 rows long with 32 features, and my test set is 30 rows long with 32 features.

Using Naive Bayes, I can train a model, and then predict the test set with it like so:

ass4q1.dLDA = lda(ass4q1.trainSet[,1]~ass4q1.trainSet[,2:3])
table(predict(ass4q1.dNB, ass4q1.testSetDF[,2:3]), ass4q1.testSetDF[,1])


However, when the same is done for LDA or GLM, suddenly it tells me that the number of rows doesn't match and doesn't predict my test data. The error for GLM, as an example, is:

Error in table(predict(ass4q1.dGLM, ass4q1.testSetDF[, 2:3]), ass4q1.testSetDF[,  :
  all arguments must have the same length
In addition: Warning message:
'newdata' had 30 rows but variable(s) found have 70 rows

What am I missing?
Reply | Threaded
Open this post in threaded view
|

Re: predict () for LDA and GLM

Uwe Ligges-3


On 22.03.2012 03:24, palanski wrote:

> Hi!
>
> I'm using GLM, LDA and NaiveBayes for binomial classification. My training
> set is 70 rows long with 32 features, and my test set is 30 rows long with
> 32 features.
>
> Using Naive Bayes, I can train a model, and then predict the test set with
> it like so:
>
> ass4q1.dLDA = lda(ass4q1.trainSet[,1]~ass4q1.trainSet[,2:3])
> table(predict(ass4q1.dNB, ass4q1.testSetDF[,2:3]), ass4q1.testSetDF[,1])
>
>
> However, when the same is done for LDA or GLM, suddenly it tells me that the
> number of rows doesn't match and doesn't predict my test data. The error for
> GLM, as an example, is:
>
> Error in table(predict(ass4q1.dGLM, ass4q1.testSetDF[, 2:3]),
> ass4q1.testSetDF[,  :
>    all arguments must have the same length
> In addition: Warning message:
> 'newdata' had 30 rows but variable(s) found have 70 rows
 >
>
> What am I missing?

A correct formula describing the model with separate variables with the
data.frame passed to the data argument of the lda() function.
A reproducible example is missing, hence this is just a guess.

Uwe Ligges



>
> --
> View this message in context: http://r.789695.n4.nabble.com/predict-for-LDA-and-GLM-tp4494381p4494381.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: predict () for LDA and GLM

palanski
Here is the full code. Look to the last part, denoted #(f) for the question being asked in this post:

#(a) Split datapoints into training (70 points) and test (30 points) sets.
#Read in ass4-data.txt and ass3-phodata.txt
ass4data = read.delim('http://www.moseslab.csb.utoronto.ca/alan/ass4-data.txt', header = FALSE, sep = "\t")

#Separate all positive and negative hits
ass4q1.neg = ass4data[which(ass4data[,1] == 0),]
ass4q1.pos = ass4data[which(ass4data[,1] == 1),]

#Reset row names
rownames(ass4q1.neg) = NULL
rownames(ass4q1.pos) = NULL

#Sample 70% (35 out of 50 in each positive/negative set) for training set, rest for testing set
ass4q1.negRid = sample(1:nrow(ass4q1.neg),floor(0.7*nrow(ass4q1.neg)))
ass4q1.posRid = sample(1:nrow(ass4q1.pos),floor(0.7*nrow(ass4q1.pos)))

#Combine negative and positive values from each data set to create training and testing arrays
ass4q1.trainSet = as.matrix(rbind(ass4q1.neg[ass4q1.negRid,], ass4q1.pos[ass4q1.posRid,]))
ass4q1.testSet = rbind(ass4q1.neg[-(ass4q1.negRid),],ass4q1.pos[-(ass4q1.posRid),])

#Reset row names
rownames(ass4q1.trainSet) = NULL
rownames(ass4q1.testSet) = NULL

ass4q1.trainSetDF = as.data.frame(ass4q1.trainSet)
ass4q1.trainSetDF$V1 = factor(ass4q1.trainSetDF$V1)

ass4q1.testSetDF = as.data.frame(ass4q1.testSet)
ass4q1.testSetDF$V1 = factor(ass4q1.testSetDF$V1)


##############
#(b)Load MASS, e1071 and glmnet
library(MASS)
library(e1071)
library(glmnet)

#############
#(c)How many features does the data contain?
#The data contains 32 features (columns of data)

#############
#(d)How does the number of parameters required for Naïve Bayes, LDA, and Logistic
#Regression (unregularized) scale as a function of the number of features?

#If Y is binary with <X1 ... Xp> features, then the number of parameters is P(Y).

#NaiveBayes
#P(Y) = p • (mew(Y=1), mew(Y=0), sigma(Y=1), sigma(Y=0))
# = 1 + 4p

#Linear Discriminant Analysis
#Have to estimate one covariance matrix and p mean values for each class.
#To compute the covariance matrix is p x p, but since the upper or lower halfsymetrical, we disregard half, but include the
#middle diagonal by multiplying p x (p + 1) and dividing by 2.
#Calculating p mean values for each class is 2p (2 classes of binary Y).
#Thus:

P(Y) = (p(p + 1) / 2) + 2p

#Logistic Regression
#P(Y) = 1 + p

#To plot the relationship:
ass4q1.dVS= matrixmatrix(,ncol(ass4q1.trainSet)-1,3)

for (p in 1:ncol(ass4q1.trainSet)-1){
ass4q1.dVS[p,1] = (1 + (4*p))
ass4q1.dVS[p,2] = ((p *(p + 1) / 2) + 2*p)
ass4q1.dVS[p,3] = (1 + p)
}


png('ass4q1.dVS.png')
plot(ass4q1.dVS[,2], type="o", col="blue",ylim=c(0,max(ass4q1.dVS)), ann=FALSE)
lines(ass4q1.dVS[,1], type="o", pch=22, lty=2, col="red")
lines(ass4q1.dVS[,3], type="o", pch=23, lty=3, col="green")
title(main = "Number of parameters as a function of features", col.main="red", font.main=4)
title(xlab= "Features", col.lab="red")
title(ylab= "Parameters", col.lab="red")
legend(1, max(ass4q1.dVS), c("LDA", "Naive Bayes", "Logistic Regression"), cex=0.8, col=c("blue","red","green"), pch=21:23, lty=1:3)
dev.off()

#############
#(e)Train Naïve Bayes, LDA and Logistic Regression to classify the training data
#using the first two, four, eight, 16 or 32 features, starting from the left of the file. Plot
#the classification error (FP + FN)/(TP+FP+TN+FN) on the training data as a function
#of the number of parameters for each method.

#Contingency table organized as:
#TN FN
#FP TP

#Organize tables to store data:
ass4q1.dNBtable = matrix(,5,2)
ass4q1.dLDAtable = matrix(,5,2)
ass4q1.dGLMtable = matrix(,5,2)

i = 1
for(p in c(2,4,8,16,32)){
        ass4q1.dNBtable[i,1] = (1 + (4*p))
        ass4q1.dLDAtable[i,1] = ((p *(p + 1) / 2) + 2*p)
        ass4q1.dGLMtable[i,1] = (1+p)
        i = i+1
}

#Copying blank tables for part (f)
ass4q1.dNBtable.testData = ass4q1.dNBtable
ass4q1.dLDAtable.testData = ass4q1.dLDAtable
ass4q1.dGLMtable.testData = ass4q1.dGLMtable

#############
#(e)Train Naïve Bayes, LDA and Logistic Regression to classify the training data
#using the first two, four, eight, 16 or 32 features, starting from the left of the file. Plot
#the classification error (FP + FN)/(TP+FP+TN+FN) on the training data as a function
#of the number of parameters for each method.

#Contingency table organized as:
#TN FN
#FP TP

#Organize tables to store data:
ass4q1.dNBtable = matrix(,5,2)
ass4q1.dLDAtable = matrix(,5,2)
ass4q1.dGLMtable = matrix(,5,2)

i = 1
for(p in c(2,4,8,16,32)){
        ass4q1.dNBtable[i,1] = (1 + (4*p))
        ass4q1.dLDAtable[i,1] = ((p *(p + 1) / 2) + 2*p)
        ass4q1.dGLMtable[i,1] = (1+p)
        i = i+1
}

#Copying blank tables for part (f)
ass4q1.dNBtable.testData = ass4q1.dNBtable
ass4q1.dLDAtable.testData = ass4q1.dLDAtable
ass4q1.dGLMtable.testData = ass4q1.dGLMtable


#############
#2 Features
#2 features for NaiveBayes
ass4q1.dNB = naiveBayes(ass4q1.trainSetDF[,2:3], ass4q1.trainSetDF[,1])
ass4q1.dNB.cTable = table(predict(ass4q1.dNB, ass4q1.trainSetDF[,2:3]), ass4q1.trainSetDF[,1])

ass4q1.dNBtable[1,2] = (ass4q1.dNB.cTable[2,1] + ass4q1.dNB.cTable [1,2])/(sum(ass4q1.dNB.cTable))

#2 features for LDA
ass4q1.dLDA = lda(ass4q1.trainSet[,1]~ass4q1.trainSet[,2:3])
ass4q1.dLDA.cTable = table(predict(ass4q1.dLDA, ass4q1.trainSetDF[,2:3])$class, ass4q1.trainSetDF[,1])

ass4q1.dLDAtable[1,2] = (ass4q1.dLDA.cTable[2,1] + ass4q1.dLDA.cTable [1,2])/(sum(ass4q1.dLDA.cTable))

#2 features for GLM
ass4q1.dGLM = glm(ass4q1.trainSet[,1]~ass4q1.trainSet[,2:3], family = "binomial")
ass4q1.dGLM.cTable = table(predict(ass4q1.dGLM, ass4q1.trainSetDF[,2:3]),ass4q1.trainSetDF[,1])

ass4q1.dGLMtable[1,2] = ((35-sum(ass4q1.dGLM.cTable[1:35,1]) + (35-sum(ass4q1.dGLM.cTable[36:70,2]))) / 70)

#############
#4 Features
#4 features for NaiveBayes
ass4q1.dNB = naiveBayes(ass4q1.trainSetDF[,2:5], ass4q1.trainSetDF[,1])
ass4q1.dNB.cTable = table(predict(ass4q1.dNB, ass4q1.trainSetDF[,2:5]), ass4q1.trainSetDF[,1])

ass4q1.dNBtable[2,2] = (ass4q1.dNB.cTable[2,1] + ass4q1.dNB.cTable [1,2])/(sum(ass4q1.dNB.cTable))

#4 features for LDA
ass4q1.dLDA = lda(ass4q1.trainSet[,1]~ass4q1.trainSet[,2:5])
ass4q1.dLDA.cTable = table(predict(ass4q1.dLDA, ass4q1.trainSetDF[,2:5])$class, ass4q1.trainSetDF[,1])

ass4q1.dLDAtable[2,2] = (ass4q1.dLDA.cTable[2,1] + ass4q1.dLDA.cTable [1,2])/(sum(ass4q1.dLDA.cTable))

#4 features for GLM
ass4q1.dGLM = glm(ass4q1.trainSet[,1]~ass4q1.trainSet[,2:5], family = "binomial")
ass4q1.dGLM.cTable = table(predict(ass4q1.dGLM, ass4q1.trainSetDF[,2:5]),ass4q1.trainSetDF[,1])

ass4q1.dGLMtable[2,2] = ((35-sum(ass4q1.dGLM.cTable[1:35,1]) + (35-sum(ass4q1.dGLM.cTable[36:70,2]))) / 70)

#############
#8 Features
#8 features for NaiveBayes
ass4q1.dNB = naiveBayes(ass4q1.trainSetDF[,2:7], ass4q1.trainSetDF[,1])
ass4q1.dNB.cTable = table(predict(ass4q1.dNB, ass4q1.trainSetDF[,2:7]), ass4q1.trainSetDF[,1])

ass4q1.dNBtable[3,2] = (ass4q1.dNB.cTable[2,1] + ass4q1.dNB.cTable [1,2])/(sum(ass4q1.dNB.cTable))

#8 features for LDA
ass4q1.dLDA = lda(ass4q1.trainSet[,1]~ass4q1.trainSet[,2:7])
ass4q1.dLDA.cTable = table(predict(ass4q1.dLDA, ass4q1.trainSetDF[,2:7])$class, ass4q1.trainSetDF[,1])

ass4q1.dLDAtable[3,2] = (ass4q1.dLDA.cTable[2,1] + ass4q1.dLDA.cTable [1,2])/(sum(ass4q1.dLDA.cTable))

#8 features for GLM
ass4q1.dGLM = glm(ass4q1.trainSet[,1]~ass4q1.trainSet[,2:7], family = "binomial")
ass4q1.dGLM.cTable = table(predict(ass4q1.dGLM, ass4q1.trainSetDF[,2:7]),ass4q1.trainSetDF[,1])

ass4q1.dGLMtable[3,2] = ((35-sum(ass4q1.dGLM.cTable[1:35,1]) + (35-sum(ass4q1.dGLM.cTable[36:70,2]))) / 70)

#############
#16 Features
#16 features for NaiveBayes
ass4q1.dNB = naiveBayes(ass4q1.trainSetDF[,2:17], ass4q1.trainSetDF[,1])
ass4q1.dNB.cTable = table(predict(ass4q1.dNB, ass4q1.trainSetDF[,2:17]), ass4q1.trainSetDF[,1])

ass4q1.dNBtable[4,2] = (ass4q1.dNB.cTable[2,1] + ass4q1.dNB.cTable [1,2])/(sum(ass4q1.dNB.cTable))

#16 features for LDA
ass4q1.dLDA = lda(ass4q1.trainSet[,1]~ass4q1.trainSet[,2:17])
ass4q1.dLDA.cTable = table(predict(ass4q1.dLDA, ass4q1.trainSetDF[,2:17])$class, ass4q1.trainSetDF[,1])

ass4q1.dLDAtable[4,2] = (ass4q1.dLDA.cTable[2,1] + ass4q1.dLDA.cTable [1,2])/(sum(ass4q1.dLDA.cTable))

#16 features for GLM
ass4q1.dGLM = glm(ass4q1.trainSet[,1]~ass4q1.trainSet[,2:17], family = "binomial")
ass4q1.dGLM.cTable = table(predict(ass4q1.dGLM, ass4q1.trainSetDF[,2:17]),ass4q1.trainSetDF[,1])

ass4q1.dGLMtable[4,2] = ((35-sum(ass4q1.dGLM.cTable[1:35,1]) + (35-sum(ass4q1.dGLM.cTable[36:70,2]))) / 70)

#############
#32 Features
#32 features for NaiveBayes
ass4q1.dNB = naiveBayes(ass4q1.trainSetDF[,2:33], ass4q1.trainSetDF[,1])
ass4q1.dNB.cTable = table(predict(ass4q1.dNB, ass4q1.trainSetDF[,2:33]), ass4q1.trainSetDF[,1])

ass4q1.dNBtable[5,2] = (ass4q1.dNB.cTable[2,1] + ass4q1.dNB.cTable [1,2])/(sum(ass4q1.dNB.cTable))

#32 features for LDA
ass4q1.dLDA = lda(ass4q1.trainSet[,1]~ass4q1.trainSet[,2:33])
ass4q1.dLDA.cTable = table(predict(ass4q1.dLDA, ass4q1.trainSetDF[,2:33])$class, ass4q1.trainSetDF[,1])

ass4q1.dLDAtable[5,2] = (ass4q1.dLDA.cTable[2,1] + ass4q1.dLDA.cTable [1,2])/(sum(ass4q1.dLDA.cTable))

#16 features for GLM
ass4q1.dGLM = glm(ass4q1.trainSet[,1]~ass4q1.trainSet[,2:33], family = "binomial")
ass4q1.dGLM.cTable = table(predict(ass4q1.dGLM, ass4q1.trainSetDF[,2:33]),ass4q1.trainSetDF[,1])

ass4q1.dLDAtable[5,2] = ((35-sum(ass4q1.dGLM.cTable[1:35,1]) + (35-sum(ass4q1.dGLM.cTable[36:70,2]))) / 70)

png('ass4q1.dTables.png')
plot(ass4q1.dLDAtable[,1],ass4q1.dLDAtable[,2], type="o", col="blue",ylim=c(0,.4), ann=FALSE)
lines(ass4q1.dNBtable[,1], ass4q1.dNBtable[,2], type="o", pch=22, lty=2, col="red")
lines(ass4q1.dGLMtable[,1], ass4q1.dGLMtable[,2], type="o", pch=23, lty=3, col="green")
title(main = "Classification error as a function of number of parameters", col.main="red", font.main=4)
title(xlab= "Parameters", col.lab="red")
title(ylab= "(FP + FN)/(TP+FP+TN+FN)", col.lab="red")
legend(300, .4, c("LDA", "Naive Bayes", "Logistic Regression"), cex=0.8, col=c("blue","red","green"), pch=21:23, lty=1:3)
dev.off()

#############
#(f)Plot the classification error as a function of the number of parameters on the test
#data for each method. Does this differ from your answer in part (e)? Explain why.

#############
#2 Features
#2 features for NaiveBayes

ass4q1.dNB.cTable = table(predict(ass4q1.dNB, ass4q1.testSetDF[,2:3]), ass4q1.testSetDF[,1])

ass4q1.dNBtable.testData[1,2] = (ass4q1.dNB.cTable[2,1] + ass4q1.dNB.cTable [1,2])/(sum(ass4q1.dNB.cTable))

#WORKS!


#2 features for LDA
ass4q1.dLDA.cTable = table(predict(ass4q1.dLDA, ass4q1.testSetDF[,2:3])$class, ass4q1.testSetDF[,1])

#DOESN'T WORK!

ass4q1.dLDAtable.tesetData[1,2] = (ass4q1.dLDA.cTable[2,1] + ass4q1.dLDA.cTable [1,2])/(sum(ass4q1.dLDA.cTable))


#2 features for GLM
ass4q1.dGLM.cTable = table(predict(ass4q1.dGLM, ass4q1.testSetDF[,2:3], s = ),ass4q1.testSetDF[,1])
#DOESN'T WORK!

ass4q1.dGLMtable[1,2] = ((35-sum(ass4q1.dGLM.cTable[1:35,1]) + (35-sum(ass4q1.dGLM.cTable[36:70,2]))) / 70)
Uwe Ligges-3 wrote
On 22.03.2012 03:24, palanski wrote:
> Hi!
>
> I'm using GLM, LDA and NaiveBayes for binomial classification. My training
> set is 70 rows long with 32 features, and my test set is 30 rows long with
> 32 features.
>
> Using Naive Bayes, I can train a model, and then predict the test set with
> it like so:
>
> ass4q1.dLDA = lda(ass4q1.trainSet[,1]~ass4q1.trainSet[,2:3])
> table(predict(ass4q1.dNB, ass4q1.testSetDF[,2:3]), ass4q1.testSetDF[,1])
>
>
> However, when the same is done for LDA or GLM, suddenly it tells me that the
> number of rows doesn't match and doesn't predict my test data. The error for
> GLM, as an example, is:
>
> Error in table(predict(ass4q1.dGLM, ass4q1.testSetDF[, 2:3]),
> ass4q1.testSetDF[,  :
>    all arguments must have the same length
> In addition: Warning message:
> 'newdata' had 30 rows but variable(s) found have 70 rows
 >
>
> What am I missing?

A correct formula describing the model with separate variables with the
data.frame passed to the data argument of the lda() function.
A reproducible example is missing, hence this is just a guess.

Uwe Ligges



>
> --
> View this message in context: http://r.789695.n4.nabble.com/predict-for-LDA-and-GLM-tp4494381p4494381.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Uwe Ligges-3 wrote
On 22.03.2012 03:24, palanski wrote:
> Hi!
>
> I'm using GLM, LDA and NaiveBayes for binomial classification. My training
> set is 70 rows long with 32 features, and my test set is 30 rows long with
> 32 features.
>
> Using Naive Bayes, I can train a model, and then predict the test set with
> it like so:
>
> ass4q1.dLDA = lda(ass4q1.trainSet[,1]~ass4q1.trainSet[,2:3])
> table(predict(ass4q1.dNB, ass4q1.testSetDF[,2:3]), ass4q1.testSetDF[,1])
>
>
> However, when the same is done for LDA or GLM, suddenly it tells me that the
> number of rows doesn't match and doesn't predict my test data. The error for
> GLM, as an example, is:
>
> Error in table(predict(ass4q1.dGLM, ass4q1.testSetDF[, 2:3]),
> ass4q1.testSetDF[,  :
>    all arguments must have the same length
> In addition: Warning message:
> 'newdata' had 30 rows but variable(s) found have 70 rows
 >
>
> What am I missing?

A correct formula describing the model with separate variables with the
data.frame passed to the data argument of the lda() function.
A reproducible example is missing, hence this is just a guess.

Uwe Ligges



>
> --
> View this message in context: http://r.789695.n4.nabble.com/predict-for-LDA-and-GLM-tp4494381p4494381.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: predict () for LDA and GLM

Uwe Ligges-3
1. Not reproducible for me (gives an ERROR).
2. Please try to make examples "minimal", as the psoting guide suggests.
3. Please follow my advice and provide "A correct formula describing the
model with separate variables with the data.frame passed to the data
argument of the lda() function."

That means like:

lda(Species ~ Sepal.Length, data=iris)

and the same for predict() afterwards.


Best,
Uwe Ligges


On 22.03.2012 14:02, palanski wrote:

> Here is the full code. Look to the last part, denoted #(f) for the question
> being asked in this post:
>
> #(a) Split datapoints into training (70 points) and test (30 points) sets.
> #Read in ass4-data.txt and ass3-phodata.txt
> ass4data =
> read.delim('http://www.moseslab.csb.utoronto.ca/alan/ass4-data.txt', header
> = FALSE, sep = "\t")
>
> #Separate all positive and negative hits
> ass4q1.neg = ass4data[which(ass4data[,1] == 0),]
> ass4q1.pos = ass4data[which(ass4data[,1] == 1),]
>
> #Reset row names
> rownames(ass4q1.neg) = NULL
> rownames(ass4q1.pos) = NULL
>
> #Sample 70% (35 out of 50 in each positive/negative set) for training set,
> rest for testing set
> ass4q1.negRid = sample(1:nrow(ass4q1.neg),floor(0.7*nrow(ass4q1.neg)))
> ass4q1.posRid = sample(1:nrow(ass4q1.pos),floor(0.7*nrow(ass4q1.pos)))
>
> #Combine negative and positive values from each data set to create training
> and testing arrays
> ass4q1.trainSet = as.matrix(rbind(ass4q1.neg[ass4q1.negRid,],
> ass4q1.pos[ass4q1.posRid,]))
> ass4q1.testSet =
> rbind(ass4q1.neg[-(ass4q1.negRid),],ass4q1.pos[-(ass4q1.posRid),])
>
> #Reset row names
> rownames(ass4q1.trainSet) = NULL
> rownames(ass4q1.testSet) = NULL
>
> ass4q1.trainSetDF = as.data.frame(ass4q1.trainSet)
> ass4q1.trainSetDF$V1 = factor(ass4q1.trainSetDF$V1)
>
> ass4q1.testSetDF = as.data.frame(ass4q1.testSet)
> ass4q1.testSetDF$V1 = factor(ass4q1.testSetDF$V1)
>
>
> ##############
> #(b)Load MASS, e1071 and glmnet
> library(MASS)
> library(e1071)
> library(glmnet)
>
> #############
> #(c)How many features does the data contain?
> #The data contains 32 features (columns of data)
>
> #############
> #(d)How does the number of parameters required for Naïve Bayes, LDA, and
> Logistic
> #Regression (unregularized) scale as a function of the number of features?
>
> #If Y is binary with<X1 ... Xp>  features, then the number of parameters is
> P(Y).
>
> #NaiveBayes
> #P(Y) = p • (mew(Y=1), mew(Y=0), sigma(Y=1), sigma(Y=0))
> # = 1 + 4p
>
> #Linear Discriminant Analysis
> #Have to estimate one covariance matrix and p mean values for each class.
> #To compute the covariance matrix is p x p, but since the upper or lower
> halfsymetrical, we disregard half, but include the
> #middle diagonal by multiplying p x (p + 1) and dividing by 2.
> #Calculating p mean values for each class is 2p (2 classes of binary Y).
> #Thus:
>
> P(Y) = (p(p + 1) / 2) + 2p
>
> #Logistic Regression
> #P(Y) = 1 + p
>
> #To plot the relationship:
> ass4q1.dVS= matrixmatrix(,ncol(ass4q1.trainSet)-1,3)
>
> for (p in 1:ncol(ass4q1.trainSet)-1){
> ass4q1.dVS[p,1] = (1 + (4*p))
> ass4q1.dVS[p,2] = ((p *(p + 1) / 2) + 2*p)
> ass4q1.dVS[p,3] = (1 + p)
> }
>
>
> png('ass4q1.dVS.png')
> plot(ass4q1.dVS[,2], type="o", col="blue",ylim=c(0,max(ass4q1.dVS)),
> ann=FALSE)
> lines(ass4q1.dVS[,1], type="o", pch=22, lty=2, col="red")
> lines(ass4q1.dVS[,3], type="o", pch=23, lty=3, col="green")
> title(main = "Number of parameters as a function of features",
> col.main="red", font.main=4)
> title(xlab= "Features", col.lab="red")
> title(ylab= "Parameters", col.lab="red")
> legend(1, max(ass4q1.dVS), c("LDA", "Naive Bayes", "Logistic Regression"),
> cex=0.8, col=c("blue","red","green"), pch=21:23, lty=1:3)
> dev.off()
>
> #############
> #(e)Train Naïve Bayes, LDA and Logistic Regression to classify the training
> data
> #using the first two, four, eight, 16 or 32 features, starting from the left
> of the file. Plot
> #the classification error (FP + FN)/(TP+FP+TN+FN) on the training data as a
> function
> #of the number of parameters for each method.
>
> #Contingency table organized as:
> #TN FN
> #FP TP
>
> #Organize tables to store data:
> ass4q1.dNBtable = matrix(,5,2)
> ass4q1.dLDAtable = matrix(,5,2)
> ass4q1.dGLMtable = matrix(,5,2)
>
> i = 1
> for(p in c(2,4,8,16,32)){
> ass4q1.dNBtable[i,1] = (1 + (4*p))
> ass4q1.dLDAtable[i,1] = ((p *(p + 1) / 2) + 2*p)
> ass4q1.dGLMtable[i,1] = (1+p)
> i = i+1
> }
>
> #Copying blank tables for part (f)
> ass4q1.dNBtable.testData = ass4q1.dNBtable
> ass4q1.dLDAtable.testData = ass4q1.dLDAtable
> ass4q1.dGLMtable.testData = ass4q1.dGLMtable
>
> #############
> #(e)Train Naïve Bayes, LDA and Logistic Regression to classify the training
> data
> #using the first two, four, eight, 16 or 32 features, starting from the left
> of the file. Plot
> #the classification error (FP + FN)/(TP+FP+TN+FN) on the training data as a
> function
> #of the number of parameters for each method.
>
> #Contingency table organized as:
> #TN FN
> #FP TP
>
> #Organize tables to store data:
> ass4q1.dNBtable = matrix(,5,2)
> ass4q1.dLDAtable = matrix(,5,2)
> ass4q1.dGLMtable = matrix(,5,2)
>
> i = 1
> for(p in c(2,4,8,16,32)){
> ass4q1.dNBtable[i,1] = (1 + (4*p))
> ass4q1.dLDAtable[i,1] = ((p *(p + 1) / 2) + 2*p)
> ass4q1.dGLMtable[i,1] = (1+p)
> i = i+1
> }
>
> #Copying blank tables for part (f)
> ass4q1.dNBtable.testData = ass4q1.dNBtable
> ass4q1.dLDAtable.testData = ass4q1.dLDAtable
> ass4q1.dGLMtable.testData = ass4q1.dGLMtable
>
>
> #############
> #2 Features
> #2 features for NaiveBayes
> ass4q1.dNB = naiveBayes(ass4q1.trainSetDF[,2:3], ass4q1.trainSetDF[,1])
> ass4q1.dNB.cTable = table(predict(ass4q1.dNB, ass4q1.trainSetDF[,2:3]),
> ass4q1.trainSetDF[,1])
>
> ass4q1.dNBtable[1,2] = (ass4q1.dNB.cTable[2,1] + ass4q1.dNB.cTable
> [1,2])/(sum(ass4q1.dNB.cTable))
>
> #2 features for LDA
> ass4q1.dLDA = lda(ass4q1.trainSet[,1]~ass4q1.trainSet[,2:3])
> ass4q1.dLDA.cTable = table(predict(ass4q1.dLDA,
> ass4q1.trainSetDF[,2:3])$class, ass4q1.trainSetDF[,1])
>
> ass4q1.dLDAtable[1,2] = (ass4q1.dLDA.cTable[2,1] + ass4q1.dLDA.cTable
> [1,2])/(sum(ass4q1.dLDA.cTable))
>
> #2 features for GLM
> ass4q1.dGLM = glm(ass4q1.trainSet[,1]~ass4q1.trainSet[,2:3], family =
> "binomial")
> ass4q1.dGLM.cTable = table(predict(ass4q1.dGLM,
> ass4q1.trainSetDF[,2:3]),ass4q1.trainSetDF[,1])
>
> ass4q1.dGLMtable[1,2] = ((35-sum(ass4q1.dGLM.cTable[1:35,1]) +
> (35-sum(ass4q1.dGLM.cTable[36:70,2]))) / 70)
>
> #############
> #4 Features
> #4 features for NaiveBayes
> ass4q1.dNB = naiveBayes(ass4q1.trainSetDF[,2:5], ass4q1.trainSetDF[,1])
> ass4q1.dNB.cTable = table(predict(ass4q1.dNB, ass4q1.trainSetDF[,2:5]),
> ass4q1.trainSetDF[,1])
>
> ass4q1.dNBtable[2,2] = (ass4q1.dNB.cTable[2,1] + ass4q1.dNB.cTable
> [1,2])/(sum(ass4q1.dNB.cTable))
>
> #4 features for LDA
> ass4q1.dLDA = lda(ass4q1.trainSet[,1]~ass4q1.trainSet[,2:5])
> ass4q1.dLDA.cTable = table(predict(ass4q1.dLDA,
> ass4q1.trainSetDF[,2:5])$class, ass4q1.trainSetDF[,1])
>
> ass4q1.dLDAtable[2,2] = (ass4q1.dLDA.cTable[2,1] + ass4q1.dLDA.cTable
> [1,2])/(sum(ass4q1.dLDA.cTable))
>
> #4 features for GLM
> ass4q1.dGLM = glm(ass4q1.trainSet[,1]~ass4q1.trainSet[,2:5], family =
> "binomial")
> ass4q1.dGLM.cTable = table(predict(ass4q1.dGLM,
> ass4q1.trainSetDF[,2:5]),ass4q1.trainSetDF[,1])
>
> ass4q1.dGLMtable[2,2] = ((35-sum(ass4q1.dGLM.cTable[1:35,1]) +
> (35-sum(ass4q1.dGLM.cTable[36:70,2]))) / 70)
>
> #############
> #8 Features
> #8 features for NaiveBayes
> ass4q1.dNB = naiveBayes(ass4q1.trainSetDF[,2:7], ass4q1.trainSetDF[,1])
> ass4q1.dNB.cTable = table(predict(ass4q1.dNB, ass4q1.trainSetDF[,2:7]),
> ass4q1.trainSetDF[,1])
>
> ass4q1.dNBtable[3,2] = (ass4q1.dNB.cTable[2,1] + ass4q1.dNB.cTable
> [1,2])/(sum(ass4q1.dNB.cTable))
>
> #8 features for LDA
> ass4q1.dLDA = lda(ass4q1.trainSet[,1]~ass4q1.trainSet[,2:7])
> ass4q1.dLDA.cTable = table(predict(ass4q1.dLDA,
> ass4q1.trainSetDF[,2:7])$class, ass4q1.trainSetDF[,1])
>
> ass4q1.dLDAtable[3,2] = (ass4q1.dLDA.cTable[2,1] + ass4q1.dLDA.cTable
> [1,2])/(sum(ass4q1.dLDA.cTable))
>
> #8 features for GLM
> ass4q1.dGLM = glm(ass4q1.trainSet[,1]~ass4q1.trainSet[,2:7], family =
> "binomial")
> ass4q1.dGLM.cTable = table(predict(ass4q1.dGLM,
> ass4q1.trainSetDF[,2:7]),ass4q1.trainSetDF[,1])
>
> ass4q1.dGLMtable[3,2] = ((35-sum(ass4q1.dGLM.cTable[1:35,1]) +
> (35-sum(ass4q1.dGLM.cTable[36:70,2]))) / 70)
>
> #############
> #16 Features
> #16 features for NaiveBayes
> ass4q1.dNB = naiveBayes(ass4q1.trainSetDF[,2:17], ass4q1.trainSetDF[,1])
> ass4q1.dNB.cTable = table(predict(ass4q1.dNB, ass4q1.trainSetDF[,2:17]),
> ass4q1.trainSetDF[,1])
>
> ass4q1.dNBtable[4,2] = (ass4q1.dNB.cTable[2,1] + ass4q1.dNB.cTable
> [1,2])/(sum(ass4q1.dNB.cTable))
>
> #16 features for LDA
> ass4q1.dLDA = lda(ass4q1.trainSet[,1]~ass4q1.trainSet[,2:17])
> ass4q1.dLDA.cTable = table(predict(ass4q1.dLDA,
> ass4q1.trainSetDF[,2:17])$class, ass4q1.trainSetDF[,1])
>
> ass4q1.dLDAtable[4,2] = (ass4q1.dLDA.cTable[2,1] + ass4q1.dLDA.cTable
> [1,2])/(sum(ass4q1.dLDA.cTable))
>
> #16 features for GLM
> ass4q1.dGLM = glm(ass4q1.trainSet[,1]~ass4q1.trainSet[,2:17], family =
> "binomial")
> ass4q1.dGLM.cTable = table(predict(ass4q1.dGLM,
> ass4q1.trainSetDF[,2:17]),ass4q1.trainSetDF[,1])
>
> ass4q1.dGLMtable[4,2] = ((35-sum(ass4q1.dGLM.cTable[1:35,1]) +
> (35-sum(ass4q1.dGLM.cTable[36:70,2]))) / 70)
>
> #############
> #32 Features
> #32 features for NaiveBayes
> ass4q1.dNB = naiveBayes(ass4q1.trainSetDF[,2:33], ass4q1.trainSetDF[,1])
> ass4q1.dNB.cTable = table(predict(ass4q1.dNB, ass4q1.trainSetDF[,2:33]),
> ass4q1.trainSetDF[,1])
>
> ass4q1.dNBtable[5,2] = (ass4q1.dNB.cTable[2,1] + ass4q1.dNB.cTable
> [1,2])/(sum(ass4q1.dNB.cTable))
>
> #32 features for LDA
> ass4q1.dLDA = lda(ass4q1.trainSet[,1]~ass4q1.trainSet[,2:33])
> ass4q1.dLDA.cTable = table(predict(ass4q1.dLDA,
> ass4q1.trainSetDF[,2:33])$class, ass4q1.trainSetDF[,1])
>
> ass4q1.dLDAtable[5,2] = (ass4q1.dLDA.cTable[2,1] + ass4q1.dLDA.cTable
> [1,2])/(sum(ass4q1.dLDA.cTable))
>
> #16 features for GLM
> ass4q1.dGLM = glm(ass4q1.trainSet[,1]~ass4q1.trainSet[,2:33], family =
> "binomial")
> ass4q1.dGLM.cTable = table(predict(ass4q1.dGLM,
> ass4q1.trainSetDF[,2:33]),ass4q1.trainSetDF[,1])
>
> ass4q1.dLDAtable[5,2] = ((35-sum(ass4q1.dGLM.cTable[1:35,1]) +
> (35-sum(ass4q1.dGLM.cTable[36:70,2]))) / 70)
>
> png('ass4q1.dTables.png')
> plot(ass4q1.dLDAtable[,1],ass4q1.dLDAtable[,2], type="o",
> col="blue",ylim=c(0,.4), ann=FALSE)
> lines(ass4q1.dNBtable[,1], ass4q1.dNBtable[,2], type="o", pch=22, lty=2,
> col="red")
> lines(ass4q1.dGLMtable[,1], ass4q1.dGLMtable[,2], type="o", pch=23, lty=3,
> col="green")
> title(main = "Classification error as a function of number of parameters",
> col.main="red", font.main=4)
> title(xlab= "Parameters", col.lab="red")
> title(ylab= "(FP + FN)/(TP+FP+TN+FN)", col.lab="red")
> legend(300, .4, c("LDA", "Naive Bayes", "Logistic Regression"), cex=0.8,
> col=c("blue","red","green"), pch=21:23, lty=1:3)
> dev.off()
>
> #############
> #(f)Plot the classification error as a function of the number of parameters
> on the test
> #data for each method. Does this differ from your answer in part (e)?
> Explain why.
>
> #############
> #2 Features
> #2 features for NaiveBayes
>
> ass4q1.dNB.cTable = table(predict(ass4q1.dNB, ass4q1.testSetDF[,2:3]),
> ass4q1.testSetDF[,1])
>
> ass4q1.dNBtable.testData[1,2] = (ass4q1.dNB.cTable[2,1] + ass4q1.dNB.cTable
> [1,2])/(sum(ass4q1.dNB.cTable))
>
> #WORKS!
>
>
> #2 features for LDA
> ass4q1.dLDA.cTable = table(predict(ass4q1.dLDA,
> ass4q1.testSetDF[,2:3])$class, ass4q1.testSetDF[,1])
>
> #DOESN'T WORK!
>
> ass4q1.dLDAtable.tesetData[1,2] = (ass4q1.dLDA.cTable[2,1] +
> ass4q1.dLDA.cTable [1,2])/(sum(ass4q1.dLDA.cTable))
>
>
> #2 features for GLM
> ass4q1.dGLM.cTable = table(predict(ass4q1.dGLM, ass4q1.testSetDF[,2:3], s =
> ),ass4q1.testSetDF[,1])
> #DOESN'T WORK!
>
> ass4q1.dGLMtable[1,2] = ((35-sum(ass4q1.dGLM.cTable[1:35,1]) +
> (35-sum(ass4q1.dGLM.cTable[36:70,2]))) / 70)
> Uwe Ligges-3 wrote
>>
>> On 22.03.2012 03:24, palanski wrote:
>>> Hi!
>>>
>>> I'm using GLM, LDA and NaiveBayes for binomial classification. My
>>> training
>>> set is 70 rows long with 32 features, and my test set is 30 rows long
>>> with
>>> 32 features.
>>>
>>> Using Naive Bayes, I can train a model, and then predict the test set
>>> with
>>> it like so:
>>>
>>> ass4q1.dLDA = lda(ass4q1.trainSet[,1]~ass4q1.trainSet[,2:3])
>>> table(predict(ass4q1.dNB, ass4q1.testSetDF[,2:3]), ass4q1.testSetDF[,1])
>>>
>>>
>>> However, when the same is done for LDA or GLM, suddenly it tells me that
>>> the
>>> number of rows doesn't match and doesn't predict my test data. The error
>>> for
>>> GLM, as an example, is:
>>>
>>> Error in table(predict(ass4q1.dGLM, ass4q1.testSetDF[, 2:3]),
>>> ass4q1.testSetDF[,  :
>>>     all arguments must have the same length
>>> In addition: Warning message:
>>> 'newdata' had 30 rows but variable(s) found have 70 rows
>>   >
>>>
>>> What am I missing?
>>
>> A correct formula describing the model with separate variables with the
>> data.frame passed to the data argument of the lda() function.
>> A reproducible example is missing, hence this is just a guess.
>>
>> Uwe Ligges
>>
>>
>>
>>>
>>> --
>>> View this message in context:
>>> http://r.789695.n4.nabble.com/predict-for-LDA-and-GLM-tp4494381p4494381.html
>>> Sent from the R help mailing list archive at Nabble.com.
>>>
>>> ______________________________________________
>>> R-help@ mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-help@ mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> Uwe Ligges-3 wrote
>>
>> On 22.03.2012 03:24, palanski wrote:
>>> Hi!
>>>
>>> I'm using GLM, LDA and NaiveBayes for binomial classification. My
>>> training
>>> set is 70 rows long with 32 features, and my test set is 30 rows long
>>> with
>>> 32 features.
>>>
>>> Using Naive Bayes, I can train a model, and then predict the test set
>>> with
>>> it like so:
>>>
>>> ass4q1.dLDA = lda(ass4q1.trainSet[,1]~ass4q1.trainSet[,2:3])
>>> table(predict(ass4q1.dNB, ass4q1.testSetDF[,2:3]), ass4q1.testSetDF[,1])
>>>
>>>
>>> However, when the same is done for LDA or GLM, suddenly it tells me that
>>> the
>>> number of rows doesn't match and doesn't predict my test data. The error
>>> for
>>> GLM, as an example, is:
>>>
>>> Error in table(predict(ass4q1.dGLM, ass4q1.testSetDF[, 2:3]),
>>> ass4q1.testSetDF[,  :
>>>     all arguments must have the same length
>>> In addition: Warning message:
>>> 'newdata' had 30 rows but variable(s) found have 70 rows
>>   >
>>>
>>> What am I missing?
>>
>> A correct formula describing the model with separate variables with the
>> data.frame passed to the data argument of the lda() function.
>> A reproducible example is missing, hence this is just a guess.
>>
>> Uwe Ligges
>>
>>
>>
>>>
>>> --
>>> View this message in context:
>>> http://r.789695.n4.nabble.com/predict-for-LDA-and-GLM-tp4494381p4494381.html
>>> Sent from the R help mailing list archive at Nabble.com.
>>>
>>> ______________________________________________
>>> R-help@ mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-help@ mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/predict-for-LDA-and-GLM-tp4494381p4495386.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.