Quantcast

correlation matrix between data from different files

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

correlation matrix between data from different files

jeff6868
Dear users,

I'm quite a new french R-user, and I have a problem about doing a correlation matrix.
I have temperature data for each weather station of my study area and for each year (for example, a data file for the weather station N°1 for the year 2009, a data file  for the N°2 for the year 2010, ....). So I have 70 weather stations with one data file per year since 2005. Each station has 4 temperature sensors.
Each data file has exactly the same structure: date&hour, sensor1, sensor2, sensor3, sensor4. Here's an example:

time                      sensor1 sensor2 sensor3sensor4
01/01/2008 00:00 -0.25 -2.43 -3.25 -2.37
01/01/2008 00:15 -0.18 -2.37 -3.18 -2.25
01/01/2008 00:30 -0.25 -2.5        -3.37 -2.56
01/01/2008 00:45 -0.25 -2.37 -3.31 -2.37

I need to do a matrix correlation between each same sensors of the different stations (one correlation matrix between all the sensors 1 of the 70 stations, another one for sensor 2, ...).
I have to find for each year and each station the best correlation. For example, which one of the 70 weather stations is the most well correlated with station 1 for the sensor 1? and with station 2? ... and so one for each sensor and each station.

Example:

Sensor 1 for the year 2009

                   Station 1 Station 2 Station 3 [...]
Station 1         1       0.910         0.748
Station 2     0.910        1                0.6
Station 3      0.748       0.6              1  
[...]

And the same for year 2005,2006,2007,2008,2009,2010,2011 for each of the 4 sensors.

Have you got any idea how can I do this on R?
Should I first merge all the sensors in one file or could I do it with data in separate files (like I have for the moment)?
Thank you very much for all your answers!
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: correlation matrix between data from different files

Rui Barradas
Hello,

jeff6868 wrote
Dear users,

I'm quite a new french R-user, and I have a problem about doing a correlation matrix.
I have temperature data for each weather station of my study area and for each year (for example, a data file for the weather station N°1 for the year 2009, a data file  for the N°2 for the year 2010, ....). So I have 70 weather stations with one data file per year since 2005. Each station has 4 temperature sensors.
Each data file has exactly the same structure: date&hour, sensor1, sensor2, sensor3, sensor4. Here's an example:

time                      sensor1 sensor2 sensor3sensor4
01/01/2008 00:00 -0.25 -2.43 -3.25 -2.37
01/01/2008 00:15 -0.18 -2.37 -3.18 -2.25
01/01/2008 00:30 -0.25 -2.5        -3.37 -2.56
01/01/2008 00:45 -0.25 -2.37 -3.31 -2.37

I need to do a matrix correlation between each same sensors of the different stations (one correlation matrix between all the sensors 1 of the 70 stations, another one for sensor 2, ...).
I have to find for each year and each station the best correlation. For example, which one of the 70 weather stations is the most well correlated with station 1 for the sensor 1? and with station 2? ... and so one for each sensor and each station.

Example:

Sensor 1 for the year 2009

                   Station 1 Station 2 Station 3 [...]
Station 1         1       0.910         0.748
Station 2     0.910        1                0.6
Station 3      0.748       0.6              1  
[...]

And the same for year 2005,2006,2007,2008,2009,2010,2011 for each of the 4 sensors.

Have you got any idea how can I do this on R?
Should I first merge all the sensors in one file or could I do it with data in separate files (like I have for the moment)?
Thank you very much for all your answers!

You don't need to merge all files, but you must do some preprocessing.
If you put all data of one year in a 3d array, then simply use 'cor'.

I've made up some fake data, in files named "station1_2009.dat", etc (only 6 stations),
each of them with the same number of observations. If you have 70 stations per year, you'll
need an automated process to access them. Something like the function below would solve
part of that problem.
What follows assumes that the n. obs. is the same in all files.

# This function gives file names with the pattern above
filenames <- function(y, n=70){
    tmp <- paste("station", seq_len(n), sep="")
    tmp <- paste(tmp, y, sep="_")
    paste(tmp, "dat", sep=".")
}


Sensors <- paste("sensor", 1:4, sep="")
Stations <- paste("station", 1:6, sep="")

nsensors <- length(Sensors)
nstations <- length(Stations)

year <- 2009
fnames <- filenames(year, nstations)

# If nobs is the same in all files, any one will do.
nobs <- nrow(read.table(fnames[1], header=TRUE))

yr2009 <- array(NA, dim=c(nobs, nsensors, nstations))
for(i in seq_len(nstations)){
    tmp <- read.table(fnames[i], header=TRUE)
    yr2009[ , , i] <- as.matrix(tmp[, Sensors])
}

dimnames(yr2009) <- list(seq.int(nobs), Sensors, Stations)

# correlations for sensor 1
cor(yr2009[ , 1, ])

# a list of correlations for the 4 sensors
cor2009 <- lapply(Sensors, function(s) cor(yr2009[ , s, ]))
names(cor2009) <- Sensors
cor2009$sensor1


Don't pay much attention to the files part, what's relevant is to create and fill the array.

Hope this helps,

Rui Barradas
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: correlation matrix between data from different files

jeff6868
Hello Rui,

Thanks a lot for your answer.

Hou hoped that your script would help me?
I answer you: It is WON-DER-FUL!
It works very well!  I had first some difficulties to adapt it to my data, but I succeeded afterwords when I made a test between 2 stations.
It's not perfect yet (I still have to modify a bit my data because it doesn't recognize the time column, and I have some problems with the automatization according to the name of the data from each stations), but the main problem (correlation matrix) seems to be resolved thanks to you!

Thanks a lot again!
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: correlation matrix between data from different files

jeff6868
I improved yesterday a bit your script (mostly according to station numbers for the automatization). Here's the final version. thanks again!

filenames <- list.files(pattern="\\_2008_reconstruit.csv$")

Sensors <- paste("capteur_", 1:4, sep="")

Stations <-substr(filenames,1,5)

nsensors <- length(Sensors)
nstations <- length(Stations)

nobs <- nrow(read.table(filenames[1], header=TRUE))

yr2008 <- array(NA, dim=c(nobs, nsensors, nstations))

for(i in seq_len(nstations)){
    tmp <- read.table(filenames[i], header=TRUE, sep=";")
    yr2008[ , , i] <- as.matrix(tmp[, Sensors])
}

dimnames(yr2008) <- list(seq.int(nobs), Sensors, Stations)
cor2008 <- lapply(Sensors, function(s) cor(yr2008[ , s, ],use="complete.obs"))
names(cor2008) <- Sensors
cor2008$capteur_1
cor2008$capteur_2
cor2008$capteur_3
cor2008$capteur_4
Loading...