Quantcast

comparing two data files

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

comparing two data files

Nicole Brandt
I have 2 large data files that I need to compare and find the differences between data file x and data file y in order to correct data entry error. Theoretically both data files should be identical. I am trying to figure out a way to do this in R. Any help would be great!
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: comparing two data files

Henrique Dallazuanna
Here is some ways:

all.equal(readLines(file1), readLines(file2))

You could try compare md5sum of the files:

library(tools)

identical(md5sum(file1), md5sum(file2))

On Tue, Oct 19, 2010 at 8:23 PM, Nicole Brandt <[hidden email]> wrote:

> I have 2 large data files that I need to compare and find the differences
> between data file x and data file y in order to correct data entry error.
> Theoretically both data files should be identical. I am trying to figure out
> a way to do this in R. Any help would be great!
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


--
Henrique Dallazuanna
Curitiba-Paraná-Brasil
25° 25' 40" S 49° 16' 22" O

        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: comparing two data files

Mike Marchywka
In reply to this post by Nicole Brandt

----------------------------------------
> From: [hidden email]
> Date: Tue, 19 Oct 2010 18:23:27 -0400
> To: [hidden email]
> Subject: [R] comparing two data files
>
> I have 2 large data files that I need to compare and find the differences between data file x and data file y in order to correct data entry error. Theoretically both data files should be identical. I am trying to figure ou[[elided Hotmail spam]]


I'm not sure why you want to use R for this, there may be very good reasons,
but generally I use text processing utilities like "diff" ( see linux
or cygwin docs) along with grep,sed, awk, and maybe perl.
Generally these are not sophisticated with numbers and just process
strings so if your validation and correction relies on R features it
may be worthwhile. If you are really just looking for diffs in strings,
these others could be a good alternative and possibly worth the learning curve
for you if you largest motivation for doing this in R is to learn more R.
I guess the next question is, "what do you want to do if they are not equal?"


     
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: comparing two data files

alizee
This post has NOT been accepted by the mailing list yet.
In reply to this post by Henrique Dallazuanna
You don't want to use identical() here, because the names in the objects returned by md5sum() will be different. Thus
Henrique Dallazuanna wrote
identical(md5sum(file1), md5sum(file2))
will always return FALSE. Here is a helper function that does the job:

compareFiles <- function(file1, file2) {
  library(tools)
  md5sum(file1) == md5sum(file2)
}
Loading...