Data file verification protocol

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Data file verification protocol

wolfste4
Hi R users,

This isn’t a R-specific issue, per-se, but I thought that this list would have some helpful input on this topic.  First, a bit of background.  I am working on a project which is interested in following approx 1000 students each semester, and collects about 15 different measurements about each student.  These are both numeric and text, for example grades in a course, race, gender, etc.

I am looking for a verification protocol which can look at a data file and see if it has been modified.  Ideally, this should be something that I can check the file with to see if the file has been changed or corrupted and incorporate into my analysis workflow.  (i.e., every time I look at my data, I can run this protocol to ensure the file hasn’t changed.)

Thanks!
-Steve

--
Steven F. Wolf
Postdoctoral Research Associate
CREATE for STEM Institute
Michigan State University

        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Data file verification protocol

barry rowlingson
On Wed, Mar 19, 2014 at 4:03 AM, Wolf, Steven <[hidden email]> wrote:

> Hi R users,
>
> This isn't a R-specific issue, per-se, but I thought that this list would
> have some helpful input on this topic.  First, a bit of background.  I am
> working on a project which is interested in following approx 1000 students
> each semester, and collects about 15 different measurements about each
> student.  These are both numeric and text, for example grades in a course,
> race, gender, etc.
>
> I am looking for a verification protocol which can look at a data file and
> see if it has been modified.  Ideally, this should be something that I can
> check the file with to see if the file has been changed or corrupted and
> incorporate into my analysis workflow.  (i.e., every time I look at my
> data, I can run this protocol to ensure the file hasn't changed.)
>

 Operating systems will keep the last modification time of a file, and you
can use the file.info function in R to check that. However, if someone just
opens and re-saves the file without changing it that will usually trigger
an update of the modification time.

 The big question you haven't answered is "has the file been changed since
when?". Since you last ran your analysis? This then looks like a job for
the 'make' utility. You specify rules in a 'Makefile' that specify how to
create "targets" based on "dependencies". For example:

results.txt: data.dat process.R
    Rscript process.R

- says that "results.txt" depends on "data.dat" (your input data) and
"process.R" (your R code that creates results.txt from data.dat), and would
run "Rscript process.R" if data.dat or process.R have a newer modification
time than results.txt. Run twice in rapid succession, this makefile
wouldn't run R the second time because results.txt would be newer then its
dependencies since it was just created.

 Objects within R don't have timestamps, so its not possible to
conditionally run an R function if its parameter objects are newer than the
result object. But if you save R objects as .RData files, you can use
"make" based on the timestamps of the .RData files.

 Alternatively you can just keep recent versions of the file hanging around
(1000x15 is pretty small, even multiplied by another 1000 is still not
exactly Big Data) and compare them. In a unix environment the "cmp" command
will quickly test two files for equality, or if you don't want to store
copies of your file you simply compute a checksum or digest and compare
digests. In a unix environment you'd typically use the "md5sum" command
which spits out a 128-bit (32 character) checksum for its arguments. If the
checksum is different, then the file is different.

 Your use case is still a bit vague - for example you haven't said what the
file format is, or how its being updated.

Barry

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Data file verification protocol

Rex-2
In reply to this post by wolfste4
Wolf, Steven <[hidden email]> [2014-03-18 21:05]:
> I am looking for a verification protocol which can look at a data file and see if it has been modified.  Ideally, this should be something that I can check the file with to see if the file has been changed or corrupted and incorporate into my analysis workflow.  (i.e., every time I look at my data, I can run this protocol to ensure the file hasn?t changed.)

http://dirk.eddelbuettel.com/code/digest.html

Overview

digest provides `hash' function summaries for GNU R objects. The md5,
sha-1, sha-256 and crc32 hash functions are available. The md5
algorithm by Ron Rivest is specified in RFC 1321, the SHA-1 and
SHA-256 algorithm is specified in FIPS-180-1 and FIPS-180-2,
respectively, and the crc32 algorithm is described in here. For md5,
sha-1 and sha-256, this packages uses small standalone C
implementations that were provided by by Christophe Devine. For crc32,
code from the zlib library is used. For sha-512, a routine by Aaron
Gifford is used. Please note that this package is not meant to be
deployed for cryptographic purposes for which more comprehensive (and
widely tested) libraries such as OpenSSL should be used.

Example

The following verbatim R session loads digest and runs the example()
from the corresponding help page:

> library(digest)
> example(digest)

digest> md5Input <- c("", "a", "abc", "message digest", "abcdefghijklmnopqrstuvwxyz",
    "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789",
    paste("12345678901234567890123456789012345678901234567890123456789012",
        "34567890123456 ..." ... [TRUNCATED]
[...]

HTH,

-rex
--
"...I paid a visit to Schrodinger in his Vienna apartment before his death...
There were no cats. I was told he did not like cats." -quantam leaps,
bernstein.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.