How to benchmark speed of load/readRDS correctly

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

How to benchmark speed of load/readRDS correctly

raphael.felber-2
Dear all

I was thinking about efficient reading data into R and tried several ways to test if load(file.Rdata) or readRDS(file.rds) is faster. The files file.Rdata and file.rds contain the same data, the first created with save(d, ' file.Rdata', compress=F) and the second with saveRDS(d, ' file.rds', compress=F).

First I used the function microbenchmark() and was a astonished about the max value of the output.

FIRST TEST:
> library(microbenchmark)
> microbenchmark(
+   n <- readRDS('file.rds'),
+   load('file.Rdata')
+ )
Unit: milliseconds
              expr                     min                lq                       mean                    median                uq                           max                      neval
n <- readRDS(fl1)        106.5956      109.6457         237.3844              117.8956              141.9921              10934.162           100
         load(fl2)                  295.0654      301.8162        335.6266              308.3757              319.6965              1915.706              100

It looks like the max value is an outlier.

So I tried:
SECOND TEST:
> sapply(1:10, function(x) system.time(n <- readRDS('file.rds'))[3])
elapsed               elapsed               elapsed               elapsed               elapsed               elapsed               elapsed               elapsed                 elapsed               elapsed
  10.50                   0.11                       0.11                       0.11                       0.10                       0.11                       0.11                       0.11                       0.12                       0.12
> sapply(1:10, function(x) system.time(load'flie.Rdata'))[3])
elapsed               elapsed               elapsed               elapsed               elapsed               elapsed               elapsed               elapsed                 elapsed               elapsed
   1.86                    0.29                       0.31                       0.30                       0.30                       0.31                       0.30                       0.29                       0.31                       0.30

Which confirmed my suspicion; the first time loading the data takes much longer than the following times. I suspect that this has something to do how the data is assigned and that R doesn't has to 'fully' read the data, if it is read the second time.

So the question remains, how can I make a realistic benchmark test? From the first test I would conclude that reading the *.rds file is faster. But this holds only for a large number of neval. If I set times = 1 then reading the *.Rdata would be faster (as also indicated by the second test).

Thanks for any help or comments.

Kind regards

Raphael
------------------------------------------------------------------------------------
Raphael Felber, PhD
Scientific Officer, Climate & Air Pollution

Federal Department of Economic Affairs,
Education and Research EAER
Agroscope
Research Division, Agroecology and Environment

Reckenholzstrasse 191, CH-8046 Z�rich
Phone +41 58 468 75 11
Fax     +41 58 468 72 01
[hidden email]<mailto:[hidden email]>
www.agroscope.ch<http://www.agroscope.ch/>


        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How to benchmark speed of load/readRDS correctly

Jeff Newmiller
You need to study how reading files works in your operating system. This question is not about R.
--
Sent from my phone. Please excuse my brevity.

On August 22, 2017 5:53:09 AM PDT, [hidden email] wrote:

>Dear all
>
>I was thinking about efficient reading data into R and tried several
>ways to test if load(file.Rdata) or readRDS(file.rds) is faster. The
>files file.Rdata and file.rds contain the same data, the first created
>with save(d, ' file.Rdata', compress=F) and the second with saveRDS(d,
>' file.rds', compress=F).
>
>First I used the function microbenchmark() and was a astonished about
>the max value of the output.
>
>FIRST TEST:
>> library(microbenchmark)
>> microbenchmark(
>+   n <- readRDS('file.rds'),
>+   load('file.Rdata')
>+ )
>Unit: milliseconds
>expr                     min                lq                      
>mean                    median                uq                      
>   max                      neval
>n <- readRDS(fl1)        106.5956      109.6457         237.3844      
>    117.8956              141.9921              10934.162           100
>load(fl2)                  295.0654      301.8162        335.6266      
>  308.3757              319.6965              1915.706              100
>
>It looks like the max value is an outlier.
>
>So I tried:
>SECOND TEST:
>> sapply(1:10, function(x) system.time(n <- readRDS('file.rds'))[3])
>elapsed               elapsed               elapsed              
>elapsed               elapsed               elapsed              
>elapsed               elapsed                 elapsed              
>elapsed
>10.50                   0.11                       0.11                
>0.11                       0.10                       0.11            
>0.11                       0.11                       0.12            
>         0.12
>> sapply(1:10, function(x) system.time(load'flie.Rdata'))[3])
>elapsed               elapsed               elapsed              
>elapsed               elapsed               elapsed              
>elapsed               elapsed                 elapsed              
>elapsed
>1.86                    0.29                       0.31                
>0.30                       0.30                       0.31            
>0.30                       0.29                       0.31            
>         0.30
>
>Which confirmed my suspicion; the first time loading the data takes
>much longer than the following times. I suspect that this has something
>to do how the data is assigned and that R doesn't has to 'fully' read
>the data, if it is read the second time.
>
>So the question remains, how can I make a realistic benchmark test?
>From the first test I would conclude that reading the *.rds file is
>faster. But this holds only for a large number of neval. If I set times
>= 1 then reading the *.Rdata would be faster (as also indicated by the
>second test).
>
>Thanks for any help or comments.
>
>Kind regards
>
>Raphael
>------------------------------------------------------------------------------------
>Raphael Felber, PhD
>Scientific Officer, Climate & Air Pollution
>
>Federal Department of Economic Affairs,
>Education and Research EAER
>Agroscope
>Research Division, Agroecology and Environment
>
>Reckenholzstrasse 191, CH-8046 Z�rich
>Phone +41 58 468 75 11
>Fax     +41 58 468 72 01
>[hidden email]<mailto:[hidden email]>
>www.agroscope.ch<http://www.agroscope.ch/>
>
>
> [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How to benchmark speed of load/readRDS correctly

J C Nash
Not convinced Jeff is completely right about this not concerning R, since I've found that the application language (R,
perl, etc.) makes a difference in how files are accessed by/to OS. He is certainly correct that OS (and versions) are
where the actual reading and writing happens, but sometimes the call to those can be inefficient. (Sorry, I've not got
examples specifically for file reads, but had a case in computation where there was an 800% i.e., 80000 fold difference
in timing with R, which rather took my breath away. That's probably been sorted now.) The difficulty in making general
statements is that a rather full set of comparisons over different commands, datasets, OS and version variants is needed
before the general picture can emerge. Using microbenchmark when you need to find the bottlenecks is how I'd proceed,
which OP is doing.

About 30 years ago, I did write up some preliminary work, never published, on estimating the two halves of a copy, that
is, the reading from file and storing to "memory" or a different storage location. This was via regression with a
singular design matrix, but one can get a minimal length least squares solution via svd. Possibly relevant today to try
to get at slow links on a network.

JN

On 2017-08-22 09:07 AM, Jeff Newmiller wrote:
> You need to study how reading files works in your operating system. This question is not about R.
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How to benchmark speed of load/readRDS correctly

R help mailing list-2
In reply to this post by raphael.felber-2
The large value for maximum time may be due to garbage collection, which
happens periodically.   E.g., try the following, where the
unlist(as.list()) creates a lot of garbage.  I get a very large time every
102 or 51 iterations and a moderately large time more often

mb <- microbenchmark::microbenchmark({ x <- as.list(sin(1:5e5)); x <-
unlist(x) / cos(1:5e5) ; sum(x) }, times=1000)
plot(mb$time)
quantile(mb$time * 1e-6, c(0, .5, .75, .90, .95, .99, 1))
#       0%       50%       75%       90%       95%       99%      100%
# 59.04446  82.15453 102.17522 180.36986 187.52667 233.42062 249.33970
diff(which(mb$time > quantile(mb$time, .99)))
# [1] 102  51 102 102 102 102 102 102  51
diff(which(mb$time > quantile(mb$time, .95)))
# [1]  6 41  4 47  4 40  7  4 47  4 33 14  4 47  4 47  4 47  4 47  4 47  4
 6 41
#[26]  4  6  7  9 25  4 47  4 47  4 47  4 22 25  4 33 14  4  6 41  4 47  4
22



Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Tue, Aug 22, 2017 at 5:53 AM, <[hidden email]> wrote:

> Dear all
>
> I was thinking about efficient reading data into R and tried several ways
> to test if load(file.Rdata) or readRDS(file.rds) is faster. The files
> file.Rdata and file.rds contain the same data, the first created with
> save(d, ' file.Rdata', compress=F) and the second with saveRDS(d, '
> file.rds', compress=F).
>
> First I used the function microbenchmark() and was a astonished about the
> max value of the output.
>
> FIRST TEST:
> > library(microbenchmark)
> > microbenchmark(
> +   n <- readRDS('file.rds'),
> +   load('file.Rdata')
> + )
> Unit: milliseconds
>               expr                     min                lq
>          mean                    median                uq
>          max                      neval
> n <- readRDS(fl1)        106.5956      109.6457         237.3844
>     117.8956              141.9921              10934.162           100
>          load(fl2)                  295.0654      301.8162
> 335.6266              308.3757              319.6965              1915.706
>             100
>
> It looks like the max value is an outlier.
>
> So I tried:
> SECOND TEST:
> > sapply(1:10, function(x) system.time(n <- readRDS('file.rds'))[3])
> elapsed               elapsed               elapsed               elapsed
>              elapsed               elapsed               elapsed
>    elapsed                 elapsed               elapsed
>   10.50                   0.11                       0.11
>      0.11                       0.10                       0.11
>            0.11                       0.11                       0.12
>                  0.12
> > sapply(1:10, function(x) system.time(load'flie.Rdata'))[3])
> elapsed               elapsed               elapsed               elapsed
>              elapsed               elapsed               elapsed
>    elapsed                 elapsed               elapsed
>    1.86                    0.29                       0.31
>        0.30                       0.30                       0.31
>              0.30                       0.29                       0.31
>                    0.30
>
> Which confirmed my suspicion; the first time loading the data takes much
> longer than the following times. I suspect that this has something to do
> how the data is assigned and that R doesn't has to 'fully' read the data,
> if it is read the second time.
>
> So the question remains, how can I make a realistic benchmark test? From
> the first test I would conclude that reading the *.rds file is faster. But
> this holds only for a large number of neval. If I set times = 1 then
> reading the *.Rdata would be faster (as also indicated by the second test).
>
> Thanks for any help or comments.
>
> Kind regards
>
> Raphael
> ------------------------------------------------------------
> ------------------------
> Raphael Felber, PhD
> Scientific Officer, Climate & Air Pollution
>
> Federal Department of Economic Affairs,
> Education and Research EAER
> Agroscope
> Research Division, Agroecology and Environment
>
> Reckenholzstrasse 191, CH-8046 Zürich
> Phone +41 58 468 75 11
> Fax     +41 58 468 72 01
> [hidden email]<mailto:[hidden email]
> >
> www.agroscope.ch<http://www.agroscope.ch/>
>
>
>         [[alternative HTML version deleted]]
>
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How to benchmark speed of load/readRDS correctly

Jeff Newmiller
In reply to this post by J C Nash
Caching happens, both within the operating system and within the C standard library. Ostensibly the intent for those caches is to help performance, but you are right that different low-level caching algorithms can be a poor match for specific application level use cases such as copying files or parsing text syntax. However, the OS and even the specific file system drivers (e.g. ext4 on flash disk or FAT32 on magnetic media) can behave quite differently for the same application level use case, so a generic discussion at the R language level (this mailing list) can be almost impossible to sort out intelligently.
--
Sent from my phone. Please excuse my brevity.

On August 22, 2017 7:11:39 AM PDT, J C Nash <[hidden email]> wrote:

>Not convinced Jeff is completely right about this not concerning R,
>since I've found that the application language (R,
>perl, etc.) makes a difference in how files are accessed by/to OS. He
>is certainly correct that OS (and versions) are
>where the actual reading and writing happens, but sometimes the call to
>those can be inefficient. (Sorry, I've not got
>examples specifically for file reads, but had a case in computation
>where there was an 800% i.e., 80000 fold difference
>in timing with R, which rather took my breath away. That's probably
>been sorted now.) The difficulty in making general
>statements is that a rather full set of comparisons over different
>commands, datasets, OS and version variants is needed
>before the general picture can emerge. Using microbenchmark when you
>need to find the bottlenecks is how I'd proceed,
>which OP is doing.
>
>About 30 years ago, I did write up some preliminary work, never
>published, on estimating the two halves of a copy, that
>is, the reading from file and storing to "memory" or a different
>storage location. This was via regression with a
>singular design matrix, but one can get a minimal length least squares
>solution via svd. Possibly relevant today to try
>to get at slow links on a network.
>
>JN
>
>On 2017-08-22 09:07 AM, Jeff Newmiller wrote:
>> You need to study how reading files works in your operating system.
>This question is not about R.
>>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How to benchmark speed of load/readRDS correctly

R help mailing list-2
In reply to this post by R help mailing list-2
Note that if you force a garbage collection each iteration the times are
more stable.  However, on the average it is faster to let the garbage
collector decide when to leap into action.

mb_gc <- microbenchmark::microbenchmark(gc(), { x <- as.list(sin(1:5e5)); x
<- unlist(x) / cos(1:5e5) ; sum(x) }, times=1000,
control=list(order="inorder"))
with(mb_gc, plot(time[expr!="gc()"]))
with(mb_gc, quantile(1e-6*time[expr!="gc()"], c(0, .5, .75, .9, .95, .99,
1)))
#       0%       50%       75%       90%       95%       99%      100%
# 59.33450  61.33954  63.43457  66.23331  68.93746  74.45629 158.09799



Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Tue, Aug 22, 2017 at 9:26 AM, William Dunlap <[hidden email]> wrote:

> The large value for maximum time may be due to garbage collection, which
> happens periodically.   E.g., try the following, where the
> unlist(as.list()) creates a lot of garbage.  I get a very large time every
> 102 or 51 iterations and a moderately large time more often
>
> mb <- microbenchmark::microbenchmark({ x <- as.list(sin(1:5e5)); x <-
> unlist(x) / cos(1:5e5) ; sum(x) }, times=1000)
> plot(mb$time)
> quantile(mb$time * 1e-6, c(0, .5, .75, .90, .95, .99, 1))
> #       0%       50%       75%       90%       95%       99%      100%
> # 59.04446  82.15453 102.17522 180.36986 187.52667 233.42062 249.33970
> diff(which(mb$time > quantile(mb$time, .99)))
> # [1] 102  51 102 102 102 102 102 102  51
> diff(which(mb$time > quantile(mb$time, .95)))
> # [1]  6 41  4 47  4 40  7  4 47  4 33 14  4 47  4 47  4 47  4 47  4 47  4
>  6 41
> #[26]  4  6  7  9 25  4 47  4 47  4 47  4 22 25  4 33 14  4  6 41  4 47  4
> 22
>
>
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
> On Tue, Aug 22, 2017 at 5:53 AM, <[hidden email]>
> wrote:
>
>> Dear all
>>
>> I was thinking about efficient reading data into R and tried several ways
>> to test if load(file.Rdata) or readRDS(file.rds) is faster. The files
>> file.Rdata and file.rds contain the same data, the first created with
>> save(d, ' file.Rdata', compress=F) and the second with saveRDS(d, '
>> file.rds', compress=F).
>>
>> First I used the function microbenchmark() and was a astonished about the
>> max value of the output.
>>
>> FIRST TEST:
>> > library(microbenchmark)
>> > microbenchmark(
>> +   n <- readRDS('file.rds'),
>> +   load('file.Rdata')
>> + )
>> Unit: milliseconds
>>               expr                     min                lq
>>          mean                    median                uq
>>          max                      neval
>> n <- readRDS(fl1)        106.5956      109.6457         237.3844
>>     117.8956              141.9921              10934.162           100
>>          load(fl2)                  295.0654      301.8162
>> 335.6266              308.3757              319.6965              1915.706
>>             100
>>
>> It looks like the max value is an outlier.
>>
>> So I tried:
>> SECOND TEST:
>> > sapply(1:10, function(x) system.time(n <- readRDS('file.rds'))[3])
>> elapsed               elapsed               elapsed
>>  elapsed               elapsed               elapsed               elapsed
>>              elapsed                 elapsed               elapsed
>>   10.50                   0.11                       0.11
>>        0.11                       0.10                       0.11
>>              0.11                       0.11                       0.12
>>                    0.12
>> > sapply(1:10, function(x) system.time(load'flie.Rdata'))[3])
>> elapsed               elapsed               elapsed
>>  elapsed               elapsed               elapsed               elapsed
>>              elapsed                 elapsed               elapsed
>>    1.86                    0.29                       0.31
>>        0.30                       0.30                       0.31
>>              0.30                       0.29                       0.31
>>                    0.30
>>
>> Which confirmed my suspicion; the first time loading the data takes much
>> longer than the following times. I suspect that this has something to do
>> how the data is assigned and that R doesn't has to 'fully' read the data,
>> if it is read the second time.
>>
>> So the question remains, how can I make a realistic benchmark test? From
>> the first test I would conclude that reading the *.rds file is faster. But
>> this holds only for a large number of neval. If I set times = 1 then
>> reading the *.Rdata would be faster (as also indicated by the second test).
>>
>> Thanks for any help or comments.
>>
>> Kind regards
>>
>> Raphael
>> ------------------------------------------------------------
>> ------------------------
>> Raphael Felber, PhD
>> Scientific Officer, Climate & Air Pollution
>>
>> Federal Department of Economic Affairs,
>> Education and Research EAER
>> Agroscope
>> Research Division, Agroecology and Environment
>>
>> Reckenholzstrasse 191, CH-8046 Zürich
>> Phone +41 58 468 75 11 <+41%2058%20468%2075%2011>
>> Fax     +41 58 468 72 01 <+41%2058%20468%2072%2001>
>> [hidden email]<mailto:raphael.felber@
>> agroscope.admin.ch>
>> www.agroscope.ch<http://www.agroscope.ch/>
>>
>>
>>         [[alternative HTML version deleted]]
>>
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posti
>> ng-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Convert Factor to Date

Patrick Casimir
In reply to this post by Jeff Newmiller
Dear R Fellows,


I Have a dataset( data1) with 2 columns of date showing a class of factor. How to convert them to date? Then compare them, keep the greater date only in a new column. Using as.Date to change the class to Date but the data becomes NA.


Much Thanks


COL1    COL2
Apr-16  1-Nov-16
May-16  1-Nov-16
Jun-16  1-Nov-16
Jul-16  1-Nov-16
Aug-16  1-Nov-16
Sep-16  1-Nov-16
Oct-16  1-Nov-16
Nov-16  1-Nov-16
Dec-16  1-Nov-16
Jan-17  1-Nov-16
Feb-17  1-Nov-16
Mar-17  1-Nov-16
Apr-17  1-Nov-16
May-17  1-Nov-16
Jun-17  1-Nov-16
Jul-17  1-Nov-16
Aug-17  1-Nov-16
Sep-17  1-Nov-16


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Convert Factor to Date

Spencer Graves-4


On 2017-08-22 1:30 PM, Patrick Casimir wrote:
> Dear R Fellows,
>
>
> I Have a dataset( data1) with 2 columns of date showing a class of factor. How to convert them to date? Then compare them, keep the greater date only in a new column. Using as.Date to change the class to Date but the data becomes NA.


       When I specified a format with the second date, I got the desired
behavior:


 > as.Date(factor('1-Nov-16'), '%d-%b-%y')
[1] "2016-11-01"
 > as.Date('Nov-16', '%b-%y')
[1] NA
 > as.Date(factor('Nov-16'), '%b-%y')
[1] NA
 > as.Date('Nov-16', '%b-%y')
[1] NA


       To convert the first column, I pasted "1-" in front:


as.Date(paste0('1-', factor('Nov-16')), '%d-%b-%y')


       Hope this helps.  Spencer

> Much Thanks
>
>
> COL1    COL2
> Apr-16  1-Nov-16
> May-16  1-Nov-16
> Jun-16  1-Nov-16
> Jul-16  1-Nov-16
> Aug-16  1-Nov-16
> Sep-16  1-Nov-16
> Oct-16  1-Nov-16
> Nov-16  1-Nov-16
> Dec-16  1-Nov-16
> Jan-17  1-Nov-16
> Feb-17  1-Nov-16
> Mar-17  1-Nov-16
> Apr-17  1-Nov-16
> May-17  1-Nov-16
> Jun-17  1-Nov-16
> Jul-17  1-Nov-16
> Aug-17  1-Nov-16
> Sep-17  1-Nov-16
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Convert Factor to Date

Patrick Casimir
This is large data set Spencer. What about when the dates change as below:


COL1    COL2
Jan-14  1-Aug-16
Feb-14  1-Aug-16
Mar-14  1-Aug-16
Apr-14  1-Aug-16
May-14  1-Aug-16
Jun-14  1-Aug-16
Jul-14  1-Aug-16
Aug-14  1-Aug-16
Sep-14  1-Aug-16
Oct-14  1-Aug-16
Nov-14  1-Aug-16
Dec-14  1-Aug-16
Jan-15  1-Aug-16
Feb-15  1-Aug-16
Mar-15  1-Aug-16
Apr-15  1-Aug-16
May-15  1-Aug-16
Jun-15  1-Aug-16
Jul-15  1-Aug-16
Aug-15  1-Aug-16
Sep-15  1-Aug-16
Oct-15  1-Aug-16
Nov-15  1-Aug-16
Dec-15  1-Aug-16
Jan-16  1-Aug-16
Feb-16  1-Aug-16
Mar-16  1-Aug-16
Apr-16  1-Aug-16
May-16  1-Aug-16
Jun-16  1-Aug-16
Jul-16  1-Aug-16
Aug-16  1-Aug-16
Sep-16  1-Aug-16
Oct-16  1-Aug-16






________________________________
From: R-help <[hidden email]> on behalf of Spencer Graves <[hidden email]>
Sent: Tuesday, August 22, 2017 2:49 PM
To: [hidden email]
Subject: Re: [R] Convert Factor to Date



On 2017-08-22 1:30 PM, Patrick Casimir wrote:
> Dear R Fellows,
>
>
> I Have a dataset( data1) with 2 columns of date showing a class of factor. How to convert them to date? Then compare them, keep the greater date only in a new column. Using as.Date to change the class to Date but the data becomes NA.


       When I specified a format with the second date, I got the desired
behavior:


 > as.Date(factor('1-Nov-16'), '%d-%b-%y')
[1] "2016-11-01"
 > as.Date('Nov-16', '%b-%y')
[1] NA
 > as.Date(factor('Nov-16'), '%b-%y')
[1] NA
 > as.Date('Nov-16', '%b-%y')
[1] NA


       To convert the first column, I pasted "1-" in front:


as.Date(paste0('1-', factor('Nov-16')), '%d-%b-%y')


       Hope this helps.  Spencer

> Much Thanks
>
>
> COL1    COL2
> Apr-16  1-Nov-16
> May-16  1-Nov-16
> Jun-16  1-Nov-16
> Jul-16  1-Nov-16
> Aug-16  1-Nov-16
> Sep-16  1-Nov-16
> Oct-16  1-Nov-16
> Nov-16  1-Nov-16
> Dec-16  1-Nov-16
> Jan-17  1-Nov-16
> Feb-17  1-Nov-16
> Mar-17  1-Nov-16
> Apr-17  1-Nov-16
> May-17  1-Nov-16
> Jun-17  1-Nov-16
> Jul-17  1-Nov-16
> Aug-17  1-Nov-16
> Sep-17  1-Nov-16
>
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=02%7C01%7Cpatrcasi%40nova.edu%7C6abf3517ab5f407427d308d4e98e9efd%7C2c2b2d312e3e4df1b571fb37c042ff1b%7C0%7C0%7C636390246143633480&sdata=jwTeb%2BvH0bbkXdckgzE6PJZ3gDl9d1%2F3t9K%2BxDtjyls%3D&reserved=0
> PLEASE do read the posting guide https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.R-project.org%2Fposting-guide.html&data=02%7C01%7Cpatrcasi%40nova.edu%7C6abf3517ab5f407427d308d4e98e9efd%7C2c2b2d312e3e4df1b571fb37c042ff1b%7C0%7C0%7C636390246143633480&sdata=GUAR582xxtA88KLkQC1oPnvyNecfUyXjV9MrIziJicU%3D&reserved=0
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=02%7C01%7Cpatrcasi%40nova.edu%7C6abf3517ab5f407427d308d4e98e9efd%7C2c2b2d312e3e4df1b571fb37c042ff1b%7C0%7C0%7C636390246143633480&sdata=jwTeb%2BvH0bbkXdckgzE6PJZ3gDl9d1%2F3t9K%2BxDtjyls%3D&reserved=0
PLEASE do read the posting guide https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.R-project.org%2Fposting-guide.html&data=02%7C01%7Cpatrcasi%40nova.edu%7C6abf3517ab5f407427d308d4e98e9efd%7C2c2b2d312e3e4df1b571fb37c042ff1b%7C0%7C0%7C636390246143633480&sdata=GUAR582xxtA88KLkQC1oPnvyNecfUyXjV9MrIziJicU%3D&reserved=0
and provide commented, minimal, self-contained, reproducible code.

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Convert Factor to Date

Spencer Graves-4


On 2017-08-22 2:04 PM, Patrick Casimir wrote:
>
> This is large data set Spencer. What about when the dates change as below:
>


 ����� Have you tried what I suggested?� What were the results? Spencer

>
> COL1 COL2
> Jan-14 1-Aug-16
> Feb-14 1-Aug-16
> Mar-14 1-Aug-16
> Apr-14 1-Aug-16
> May-14 1-Aug-16
> Jun-14 1-Aug-16
> Jul-14 1-Aug-16
> Aug-14 1-Aug-16
> Sep-14 1-Aug-16
> Oct-14 1-Aug-16
> Nov-14 1-Aug-16
> Dec-14 1-Aug-16
> Jan-15 1-Aug-16
> Feb-15 1-Aug-16
> Mar-15 1-Aug-16
> Apr-15 1-Aug-16
> May-15 1-Aug-16
> Jun-15 1-Aug-16
> Jul-15 1-Aug-16
> Aug-15 1-Aug-16
> Sep-15 1-Aug-16
> Oct-15 1-Aug-16
> Nov-15 1-Aug-16
> Dec-15 1-Aug-16
> Jan-16 1-Aug-16
> Feb-16 1-Aug-16
> Mar-16 1-Aug-16
> Apr-16 1-Aug-16
> May-16 1-Aug-16
> Jun-16 1-Aug-16
> Jul-16 1-Aug-16
> Aug-16 1-Aug-16
> Sep-16 1-Aug-16
> Oct-16 1-Aug-16
>
>
>
>
>
>
>
> ------------------------------------------------------------------------
> *From:* R-help <[hidden email]> on behalf of Spencer
> Graves <[hidden email]>
> *Sent:* Tuesday, August 22, 2017 2:49 PM
> *To:* [hidden email]
> *Subject:* Re: [R] Convert Factor to Date
>
>
> On 2017-08-22 1:30 PM, Patrick Casimir wrote:
> > Dear R Fellows,
> >
> >
> > I Have a dataset( data1) with 2 columns of date showing a class of
> factor. How to convert them to date? Then compare them, keep the
> greater date only in a new column. Using as.Date to change the class
> to Date but the data becomes NA.
>
>
> ������ When I specified a format with the second date, I got the desired
> behavior:
>
>
> �> as.Date(factor('1-Nov-16'), '%d-%b-%y')
> [1] "2016-11-01"
> �> as.Date('Nov-16', '%b-%y')
> [1] NA
> �> as.Date(factor('Nov-16'), '%b-%y')
> [1] NA
> �> as.Date('Nov-16', '%b-%y')
> [1] NA
>
>
> ������ To convert the first column, I pasted "1-" in front:
>
>
> as.Date(paste0('1-', factor('Nov-16')), '%d-%b-%y')
>
>
> ������ Hope this helps.� Spencer
>
> > Much Thanks
> >
> >
> > COL1��� COL2
> > Apr-16� 1-Nov-16
> > May-16� 1-Nov-16
> > Jun-16� 1-Nov-16
> > Jul-16� 1-Nov-16
> > Aug-16� 1-Nov-16
> > Sep-16� 1-Nov-16
> > Oct-16� 1-Nov-16
> > Nov-16� 1-Nov-16
> > Dec-16� 1-Nov-16
> > Jan-17� 1-Nov-16
> > Feb-17� 1-Nov-16
> > Mar-17� 1-Nov-16
> > Apr-17� 1-Nov-16
> > May-17� 1-Nov-16
> > Jun-17� 1-Nov-16
> > Jul-17� 1-Nov-16
> > Aug-17� 1-Nov-16
> > Sep-17� 1-Nov-16
> >
> >
> >������� [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> >
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=02%7C01%7Cpatrcasi%40nova.edu%7C6abf3517ab5f407427d308d4e98e9efd%7C2c2b2d312e3e4df1b571fb37c042ff1b%7C0%7C0%7C636390246143633480&sdata=jwTeb%2BvH0bbkXdckgzE6PJZ3gDl9d1%2F3t9K%2BxDtjyls%3D&reserved=0
> > PLEASE do read the posting guide
> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.R-project.org%2Fposting-guide.html&data=02%7C01%7Cpatrcasi%40nova.edu%7C6abf3517ab5f407427d308d4e98e9efd%7C2c2b2d312e3e4df1b571fb37c042ff1b%7C0%7C0%7C636390246143633480&sdata=GUAR582xxtA88KLkQC1oPnvyNecfUyXjV9MrIziJicU%3D&reserved=0
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=02%7C01%7Cpatrcasi%40nova.edu%7C6abf3517ab5f407427d308d4e98e9efd%7C2c2b2d312e3e4df1b571fb37c042ff1b%7C0%7C0%7C636390246143633480&sdata=jwTeb%2BvH0bbkXdckgzE6PJZ3gDl9d1%2F3t9K%2BxDtjyls%3D&reserved=0
> PLEASE do read the posting guide
> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.R-project.org%2Fposting-guide.html&data=02%7C01%7Cpatrcasi%40nova.edu%7C6abf3517ab5f407427d308d4e98e9efd%7C2c2b2d312e3e4df1b571fb37c042ff1b%7C0%7C0%7C636390246143633480&sdata=GUAR582xxtA88KLkQC1oPnvyNecfUyXjV9MrIziJicU%3D&reserved=0
> and provide commented, minimal, self-contained, reproducible code.

        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How to benchmark speed of load/readRDS correctly

raphael.felber-2
In reply to this post by R help mailing list-2
Hi Bill

Thanks for your answer and the explanations. I tried to use garbage collection but I'm still not satisfied with the result. Maybe the question was not stated clear enough. I want to test the speed of reading/loading of data into R when a 'fresh' R session is started (or even after a new start of the computer).

To understand what really happens, I tried:
r1 <- sapply(1:10000, function(x) { gc(); t <- system.time(n <- readRDS('file.Rdata'))[3]; rm(n); gc(); return(t)})

and found a similar behavior as you; here and then the time is much larger, but the times are not as stable as in your example. Highest values are up to 50 times larger than most of the other times (8 sec vs 0.15 sec), even with garbage collection. I assume with the code above the time spent for garbage collection isn't measured.

However, the first iteration always takes the longest. I'm wondering if I should take the first value as best guess.

Cheers Raphael
Von: William Dunlap [mailto:[hidden email]]
Gesendet: Dienstag, 22. August 2017 19:13
An: Felber Raphael Agroscope <[hidden email]>
Cc: [hidden email]
Betreff: Re: [R] How to benchmark speed of load/readRDS correctly

Note that if you force a garbage collection each iteration the times are more stable.  However, on the average it is faster to let the garbage collector decide when to leap into action.

mb_gc <- microbenchmark::microbenchmark(gc(), { x <- as.list(sin(1:5e5)); x <- unlist(x) / cos(1:5e5) ; sum(x) }, times=1000, control=list(order="inorder"))
with(mb_gc, plot(time[expr!="gc()"]))
with(mb_gc, quantile(1e-6*time[expr!="gc()"], c(0, .5, .75, .9, .95, .99, 1)))
#       0%       50%       75%       90%       95%       99%      100%
# 59.33450  61.33954  63.43457  66.23331  68.93746  74.45629 158.09799



Bill Dunlap
TIBCO Software
wdunlap tibco.com<http://tibco.com>

On Tue, Aug 22, 2017 at 9:26 AM, William Dunlap <[hidden email]<mailto:[hidden email]>> wrote:
The large value for maximum time may be due to garbage collection, which happens periodically.   E.g., try the following, where the unlist(as.list()) creates a lot of garbage.  I get a very large time every 102 or 51 iterations and a moderately large time more often

mb <- microbenchmark::microbenchmark({ x <- as.list(sin(1:5e5)); x <- unlist(x) / cos(1:5e5) ; sum(x) }, times=1000)
plot(mb$time)
quantile(mb$time * 1e-6, c(0, .5, .75, .90, .95, .99, 1))
#       0%       50%       75%       90%       95%       99%      100%
# 59.04446  82.15453 102.17522 180.36986 187.52667 233.42062 249.33970
diff(which(mb$time > quantile(mb$time, .99)))
# [1] 102  51 102 102 102 102 102 102  51
diff(which(mb$time > quantile(mb$time, .95)))
# [1]  6 41  4 47  4 40  7  4 47  4 33 14  4 47  4 47  4 47  4 47  4 47  4  6 41
#[26]  4  6  7  9 25  4 47  4 47  4 47  4 22 25  4 33 14  4  6 41  4 47  4 22



Bill Dunlap
TIBCO Software
wdunlap tibco.com<http://tibco.com>

On Tue, Aug 22, 2017 at 5:53 AM, <[hidden email]<mailto:[hidden email]>> wrote:
Dear all

I was thinking about efficient reading data into R and tried several ways to test if load(file.Rdata) or readRDS(file.rds) is faster. The files file.Rdata and file.rds contain the same data, the first created with save(d, ' file.Rdata', compress=F) and the second with saveRDS(d, ' file.rds', compress=F).

First I used the function microbenchmark() and was a astonished about the max value of the output.

FIRST TEST:
> library(microbenchmark)
> microbenchmark(
+   n <- readRDS('file.rds'),
+   load('file.Rdata')
+ )
Unit: milliseconds
              expr                     min                lq                       mean                    median                uq                           max                      neval
n <- readRDS(fl1)        106.5956      109.6457         237.3844              117.8956              141.9921              10934.162           100
         load(fl2)                  295.0654      301.8162        335.6266              308.3757              319.6965              1915.706              100

It looks like the max value is an outlier.

So I tried:
SECOND TEST:
> sapply(1:10, function(x) system.time(n <- readRDS('file.rds'))[3])
elapsed               elapsed               elapsed               elapsed               elapsed               elapsed               elapsed               elapsed                 elapsed               elapsed
  10.50                   0.11                       0.11                       0.11                       0.10                       0.11                       0.11                       0.11                       0.12                       0.12
> sapply(1:10, function(x) system.time(load'flie.Rdata'))[3])
elapsed               elapsed               elapsed               elapsed               elapsed               elapsed               elapsed               elapsed                 elapsed               elapsed
   1.86                    0.29                       0.31                       0.30                       0.30                       0.31                       0.30                       0.29                       0.31                       0.30

Which confirmed my suspicion; the first time loading the data takes much longer than the following times. I suspect that this has something to do how the data is assigned and that R doesn't has to 'fully' read the data, if it is read the second time.

So the question remains, how can I make a realistic benchmark test? From the first test I would conclude that reading the *.rds file is faster. But this holds only for a large number of neval. If I set times = 1 then reading the *.Rdata would be faster (as also indicated by the second test).

Thanks for any help or comments.

Kind regards

Raphael
------------------------------------------------------------------------------------
Raphael Felber, PhD
Scientific Officer, Climate & Air Pollution

Federal Department of Economic Affairs,
Education and Research EAER
Agroscope
Research Division, Agroecology and Environment

Reckenholzstrasse 191, CH-8046 Zürich
Phone +41 58 468 75 11<tel:+41%2058%20468%2075%2011>
Fax     +41 58 468 72 01<tel:+41%2058%20468%2072%2001>
[hidden email]<mailto:[hidden email]><mailto:[hidden email]<mailto:[hidden email]>>
www.agroscope.ch<http://www.agroscope.ch><http://www.agroscope.ch/>


        [[alternative HTML version deleted]]

______________________________________________
[hidden email]<mailto:[hidden email]> mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How to benchmark speed of load/readRDS correctly

raphael.felber-2
In reply to this post by Jeff Newmiller
Hi there

Thanks for your answers. I didn't expect that this would be so complex. Honestly, I don't understand everything you wrote since I'm not an IT specialist. But I read something that reading *.rds files is faster than loading *.Rdata and I wanted to proof that for my system and R version. But thanks anyway for your time.

Cheers Raphael


> -----Ursprüngliche Nachricht-----
> Von: Jeff Newmiller [mailto:[hidden email]]
> Gesendet: Dienstag, 22. August 2017 18:33
> An: J C Nash <[hidden email]>; [hidden email]; Felber Raphael
> Agroscope <[hidden email]>
> Betreff: Re: [R] How to benchmark speed of load/readRDS correctly
>
> Caching happens, both within the operating system and within the C
> standard library. Ostensibly the intent for those caches is to help
> performance, but you are right that different low-level caching algorithms
> can be a poor match for specific application level use cases such as copying
> files or parsing text syntax. However, the OS and even the specific file
> system drivers (e.g. ext4 on flash disk or FAT32 on magnetic media) can
> behave quite differently for the same application level use case, so a generic
> discussion at the R language level (this mailing list) can be almost impossible
> to sort out intelligently.
> --
> Sent from my phone. Please excuse my brevity.
>
> On August 22, 2017 7:11:39 AM PDT, J C Nash <[hidden email]>
> wrote:
> >Not convinced Jeff is completely right about this not concerning R,
> >since I've found that the application language (R, perl, etc.) makes a
> >difference in how files are accessed by/to OS. He is certainly correct
> >that OS (and versions) are where the actual reading and writing
> >happens, but sometimes the call to those can be inefficient. (Sorry,
> >I've not got examples specifically for file reads, but had a case in
> >computation where there was an 800% i.e., 80000 fold difference in
> >timing with R, which rather took my breath away. That's probably been
> >sorted now.) The difficulty in making general statements is that a
> >rather full set of comparisons over different commands, datasets, OS
> >and version variants is needed before the general picture can emerge.
> >Using microbenchmark when you need to find the bottlenecks is how I'd
> >proceed, which OP is doing.
> >
> >About 30 years ago, I did write up some preliminary work, never
> >published, on estimating the two halves of a copy, that is, the reading
> >from file and storing to "memory" or a different storage location. This
> >was via regression with a singular design matrix, but one can get a
> >minimal length least squares solution via svd. Possibly relevant today
> >to try to get at slow links on a network.
> >
> >JN
> >
> >On 2017-08-22 09:07 AM, Jeff Newmiller wrote:
> >> You need to study how reading files works in your operating system.
> >This question is not about R.
> >>
______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: How to benchmark speed of load/readRDS correctly

Ismail SEZEN
First of all I want to mention the _warmup_ parameter to _control_argument of microbenchmark function. Default value is 2 and function runs the code 2 times before count the time intervals.
See ?microbenchmark

> However, the first iteration always takes the longest. I'm wondering if I should take the first value as best guess.

So, at least for the microbenchmark function, the maximum iteration time in microbenchmark result is not relevant to first iteration but may be relevant to another processes/factors that reach the file system at the same time with your code.

Also we can examine the underlying code of load and readRDS functions. Simply, type _load_ and _readRDS_ in the terminal and see the codes.

# readRDS
function (file, refhook = NULL)
{
    if (is.character(file)) {
        con <- gzfile(file, "rb")
        on.exit(close(con))
    }
    else if (inherits(file, "connection"))
        con <- if (inherits(file, "gzfile") || inherits(file,
            "gzcon"))
            file
        else gzcon(file)
    else stop("bad 'file' argument")
    .Internal(unserializeFromConn(con, refhook))
}

link to unserializeFromConn -> https://github.com/wch/r-source/blob/94cd276ed0eef865e01fcf4e96925d9373cc5799/src/main/serialize.c#L2246

# load
function (file, envir = parent.frame(), verbose = FALSE)
{
    if (is.character(file)) {
        con <- gzfile(file)
        on.exit(close(con))
        magic <- readChar(con, 5L, useBytes = TRUE)
        if (!length(magic))
            stop("empty (zero-byte) input file")
        if (!grepl("RD[AX]2\\n", magic)) {
            if (grepl("RD[ABX][12]\\r", magic))
                stop("input has been corrupted, with LF replaced by CR")
            warning(sprintf("file %s has magic number '%s'\\n",
                sQuote(basename(file)), gsub("[\\n\\r]*", "", magic)),
                "  ", "Use of save versions prior to 2 is deprecated",
                domain = NA, call. = FALSE)
            return(.Internal(load(file, envir)))
        }
    }
    else if (inherits(file, "connection")) {
        con <- if (inherits(file, "gzfile") || inherits(file,
            "gzcon"))
            file
        else gzcon(file)
    }
    else stop("bad 'file' argument")
    if (verbose)
        cat("Loading objects:\\n")
    .Internal(loadFromConn2(con, envir, verbose))
}

link to loadFromConn2 -> https://github.com/wch/r-source/blob/c1093fa1073fef6404869f26a1be6ef5bd2aa0fd/src/main/saveload.c#L2329

both use an internal function called “unserializeFromConn” and “ loadFromConn2”. You can examine them at attached links in the text.

Even if we don’t know C/C++, we can conclude that both functions have similar codes to read the data. Also _load_ function has much lines of codes than _readRDS_ function to check some bytes. (this also may create a small difference as you find out in your tests, see mean and median)

Additionally, I want to discuss another aspect. Why are there 2 functions called _readRDS_ and _load_?

Because they have different purposses. You use _load _function to read/load bulk saved variables by _save_ function and you use _readRDS_ function to read/load a single variable saved by _saveRDS_ function. So, it is inevitable both functions are optimized for different purposes. This is something like compare apple and pear altough both are fruits. Sometimes you want to eat apple but sometimes pear.

Aas a result, if I need to save and read a single R object, I prefer readRDS/saveRDS couple. If I need to serialize multiple object to a file, I use the load/save couple.

> On 23 Aug 2017, at 15:40, <[hidden email]> <[hidden email]> wrote:
>
> Hi there
>
> Thanks for your answers. I didn't expect that this would be so complex. Honestly, I don't understand everything you wrote since I'm not an IT specialist. But I read something that reading *.rds files is faster than loading *.Rdata and I wanted to proof that for my system and R version. But thanks anyway for your time.
>
> Cheers Raphael
>
>
>> -----Ursprüngliche Nachricht-----
>> Von: Jeff Newmiller [mailto:[hidden email]]
>> Gesendet: Dienstag, 22. August 2017 18:33
>> An: J C Nash <[hidden email]>; [hidden email]; Felber Raphael
>> Agroscope <[hidden email]>
>> Betreff: Re: [R] How to benchmark speed of load/readRDS correctly
>>
>> Caching happens, both within the operating system and within the C
>> standard library. Ostensibly the intent for those caches is to help
>> performance, but you are right that different low-level caching algorithms
>> can be a poor match for specific application level use cases such as copying
>> files or parsing text syntax. However, the OS and even the specific file
>> system drivers (e.g. ext4 on flash disk or FAT32 on magnetic media) can
>> behave quite differently for the same application level use case, so a generic
>> discussion at the R language level (this mailing list) can be almost impossible
>> to sort out intelligently.
>> --
>> Sent from my phone. Please excuse my brevity.
>>
>> On August 22, 2017 7:11:39 AM PDT, J C Nash <[hidden email]>
>> wrote:
>>> Not convinced Jeff is completely right about this not concerning R,
>>> since I've found that the application language (R, perl, etc.) makes a
>>> difference in how files are accessed by/to OS. He is certainly correct
>>> that OS (and versions) are where the actual reading and writing
>>> happens, but sometimes the call to those can be inefficient. (Sorry,
>>> I've not got examples specifically for file reads, but had a case in
>>> computation where there was an 800% i.e., 80000 fold difference in
>>> timing with R, which rather took my breath away. That's probably been
>>> sorted now.) The difficulty in making general statements is that a
>>> rather full set of comparisons over different commands, datasets, OS
>>> and version variants is needed before the general picture can emerge.
>>> Using microbenchmark when you need to find the bottlenecks is how I'd
>>> proceed, which OP is doing.
>>>
>>> About 30 years ago, I did write up some preliminary work, never
>>> published, on estimating the two halves of a copy, that is, the reading
>>> from file and storing to "memory" or a different storage location. This
>>> was via regression with a singular design matrix, but one can get a
>>> minimal length least squares solution via svd. Possibly relevant today
>>> to try to get at slow links on a network.
>>>
>>> JN
>>>
>>> On 2017-08-22 09:07 AM, Jeff Newmiller wrote:
>>>> You need to study how reading files works in your operating system.
>>> This question is not about R.
>>>>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Comparing 2 dale columns

Patrick Casimir
In reply to this post by Patrick Casimir
Dear R fellows,


I created a new column Date_flag to compare the dates of COL1 and COL2 using the code
below. But it showed that 5/1/15 is greater than 6/1/2014 and 5/1/2015 greater than
7/1/2014 despite the year is greater. How do I fix that? I did try to format as %y/%m/%d

 but it does not fix that.


data$Date_Flag <- ifelse(data$COL2 > data$COL1, 0,1)


COL1       COL2
6/1/14     5/1/15
7/1/14     5/1/15


data$COL2<- as.Date(as.character(data$COL2, format="%y/%m/%d"))
data$COL1<- as.Date(as.character(data$COL1, format="%y/%m/%d"))


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Comparing 2 dale columns

PIKAL Petr
Hi

your code is wrong.

I get
> test<-read.table("clipboard", header=T)
> str(test)
'data.frame':   2 obs. of  2 variables:
 $ COL1: Factor w/ 2 levels "6/1/14","7/1/14": 1 2
 $ COL2: Factor w/ 1 level "5/1/15": 1 1
> test$COL2<- as.Date(as.character(test$COL2, format="%y/%m/%d"))
> test$COL1<- as.Date(as.character(test$COL1, format="%y/%m/%d"))
                                                                                 ^^^^^^^^^^^^^^^^^^^^^^^
incorrect parentheses position, wrong y,m,d

Using correct syntax I get correct result.

> test$COL2<- as.Date(test$COL2, format="%d/%m/%y")
> test$COL1<- as.Date(test$COL1, format="%d/%m/%y")
>
> test$COL2 > test$COL1
[1] TRUE TRUE
> test
        COL1       COL2
1 2014-01-06 2015-01-05
2 2014-01-07 2015-01-05
>

Cheers
Petr

> -----Original Message-----
> From: R-help [mailto:[hidden email]] On Behalf Of Patrick
> Casimir
> Sent: Wednesday, August 23, 2017 4:54 PM
> To: [hidden email]
> Subject: [R] Comparing 2 dale columns
>
> Dear R fellows,
>
>
> I created a new column Date_flag to compare the dates of COL1 and COL2
> using the code below. But it showed that 5/1/15 is greater than 6/1/2014 and
> 5/1/2015 greater than
> 7/1/2014 despite the year is greater. How do I fix that? I did try to format as
> %y/%m/%d
>
>  but it does not fix that.
>
>
> data$Date_Flag <- ifelse(data$COL2 > data$COL1, 0,1)
>
>
> COL1       COL2
> 6/1/14     5/1/15
> 7/1/14     5/1/15
>
>
> data$COL2<- as.Date(as.character(data$COL2, format="%y/%m/%d"))
> data$COL1<- as.Date(as.character(data$COL1, format="%y/%m/%d"))
>
>
>       [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

________________________________
Tento e-mail a jakékoliv k němu připojené dokumenty jsou důvěrné a jsou určeny pouze jeho adresátům.
Jestliže jste obdržel(a) tento e-mail omylem, informujte laskavě neprodleně jeho odesílatele. Obsah tohoto emailu i s přílohami a jeho kopie vymažte ze svého systému.
Nejste-li zamýšleným adresátem tohoto emailu, nejste oprávněni tento email jakkoliv užívat, rozšiřovat, kopírovat či zveřejňovat.
Odesílatel e-mailu neodpovídá za eventuální škodu způsobenou modifikacemi či zpožděním přenosu e-mailu.

V případě, že je tento e-mail součástí obchodního jednání:
- vyhrazuje si odesílatel právo ukončit kdykoliv jednání o uzavření smlouvy, a to z jakéhokoliv důvodu i bez uvedení důvodu.
- a obsahuje-li nabídku, je adresát oprávněn nabídku bezodkladně přijmout; Odesílatel tohoto e-mailu (nabídky) vylučuje přijetí nabídky ze strany příjemce s dodatkem či odchylkou.
- trvá odesílatel na tom, že příslušná smlouva je uzavřena teprve výslovným dosažením shody na všech jejích náležitostech.
- odesílatel tohoto emailu informuje, že není oprávněn uzavírat za společnost žádné smlouvy s výjimkou případů, kdy k tomu byl písemně zmocněn nebo písemně pověřen a takové pověření nebo plná moc byly adresátovi tohoto emailu případně osobě, kterou adresát zastupuje, předloženy nebo jejich existence je adresátovi či osobě jím zastoupené známá.

This e-mail and any documents attached to it may be confidential and are intended only for its intended recipients.
If you received this e-mail by mistake, please immediately inform its sender. Delete the contents of this e-mail with all attachments and its copies from your system.
If you are not the intended recipient of this e-mail, you are not authorized to use, disseminate, copy or disclose this e-mail in any manner.
The sender of this e-mail shall not be liable for any possible damage caused by modifications of the e-mail or by delay with transfer of the email.

In case that this e-mail forms part of business dealings:
- the sender reserves the right to end negotiations about entering into a contract in any time, for any reason, and without stating any reasoning.
- if the e-mail contains an offer, the recipient is entitled to immediately accept such offer; The sender of this e-mail (offer) excludes any acceptance of the offer on the part of the recipient containing any amendment or variation.
- the sender insists on that the respective contract is concluded only upon an express mutual agreement on all its aspects.
- the sender of this e-mail informs that he/she is not authorized to enter into any contracts on behalf of the company except for cases in which he/she is expressly authorized to do so in writing, and such authorization or power of attorney is submitted to the recipient or the person represented by the recipient, or the existence of such authorization is known to the recipient of the person represented by the recipient.
______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Comparing 2 dale columns

R. Mark Sharp
In reply to this post by Patrick Casimir
Patrick,

## Run the following script an notice the different values of the dataframe "data" in each instance.

# I understand you have done something like the following:
data <- data.frame(COL1 = c("6/1/14", "7/1/14"),
                   COL2 = c("5/1/15", "5/1/15"), stringsAsFactors = FALSE)
data$Date_Flag <- ifelse(data$COL2 > data$COL1, 0,1)
data
data$COL2 <- as.Date(as.character(data$COL2, format = "%y/%m/%d"))
data$COL1 <- as.Date(as.character(data$COL1, format = "%y/%m/%d"))
data$Date_Flag <- ifelse(data$COL2 > data$COL1, 0,1)
data

# What you may want instead is the following:
data <- data.frame(COL1 = c("6/1/14", "7/1/14"),
                   COL2 = c("5/1/15", "5/1/15"), stringsAsFactors = FALSE)
data
## strptime() converts the character vector to POSIXct so you do not necessarily
## need the as.Date. However, they are not the same and you may need the Date
## class.
data$COL2 <- as.Date(strptime(data$COL2, format = "%m/%d/%y"))
data$COL1 <- as.Date(strptime(data$COL1, format = "%m/%d/%y"))
data
data$Date_Flag <- ifelse(data$COL2 > data$COL1, 0,1)
data


R. Mark Sharp, Ph.D.
[hidden email]





> On Aug 23, 2017, at 9:53 AM, Patrick Casimir <[hidden email]> wrote:
>
> data$Date_Flag <- ifelse(data$COL2 > data$COL1, 0,1)
>
>
> COL1       COL2
> 6/1/14     5/1/15
> 7/1/14     5/1/15
>
>
> data$COL2<- as.Date(as.character(data$COL2, format="%y/%m/%d"))
> data$COL1<- as.Date(as.character(data$COL1, format="%y/%m/%d"))
>

CONFIDENTIALITY NOTICE: This e-mail and any files and/or...{{dropped:10}}

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Comparing 2 dale columns

Patrick Casimir
Hi Mark,


Instead of 1 and 0. It generates NA

 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA



________________________________
From: Mark Sharp <[hidden email]>
Sent: Wednesday, August 23, 2017 11:28:39 AM
To: Patrick Casimir
Cc: [hidden email]
Subject: Re: [R] Comparing 2 dale columns

Patrick,

## Run the following script an notice the different values of the dataframe "data" in each instance.

# I understand you have done something like the following:
data <- data.frame(COL1 = c("6/1/14", "7/1/14"),
                   COL2 = c("5/1/15", "5/1/15"), stringsAsFactors = FALSE)
data$Date_Flag <- ifelse(data$COL2 > data$COL1, 0,1)
data
data$COL2 <- as.Date(as.character(data$COL2, format = "%y/%m/%d"))
data$COL1 <- as.Date(as.character(data$COL1, format = "%y/%m/%d"))
data$Date_Flag <- ifelse(data$COL2 > data$COL1, 0,1)
data

# What you may want instead is the following:
data <- data.frame(COL1 = c("6/1/14", "7/1/14"),
                   COL2 = c("5/1/15", "5/1/15"), stringsAsFactors = FALSE)
data
## strptime() converts the character vector to POSIXct so you do not necessarily
## need the as.Date. However, they are not the same and you may need the Date
## class.
data$COL2 <- as.Date(strptime(data$COL2, format = "%m/%d/%y"))
data$COL1 <- as.Date(strptime(data$COL1, format = "%m/%d/%y"))
data
data$Date_Flag <- ifelse(data$COL2 > data$COL1, 0,1)
data


R. Mark Sharp, Ph.D.
[hidden email]





> On Aug 23, 2017, at 9:53 AM, Patrick Casimir <[hidden email]> wrote:
>
> data$Date_Flag <- ifelse(data$COL2 > data$COL1, 0,1)
>
>
> COL1       COL2
> 6/1/14     5/1/15
> 7/1/14     5/1/15
>
>
> data$COL2<- as.Date(as.character(data$COL2, format="%y/%m/%d"))
> data$COL1<- as.Date(as.character(data$COL1, format="%y/%m/%d"))
>

CONFIDENTIALITY NOTICE: This e-mail and any files and/or...{{dropped:13}}

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Comparing 2 dale columns

Patrick Casimir
In reply to this post by PIKAL Petr

Thanks. But when I apply your codes I get all NA instead of TRUE and FALSE

________________________________
From: PIKAL Petr <[hidden email]>
Sent: Wednesday, August 23, 2017 11:20:00 AM
To: Patrick Casimir; [hidden email]
Subject: RE: Comparing 2 dale columns

Hi

your code is wrong.

I get
> test<-read.table("clipboard", header=T)
> str(test)
'data.frame':   2 obs. of  2 variables:
 $ COL1: Factor w/ 2 levels "6/1/14","7/1/14": 1 2
 $ COL2: Factor w/ 1 level "5/1/15": 1 1
> test$COL2<- as.Date(as.character(test$COL2, format="%y/%m/%d"))
> test$COL1<- as.Date(as.character(test$COL1, format="%y/%m/%d"))
                                                                                 ^^^^^^^^^^^^^^^^^^^^^^^
incorrect parentheses position, wrong y,m,d

Using correct syntax I get correct result.

> test$COL2<- as.Date(test$COL2, format="%d/%m/%y")
> test$COL1<- as.Date(test$COL1, format="%d/%m/%y")
>
> test$COL2 > test$COL1
[1] TRUE TRUE
> test
        COL1       COL2
1 2014-01-06 2015-01-05
2 2014-01-07 2015-01-05
>

Cheers
Petr

> -----Original Message-----
> From: R-help [mailto:[hidden email]] On Behalf Of Patrick
> Casimir
> Sent: Wednesday, August 23, 2017 4:54 PM
> To: [hidden email]
> Subject: [R] Comparing 2 dale columns
>
> Dear R fellows,
>
>
> I created a new column Date_flag to compare the dates of COL1 and COL2
> using the code below. But it showed that 5/1/15 is greater than 6/1/2014 and
> 5/1/2015 greater than
> 7/1/2014 despite the year is greater. How do I fix that? I did try to format as
> %y/%m/%d
>
>  but it does not fix that.
>
>
> data$Date_Flag <- ifelse(data$COL2 > data$COL1, 0,1)
>
>
> COL1       COL2
> 6/1/14     5/1/15
> 7/1/14     5/1/15
>
>
> data$COL2<- as.Date(as.character(data$COL2, format="%y/%m/%d"))
> data$COL1<- as.Date(as.character(data$COL1, format="%y/%m/%d"))
>
>
>       [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=02%7C01%7Cpatrcasi%40nova.edu%7Cbf182d25cae845ec841c08d4ea3a7139%7C2c2b2d312e3e4df1b571fb37c042ff1b%7C0%7C0%7C636390984126927971&sdata=Tait17GF8fuEBwFnDZHCgnc7v88yXqAxpTX7W76O1cA%3D&reserved=0
> PLEASE do read the posting guide https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.R-project.org%2Fposting-guide.html&data=02%7C01%7Cpatrcasi%40nova.edu%7Cbf182d25cae845ec841c08d4ea3a7139%7C2c2b2d312e3e4df1b571fb37c042ff1b%7C0%7C0%7C636390984126927971&sdata=H3xZXXDLx4sN23A4HoUp1bElRZfWfPT1HbOaRMSDCjI%3D&reserved=0
> and provide commented, minimal, self-contained, reproducible code.
________________________________
Tento e-mail a jak�koliv k n�mu p�ipojen� dokumenty jsou d�v�rn� a jsou ur�eny pouze jeho adres�t�m.
Jestli�e jste obdr�el(a) tento e-mail omylem, informujte laskav� neprodlen� jeho odes�latele. Obsah tohoto emailu i s p��lohami a jeho kopie vyma�te ze sv�ho syst�mu.
Nejste-li zam��len�m adres�tem tohoto emailu, nejste opr�vn�ni tento email jakkoliv u��vat, roz�i�ovat, kop�rovat �i zve�ej�ovat.
Odes�latel e-mailu neodpov�d� za eventu�ln� �kodu zp�sobenou modifikacemi �i zpo�d�n�m p�enosu e-mailu.

V p��pad�, �e je tento e-mail sou��st� obchodn�ho jedn�n�:
- vyhrazuje si odes�latel pr�vo ukon�it kdykoliv jedn�n� o uzav�en� smlouvy, a to z jak�hokoliv d�vodu i bez uveden� d�vodu.
- a obsahuje-li nab�dku, je adres�t opr�vn�n nab�dku bezodkladn� p�ijmout; Odes�latel tohoto e-mailu (nab�dky) vylu�uje p�ijet� nab�dky ze strany p��jemce s dodatkem �i odchylkou.
- trv� odes�latel na tom, �e p��slu�n� smlouva je uzav�ena teprve v�slovn�m dosa�en�m shody na v�ech jej�ch n�le�itostech.
- odes�latel tohoto emailu informuje, �e nen� opr�vn�n uzav�rat za spole�nost ��dn� smlouvy s v�jimkou p��pad�, kdy k tomu byl p�semn� zmocn�n nebo p�semn� pov��en a takov� pov��en� nebo pln� moc byly adres�tovi tohoto emailu p��padn� osob�, kterou adres�t zastupuje, p�edlo�eny nebo jejich existence je adres�tovi �i osob� j�m zastoupen� zn�m�.

This e-mail and any documents attached to it may be confidential and are intended only for its intended recipients.
If you received this e-mail by mistake, please immediately inform its sender. Delete the contents of this e-mail with all attachments and its copies from your system.
If you are not the intended recipient of this e-mail, you are not authorized to use, disseminate, copy or disclose this e-mail in any manner.
The sender of this e-mail shall not be liable for any possible damage caused by modifications of the e-mail or by delay with transfer of the email.

In case that this e-mail forms part of business dealings:
- the sender reserves the right to end negotiations about entering into a contract in any time, for any reason, and without stating any reasoning.
- if the e-mail contains an offer, the recipient is entitled to immediately accept such offer; The sender of this e-mail (offer) excludes any acceptance of the offer on the part of the recipient containing any amendment or variation.
- the sender insists on that the respective contract is concluded only upon an express mutual agreement on all its aspects.
- the sender of this e-mail informs that he/she is not authorized to enter into any contracts on behalf of the company except for cases in which he/she is expressly authorized to do so in writing, and such authorization or power of attorney is submitted to the recipient or the person represented by the recipient, or the existence of such authorization is known to the recipient of the person represented by the recipient.

        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Comparing 2 dale columns

R help mailing list-2
: Thanks. But when I apply your codes I get all NA instead of TRUE and FALSE

You need to show a self-contained example of your problem, one that others
can copy and paste into their R sessions.  E.g.,

data <- read.table(header=TRUE, stringsAsFactors=FALSE, text="
COL1       COL2
6/1/14     5/1/15
7/1/14     5/1/15")
data$COL1<- as.Date(data$COL1, format="%d/%m/%y")
data$COL2<- as.Date(data$COL2, format="%d/%m/%y")
str(data)
#'data.frame':   2 obs. of  2 variables:
# $ COL1: Date, format: "2014-01-06" "2014-01-07"
# $ COL2: Date, format: "2015-01-05" "2015-01-05"
transform(data, `Col2>Col1` = COL2 > COL1, `Col2-Col1` = COL2 - COL1,
check.names=FALSE)
#        COL1       COL2 Col2>Col1 Col2-Col1
#1 2014-01-06 2015-01-05      TRUE  364 days
#2 2014-01-07 2015-01-05      TRUE  363 days


Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Wed, Aug 23, 2017 at 9:09 AM, Patrick Casimir <[hidden email]> wrote:

>
> Thanks. But when I apply your codes I get all NA instead of TRUE and FALSE
>
> ________________________________
> From: PIKAL Petr <[hidden email]>
> Sent: Wednesday, August 23, 2017 11:20:00 AM
> To: Patrick Casimir; [hidden email]
> Subject: RE: Comparing 2 dale columns
>
> Hi
>
> your code is wrong.
>
> I get
> > test<-read.table("clipboard", header=T)
> > str(test)
> 'data.frame':   2 obs. of  2 variables:
>  $ COL1: Factor w/ 2 levels "6/1/14","7/1/14": 1 2
>  $ COL2: Factor w/ 1 level "5/1/15": 1 1
> > test$COL2<- as.Date(as.character(test$COL2, format="%y/%m/%d"))
> > test$COL1<- as.Date(as.character(test$COL1, format="%y/%m/%d"))
>
>        ^^^^^^^^^^^^^^^^^^^^^^^
> incorrect parentheses position, wrong y,m,d
>
> Using correct syntax I get correct result.
>
> > test$COL2<- as.Date(test$COL2, format="%d/%m/%y")
> > test$COL1<- as.Date(test$COL1, format="%d/%m/%y")
> >
> > test$COL2 > test$COL1
> [1] TRUE TRUE
> > test
>         COL1       COL2
> 1 2014-01-06 2015-01-05
> 2 2014-01-07 2015-01-05
> >
>
> Cheers
> Petr
>
> > -----Original Message-----
> > From: R-help [mailto:[hidden email]] On Behalf Of Patrick
> > Casimir
> > Sent: Wednesday, August 23, 2017 4:54 PM
> > To: [hidden email]
> > Subject: [R] Comparing 2 dale columns
> >
> > Dear R fellows,
> >
> >
> > I created a new column Date_flag to compare the dates of COL1 and COL2
> > using the code below. But it showed that 5/1/15 is greater than 6/1/2014
> and
> > 5/1/2015 greater than
> > 7/1/2014 despite the year is greater. How do I fix that? I did try to
> format as
> > %y/%m/%d
> >
> >  but it does not fix that.
> >
> >
> > data$Date_Flag <- ifelse(data$COL2 > data$COL1, 0,1)
> >
> >
> > COL1       COL2
> > 6/1/14     5/1/15
> > 7/1/14     5/1/15
> >
> >
> > data$COL2<- as.Date(as.character(data$COL2, format="%y/%m/%d"))
> > data$COL1<- as.Date(as.character(data$COL1, format="%y/%m/%d"))
> >
> >
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > https://na01.safelinks.protection.outlook.com/?url=
> https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&
> data=02%7C01%7Cpatrcasi%40nova.edu%7Cbf182d25cae845ec841c08d4ea3a7139%
> 7C2c2b2d312e3e4df1b571fb37c042ff1b%7C0%7C0%7C636390984126927971&sdata=
> Tait17GF8fuEBwFnDZHCgnc7v88yXqAxpTX7W76O1cA%3D&reserved=0
> > PLEASE do read the posting guide https://na01.safelinks.
> protection.outlook.com/?url=http%3A%2F%2Fwww.R-project.
> org%2Fposting-guide.html&data=02%7C01%7Cpatrcasi%40nova.edu%
> 7Cbf182d25cae845ec841c08d4ea3a7139%7C2c2b2d312e3e4df1b571fb37c042
> ff1b%7C0%7C0%7C636390984126927971&sdata=H3xZXXDLx4sN23A4HoUp1bElRZfWfP
> T1HbOaRMSDCjI%3D&reserved=0
> > and provide commented, minimal, self-contained, reproducible code.
>
> ________________________________
> Tento e-mail a jakékoliv k němu připojené dokumenty jsou důvěrné a jsou
> určeny pouze jeho adresátům.
> Jestliľe jste obdrľel(a) tento e-mail omylem, informujte laskavě
> neprodleně jeho odesílatele. Obsah tohoto emailu i s přílohami a jeho kopie
> vymaľte ze svého systému.
> Nejste-li zamýąleným adresátem tohoto emailu, nejste oprávněni tento email
> jakkoliv uľívat, roząiřovat, kopírovat či zveřejňovat.
> Odesílatel e-mailu neodpovídá za eventuální ąkodu způsobenou modifikacemi
> či zpoľděním přenosu e-mailu.
>
> V případě, ľe je tento e-mail součástí obchodního jednání:
> - vyhrazuje si odesílatel právo ukončit kdykoliv jednání o uzavření
> smlouvy, a to z jakéhokoliv důvodu i bez uvedení důvodu.
> - a obsahuje-li nabídku, je adresát oprávněn nabídku bezodkladně přijmout;
> Odesílatel tohoto e-mailu (nabídky) vylučuje přijetí nabídky ze strany
> příjemce s dodatkem či odchylkou.
> - trvá odesílatel na tom, ľe přísluąná smlouva je uzavřena teprve
> výslovným dosaľením shody na vąech jejích náleľitostech.
> - odesílatel tohoto emailu informuje, ľe není oprávněn uzavírat za
> společnost ľádné smlouvy s výjimkou případů, kdy k tomu byl písemně zmocněn
> nebo písemně pověřen a takové pověření nebo plná moc byly adresátovi tohoto
> emailu případně osobě, kterou adresát zastupuje, předloľeny nebo jejich
> existence je adresátovi či osobě jím zastoupené známá.
>
> This e-mail and any documents attached to it may be confidential and are
> intended only for its intended recipients.
> If you received this e-mail by mistake, please immediately inform its
> sender. Delete the contents of this e-mail with all attachments and its
> copies from your system.
> If you are not the intended recipient of this e-mail, you are not
> authorized to use, disseminate, copy or disclose this e-mail in any manner.
> The sender of this e-mail shall not be liable for any possible damage
> caused by modifications of the e-mail or by delay with transfer of the
> email.
>
> In case that this e-mail forms part of business dealings:
> - the sender reserves the right to end negotiations about entering into a
> contract in any time, for any reason, and without stating any reasoning.
> - if the e-mail contains an offer, the recipient is entitled to
> immediately accept such offer; The sender of this e-mail (offer) excludes
> any acceptance of the offer on the part of the recipient containing any
> amendment or variation.
> - the sender insists on that the respective contract is concluded only
> upon an express mutual agreement on all its aspects.
> - the sender of this e-mail informs that he/she is not authorized to enter
> into any contracts on behalf of the company except for cases in which
> he/she is expressly authorized to do so in writing, and such authorization
> or power of attorney is submitted to the recipient or the person
> represented by the recipient, or the existence of such authorization is
> known to the recipient of the person represented by the recipient.
>
>         [[alternative HTML version deleted]]
>
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Comparing 2 dale columns

PIKAL Petr
In reply to this post by Patrick Casimir
Hm.

I used the code from Mark's response and it works correctly. So you either messed the data or you do not tell us the whole story.

data
> dput(data)
structure(list(COL1 = structure(c(16222, 16252), class = "Date"),
    COL2 = structure(c(16556, 16556), class = "Date"), Date_Flag = c(0,
    0)), .Names = c("COL1", "COL2", "Date_Flag"), row.names = c(NA,
-2L), class = "data.frame")

Instead of ifelse you can use

data$Date_Flag_calc <- (data$COL2 < data$COL1)*1
data
        COL1       COL2 Date_Flag Date_Flag_calc
1 2014-06-01 2015-05-01         0              0
2 2014-07-01 2015-05-01         0              0


What is result of

str(data)
in your case?


Cheers
Petr

From: Patrick Casimir [mailto:[hidden email]]
Sent: Wednesday, August 23, 2017 6:10 PM
To: PIKAL Petr <[hidden email]>; [hidden email]
Subject: Re: Comparing 2 dale columns




Thanks. But when I apply your codes I get all NA instead of TRUE and FALSE

________________________________
From: PIKAL Petr <[hidden email]<mailto:[hidden email]>>
Sent: Wednesday, August 23, 2017 11:20:00 AM
To: Patrick Casimir; [hidden email]<mailto:[hidden email]>
Subject: RE: Comparing 2 dale columns

Hi

your code is wrong.

I get
> test<-read.table("clipboard", header=T)
> str(test)
'data.frame':   2 obs. of  2 variables:
 $ COL1: Factor w/ 2 levels "6/1/14","7/1/14": 1 2
 $ COL2: Factor w/ 1 level "5/1/15": 1 1
> test$COL2<- as.Date(as.character(test$COL2, format="%y/%m/%d"))
> test$COL1<- as.Date(as.character(test$COL1, format="%y/%m/%d"))
                                                                                 ^^^^^^^^^^^^^^^^^^^^^^^
incorrect parentheses position, wrong y,m,d

Using correct syntax I get correct result.

> test$COL2<- as.Date(test$COL2, format="%d/%m/%y")
> test$COL1<- as.Date(test$COL1, format="%d/%m/%y")
>
> test$COL2 > test$COL1
[1] TRUE TRUE
> test
        COL1       COL2
1 2014-01-06 2015-01-05
2 2014-01-07 2015-01-05
>

Cheers
Petr

> -----Original Message-----
> From: R-help [mailto:[hidden email]] On Behalf Of Patrick
> Casimir
> Sent: Wednesday, August 23, 2017 4:54 PM
> To: [hidden email]<mailto:[hidden email]>
> Subject: [R] Comparing 2 dale columns
>
> Dear R fellows,
>
>
> I created a new column Date_flag to compare the dates of COL1 and COL2
> using the code below. But it showed that 5/1/15 is greater than 6/1/2014 and
> 5/1/2015 greater than
> 7/1/2014 despite the year is greater. How do I fix that? I did try to format as
> %y/%m/%d
>
>  but it does not fix that.
>
>
> data$Date_Flag <- ifelse(data$COL2 > data$COL1, 0,1)
>
>
> COL1       COL2
> 6/1/14     5/1/15
> 7/1/14     5/1/15
>
>
> data$COL2<- as.Date(as.character(data$COL2, format="%y/%m/%d"))
> data$COL1<- as.Date(as.character(data$COL1, format="%y/%m/%d"))
>
>

________________________________
Tento e-mail a jak?koliv k n?mu p?ipojen? dokumenty jsou d?v?rn? a jsou ur?eny pouze jeho adres?t?m.
Jestli?e jste obdr?el(a) tento e-mail omylem, informujte laskav? neprodlen? jeho odes?latele. Obsah tohoto emailu i s p??lohami a jeho kopie vyma?te ze sv?ho syst?mu.
Nejste-li zam??len?m adres?tem tohoto emailu, nejste opr?vn?ni tento email jakkoliv u??vat, roz?i?ovat, kop?rovat ?i zve?ej?ovat.
Odes?latel e-mailu neodpov?d? za eventu?ln? ?kodu zp?sobenou modifikacemi ?i zpo?d?n?m p?enosu e-mailu.

V p??pad?, ?e je tento e-mail sou??st? obchodn?ho jedn?n?:
- vyhrazuje si odes?latel pr?vo ukon?it kdykoliv jedn?n? o uzav?en? smlouvy, a to z jak?hokoliv d?vodu i bez uveden? d?vodu.
- a obsahuje-li nab?dku, je adres?t opr?vn?n nab?dku bezodkladn? p?ijmout; Odes?latel tohoto e-mailu (nab?dky) vylu?uje p?ijet? nab?dky ze strany p??jemce s dodatkem ?i odchylkou.
- trv? odes?latel na tom, ?e p??slu?n? smlouva je uzav?ena teprve v?slovn?m dosa?en?m shody na v?ech jej?ch n?le?itostech.
- odes?latel tohoto emailu informuje, ?e nen? opr?vn?n uzav?rat za spole?nost ??dn? smlouvy s v?jimkou p??pad?, kdy k tomu byl p?semn? zmocn?n nebo p?semn? pov??en a takov? pov??en? nebo pln? moc byly adres?tovi tohoto emailu p??padn? osob?, kterou adres?t zastupuje, p?edlo?eny nebo jejich existence je adres?tovi ?i osob? j?m zastoupen? zn?m?.

This e-mail and any documents attached to it may be confidential and are intended only for its intended recipients.
If you received this e-mail by mistake, please immediately inform its sender. Delete the contents of this e-mail with all attachments and its copies from your system.
If you are not the intended recipient of this e-mail, you are not authorized to use, disseminate, copy or disclose this e-mail in any manner.
The sender of this e-mail shall not be liable for any possible damage caused by modifications of the e-mail or by delay with transfer of the email.

In case that this e-mail forms part of business dealings:
- the sender reserves the right to end negotiations about entering into a contract in any time, for any reason, and without stating any reasoning.
- if the e-mail contains an offer, the recipient is entitled to immediately accept such offer; The sender of this e-mail (offer) excludes any acceptance of the offer on the part of the recipient containing any amendment or variation.
- the sender insists on that the respective contract is concluded only upon an express mutual agreement on all its aspects.
- the sender of this e-mail informs that he/she is not authorized to enter into any contracts on behalf of the company except for cases in which he/she is expressly authorized to do so in writing, and such authorization or power of attorney is submitted to the recipient or the person represented by the recipient, or the existence of such authorization is known to the recipient of the person represented by the recipient.

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.