[bug] in cut.POSIXt(..., breaks = <numeric>)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[bug] in cut.POSIXt(..., breaks = <numeric>)

Xianghui Dong
The exact error was reported before in *Bug 14288*
<https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=14288> *- **bug in
cut.POSIXt(..., breaks = <numeric>) and cut.Date. *But the fix in that bug
report only covered the simplest case.

This is the error I met
-----------------------------

x <- structure(c(1057067700, 1057215720, 1060597800, 1061470800,
1061911680,
1062048000, 1062137880, 1064479440, 1064926380, 1064995140, 1066822800,
          1068033720, 1070869740, 1070939820, 1071030540, 1074244560,
1077545880,
          1078449720, 1084955460, 1129020000, 1130324280, 1130404800,
1131519420,
          1132640100, 1133772000, 1137567960, 1138952640, 1141810380,
1147444200,
          1161643440, 1164086160), class = c("POSIXct", "POSIXt"), tzone =
"UTC")

> cut(x, 20)
Error in `levels<-.factor`(`*tmp*`, value = as.character(if
(is.numeric(breaks)) x[!duplicated(res)] else breaks[-length(breaks)])) :
  number of levels differs
-----------------------------

The cause of the bug is that the input have spread out date-time values,
only 10 breaks in the total 20 breaks have value.
-------------------

cut_n <- cut(as.numeric(x), 20)

> unique(cut_n)
 [1] (1.057e+09,1.062e+09] (1.062e+09,1.068e+09] (1.068e+09,1.073e+09]
(1.073e+09,1.078e+09]
 [5] (1.084e+09,1.089e+09] (1.127e+09,1.132e+09] (1.132e+09,1.137e+09]
(1.137e+09,1.143e+09]
 [9] (1.143e+09,1.148e+09] (1.159e+09,1.164e+09]
20 Levels: (1.057e+09,1.062e+09] (1.062e+09,1.068e+09]
(1.068e+09,1.073e+09] ... (1.159e+09,1.164e+09]
------------------------
To get proper 20 labels of each break, the break need to be formatted from
number to date-time string. Current code didn't really convert the breaks
However the code just used the original date-time values from input data.
This will not work if the interval value doesn't happen to equal to
original input. For a even simpler example from the original bug report:
-----------------------
x <- seq(as.POSIXct("2000-01-01"), by = "days", length = 20)
> cut(x, breaks = 30)
Error in `levels<-.factor`(`*tmp*`, value = as.character(if
(is.numeric(breaks)) x[!duplicated(res)] else breaks[-length(breaks)])) :
  number of levels differs
---------------------

I think to fix the bug will need either
- get the actual numeric value of the breaks from "cut", modify "cut" if
needed. Then convert the numeric value back to date-time
- or use regex to extract the break value then convert to date-time

Best,
Xianghui Dong

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [bug] in cut.POSIXt(..., breaks = <numeric>)

Xianghui Dong
This is a simple fix. I just extract the part of cut.R that calculated
breaks by a number, then convert the breaks format, provide the breaks
manually to cut again. I used lubridate as_datetime because it's simpler.
Of course it can be replaced with as.POSIXct.

The breaks are always formatted in one way, but user can format it anyway
he/she want by just use divide. I felt the return result of divide is often
very useful, so it's worth to be extracted as an individual function.

------------------------------------
# focused on one case: cut x into intervals given a number of interval count

# divide x into interval_count intervals. Taken from
https://github.com/wch/r-source/blob/trunk/src/library/base/R/cut.R
divide <-
  function (x, interval_count)
  {
      if (is.na(interval_count) || interval_count < 2L)
        stop("invalid number of intervals")
      nb <- as.integer(interval_count + 1) # one more than #{intervals}
      dx <- diff(rx <- range(x, na.rm = TRUE))
      if(dx == 0) {
        dx <- abs(rx[1L])
        breaks <- seq.int(rx[1L] - dx/1000, rx[2L] + dx/1000,
                          length.out = nb)
      } else {
        breaks <- seq.int(rx[1L], rx[2L], length.out = nb)
        breaks[c(1L, nb)] <- c(rx[1L] - dx/1000, rx[2L] + dx/1000)
      }
    return(breaks)
  }

cut_date_time <- function(x, interval_count) {
  brks <- divide(as.numeric(x), interval_count)
  return(cut(x, as_datetime(brks)))
}

divide_date_time <- function(x, interval_count) {
  return(as_datetime(divide(as.numeric(x), interval_count)))
}
--------------------

Best,
Xianghui Dong

On Thu, Apr 6, 2017 at 3:37 PM, Xianghui Dong <[hidden email]> wrote:

> The exact error was reported before in *Bug 14288*
> <https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=14288> *- **bug in
> cut.POSIXt(..., breaks = <numeric>) and cut.Date. *But the fix in that
> bug report only covered the simplest case.
>
> This is the error I met
> -----------------------------
>
> x <- structure(c(1057067700, 1057215720, 1060597800, 1061470800,
> 1061911680,
> 1062048000, 1062137880, 1064479440, 1064926380, 1064995140, 1066822800,
>           1068033720, 1070869740, 1070939820, 1071030540, 1074244560,
> 1077545880,
>           1078449720, 1084955460, 1129020000, 1130324280, 1130404800,
> 1131519420,
>           1132640100, 1133772000, 1137567960, 1138952640, 1141810380,
> 1147444200,
>           1161643440, 1164086160), class = c("POSIXct", "POSIXt"), tzone =
> "UTC")
>
> > cut(x, 20)
> Error in `levels<-.factor`(`*tmp*`, value = as.character(if
> (is.numeric(breaks)) x[!duplicated(res)] else breaks[-length(breaks)])) :
>   number of levels differs
> -----------------------------
>
> The cause of the bug is that the input have spread out date-time values,
> only 10 breaks in the total 20 breaks have value.
> -------------------
>
> cut_n <- cut(as.numeric(x), 20)
>
> > unique(cut_n)
>  [1] (1.057e+09,1.062e+09] (1.062e+09,1.068e+09] (1.068e+09,1.073e+09]
> (1.073e+09,1.078e+09]
>  [5] (1.084e+09,1.089e+09] (1.127e+09,1.132e+09] (1.132e+09,1.137e+09]
> (1.137e+09,1.143e+09]
>  [9] (1.143e+09,1.148e+09] (1.159e+09,1.164e+09]
> 20 Levels: (1.057e+09,1.062e+09] (1.062e+09,1.068e+09]
> (1.068e+09,1.073e+09] ... (1.159e+09,1.164e+09]
> ------------------------
> To get proper 20 labels of each break, the break need to be formatted from
> number to date-time string. Current code didn't really convert the breaks
> However the code just used the original date-time values from input data.
> This will not work if the interval value doesn't happen to equal to
> original input. For a even simpler example from the original bug report:
> -----------------------
> x <- seq(as.POSIXct("2000-01-01"), by = "days", length = 20)
> > cut(x, breaks = 30)
> Error in `levels<-.factor`(`*tmp*`, value = as.character(if
> (is.numeric(breaks)) x[!duplicated(res)] else breaks[-length(breaks)])) :
>   number of levels differs
> ---------------------
>
> I think to fix the bug will need either
> - get the actual numeric value of the breaks from "cut", modify "cut" if
> needed. Then convert the numeric value back to date-time
> - or use regex to extract the break value then convert to date-time
>
> Best,
> Xianghui Dong
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Loading...