jitter-bug? problematic behaviour of the jitter function

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

jitter-bug? problematic behaviour of the jitter function

Martin Keller-Ressel-2
Dear all,

i have noticed some strange behaviour in the „jitter“ function in R.
On the help page for jitter it is stated that

"The result, say r, is r <- x + runif(n, -a, a) where n <- length(x) and a is the amount argument (if specified).“

and

"If amount is NULL (default), we set a <- factor * d/5 where d is the smallest difference between adjacent unique (apart from fuzz) x values.“

This works fine as long as there is no (very) large outlier

> jitter(c(1,2,10^4))  # desired behaviour
[1]    1.083243    1.851571 9999.942716

But for very large outliers the added noise suddenly ‚jumps‘ to a much larger scale:

> jitter(c(1,2,10^5)) # bad behaviour
[1] -19535.649   9578.702 115693.854
# Noise should be of order (2-1)/5  = 0.2 but is of much larger order.

This probably does not matter much when jitter is used for plotting, but it can cause problems when jitter is used to break ties.

best regards,
Martin

--------------------------------
Martin Keller-Ressel
Professor für Stochastische Analysis und Finanzmathematik
Technische Universität Dresden
Institut für Mathematische Stochastik
Willersbau B 316, Zellescher Weg 12-14
01062 Dresden
--------------------------------


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: jitter-bug? problematic behaviour of the jitter function

Duncan Murdoch-2
On 23/09/2020 6:32 a.m., Martin Keller-Ressel wrote:

> Dear all,
>
> i have noticed some strange behaviour in the „jitter“ function in R.
> On the help page for jitter it is stated that
>
> "The result, say r, is r <- x + runif(n, -a, a) where n <- length(x) and a is the amount argument (if specified).“
>
> and
>
> "If amount is NULL (default), we set a <- factor * d/5 where d is the smallest difference between adjacent unique (apart from fuzz) x values.“
>
> This works fine as long as there is no (very) large outlier
>
>> jitter(c(1,2,10^4))  # desired behaviour
> [1]    1.083243    1.851571 9999.942716
>
> But for very large outliers the added noise suddenly ‚jumps‘ to a much larger scale:
>
>> jitter(c(1,2,10^5)) # bad behaviour
> [1] -19535.649   9578.702 115693.854
> # Noise should be of order (2-1)/5  = 0.2 but is of much larger order.
>
> This probably does not matter much when jitter is used for plotting, but it can cause problems when jitter is used to break ties.

I think this is kind of documented:  "apart from fuzz" is what counts.
If you look at the code for jitter, you'll see this important line:

  d <- diff(xx <- unique(sort.int(round(x, 3 - floor(log10(z))))))

By the time you get here, z is the length of the rante of the data, so
it's 99999 in your example.  The rounding changes your values to
0,0,1e5, so the smallest difference is 1e5.

Duncan Murdoch

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: jitter-bug? problematic behaviour of the jitter function

Rui Barradas
Hello,

I believe that though Duncan's explanation is right it is also not
explaining the value of the digits argument. round makes the first 2
numbers 0 but why? The function below prints the digits argument and
then outputs d. The code is taken from jitter.


f <- function(x){
   z <- diff(r <- range(x[is.finite(x)]))
   cat("digits:", 3 - floor(log10(z)), "\n")
   diff(xx <- unique(sort.int(round(x, 3 - floor(log10(z))))))
}


Now see what cat outputs for 'digits'.


f(c(1,2,10^4))  # desired behaviour
#digits: 0
#[1]    1 9998
f(c(0,1,10^4))  # bad behaviour
#digits: -1
#[1] 10000
f(c(-1,0,10^4))  # bad behaviour
#digits: -1
#[1] 10000
f(c(1,2,10^5))  # bad behaviour
#digits: -1
#[1] 1e+05



And according to the documentation of ?round, negative digits are allowed:


Rounding to a negative number of digits means rounding to a power of
ten, so for example round(x, digits = -2) rounds to the nearest hundred.


But in this case two of the numbers are closer to 0 than they are of 10.
And unique keeps only 0 and the largest, then diff is big.


round(c(1,2,10^4),0)  # desired behaviour
#[1]     1     2 10000
round(c(0,1,10^4),-1)  # bad behaviour
#[1]     0     0 10000
round(c(-1,0,10^4),-1)  # bad behaviour
#[1]     0     0 10000
round(c(1,2,10^5),-1)  # bad behaviour
#[1] 0e+00 0e+00 1e+05



Isn't it still a bug?

Rui Barradas


Às 15:57 de 23/09/20, Duncan Murdoch escreveu:

> On 23/09/2020 6:32 a.m., Martin Keller-Ressel wrote:
>> Dear all,
>>
>> i have noticed some strange behaviour in the „jitter“ function in R.
>> On the help page for jitter it is stated that
>>
>> "The result, say r, is r <- x + runif(n, -a, a) where n <- length(x)
>> and a is the amount argument (if specified).“
>>
>> and
>>
>> "If amount is NULL (default), we set a <- factor * d/5 where d is the
>> smallest difference between adjacent unique (apart from fuzz) x values.“
>>
>> This works fine as long as there is no (very) large outlier
>>
>>> jitter(c(1,2,10^4))  # desired behaviour
>> [1]    1.083243    1.851571 9999.942716
>>
>> But for very large outliers the added noise suddenly ‚jumps‘ to a much
>> larger scale:
>>
>>> jitter(c(1,2,10^5)) # bad behaviour
>> [1] -19535.649   9578.702 115693.854
>> # Noise should be of order (2-1)/5  = 0.2 but is of much larger order.
>>
>> This probably does not matter much when jitter is used for plotting,
>> but it can cause problems when jitter is used to break ties.
>
> I think this is kind of documented:  "apart from fuzz" is what counts.
> If you look at the code for jitter, you'll see this important line:
>
>   d <- diff(xx <- unique(sort.int(round(x, 3 - floor(log10(z))))))
>
> By the time you get here, z is the length of the rante of the data, so
> it's 99999 in your example.  The rounding changes your values to
> 0,0,1e5, so the smallest difference is 1e5.
>
> Duncan Murdoch
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: jitter-bug? problematic behaviour of the jitter function

Duncan Murdoch-2
On 23/09/2020 4:03 p.m., Rui Barradas wrote:
> Hello,
>
> I believe that though Duncan's explanation is right it is also not
> explaining the value of the digits argument. round makes the first 2
> numbers 0 but why?

If there had been rounding in their computation, you might see a
difference like 1e-15.  You wouldn't want to use that for the scale of
jittering, so some rounding is needed.

I think the documentation for the function is poor, but the intention
was probably to use the function in graphics (as the references did),
and in that case, any values too close together should be treated as
equal and jittering should separate them.  The particular computation
used says that if the range is in [1, 10), values equal to 3 decimal
places will be too close and need separation.

So I don't think this is a bug, but it might be a valid wishlist item:
document what "apart from fuzz" means, and perhaps allow it to be
controlled by the user.

Duncan Murdoch



  The function below prints the digits argument and

> then outputs d. The code is taken from jitter.
>
>
> f <- function(x){
>     z <- diff(r <- range(x[is.finite(x)]))
>     cat("digits:", 3 - floor(log10(z)), "\n")
>     diff(xx <- unique(sort.int(round(x, 3 - floor(log10(z))))))
> }
>
>
> Now see what cat outputs for 'digits'.
>
>
> f(c(1,2,10^4))  # desired behaviour
> #digits: 0
> #[1]    1 9998
> f(c(0,1,10^4))  # bad behaviour
> #digits: -1
> #[1] 10000
> f(c(-1,0,10^4))  # bad behaviour
> #digits: -1
> #[1] 10000
> f(c(1,2,10^5))  # bad behaviour
> #digits: -1
> #[1] 1e+05
>
>
>
> And according to the documentation of ?round, negative digits are allowed:
>
>
> Rounding to a negative number of digits means rounding to a power of
> ten, so for example round(x, digits = -2) rounds to the nearest hundred.
>
>
> But in this case two of the numbers are closer to 0 than they are of 10.
> And unique keeps only 0 and the largest, then diff is big.
>
>
> round(c(1,2,10^4),0)  # desired behaviour
> #[1]     1     2 10000
> round(c(0,1,10^4),-1)  # bad behaviour
> #[1]     0     0 10000
> round(c(-1,0,10^4),-1)  # bad behaviour
> #[1]     0     0 10000
> round(c(1,2,10^5),-1)  # bad behaviour
> #[1] 0e+00 0e+00 1e+05
>
>
>
> Isn't it still a bug?
>
> Rui Barradas
>
>
> Às 15:57 de 23/09/20, Duncan Murdoch escreveu:
>> On 23/09/2020 6:32 a.m., Martin Keller-Ressel wrote:
>>> Dear all,
>>>
>>> i have noticed some strange behaviour in the „jitter“ function in R.
>>> On the help page for jitter it is stated that
>>>
>>> "The result, say r, is r <- x + runif(n, -a, a) where n <- length(x)
>>> and a is the amount argument (if specified).“
>>>
>>> and
>>>
>>> "If amount is NULL (default), we set a <- factor * d/5 where d is the
>>> smallest difference between adjacent unique (apart from fuzz) x values.“
>>>
>>> This works fine as long as there is no (very) large outlier
>>>
>>>> jitter(c(1,2,10^4))  # desired behaviour
>>> [1]    1.083243    1.851571 9999.942716
>>>
>>> But for very large outliers the added noise suddenly ‚jumps‘ to a much
>>> larger scale:
>>>
>>>> jitter(c(1,2,10^5)) # bad behaviour
>>> [1] -19535.649   9578.702 115693.854
>>> # Noise should be of order (2-1)/5  = 0.2 but is of much larger order.
>>>
>>> This probably does not matter much when jitter is used for plotting,
>>> but it can cause problems when jitter is used to break ties.
>>
>> I think this is kind of documented:  "apart from fuzz" is what counts.
>> If you look at the code for jitter, you'll see this important line:
>>
>>    d <- diff(xx <- unique(sort.int(round(x, 3 - floor(log10(z))))))
>>
>> By the time you get here, z is the length of the rante of the data, so
>> it's 99999 in your example.  The rounding changes your values to
>> 0,0,1e5, so the smallest difference is 1e5.
>>
>> Duncan Murdoch
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: jitter-bug? problematic behaviour of the jitter function

Rui Barradas
Hello,

Thanks for the further explanation.
I believe that yes, it  would be a good idea to document a bit better
that "apart from fuzz" is a rounding operation, it is said en passant,
and its meaning is not clear.

Rui Barradas

Às 21:25 de 23/09/20, Duncan Murdoch escreveu:

> On 23/09/2020 4:03 p.m., Rui Barradas wrote:
>> Hello,
>>
>> I believe that though Duncan's explanation is right it is also not
>> explaining the value of the digits argument. round makes the first 2
>> numbers 0 but why?
>
> If there had been rounding in their computation, you might see a
> difference like 1e-15.  You wouldn't want to use that for the scale of
> jittering, so some rounding is needed.
>
> I think the documentation for the function is poor, but the intention
> was probably to use the function in graphics (as the references did),
> and in that case, any values too close together should be treated as
> equal and jittering should separate them.  The particular computation
> used says that if the range is in [1, 10), values equal to 3 decimal
> places will be too close and need separation.
>
> So I don't think this is a bug, but it might be a valid wishlist item:
> document what "apart from fuzz" means, and perhaps allow it to be
> controlled by the user.
>
> Duncan Murdoch
>
>
>
>   The function below prints the digits argument and
>> then outputs d. The code is taken from jitter.
>>
>>
>> f <- function(x){
>>     z <- diff(r <- range(x[is.finite(x)]))
>>     cat("digits:", 3 - floor(log10(z)), "\n")
>>     diff(xx <- unique(sort.int(round(x, 3 - floor(log10(z))))))
>> }
>>
>>
>> Now see what cat outputs for 'digits'.
>>
>>
>> f(c(1,2,10^4))  # desired behaviour
>> #digits: 0
>> #[1]    1 9998
>> f(c(0,1,10^4))  # bad behaviour
>> #digits: -1
>> #[1] 10000
>> f(c(-1,0,10^4))  # bad behaviour
>> #digits: -1
>> #[1] 10000
>> f(c(1,2,10^5))  # bad behaviour
>> #digits: -1
>> #[1] 1e+05
>>
>>
>>
>> And according to the documentation of ?round, negative digits are
>> allowed:
>>
>>
>> Rounding to a negative number of digits means rounding to a power of
>> ten, so for example round(x, digits = -2) rounds to the nearest hundred.
>>
>>
>> But in this case two of the numbers are closer to 0 than they are of 10.
>> And unique keeps only 0 and the largest, then diff is big.
>>
>>
>> round(c(1,2,10^4),0)  # desired behaviour
>> #[1]     1     2 10000
>> round(c(0,1,10^4),-1)  # bad behaviour
>> #[1]     0     0 10000
>> round(c(-1,0,10^4),-1)  # bad behaviour
>> #[1]     0     0 10000
>> round(c(1,2,10^5),-1)  # bad behaviour
>> #[1] 0e+00 0e+00 1e+05
>>
>>
>>
>> Isn't it still a bug?
>>
>> Rui Barradas
>>
>>
>> Às 15:57 de 23/09/20, Duncan Murdoch escreveu:
>>> On 23/09/2020 6:32 a.m., Martin Keller-Ressel wrote:
>>>> Dear all,
>>>>
>>>> i have noticed some strange behaviour in the „jitter“ function in R.
>>>> On the help page for jitter it is stated that
>>>>
>>>> "The result, say r, is r <- x + runif(n, -a, a) where n <- length(x)
>>>> and a is the amount argument (if specified).“
>>>>
>>>> and
>>>>
>>>> "If amount is NULL (default), we set a <- factor * d/5 where d is the
>>>> smallest difference between adjacent unique (apart from fuzz) x
>>>> values.“
>>>>
>>>> This works fine as long as there is no (very) large outlier
>>>>
>>>>> jitter(c(1,2,10^4))  # desired behaviour
>>>> [1]    1.083243    1.851571 9999.942716
>>>>
>>>> But for very large outliers the added noise suddenly ‚jumps‘ to a much
>>>> larger scale:
>>>>
>>>>> jitter(c(1,2,10^5)) # bad behaviour
>>>> [1] -19535.649   9578.702 115693.854
>>>> # Noise should be of order (2-1)/5  = 0.2 but is of much larger order.
>>>>
>>>> This probably does not matter much when jitter is used for plotting,
>>>> but it can cause problems when jitter is used to break ties.
>>>
>>> I think this is kind of documented:  "apart from fuzz" is what counts.
>>> If you look at the code for jitter, you'll see this important line:
>>>
>>>    d <- diff(xx <- unique(sort.int(round(x, 3 - floor(log10(z))))))
>>>
>>> By the time you get here, z is the length of the rante of the data, so
>>> it's 99999 in your example.  The rounding changes your values to
>>> 0,0,1e5, so the smallest difference is 1e5.
>>>
>>> Duncan Murdoch
>>>
>>> ______________________________________________
>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: jitter-bug? problematic behaviour of the jitter function

Martin Keller-Ressel-2
In reply to this post by Duncan Murdoch-2
Dear Duncan, Dear Rui,

thanks for the responses and for pointing out that it is the ‚fuzz‘ part that is causing the problem. I agree that this is not a bug, but could be undesirable/surprising behaviour, since it causes a large ‚discontinuity‘ in the jitter functions output depending on the input data.

I was (ab?)using the jitter function to break ties, where the desired behaviour would be to add noise just small enough to make all values unique. (Such a function can easily be hand coded of course.)

best regards,
Martin

Am 23.09.2020 um 22:25 schrieb Duncan Murdoch <[hidden email]<mailto:[hidden email]>>:

On 23/09/2020 4:03 p.m., Rui Barradas wrote:
Hello,
I believe that though Duncan's explanation is right it is also not
explaining the value of the digits argument. round makes the first 2
numbers 0 but why?

If there had been rounding in their computation, you might see a difference like 1e-15.  You wouldn't want to use that for the scale of jittering, so some rounding is needed.

I think the documentation for the function is poor, but the intention was probably to use the function in graphics (as the references did), and in that case, any values too close together should be treated as equal and jittering should separate them.  The particular computation used says that if the range is in [1, 10), values equal to 3 decimal places will be too close and need separation.

So I don't think this is a bug, but it might be a valid wishlist item: document what "apart from fuzz" means, and perhaps allow it to be controlled by the user.

Duncan Murdoch



The function below prints the digits argument and
then outputs d. The code is taken from jitter.
f <- function(x){
   z <- diff(r <- range(x[is.finite(x)]))
   cat("digits:", 3 - floor(log10(z)), "\n")
   diff(xx <- unique(sort.int(round(x, 3 - floor(log10(z))))))
}
Now see what cat outputs for 'digits'.
f(c(1,2,10^4))  # desired behaviour
#digits: 0
#[1]    1 9998
f(c(0,1,10^4))  # bad behaviour
#digits: -1
#[1] 10000
f(c(-1,0,10^4))  # bad behaviour
#digits: -1
#[1] 10000
f(c(1,2,10^5))  # bad behaviour
#digits: -1
#[1] 1e+05
And according to the documentation of ?round, negative digits are allowed:
Rounding to a negative number of digits means rounding to a power of
ten, so for example round(x, digits = -2) rounds to the nearest hundred.
But in this case two of the numbers are closer to 0 than they are of 10.
And unique keeps only 0 and the largest, then diff is big.
round(c(1,2,10^4),0)  # desired behaviour
#[1]     1     2 10000
round(c(0,1,10^4),-1)  # bad behaviour
#[1]     0     0 10000
round(c(-1,0,10^4),-1)  # bad behaviour
#[1]     0     0 10000
round(c(1,2,10^5),-1)  # bad behaviour
#[1] 0e+00 0e+00 1e+05
Isn't it still a bug?
Rui Barradas
Às 15:57 de 23/09/20, Duncan Murdoch escreveu:
On 23/09/2020 6:32 a.m., Martin Keller-Ressel wrote:
Dear all,

i have noticed some strange behaviour in the „jitter“ function in R.
On the help page for jitter it is stated that

"The result, say r, is r <- x + runif(n, -a, a) where n <- length(x)
and a is the amount argument (if specified).“

and

"If amount is NULL (default), we set a <- factor * d/5 where d is the
smallest difference between adjacent unique (apart from fuzz) x values.“

This works fine as long as there is no (very) large outlier

jitter(c(1,2,10^4))  # desired behaviour
[1]    1.083243    1.851571 9999.942716

But for very large outliers the added noise suddenly ‚jumps‘ to a much
larger scale:

jitter(c(1,2,10^5)) # bad behaviour
[1] -19535.649   9578.702 115693.854
# Noise should be of order (2-1)/5  = 0.2 but is of much larger order.

This probably does not matter much when jitter is used for plotting,
but it can cause problems when jitter is used to break ties.

I think this is kind of documented:  "apart from fuzz" is what counts.
If you look at the code for jitter, you'll see this important line:

  d <- diff(xx <- unique(sort.int(round(x, 3 - floor(log10(z))))))

By the time you get here, z is the length of the rante of the data, so
it's 99999 in your example.  The rounding changes your values to
0,0,1e5, so the smallest difference is 1e5.

Duncan Murdoch

______________________________________________
[hidden email]<mailto:[hidden email]> mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: jitter-bug? problematic behaviour of the jitter function

Koenker, Roger W
FWIW, there is a similar function called “dither” in the quantreg package.

> On Sep 24, 2020, at 8:08 AM, Martin Keller-Ressel <[hidden email]> wrote:
>
> Dear Duncan, Dear Rui,
>
> thanks for the responses and for pointing out that it is the ‚fuzz‘ part that is causing the problem. I agree that this is not a bug, but could be undesirable/surprising behaviour, since it causes a large ‚discontinuity‘ in the jitter functions output depending on the input data.
>
> I was (ab?)using the jitter function to break ties, where the desired behaviour would be to add noise just small enough to make all values unique. (Such a function can easily be hand coded of course.)
>
> best regards,
> Martin
>
> Am 23.09.2020 um 22:25 schrieb Duncan Murdoch <[hidden email]<mailto:[hidden email]>>:
>
> On 23/09/2020 4:03 p.m., Rui Barradas wrote:
> Hello,
> I believe that though Duncan's explanation is right it is also not
> explaining the value of the digits argument. round makes the first 2
> numbers 0 but why?
>
> If there had been rounding in their computation, you might see a difference like 1e-15.  You wouldn't want to use that for the scale of jittering, so some rounding is needed.
>
> I think the documentation for the function is poor, but the intention was probably to use the function in graphics (as the references did), and in that case, any values too close together should be treated as equal and jittering should separate them.  The particular computation used says that if the range is in [1, 10), values equal to 3 decimal places will be too close and need separation.
>
> So I don't think this is a bug, but it might be a valid wishlist item: document what "apart from fuzz" means, and perhaps allow it to be controlled by the user.
>
> Duncan Murdoch
>
>
>
> The function below prints the digits argument and
> then outputs d. The code is taken from jitter.
> f <- function(x){
>   z <- diff(r <- range(x[is.finite(x)]))
>   cat("digits:", 3 - floor(log10(z)), "\n")
>   diff(xx <- unique(sort.int(round(x, 3 - floor(log10(z))))))
> }
> Now see what cat outputs for 'digits'.
> f(c(1,2,10^4))  # desired behaviour
> #digits: 0
> #[1]    1 9998
> f(c(0,1,10^4))  # bad behaviour
> #digits: -1
> #[1] 10000
> f(c(-1,0,10^4))  # bad behaviour
> #digits: -1
> #[1] 10000
> f(c(1,2,10^5))  # bad behaviour
> #digits: -1
> #[1] 1e+05
> And according to the documentation of ?round, negative digits are allowed:
> Rounding to a negative number of digits means rounding to a power of
> ten, so for example round(x, digits = -2) rounds to the nearest hundred.
> But in this case two of the numbers are closer to 0 than they are of 10.
> And unique keeps only 0 and the largest, then diff is big.
> round(c(1,2,10^4),0)  # desired behaviour
> #[1]     1     2 10000
> round(c(0,1,10^4),-1)  # bad behaviour
> #[1]     0     0 10000
> round(c(-1,0,10^4),-1)  # bad behaviour
> #[1]     0     0 10000
> round(c(1,2,10^5),-1)  # bad behaviour
> #[1] 0e+00 0e+00 1e+05
> Isn't it still a bug?
> Rui Barradas
> Às 15:57 de 23/09/20, Duncan Murdoch escreveu:
> On 23/09/2020 6:32 a.m., Martin Keller-Ressel wrote:
> Dear all,
>
> i have noticed some strange behaviour in the „jitter“ function in R.
> On the help page for jitter it is stated that
>
> "The result, say r, is r <- x + runif(n, -a, a) where n <- length(x)
> and a is the amount argument (if specified).“
>
> and
>
> "If amount is NULL (default), we set a <- factor * d/5 where d is the
> smallest difference between adjacent unique (apart from fuzz) x values.“
>
> This works fine as long as there is no (very) large outlier
>
> jitter(c(1,2,10^4))  # desired behaviour
> [1]    1.083243    1.851571 9999.942716
>
> But for very large outliers the added noise suddenly ‚jumps‘ to a much
> larger scale:
>
> jitter(c(1,2,10^5)) # bad behaviour
> [1] -19535.649   9578.702 115693.854
> # Noise should be of order (2-1)/5  = 0.2 but is of much larger order.
>
> This probably does not matter much when jitter is used for plotting,
> but it can cause problems when jitter is used to break ties.
>
> I think this is kind of documented:  "apart from fuzz" is what counts.
> If you look at the code for jitter, you'll see this important line:
>
>  d <- diff(xx <- unique(sort.int(round(x, 3 - floor(log10(z))))))
>
> By the time you get here, z is the length of the rante of the data, so
> it's 99999 in your example.  The rounding changes your values to
> 0,0,1e5, so the smallest difference is 1e5.
>
> Duncan Murdoch
>
> ______________________________________________
> [hidden email]<mailto:[hidden email]> mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: jitter-bug? problematic behaviour of the jitter function

Bert Gunter-2
In reply to this post by Martin Keller-Ressel-2
Folks: Please note:

There is *no* way to "jitter" the 3 values 1,2, and 1e5 so that:

a) the jittered values differ from the original ones by a fraction of their
original value;
b) the plotting symbols for the jittered values will be distinguishable on
a linear scale holding all 3 values.

Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Thu, Sep 24, 2020 at 8:39 AM Martin Keller-Ressel <
[hidden email]> wrote:

> Dear Duncan, Dear Rui,
>
> thanks for the responses and for pointing out that it is the ‚fuzz‘ part
> that is causing the problem. I agree that this is not a bug, but could be
> undesirable/surprising behaviour, since it causes a large ‚discontinuity‘
> in the jitter functions output depending on the input data.
>
> I was (ab?)using the jitter function to break ties, where the desired
> behaviour would be to add noise just small enough to make all values
> unique. (Such a function can easily be hand coded of course.)
>
> best regards,
> Martin
>
> Am 23.09.2020 um 22:25 schrieb Duncan Murdoch <[hidden email]
> <mailto:[hidden email]>>:
>
> On 23/09/2020 4:03 p.m., Rui Barradas wrote:
> Hello,
> I believe that though Duncan's explanation is right it is also not
> explaining the value of the digits argument. round makes the first 2
> numbers 0 but why?
>
> If there had been rounding in their computation, you might see a
> difference like 1e-15.  You wouldn't want to use that for the scale of
> jittering, so some rounding is needed.
>
> I think the documentation for the function is poor, but the intention was
> probably to use the function in graphics (as the references did), and in
> that case, any values too close together should be treated as equal and
> jittering should separate them.  The particular computation used says that
> if the range is in [1, 10), values equal to 3 decimal places will be too
> close and need separation.
>
> So I don't think this is a bug, but it might be a valid wishlist item:
> document what "apart from fuzz" means, and perhaps allow it to be
> controlled by the user.
>
> Duncan Murdoch
>
>
>
> The function below prints the digits argument and
> then outputs d. The code is taken from jitter.
> f <- function(x){
>    z <- diff(r <- range(x[is.finite(x)]))
>    cat("digits:", 3 - floor(log10(z)), "\n")
>    diff(xx <- unique(sort.int(round(x, 3 - floor(log10(z))))))
> }
> Now see what cat outputs for 'digits'.
> f(c(1,2,10^4))  # desired behaviour
> #digits: 0
> #[1]    1 9998
> f(c(0,1,10^4))  # bad behaviour
> #digits: -1
> #[1] 10000
> f(c(-1,0,10^4))  # bad behaviour
> #digits: -1
> #[1] 10000
> f(c(1,2,10^5))  # bad behaviour
> #digits: -1
> #[1] 1e+05
> And according to the documentation of ?round, negative digits are allowed:
> Rounding to a negative number of digits means rounding to a power of
> ten, so for example round(x, digits = -2) rounds to the nearest hundred.
> But in this case two of the numbers are closer to 0 than they are of 10.
> And unique keeps only 0 and the largest, then diff is big.
> round(c(1,2,10^4),0)  # desired behaviour
> #[1]     1     2 10000
> round(c(0,1,10^4),-1)  # bad behaviour
> #[1]     0     0 10000
> round(c(-1,0,10^4),-1)  # bad behaviour
> #[1]     0     0 10000
> round(c(1,2,10^5),-1)  # bad behaviour
> #[1] 0e+00 0e+00 1e+05
> Isn't it still a bug?
> Rui Barradas
> Às 15:57 de 23/09/20, Duncan Murdoch escreveu:
> On 23/09/2020 6:32 a.m., Martin Keller-Ressel wrote:
> Dear all,
>
> i have noticed some strange behaviour in the „jitter“ function in R.
> On the help page for jitter it is stated that
>
> "The result, say r, is r <- x + runif(n, -a, a) where n <- length(x)
> and a is the amount argument (if specified).“
>
> and
>
> "If amount is NULL (default), we set a <- factor * d/5 where d is the
> smallest difference between adjacent unique (apart from fuzz) x values.“
>
> This works fine as long as there is no (very) large outlier
>
> jitter(c(1,2,10^4))  # desired behaviour
> [1]    1.083243    1.851571 9999.942716
>
> But for very large outliers the added noise suddenly ‚jumps‘ to a much
> larger scale:
>
> jitter(c(1,2,10^5)) # bad behaviour
> [1] -19535.649   9578.702 115693.854
> # Noise should be of order (2-1)/5  = 0.2 but is of much larger order.
>
> This probably does not matter much when jitter is used for plotting,
> but it can cause problems when jitter is used to break ties.
>
> I think this is kind of documented:  "apart from fuzz" is what counts.
> If you look at the code for jitter, you'll see this important line:
>
>   d <- diff(xx <- unique(sort.int(round(x, 3 - floor(log10(z))))))
>
> By the time you get here, z is the length of the rante of the data, so
> it's 99999 in your example.  The rounding changes your values to
> 0,0,1e5, so the smallest difference is 1e5.
>
> Duncan Murdoch
>
> ______________________________________________
> [hidden email]<mailto:[hidden email]> mailing list -- To
> UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: jitter-bug? problematic behaviour of the jitter function

Duncan Murdoch-2
Those seem like useful properties if jitter() is used in plotting (as it
was originally intended), but that use isn't even mentioned in the help
page.  Martin wanted to "add a small amount of noise to a numeric
vector" "in order to break ties" (quoting from that help page).

For Martin's use, it sounds as though quantreg::dither might be a better
solution (though I think it won't work when numerical error splits ties,
so some differences are extremely small, if the scale of the values
varies too much, but I'd guess that's a fairly rare circumstance).

Duncan Murdoch

On 24/09/2020 1:03 p.m., Bert Gunter wrote:

> Folks: Please note:
>
> There is *no* way to "jitter" the 3 values 1,2, and 1e5 so that:
>
> a) the jittered values differ from the original ones by a fraction of
> their original value;
> b) the plotting symbols for the jittered values will be distinguishable
> on a linear scale holding all 3 values.
>
> Cheers,
> Bert
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Thu, Sep 24, 2020 at 8:39 AM Martin Keller-Ressel
> <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Dear Duncan, Dear Rui,
>
>     thanks for the responses and for pointing out that it is the ‚fuzz‘
>     part that is causing the problem. I agree that this is not a bug,
>     but could be undesirable/surprising behaviour, since it causes a
>     large ‚discontinuity‘ in the jitter functions output depending on
>     the input data.
>
>     I was (ab?)using the jitter function to break ties, where the
>     desired behaviour would be to add noise just small enough to make
>     all values unique. (Such a function can easily be hand coded of course.)
>
>     best regards,
>     Martin
>
>     Am 23.09.2020 um 22:25 schrieb Duncan Murdoch
>     <[hidden email]
>     <mailto:[hidden email]><mailto:[hidden email]
>     <mailto:[hidden email]>>>:
>
>     On 23/09/2020 4:03 p.m., Rui Barradas wrote:
>     Hello,
>     I believe that though Duncan's explanation is right it is also not
>     explaining the value of the digits argument. round makes the first 2
>     numbers 0 but why?
>
>     If there had been rounding in their computation, you might see a
>     difference like 1e-15.  You wouldn't want to use that for the scale
>     of jittering, so some rounding is needed.
>
>     I think the documentation for the function is poor, but the
>     intention was probably to use the function in graphics (as the
>     references did), and in that case, any values too close together
>     should be treated as equal and jittering should separate them.  The
>     particular computation used says that if the range is in [1, 10),
>     values equal to 3 decimal places will be too close and need separation.
>
>     So I don't think this is a bug, but it might be a valid wishlist
>     item: document what "apart from fuzz" means, and perhaps allow it to
>     be controlled by the user.
>
>     Duncan Murdoch
>
>
>
>     The function below prints the digits argument and
>     then outputs d. The code is taken from jitter.
>     f <- function(x){
>         z <- diff(r <- range(x[is.finite(x)]))
>         cat("digits:", 3 - floor(log10(z)), "\n")
>         diff(xx <- unique(sort.int <http://sort.int>(round(x, 3 -
>     floor(log10(z))))))
>     }
>     Now see what cat outputs for 'digits'.
>     f(c(1,2,10^4))  # desired behaviour
>     #digits: 0
>     #[1]    1 9998
>     f(c(0,1,10^4))  # bad behaviour
>     #digits: -1
>     #[1] 10000
>     f(c(-1,0,10^4))  # bad behaviour
>     #digits: -1
>     #[1] 10000
>     f(c(1,2,10^5))  # bad behaviour
>     #digits: -1
>     #[1] 1e+05
>     And according to the documentation of ?round, negative digits are
>     allowed:
>     Rounding to a negative number of digits means rounding to a power of
>     ten, so for example round(x, digits = -2) rounds to the nearest hundred.
>     But in this case two of the numbers are closer to 0 than they are of 10.
>     And unique keeps only 0 and the largest, then diff is big.
>     round(c(1,2,10^4),0)  # desired behaviour
>     #[1]     1     2 10000
>     round(c(0,1,10^4),-1)  # bad behaviour
>     #[1]     0     0 10000
>     round(c(-1,0,10^4),-1)  # bad behaviour
>     #[1]     0     0 10000
>     round(c(1,2,10^5),-1)  # bad behaviour
>     #[1] 0e+00 0e+00 1e+05
>     Isn't it still a bug?
>     Rui Barradas
>     Às 15:57 de 23/09/20, Duncan Murdoch escreveu:
>     On 23/09/2020 6:32 a.m., Martin Keller-Ressel wrote:
>     Dear all,
>
>     i have noticed some strange behaviour in the „jitter“ function in R.
>     On the help page for jitter it is stated that
>
>     "The result, say r, is r <- x + runif(n, -a, a) where n <- length(x)
>     and a is the amount argument (if specified).“
>
>     and
>
>     "If amount is NULL (default), we set a <- factor * d/5 where d is the
>     smallest difference between adjacent unique (apart from fuzz) x values.“
>
>     This works fine as long as there is no (very) large outlier
>
>     jitter(c(1,2,10^4))  # desired behaviour
>     [1]    1.083243    1.851571 9999.942716
>
>     But for very large outliers the added noise suddenly ‚jumps‘ to a much
>     larger scale:
>
>     jitter(c(1,2,10^5)) # bad behaviour
>     [1] -19535.649   9578.702 115693.854
>     # Noise should be of order (2-1)/5  = 0.2 but is of much larger order.
>
>     This probably does not matter much when jitter is used for plotting,
>     but it can cause problems when jitter is used to break ties.
>
>     I think this is kind of documented:  "apart from fuzz" is what counts.
>     If you look at the code for jitter, you'll see this important line:
>
>        d <- diff(xx <- unique(sort.int <http://sort.int>(round(x, 3 -
>     floor(log10(z))))))
>
>     By the time you get here, z is the length of the rante of the data, so
>     it's 99999 in your example.  The rounding changes your values to
>     0,0,1e5, so the smallest difference is 1e5.
>
>     Duncan Murdoch
>
>     ______________________________________________
>     [hidden email]
>     <mailto:[hidden email]><mailto:[hidden email]
>     <mailto:[hidden email]>> mailing list -- To UNSUBSCRIBE and
>     more, see
>     https://stat.ethz.ch/mailman/listinfo/r-help
>     PLEASE do read the posting guide
>     http://www.R-project.org/posting-guide.html
>     and provide commented, minimal, self-contained, reproducible code.
>
>
>
>              [[alternative HTML version deleted]]
>
>     ______________________________________________
>     [hidden email] <mailto:[hidden email]> mailing list --
>     To UNSUBSCRIBE and more, see
>     https://stat.ethz.ch/mailman/listinfo/r-help
>     PLEASE do read the posting guide
>     http://www.R-project.org/posting-guide.html
>     and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.