'by' on a numeric column produces inconsistent output

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

'by' on a numeric column produces inconsistent output

Kevin Ushey
I'm cross-posting this from the GitHub mirror:
https://github.com/arunsrinivasan/datatable/issues/2

For reference, I only see this with the latest RForge version of
data.table (1.8.11), not the CRAN version of data.table.

-----

library(data.table, lib="/Users/kevinushey/Library/R/3.1/library")
set.seed(32)
n <- 3
dt <- data.table(
  y=rnorm(n),
  by=round( rnorm(n), 1)
)

dt[,
  list(max=max(y, na.rm=TRUE)),
  by=list(by)
]

dt[,
  list(max=max(y, na.rm=TRUE)),
  by=list(by)
]

produces the output

> dt[,
+   list(max=max(y, na.rm=TRUE)),
+   by=list(by)
+ ]
    by         max
1: 0.4  0.01464054
2: 0.4  0.87328871
3: 0.7 -1.02794620
>
> dt[,
+   list(max=max(y, na.rm=TRUE)),
+   by=list(by)
+ ]
    by        max
1: 0.4  0.8732887
2: 0.7 -1.0279462

For some reason, the first return is wrong, while the second (and all
subsequent) output is correct. Any idea what's going on?

> sessionInfo()
R Under development (unstable) (2013-12-12 r64453)
Platform: x86_64-apple-darwin13.0.0 (64-bit)

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] data.table_1.8.11    knitr_1.5            devtools_1.4.1.99
BiocInstaller_1.13.3

loaded via a namespace (and not attached):
 [1] compiler_3.1.0 digest_0.6.4   evaluate_0.5.1 formatR_0.10
httr_0.2       memoise_0.1
 [7] parallel_3.1.0 plyr_1.8       RCurl_1.95-4.1 reshape2_1.2.2
stringr_0.6.2  tools_3.1.0
[13] whisker_0.3-2

---

Kevin
_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|

Re: 'by' on a numeric column produces inconsistent output

Michael Nelson
Using
data.table 1.8.11 (Fresh install from r-forge today)
R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)

I get

    by         max
1: 0.7  0.01464054
2: 0.4  0.87328871
3: 0.4 -1.02794620

On both runs.




________________________________________
From: [hidden email] [[hidden email]] on behalf of Kevin Ushey [[hidden email]]
Sent: Thursday, 19 December 2013 12:54 PM
To: [hidden email]
Subject: [datatable-help] 'by' on a numeric column produces inconsistent        output

I'm cross-posting this from the GitHub mirror:
https://github.com/arunsrinivasan/datatable/issues/2

For reference, I only see this with the latest RForge version of
data.table (1.8.11), not the CRAN version of data.table.

-----

library(data.table, lib="/Users/kevinushey/Library/R/3.1/library")
set.seed(32)
n <- 3
dt <- data.table(
  y=rnorm(n),
  by=round( rnorm(n), 1)
)

dt[,
  list(max=max(y, na.rm=TRUE)),
  by=list(by)
]

dt[,
  list(max=max(y, na.rm=TRUE)),
  by=list(by)
]

produces the output

> dt[,
+   list(max=max(y, na.rm=TRUE)),
+   by=list(by)
+ ]
    by         max
1: 0.4  0.01464054
2: 0.4  0.87328871
3: 0.7 -1.02794620
>
> dt[,
+   list(max=max(y, na.rm=TRUE)),
+   by=list(by)
+ ]
    by        max
1: 0.4  0.8732887
2: 0.7 -1.0279462

For some reason, the first return is wrong, while the second (and all
subsequent) output is correct. Any idea what's going on?

> sessionInfo()
R Under development (unstable) (2013-12-12 r64453)
Platform: x86_64-apple-darwin13.0.0 (64-bit)

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] data.table_1.8.11    knitr_1.5            devtools_1.4.1.99
BiocInstaller_1.13.3

loaded via a namespace (and not attached):
 [1] compiler_3.1.0 digest_0.6.4   evaluate_0.5.1 formatR_0.10
httr_0.2       memoise_0.1
 [7] parallel_3.1.0 plyr_1.8       RCurl_1.95-4.1 reshape2_1.2.2
stringr_0.6.2  tools_3.1.0
[13] whisker_0.3-2

---

Kevin
_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|

Re: 'by' on a numeric column produces inconsistent utput

Arunkumar Srinivasan
Not sure how to debug without being able to reproduce. Tried on Mac OS X 10.8.5 and Debian GNU/Linux 7 (wheezy). I don't have access to a windows machine. I consistently gives me this:

> dt[,
+    list(max=max(y, na.rm=TRUE)),
+    by=list(by)
+    ]
    by        max
1: 0.7 0.01464054
2: 0.4 0.87328871
> dt[,
+    list(max=max(y, na.rm=TRUE)),
+    by=list(by)
+    ]
    by        max
1: 0.7 0.01464054
2: 0.4 0.87328871

Can either of you provide me with the output of these steps in cases where there's an error? I've commented the output I get for each step. 

byval <- list(by=dt$by) 
o__ <- data.table:::fastorder(byval) # 2,3,1
f__ = data.table:::uniqlist(byval, order=o__) # 1,3
len__ = data.table:::uniqlengths(f__, nrow(dt)) # 2,1
firstofeachgroup = o__[f__] # 2,1
origorder = data.table:::iradixorder(firstofeachgroup) # 2,1
f__ = f__[origorder] # 3,1
len__ = len__[origorder] # 2,1


Arun

On Thursday, December 19, 2013 at 3:50 AM, Michael Nelson wrote:

Using
data.table 1.8.11 (Fresh install from r-forge today)
R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)

I get

by max
1: 0.7 0.01464054
2: 0.4 0.87328871
3: 0.4 -1.02794620

On both runs.




________________________________________
From: [hidden email] [[hidden email]] on behalf of Kevin Ushey [[hidden email]]
Sent: Thursday, 19 December 2013 12:54 PM
Subject: [datatable-help] 'by' on a numeric column produces inconsistent output

I'm cross-posting this from the GitHub mirror:

For reference, I only see this with the latest RForge version of
data.table (1.8.11), not the CRAN version of data.table.

-----

library(data.table, lib="/Users/kevinushey/Library/R/3.1/library")
set.seed(32)
n <- 3
dt <- data.table(
y=rnorm(n),
by=round( rnorm(n), 1)
)

dt[,
list(max=max(y, na.rm=TRUE)),
by=list(by)
]

dt[,
list(max=max(y, na.rm=TRUE)),
by=list(by)
]

produces the output

dt[,
+ list(max=max(y, na.rm=TRUE)),
+ by=list(by)
+ ]
by max
1: 0.4 0.01464054
2: 0.4 0.87328871
3: 0.7 -1.02794620

dt[,
+ list(max=max(y, na.rm=TRUE)),
+ by=list(by)
+ ]
by max
1: 0.4 0.8732887
2: 0.7 -1.0279462

For some reason, the first return is wrong, while the second (and all
subsequent) output is correct. Any idea what's going on?

sessionInfo()
R Under development (unstable) (2013-12-12 r64453)
Platform: x86_64-apple-darwin13.0.0 (64-bit)

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] data.table_1.8.11 knitr_1.5 devtools_1.4.1.99
BiocInstaller_1.13.3

loaded via a namespace (and not attached):
[1] compiler_3.1.0 digest_0.6.4 evaluate_0.5.1 formatR_0.10
httr_0.2 memoise_0.1
[7] parallel_3.1.0 plyr_1.8 RCurl_1.95-4.1 reshape2_1.2.2
stringr_0.6.2 tools_3.1.0
[13] whisker_0.3-2

---

Kevin
_______________________________________________
datatable-help mailing list
_______________________________________________
datatable-help mailing list


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|

Re: 'by' on a numeric column produces inconsistent output

Arunkumar Srinivasan
In reply to this post by Kevin Ushey
Kevin, your output looks sorted by the "by" column, which shouldn't happen as well. So, I would consider even the second output wrong, unless you're setting key on "by".

Arun

On Thursday, December 19, 2013 at 2:54 AM, Kevin Ushey wrote:

I'm cross-posting this from the GitHub mirror:

For reference, I only see this with the latest RForge version of
data.table (1.8.11), not the CRAN version of data.table.

-----

library(data.table, lib="/Users/kevinushey/Library/R/3.1/library")
set.seed(32)
n <- 3
dt <- data.table(
y=rnorm(n),
by=round( rnorm(n), 1)
)

dt[,
list(max=max(y, na.rm=TRUE)),
by=list(by)
]

dt[,
list(max=max(y, na.rm=TRUE)),
by=list(by)
]

produces the output

dt[,
+ list(max=max(y, na.rm=TRUE)),
+ by=list(by)
+ ]
by max
1: 0.4 0.01464054
2: 0.4 0.87328871
3: 0.7 -1.02794620

dt[,
+ list(max=max(y, na.rm=TRUE)),
+ by=list(by)
+ ]
by max
1: 0.4 0.8732887
2: 0.7 -1.0279462

For some reason, the first return is wrong, while the second (and all
subsequent) output is correct. Any idea what's going on?

sessionInfo()
R Under development (unstable) (2013-12-12 r64453)
Platform: x86_64-apple-darwin13.0.0 (64-bit)

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] data.table_1.8.11 knitr_1.5 devtools_1.4.1.99
BiocInstaller_1.13.3

loaded via a namespace (and not attached):
[1] compiler_3.1.0 digest_0.6.4 evaluate_0.5.1 formatR_0.10
httr_0.2 memoise_0.1
[7] parallel_3.1.0 plyr_1.8 RCurl_1.95-4.1 reshape2_1.2.2
stringr_0.6.2 tools_3.1.0
[13] whisker_0.3-2

---

Kevin
_______________________________________________
datatable-help mailing list


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|

Re: 'by' on a numeric column produces inconsistent utput

Kevin Ushey
In reply to this post by Arunkumar Srinivasan
Hi Arun,

Here's the output on my machine -- other information missing from
before; it's with OSX Mavericks, with R and data.table compiled with
Apple clang.

---

> library(data.table, lib="/Users/kevinushey/Library/R/3.1/library")
> set.seed(32)
> n <- 3
> dt <- data.table(
+   y=rnorm(n),
+   by=round( rnorm(n), 1)
+ )
>
## run one
> byval <- list(by=dt$by)
> (o__ <- data.table:::fastorder(byval)) # 2,3,1
[1] 2 3 1
> (f__ = data.table:::uniqlist(byval, order=o__)) # 1,3
[1] 1 2 3
> (len__ = data.table:::uniqlengths(f__, nrow(dt))) # 2,1
[1] 1 1 1
> (firstofeachgroup = o__[f__]) # 2,1
[1] 2 3 1
> (origorder = data.table:::iradixorder(firstofeachgroup)) # 2,1
[1] 3 1 2
> (f__ = f__[origorder]) # 3,1
[1] 3 1 2
> (len__ = len__[origorder]) # 2,1
[1] 1 1 1

## run two
> (o__ <- data.table:::fastorder(byval)) # 2,3,1
[1] 1 2 3
> (f__ = data.table:::uniqlist(byval, order=o__)) # 1,3
[1] 1 3
> (len__ = data.table:::uniqlengths(f__, nrow(dt))) # 2,1
[1] 2 1
> (firstofeachgroup = o__[f__]) # 2,1
[1] 1 3
> (origorder = data.table:::iradixorder(firstofeachgroup)) # 2,1
[1] 1 2
> (f__ = f__[origorder]) # 3,1
[1] 1 3
> (len__ = len__[origorder]) # 2,1
[1] 2 1

On Wed, Dec 18, 2013 at 11:22 PM, Arunkumar Srinivasan
<[hidden email]> wrote:

> Not sure how to debug without being able to reproduce. Tried on Mac OS X
> 10.8.5 and Debian GNU/Linux 7 (wheezy). I don't have access to a windows
> machine. I consistently gives me this:
>
>> dt[,
> +    list(max=max(y, na.rm=TRUE)),
> +    by=list(by)
> +    ]
>     by        max
> 1: 0.7 0.01464054
> 2: 0.4 0.87328871
>>
>> dt[,
> +    list(max=max(y, na.rm=TRUE)),
> +    by=list(by)
> +    ]
>     by        max
> 1: 0.7 0.01464054
> 2: 0.4 0.87328871
>
> Can either of you provide me with the output of these steps in cases where
> there's an error? I've commented the output I get for each step.
>
> byval <- list(by=dt$by)
> o__ <- data.table:::fastorder(byval) # 2,3,1
> f__ = data.table:::uniqlist(byval, order=o__) # 1,3
> len__ = data.table:::uniqlengths(f__, nrow(dt)) # 2,1
> firstofeachgroup = o__[f__] # 2,1
> origorder = data.table:::iradixorder(firstofeachgroup) # 2,1
> f__ = f__[origorder] # 3,1
> len__ = len__[origorder] # 2,1
>
>
> Arun
>
> <...snip...>
_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|

Re: 'by' on a numeric column produces inconsistent utput

Arunkumar Srinivasan
Aha, the issue seems to be with 'uniqlist', not sure why it gives 
(f__ = data.table:::uniqlist(byval, order=o__)) # 1,3
1,2,3 for you and 1,3 consistently for me. I'll revert this back to `duplist` for now. Not sure how to solve this though. I've tried it so far on 3 machines:

1) OS X 10.8.5 + libvm (gcc)
2) OS X Mavericks + Clang 
3) Debian Weezy + gcc

All of them give consistent output. Man this is such a drag.

Arun

On Thursday, December 19, 2013 at 8:37 AM, Kevin Ushey wrote:

Hi Arun,

Here's the output on my machine -- other information missing from
before; it's with OSX Mavericks, with R and data.table compiled with
Apple clang.

---

library(data.table, lib="/Users/kevinushey/Library/R/3.1/library")
set.seed(32)
n <- 3
dt <- data.table(
+ y=rnorm(n),
+ by=round( rnorm(n), 1)
+ )
## run one
byval <- list(by=dt$by)
(o__ <- data.table:::fastorder(byval)) # 2,3,1
[1] 2 3 1
(f__ = data.table:::uniqlist(byval, order=o__)) # 1,3
[1] 1 2 3
(len__ = data.table:::uniqlengths(f__, nrow(dt))) # 2,1
[1] 1 1 1
(firstofeachgroup = o__[f__]) # 2,1
[1] 2 3 1
(origorder = data.table:::iradixorder(firstofeachgroup)) # 2,1
[1] 3 1 2
(f__ = f__[origorder]) # 3,1
[1] 3 1 2
(len__ = len__[origorder]) # 2,1
[1] 1 1 1

## run two
(o__ <- data.table:::fastorder(byval)) # 2,3,1
[1] 1 2 3
(f__ = data.table:::uniqlist(byval, order=o__)) # 1,3
[1] 1 3
(len__ = data.table:::uniqlengths(f__, nrow(dt))) # 2,1
[1] 2 1
(firstofeachgroup = o__[f__]) # 2,1
[1] 1 3
(origorder = data.table:::iradixorder(firstofeachgroup)) # 2,1
[1] 1 2
(f__ = f__[origorder]) # 3,1
[1] 1 3
(len__ = len__[origorder]) # 2,1
[1] 2 1

On Wed, Dec 18, 2013 at 11:22 PM, Arunkumar Srinivasan
Not sure how to debug without being able to reproduce. Tried on Mac OS X
10.8.5 and Debian GNU/Linux 7 (wheezy). I don't have access to a windows
machine. I consistently gives me this:

dt[,
+ list(max=max(y, na.rm=TRUE)),
+ by=list(by)
+ ]
by max
1: 0.7 0.01464054
2: 0.4 0.87328871

dt[,
+ list(max=max(y, na.rm=TRUE)),
+ by=list(by)
+ ]
by max
1: 0.7 0.01464054
2: 0.4 0.87328871

Can either of you provide me with the output of these steps in cases where
there's an error? I've commented the output I get for each step.

byval <- list(by=dt$by)
o__ <- data.table:::fastorder(byval) # 2,3,1
f__ = data.table:::uniqlist(byval, order=o__) # 1,3
len__ = data.table:::uniqlengths(f__, nrow(dt)) # 2,1
firstofeachgroup = o__[f__] # 2,1
origorder = data.table:::iradixorder(firstofeachgroup) # 2,1
f__ = f__[origorder] # 3,1
len__ = len__[origorder] # 2,1


Arun

<...snip...>


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|

Re: 'by' on a numeric column produces inconsistent utput

szehnder@uni-bonn.de
Arun,

if you could send me the reproducible code in copyable form I can as well try it on Mac OS X Mavericks with gcc 4.8.

Best

Simon

On 19 Dec 2013, at 08:44, Arunkumar Srinivasan <[hidden email]> wrote:

> Aha, the issue seems to be with 'uniqlist', not sure why it gives
>> (f__ = data.table:::uniqlist(byval, order=o__)) # 1,3
> 1,2,3 for you and 1,3 consistently for me. I'll revert this back to `duplist` for now. Not sure how to solve this though. I've tried it so far on 3 machines:
>
> 1) OS X 10.8.5 + libvm (gcc)
> 2) OS X Mavericks + Clang
> 3) Debian Weezy + gcc
>
> All of them give consistent output. Man this is such a drag.
>
> Arun
>
> On Thursday, December 19, 2013 at 8:37 AM, Kevin Ushey wrote:
>
>> Hi Arun,
>>
>> Here's the output on my machine -- other information missing from
>> before; it's with OSX Mavericks, with R and data.table compiled with
>> Apple clang.
>>
>> ---
>>
>>> library(data.table, lib="/Users/kevinushey/Library/R/3.1/library")
>>> set.seed(32)
>>> n <- 3
>>> dt <- data.table(
>> + y=rnorm(n),
>> + by=round( rnorm(n), 1)
>> + )
>> ## run one
>>> byval <- list(by=dt$by)
>>> (o__ <- data.table:::fastorder(byval)) # 2,3,1
>> [1] 2 3 1
>>> (f__ = data.table:::uniqlist(byval, order=o__)) # 1,3
>> [1] 1 2 3
>>> (len__ = data.table:::uniqlengths(f__, nrow(dt))) # 2,1
>> [1] 1 1 1
>>> (firstofeachgroup = o__[f__]) # 2,1
>> [1] 2 3 1
>>> (origorder = data.table:::iradixorder(firstofeachgroup)) # 2,1
>> [1] 3 1 2
>>> (f__ = f__[origorder]) # 3,1
>> [1] 3 1 2
>>> (len__ = len__[origorder]) # 2,1
>> [1] 1 1 1
>>
>> ## run two
>>> (o__ <- data.table:::fastorder(byval)) # 2,3,1
>> [1] 1 2 3
>>> (f__ = data.table:::uniqlist(byval, order=o__)) # 1,3
>> [1] 1 3
>>> (len__ = data.table:::uniqlengths(f__, nrow(dt))) # 2,1
>> [1] 2 1
>>> (firstofeachgroup = o__[f__]) # 2,1
>> [1] 1 3
>>> (origorder = data.table:::iradixorder(firstofeachgroup)) # 2,1
>> [1] 1 2
>>> (f__ = f__[origorder]) # 3,1
>> [1] 1 3
>>> (len__ = len__[origorder]) # 2,1
>> [1] 2 1
>>
>> On Wed, Dec 18, 2013 at 11:22 PM, Arunkumar Srinivasan
>> <[hidden email]> wrote:
>>> Not sure how to debug without being able to reproduce. Tried on Mac OS X
>>> 10.8.5 and Debian GNU/Linux 7 (wheezy). I don't have access to a windows
>>> machine. I consistently gives me this:
>>>
>>>> dt[,
>>> + list(max=max(y, na.rm=TRUE)),
>>> + by=list(by)
>>> + ]
>>> by max
>>> 1: 0.7 0.01464054
>>> 2: 0.4 0.87328871
>>>>
>>>> dt[,
>>> + list(max=max(y, na.rm=TRUE)),
>>> + by=list(by)
>>> + ]
>>> by max
>>> 1: 0.7 0.01464054
>>> 2: 0.4 0.87328871
>>>
>>> Can either of you provide me with the output of these steps in cases where
>>> there's an error? I've commented the output I get for each step.
>>>
>>> byval <- list(by=dt$by)
>>> o__ <- data.table:::fastorder(byval) # 2,3,1
>>> f__ = data.table:::uniqlist(byval, order=o__) # 1,3
>>> len__ = data.table:::uniqlengths(f__, nrow(dt)) # 2,1
>>> firstofeachgroup = o__[f__] # 2,1
>>> origorder = data.table:::iradixorder(firstofeachgroup) # 2,1
>>> f__ = f__[origorder] # 3,1
>>> len__ = len__[origorder] # 2,1
>>>
>>>
>>> Arun
>>>
>>> <...snip...>
>
> _______________________________________________
> datatable-help mailing list
> [hidden email]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|

Re: 'by' on a numeric column produces inconsistent utput

Kevin Ushey
In reply to this post by Arunkumar Srinivasan
Hmm, I am seeing that after the data.table:::fastorder call, the dt
itself is modified. Notice that 'by' is rearranged without modifying
'y'.

> dt
             y  by
1:  0.01464054 0.7
2:  0.87328871 0.4
3: -1.02794620 0.4
> (o__ <- data.table:::fastorder(byval)) # 2,3,1
[1] 2 3 1
> dt
             y  by
1:  0.01464054 0.4
2:  0.87328871 0.4
3: -1.02794620 0.7

On Wed, Dec 18, 2013 at 11:44 PM, Arunkumar Srinivasan
<[hidden email]> wrote:

> Aha, the issue seems to be with 'uniqlist', not sure why it gives
>
> (f__ = data.table:::uniqlist(byval, order=o__)) # 1,3
>
> 1,2,3 for you and 1,3 consistently for me. I'll revert this back to
> `duplist` for now. Not sure how to solve this though. I've tried it so far
> on 3 machines:
>
> 1) OS X 10.8.5 + libvm (gcc)
> 2) OS X Mavericks + Clang
> 3) Debian Weezy + gcc
>
> All of them give consistent output. Man this is such a drag.
>
> Arun
>
> On Thursday, December 19, 2013 at 8:37 AM, Kevin Ushey wrote:
>
> Hi Arun,
>
> Here's the output on my machine -- other information missing from
> before; it's with OSX Mavericks, with R and data.table compiled with
> Apple clang.
>
> ---
>
> library(data.table, lib="/Users/kevinushey/Library/R/3.1/library")
> set.seed(32)
> n <- 3
> dt <- data.table(
>
> + y=rnorm(n),
> + by=round( rnorm(n), 1)
> + )
>
> ## run one
>
> byval <- list(by=dt$by)
> (o__ <- data.table:::fastorder(byval)) # 2,3,1
>
> [1] 2 3 1
>
> (f__ = data.table:::uniqlist(byval, order=o__)) # 1,3
>
> [1] 1 2 3
>
> (len__ = data.table:::uniqlengths(f__, nrow(dt))) # 2,1
>
> [1] 1 1 1
>
> (firstofeachgroup = o__[f__]) # 2,1
>
> [1] 2 3 1
>
> (origorder = data.table:::iradixorder(firstofeachgroup)) # 2,1
>
> [1] 3 1 2
>
> (f__ = f__[origorder]) # 3,1
>
> [1] 3 1 2
>
> (len__ = len__[origorder]) # 2,1
>
> [1] 1 1 1
>
> ## run two
>
> (o__ <- data.table:::fastorder(byval)) # 2,3,1
>
> [1] 1 2 3
>
> (f__ = data.table:::uniqlist(byval, order=o__)) # 1,3
>
> [1] 1 3
>
> (len__ = data.table:::uniqlengths(f__, nrow(dt))) # 2,1
>
> [1] 2 1
>
> (firstofeachgroup = o__[f__]) # 2,1
>
> [1] 1 3
>
> (origorder = data.table:::iradixorder(firstofeachgroup)) # 2,1
>
> [1] 1 2
>
> (f__ = f__[origorder]) # 3,1
>
> [1] 1 3
>
> (len__ = len__[origorder]) # 2,1
>
> [1] 2 1
>
> On Wed, Dec 18, 2013 at 11:22 PM, Arunkumar Srinivasan
> <[hidden email]> wrote:
>
> Not sure how to debug without being able to reproduce. Tried on Mac OS X
> 10.8.5 and Debian GNU/Linux 7 (wheezy). I don't have access to a windows
> machine. I consistently gives me this:
>
> dt[,
>
> + list(max=max(y, na.rm=TRUE)),
> + by=list(by)
> + ]
> by max
> 1: 0.7 0.01464054
> 2: 0.4 0.87328871
>
>
> dt[,
>
> + list(max=max(y, na.rm=TRUE)),
> + by=list(by)
> + ]
> by max
> 1: 0.7 0.01464054
> 2: 0.4 0.87328871
>
> Can either of you provide me with the output of these steps in cases where
> there's an error? I've commented the output I get for each step.
>
> byval <- list(by=dt$by)
> o__ <- data.table:::fastorder(byval) # 2,3,1
> f__ = data.table:::uniqlist(byval, order=o__) # 1,3
> len__ = data.table:::uniqlengths(f__, nrow(dt)) # 2,1
> firstofeachgroup = o__[f__] # 2,1
> origorder = data.table:::iradixorder(firstofeachgroup) # 2,1
> f__ = f__[origorder] # 3,1
> len__ = len__[origorder] # 2,1
>
>
> Arun
>
> <...snip...>
>
>
_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|

Re: 'by' on a numeric column produces inconsistent utput

Arunkumar Srinivasan
Ah, that explains it as well. So a copy is not being sent to fastorder, but that only happens the first timeā€¦ I'll write again if there are more questions. 

Thanks again Kevin.


Arun

On Thursday, December 19, 2013 at 8:55 AM, Kevin Ushey wrote:

Hmm, I am seeing that after the data.table:::fastorder call, the dt
itself is modified. Notice that 'by' is rearranged without modifying
'y'.

dt
y by
1: 0.01464054 0.7
2: 0.87328871 0.4
3: -1.02794620 0.4
(o__ <- data.table:::fastorder(byval)) # 2,3,1
[1] 2 3 1
dt
y by
1: 0.01464054 0.4
2: 0.87328871 0.4
3: -1.02794620 0.7

On Wed, Dec 18, 2013 at 11:44 PM, Arunkumar Srinivasan
Aha, the issue seems to be with 'uniqlist', not sure why it gives

(f__ = data.table:::uniqlist(byval, order=o__)) # 1,3

1,2,3 for you and 1,3 consistently for me. I'll revert this back to
`duplist` for now. Not sure how to solve this though. I've tried it so far
on 3 machines:

1) OS X 10.8.5 + libvm (gcc)
2) OS X Mavericks + Clang
3) Debian Weezy + gcc

All of them give consistent output. Man this is such a drag.

Arun

On Thursday, December 19, 2013 at 8:37 AM, Kevin Ushey wrote:

Hi Arun,

Here's the output on my machine -- other information missing from
before; it's with OSX Mavericks, with R and data.table compiled with
Apple clang.

---

library(data.table, lib="/Users/kevinushey/Library/R/3.1/library")
set.seed(32)
n <- 3
dt <- data.table(

+ y=rnorm(n),
+ by=round( rnorm(n), 1)
+ )

## run one

byval <- list(by=dt$by)
(o__ <- data.table:::fastorder(byval)) # 2,3,1

[1] 2 3 1

(f__ = data.table:::uniqlist(byval, order=o__)) # 1,3

[1] 1 2 3

(len__ = data.table:::uniqlengths(f__, nrow(dt))) # 2,1

[1] 1 1 1

(firstofeachgroup = o__[f__]) # 2,1

[1] 2 3 1

(origorder = data.table:::iradixorder(firstofeachgroup)) # 2,1

[1] 3 1 2

(f__ = f__[origorder]) # 3,1

[1] 3 1 2

(len__ = len__[origorder]) # 2,1

[1] 1 1 1

## run two

(o__ <- data.table:::fastorder(byval)) # 2,3,1

[1] 1 2 3

(f__ = data.table:::uniqlist(byval, order=o__)) # 1,3

[1] 1 3

(len__ = data.table:::uniqlengths(f__, nrow(dt))) # 2,1

[1] 2 1

(firstofeachgroup = o__[f__]) # 2,1

[1] 1 3

(origorder = data.table:::iradixorder(firstofeachgroup)) # 2,1

[1] 1 2

(f__ = f__[origorder]) # 3,1

[1] 1 3

(len__ = len__[origorder]) # 2,1

[1] 2 1

On Wed, Dec 18, 2013 at 11:22 PM, Arunkumar Srinivasan

Not sure how to debug without being able to reproduce. Tried on Mac OS X
10.8.5 and Debian GNU/Linux 7 (wheezy). I don't have access to a windows
machine. I consistently gives me this:

dt[,

+ list(max=max(y, na.rm=TRUE)),
+ by=list(by)
+ ]
by max
1: 0.7 0.01464054
2: 0.4 0.87328871


dt[,

+ list(max=max(y, na.rm=TRUE)),
+ by=list(by)
+ ]
by max
1: 0.7 0.01464054
2: 0.4 0.87328871

Can either of you provide me with the output of these steps in cases where
there's an error? I've commented the output I get for each step.

byval <- list(by=dt$by)
o__ <- data.table:::fastorder(byval) # 2,3,1
f__ = data.table:::uniqlist(byval, order=o__) # 1,3
len__ = data.table:::uniqlengths(f__, nrow(dt)) # 2,1
firstofeachgroup = o__[f__] # 2,1
origorder = data.table:::iradixorder(firstofeachgroup) # 2,1
f__ = f__[origorder] # 3,1
len__ = len__[origorder] # 2,1


Arun

<...snip...>


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|

Re: 'by' on a numeric column produces inconsistent utput

Arunkumar Srinivasan
In reply to this post by szehnder@uni-bonn.de
Simon, sure.

set.seed(32)
n <- 3
dt <- data.table(
y=rnorm(n),
by=round( rnorm(n), 1)
)

dt[,
list(max=max(y, na.rm=TRUE)),
by=list(by)
]

dt[,
list(max=max(y, na.rm=TRUE)),
by=list(by)
]



Arun

On Thursday, December 19, 2013 at 8:49 AM, Simon Zehnder wrote:

Arun,

if you could send me the reproducible code in copyable form I can as well try it on Mac OS X Mavericks with gcc 4.8.

Best

Simon

On 19 Dec 2013, at 08:44, Arunkumar Srinivasan <[hidden email]> wrote:

Aha, the issue seems to be with 'uniqlist', not sure why it gives
(f__ = data.table:::uniqlist(byval, order=o__)) # 1,3
1,2,3 for you and 1,3 consistently for me. I'll revert this back to `duplist` for now. Not sure how to solve this though. I've tried it so far on 3 machines:

1) OS X 10.8.5 + libvm (gcc)
2) OS X Mavericks + Clang
3) Debian Weezy + gcc

All of them give consistent output. Man this is such a drag.

Arun

On Thursday, December 19, 2013 at 8:37 AM, Kevin Ushey wrote:

Hi Arun,

Here's the output on my machine -- other information missing from
before; it's with OSX Mavericks, with R and data.table compiled with
Apple clang.

---

library(data.table, lib="/Users/kevinushey/Library/R/3.1/library")
set.seed(32)
n <- 3
dt <- data.table(
+ y=rnorm(n),
+ by=round( rnorm(n), 1)
+ )
## run one
byval <- list(by=dt$by)
(o__ <- data.table:::fastorder(byval)) # 2,3,1
[1] 2 3 1
(f__ = data.table:::uniqlist(byval, order=o__)) # 1,3
[1] 1 2 3
(len__ = data.table:::uniqlengths(f__, nrow(dt))) # 2,1
[1] 1 1 1
(firstofeachgroup = o__[f__]) # 2,1
[1] 2 3 1
(origorder = data.table:::iradixorder(firstofeachgroup)) # 2,1
[1] 3 1 2
(f__ = f__[origorder]) # 3,1
[1] 3 1 2
(len__ = len__[origorder]) # 2,1
[1] 1 1 1

## run two
(o__ <- data.table:::fastorder(byval)) # 2,3,1
[1] 1 2 3
(f__ = data.table:::uniqlist(byval, order=o__)) # 1,3
[1] 1 3
(len__ = data.table:::uniqlengths(f__, nrow(dt))) # 2,1
[1] 2 1
(firstofeachgroup = o__[f__]) # 2,1
[1] 1 3
(origorder = data.table:::iradixorder(firstofeachgroup)) # 2,1
[1] 1 2
(f__ = f__[origorder]) # 3,1
[1] 1 3
(len__ = len__[origorder]) # 2,1
[1] 2 1

On Wed, Dec 18, 2013 at 11:22 PM, Arunkumar Srinivasan
Not sure how to debug without being able to reproduce. Tried on Mac OS X
10.8.5 and Debian GNU/Linux 7 (wheezy). I don't have access to a windows
machine. I consistently gives me this:

dt[,
+ list(max=max(y, na.rm=TRUE)),
+ by=list(by)
+ ]
by max
1: 0.7 0.01464054
2: 0.4 0.87328871

dt[,
+ list(max=max(y, na.rm=TRUE)),
+ by=list(by)
+ ]
by max
1: 0.7 0.01464054
2: 0.4 0.87328871

Can either of you provide me with the output of these steps in cases where
there's an error? I've commented the output I get for each step.

byval <- list(by=dt$by)
o__ <- data.table:::fastorder(byval) # 2,3,1
f__ = data.table:::uniqlist(byval, order=o__) # 1,3
len__ = data.table:::uniqlengths(f__, nrow(dt)) # 2,1
firstofeachgroup = o__[f__] # 2,1
origorder = data.table:::iradixorder(firstofeachgroup) # 2,1
f__ = f__[origorder] # 3,1
len__ = len__[origorder] # 2,1


Arun

<...snip...>

_______________________________________________
datatable-help mailing list


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|

Re: 'by' on a numeric column produces inconsistent utput

szehnder@uni-bonn.de
Hi Arun,

here the results on Mac OS X Mavericks with gcc 4.8.2

data.table 1.8.10:

> set.seed(32)
> n <- 3
> dt <- data.table(
+ y=rnorm(n),
+ by=round( rnorm(n), 1)
+ )
>
> dt[,
+ list(max=max(y, na.rm=TRUE)),
+ by=list(by)
+ ]
    by        max
1: 0.7 0.01464054
2: 0.4 0.87328871
>
> dt[,
+ list(max=max(y, na.rm=TRUE)),
+ by=list(by)
+ ]
    by        max
1: 0.7 0.01464054
2: 0.4 0.87328871

data.table 1.8.11:

> set.seed(32)
> n <- 3
> dt <- data.table(
+ y=rnorm(n),
+ by=round( rnorm(n), 1)
+ )
>
> dt[,
+ list(max=max(y, na.rm=TRUE)),
+ by=list(by)
+ ]
    by        max
1: 0.7 0.01464054
2: 0.4 0.87328871
>
> dt[,
+ list(max=max(y, na.rm=TRUE)),
+ by=list(by)
+ ]
    by        max
1: 0.7 0.01464054
2: 0.4 0.87328871

Best

Simon


On 19 Dec 2013, at 09:05, Arunkumar Srinivasan <[hidden email]> wrote:

> Simon, sure.
>
> set.seed(32)
> n <- 3
> dt <- data.table(
> y=rnorm(n),
> by=round( rnorm(n), 1)
> )
>
> dt[,
> list(max=max(y, na.rm=TRUE)),
> by=list(by)
> ]
>
> dt[,
> list(max=max(y, na.rm=TRUE)),
> by=list(by)
> ]
>
>
>
> Arun
>
> On Thursday, December 19, 2013 at 8:49 AM, Simon Zehnder wrote:
>
>> Arun,
>>
>> if you could send me the reproducible code in copyable form I can as well try it on Mac OS X Mavericks with gcc 4.8.
>>
>> Best
>>
>> Simon
>>
>> On 19 Dec 2013, at 08:44, Arunkumar Srinivasan <[hidden email]> wrote:
>>
>>> Aha, the issue seems to be with 'uniqlist', not sure why it gives
>>>> (f__ = data.table:::uniqlist(byval, order=o__)) # 1,3
>>> 1,2,3 for you and 1,3 consistently for me. I'll revert this back to `duplist` for now. Not sure how to solve this though. I've tried it so far on 3 machines:
>>>
>>> 1) OS X 10.8.5 + libvm (gcc)
>>> 2) OS X Mavericks + Clang
>>> 3) Debian Weezy + gcc
>>>
>>> All of them give consistent output. Man this is such a drag.
>>>
>>> Arun
>>>
>>> On Thursday, December 19, 2013 at 8:37 AM, Kevin Ushey wrote:
>>>
>>>> Hi Arun,
>>>>
>>>> Here's the output on my machine -- other information missing from
>>>> before; it's with OSX Mavericks, with R and data.table compiled with
>>>> Apple clang.
>>>>
>>>> ---
>>>>
>>>>> library(data.table, lib="/Users/kevinushey/Library/R/3.1/library")
>>>>> set.seed(32)
>>>>> n <- 3
>>>>> dt <- data.table(
>>>> + y=rnorm(n),
>>>> + by=round( rnorm(n), 1)
>>>> + )
>>>> ## run one
>>>>> byval <- list(by=dt$by)
>>>>> (o__ <- data.table:::fastorder(byval)) # 2,3,1
>>>> [1] 2 3 1
>>>>> (f__ = data.table:::uniqlist(byval, order=o__)) # 1,3
>>>> [1] 1 2 3
>>>>> (len__ = data.table:::uniqlengths(f__, nrow(dt))) # 2,1
>>>> [1] 1 1 1
>>>>> (firstofeachgroup = o__[f__]) # 2,1
>>>> [1] 2 3 1
>>>>> (origorder = data.table:::iradixorder(firstofeachgroup)) # 2,1
>>>> [1] 3 1 2
>>>>> (f__ = f__[origorder]) # 3,1
>>>> [1] 3 1 2
>>>>> (len__ = len__[origorder]) # 2,1
>>>> [1] 1 1 1
>>>>
>>>> ## run two
>>>>> (o__ <- data.table:::fastorder(byval)) # 2,3,1
>>>> [1] 1 2 3
>>>>> (f__ = data.table:::uniqlist(byval, order=o__)) # 1,3
>>>> [1] 1 3
>>>>> (len__ = data.table:::uniqlengths(f__, nrow(dt))) # 2,1
>>>> [1] 2 1
>>>>> (firstofeachgroup = o__[f__]) # 2,1
>>>> [1] 1 3
>>>>> (origorder = data.table:::iradixorder(firstofeachgroup)) # 2,1
>>>> [1] 1 2
>>>>> (f__ = f__[origorder]) # 3,1
>>>> [1] 1 3
>>>>> (len__ = len__[origorder]) # 2,1
>>>> [1] 2 1
>>>>
>>>> On Wed, Dec 18, 2013 at 11:22 PM, Arunkumar Srinivasan
>>>> <[hidden email]> wrote:
>>>>> Not sure how to debug without being able to reproduce. Tried on Mac OS X
>>>>> 10.8.5 and Debian GNU/Linux 7 (wheezy). I don't have access to a windows
>>>>> machine. I consistently gives me this:
>>>>>
>>>>>> dt[,
>>>>> + list(max=max(y, na.rm=TRUE)),
>>>>> + by=list(by)
>>>>> + ]
>>>>> by max
>>>>> 1: 0.7 0.01464054
>>>>> 2: 0.4 0.87328871
>>>>>>
>>>>>> dt[,
>>>>> + list(max=max(y, na.rm=TRUE)),
>>>>> + by=list(by)
>>>>> + ]
>>>>> by max
>>>>> 1: 0.7 0.01464054
>>>>> 2: 0.4 0.87328871
>>>>>
>>>>> Can either of you provide me with the output of these steps in cases where
>>>>> there's an error? I've commented the output I get for each step.
>>>>>
>>>>> byval <- list(by=dt$by)
>>>>> o__ <- data.table:::fastorder(byval) # 2,3,1
>>>>> f__ = data.table:::uniqlist(byval, order=o__) # 1,3
>>>>> len__ = data.table:::uniqlengths(f__, nrow(dt)) # 2,1
>>>>> firstofeachgroup = o__[f__] # 2,1
>>>>> origorder = data.table:::iradixorder(firstofeachgroup) # 2,1
>>>>> f__ = f__[origorder] # 3,1
>>>>> len__ = len__[origorder] # 2,1
>>>>>
>>>>>
>>>>> Arun
>>>>>
>>>>> <...snip...>
>>>
>>> _______________________________________________
>>> datatable-help mailing list
>>> [hidden email]
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>

_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|

Re: 'by' on a numeric column produces inconsistent utput

Arunkumar Srinivasan
In reply to this post by Michael Nelson
@mnel, I'm not sure I understand your output. Yours is different from the correct output, but it is also different from Kevin's. Basically, dt[, max(y), by=by] has no effect on yours and just returns back dt?

Arun

On Thursday, December 19, 2013 at 3:50 AM, Michael Nelson wrote:

Using
data.table 1.8.11 (Fresh install from r-forge today)
R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)

I get

by max
1: 0.7 0.01464054
2: 0.4 0.87328871
3: 0.4 -1.02794620

On both runs.




________________________________________
From: [hidden email] [[hidden email]] on behalf of Kevin Ushey [[hidden email]]
Sent: Thursday, 19 December 2013 12:54 PM
Subject: [datatable-help] 'by' on a numeric column produces inconsistent output

I'm cross-posting this from the GitHub mirror:

For reference, I only see this with the latest RForge version of
data.table (1.8.11), not the CRAN version of data.table.

-----

library(data.table, lib="/Users/kevinushey/Library/R/3.1/library")
set.seed(32)
n <- 3
dt <- data.table(
y=rnorm(n),
by=round( rnorm(n), 1)
)

dt[,
list(max=max(y, na.rm=TRUE)),
by=list(by)
]

dt[,
list(max=max(y, na.rm=TRUE)),
by=list(by)
]

produces the output

dt[,
+ list(max=max(y, na.rm=TRUE)),
+ by=list(by)
+ ]
by max
1: 0.4 0.01464054
2: 0.4 0.87328871
3: 0.7 -1.02794620

dt[,
+ list(max=max(y, na.rm=TRUE)),
+ by=list(by)
+ ]
by max
1: 0.4 0.8732887
2: 0.7 -1.0279462

For some reason, the first return is wrong, while the second (and all
subsequent) output is correct. Any idea what's going on?

sessionInfo()
R Under development (unstable) (2013-12-12 r64453)
Platform: x86_64-apple-darwin13.0.0 (64-bit)

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] data.table_1.8.11 knitr_1.5 devtools_1.4.1.99
BiocInstaller_1.13.3

loaded via a namespace (and not attached):
[1] compiler_3.1.0 digest_0.6.4 evaluate_0.5.1 formatR_0.10
httr_0.2 memoise_0.1
[7] parallel_3.1.0 plyr_1.8 RCurl_1.95-4.1 reshape2_1.2.2
stringr_0.6.2 tools_3.1.0
[13] whisker_0.3-2

---

Kevin
_______________________________________________
datatable-help mailing list
_______________________________________________
datatable-help mailing list


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|

Re: 'by' on a numeric column produces inconsistent utput

Arunkumar Srinivasan
As I was just writing Kevin, I think (if @mnel could verify his output is correct), the reason is because Kevin's using R-devel...

If you do the following shown below, then, `xx` should *not* have the same address as *dt$by* (as is the case for me). But for Kevin, they seem to be pointing to the same location and I can't figure out why it would/should, from how R has been working so far.
byval <- list(by=dt$by)
address(dt$by)
# [1] "0x7fa848ad8608"

address(byval)
# [1] "0x7fa84a93fa68"

xx = byval[[1L]]
address(xx)
# [1] "0x7fa848e3fc48"

address(list(xx))
[1] "0x7fa84aa1ba78"

data.table:::dradixorder(xx)
# [1] 2 3 1

byval
$by
[1] 0.7 0.4 0.4

Arun

On Thursday, December 19, 2013 at 9:36 AM, Arunkumar Srinivasan wrote:

@mnel, I'm not sure I understand your output. Yours is different from the correct output, but it is also different from Kevin's. Basically, dt[, max(y), by=by] has no effect on yours and just returns back dt?

Arun

On Thursday, December 19, 2013 at 3:50 AM, Michael Nelson wrote:

Using
data.table 1.8.11 (Fresh install from r-forge today)
R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)

I get

by max
1: 0.7 0.01464054
2: 0.4 0.87328871
3: 0.4 -1.02794620

On both runs.




________________________________________
From: [hidden email] [[hidden email]] on behalf of Kevin Ushey [[hidden email]]
Sent: Thursday, 19 December 2013 12:54 PM
Subject: [datatable-help] 'by' on a numeric column produces inconsistent output

I'm cross-posting this from the GitHub mirror:

For reference, I only see this with the latest RForge version of
data.table (1.8.11), not the CRAN version of data.table.

-----

library(data.table, lib="/Users/kevinushey/Library/R/3.1/library")
set.seed(32)
n <- 3
dt <- data.table(
y=rnorm(n),
by=round( rnorm(n), 1)
)

dt[,
list(max=max(y, na.rm=TRUE)),
by=list(by)
]

dt[,
list(max=max(y, na.rm=TRUE)),
by=list(by)
]

produces the output

dt[,
+ list(max=max(y, na.rm=TRUE)),
+ by=list(by)
+ ]
by max
1: 0.4 0.01464054
2: 0.4 0.87328871
3: 0.7 -1.02794620

dt[,
+ list(max=max(y, na.rm=TRUE)),
+ by=list(by)
+ ]
by max
1: 0.4 0.8732887
2: 0.7 -1.0279462

For some reason, the first return is wrong, while the second (and all
subsequent) output is correct. Any idea what's going on?

sessionInfo()
R Under development (unstable) (2013-12-12 r64453)
Platform: x86_64-apple-darwin13.0.0 (64-bit)

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] data.table_1.8.11 knitr_1.5 devtools_1.4.1.99
BiocInstaller_1.13.3

loaded via a namespace (and not attached):
[1] compiler_3.1.0 digest_0.6.4 evaluate_0.5.1 formatR_0.10
httr_0.2 memoise_0.1
[7] parallel_3.1.0 plyr_1.8 RCurl_1.95-4.1 reshape2_1.2.2
stringr_0.6.2 tools_3.1.0
[13] whisker_0.3-2

---

Kevin
_______________________________________________
datatable-help mailing list
_______________________________________________
datatable-help mailing list



_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|

Re: 'by' on a numeric column produces inconsistent utput

Arunkumar Srinivasan
In reply to this post by szehnder@uni-bonn.de
Simon,

Thanks. One more towards my way :). I think we've nailed down the problem to R-devel version. I'll write again once I discuss it over with Kevin.

Arun

On Thursday, December 19, 2013 at 9:26 AM, Simon Zehnder wrote:

Hi Arun,

here the results on Mac OS X Mavericks with gcc 4.8.2

data.table 1.8.10:

set.seed(32)
n <- 3
dt <- data.table(
+ y=rnorm(n),
+ by=round( rnorm(n), 1)
+ )

dt[,
+ list(max=max(y, na.rm=TRUE)),
+ by=list(by)
+ ]
by max
1: 0.7 0.01464054
2: 0.4 0.87328871

dt[,
+ list(max=max(y, na.rm=TRUE)),
+ by=list(by)
+ ]
by max
1: 0.7 0.01464054
2: 0.4 0.87328871

data.table 1.8.11:

set.seed(32)
n <- 3
dt <- data.table(
+ y=rnorm(n),
+ by=round( rnorm(n), 1)
+ )

dt[,
+ list(max=max(y, na.rm=TRUE)),
+ by=list(by)
+ ]
by max
1: 0.7 0.01464054
2: 0.4 0.87328871

dt[,
+ list(max=max(y, na.rm=TRUE)),
+ by=list(by)
+ ]
by max
1: 0.7 0.01464054
2: 0.4 0.87328871

Best

Simon


On 19 Dec 2013, at 09:05, Arunkumar Srinivasan <[hidden email]> wrote:

Simon, sure.

set.seed(32)
n <- 3
dt <- data.table(
y=rnorm(n),
by=round( rnorm(n), 1)
)

dt[,
list(max=max(y, na.rm=TRUE)),
by=list(by)
]

dt[,
list(max=max(y, na.rm=TRUE)),
by=list(by)
]



Arun

On Thursday, December 19, 2013 at 8:49 AM, Simon Zehnder wrote:

Arun,

if you could send me the reproducible code in copyable form I can as well try it on Mac OS X Mavericks with gcc 4.8.

Best

Simon

On 19 Dec 2013, at 08:44, Arunkumar Srinivasan <[hidden email]> wrote:

Aha, the issue seems to be with 'uniqlist', not sure why it gives
(f__ = data.table:::uniqlist(byval, order=o__)) # 1,3
1,2,3 for you and 1,3 consistently for me. I'll revert this back to `duplist` for now. Not sure how to solve this though. I've tried it so far on 3 machines:

1) OS X 10.8.5 + libvm (gcc)
2) OS X Mavericks + Clang
3) Debian Weezy + gcc

All of them give consistent output. Man this is such a drag.

Arun

On Thursday, December 19, 2013 at 8:37 AM, Kevin Ushey wrote:

Hi Arun,

Here's the output on my machine -- other information missing from
before; it's with OSX Mavericks, with R and data.table compiled with
Apple clang.

---

library(data.table, lib="/Users/kevinushey/Library/R/3.1/library")
set.seed(32)
n <- 3
dt <- data.table(
+ y=rnorm(n),
+ by=round( rnorm(n), 1)
+ )
## run one
byval <- list(by=dt$by)
(o__ <- data.table:::fastorder(byval)) # 2,3,1
[1] 2 3 1
(f__ = data.table:::uniqlist(byval, order=o__)) # 1,3
[1] 1 2 3
(len__ = data.table:::uniqlengths(f__, nrow(dt))) # 2,1
[1] 1 1 1
(firstofeachgroup = o__[f__]) # 2,1
[1] 2 3 1
(origorder = data.table:::iradixorder(firstofeachgroup)) # 2,1
[1] 3 1 2
(f__ = f__[origorder]) # 3,1
[1] 3 1 2
(len__ = len__[origorder]) # 2,1
[1] 1 1 1

## run two
(o__ <- data.table:::fastorder(byval)) # 2,3,1
[1] 1 2 3
(f__ = data.table:::uniqlist(byval, order=o__)) # 1,3
[1] 1 3
(len__ = data.table:::uniqlengths(f__, nrow(dt))) # 2,1
[1] 2 1
(firstofeachgroup = o__[f__]) # 2,1
[1] 1 3
(origorder = data.table:::iradixorder(firstofeachgroup)) # 2,1
[1] 1 2
(f__ = f__[origorder]) # 3,1
[1] 1 3
(len__ = len__[origorder]) # 2,1
[1] 2 1

On Wed, Dec 18, 2013 at 11:22 PM, Arunkumar Srinivasan
Not sure how to debug without being able to reproduce. Tried on Mac OS X
10.8.5 and Debian GNU/Linux 7 (wheezy). I don't have access to a windows
machine. I consistently gives me this:

dt[,
+ list(max=max(y, na.rm=TRUE)),
+ by=list(by)
+ ]
by max
1: 0.7 0.01464054
2: 0.4 0.87328871

dt[,
+ list(max=max(y, na.rm=TRUE)),
+ by=list(by)
+ ]
by max
1: 0.7 0.01464054
2: 0.4 0.87328871

Can either of you provide me with the output of these steps in cases where
there's an error? I've commented the output I get for each step.

byval <- list(by=dt$by)
o__ <- data.table:::fastorder(byval) # 2,3,1
f__ = data.table:::uniqlist(byval, order=o__) # 1,3
len__ = data.table:::uniqlengths(f__, nrow(dt)) # 2,1
firstofeachgroup = o__[f__] # 2,1
origorder = data.table:::iradixorder(firstofeachgroup) # 2,1
f__ = f__[origorder] # 3,1
len__ = len__[origorder] # 2,1


Arun

<...snip...>

_______________________________________________
datatable-help mailing list


_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|

Re: 'by' on a numeric column produces inconsistent utput

Arunkumar Srinivasan
Just tested this on the devel version (today's). And yes, this issue happens. But I'm not sure if this is an issue with 'data.table' per-se:

On a clean session, if you do this:

require(data.table)
set.seed(32)
n <- 3
dt <- data.table(y=rnorm(n), by=round( rnorm(n), 1))

ll <- list(dt$by)
yy <- ll[[1L]]
address(dt$by) # [1] "0x7fad3c524a40"
address(ll[[1L]]) # [1] "0x7fad3c524a40"
address(yy) # [1] "0x7fad3c524a40"

You see that all three are pointing to the same address. And that's why the result is wrong because internally "yy" will be changed by reference during "fastorder". And it is *not* supposed to point to "yy" but to have made a copy.

After doing it the first time, the pointing changes back to how it's in R-stable.. Not sure if this is desirable. Probably should report on R-devel.

On R-3.0.2, the same commands as above on a clean session:

require(data.table)
set.seed(32)
n <- 3
dt <- data.table(y=rnorm(n), by=round( rnorm(n), 1))

ll <- list(dt$by)
yy <- ll[[1L]]
address(dt$by) # [1] "0x7fc35b640408"
address(ll[[1L]]) # [1] "0x7fc35a0ec838"
address(yy) # [1] "0x7fc35a0ec838"


Arun

On Thursday, December 19, 2013 at 9:43 AM, Arunkumar Srinivasan wrote:

Simon,

Thanks. One more towards my way :). I think we've nailed down the problem to R-devel version. I'll write again once I discuss it over with Kevin.

Arun

On Thursday, December 19, 2013 at 9:26 AM, Simon Zehnder wrote:

Hi Arun,

here the results on Mac OS X Mavericks with gcc 4.8.2

data.table 1.8.10:

set.seed(32)
n <- 3
dt <- data.table(
+ y=rnorm(n),
+ by=round( rnorm(n), 1)
+ )

dt[,
+ list(max=max(y, na.rm=TRUE)),
+ by=list(by)
+ ]
by max
1: 0.7 0.01464054
2: 0.4 0.87328871

dt[,
+ list(max=max(y, na.rm=TRUE)),
+ by=list(by)
+ ]
by max
1: 0.7 0.01464054
2: 0.4 0.87328871

data.table 1.8.11:

set.seed(32)
n <- 3
dt <- data.table(
+ y=rnorm(n),
+ by=round( rnorm(n), 1)
+ )

dt[,
+ list(max=max(y, na.rm=TRUE)),
+ by=list(by)
+ ]
by max
1: 0.7 0.01464054
2: 0.4 0.87328871

dt[,
+ list(max=max(y, na.rm=TRUE)),
+ by=list(by)
+ ]
by max
1: 0.7 0.01464054
2: 0.4 0.87328871

Best

Simon


On 19 Dec 2013, at 09:05, Arunkumar Srinivasan <[hidden email]> wrote:

Simon, sure.

set.seed(32)
n <- 3
dt <- data.table(
y=rnorm(n),
by=round( rnorm(n), 1)
)

dt[,
list(max=max(y, na.rm=TRUE)),
by=list(by)
]

dt[,
list(max=max(y, na.rm=TRUE)),
by=list(by)
]



Arun

On Thursday, December 19, 2013 at 8:49 AM, Simon Zehnder wrote:

Arun,

if you could send me the reproducible code in copyable form I can as well try it on Mac OS X Mavericks with gcc 4.8.

Best

Simon

On 19 Dec 2013, at 08:44, Arunkumar Srinivasan <[hidden email]> wrote:

Aha, the issue seems to be with 'uniqlist', not sure why it gives
(f__ = data.table:::uniqlist(byval, order=o__)) # 1,3
1,2,3 for you and 1,3 consistently for me. I'll revert this back to `duplist` for now. Not sure how to solve this though. I've tried it so far on 3 machines:

1) OS X 10.8.5 + libvm (gcc)
2) OS X Mavericks + Clang
3) Debian Weezy + gcc

All of them give consistent output. Man this is such a drag.

Arun

On Thursday, December 19, 2013 at 8:37 AM, Kevin Ushey wrote:

Hi Arun,

Here's the output on my machine -- other information missing from
before; it's with OSX Mavericks, with R and data.table compiled with
Apple clang.

---

library(data.table, lib="/Users/kevinushey/Library/R/3.1/library")
set.seed(32)
n <- 3
dt <- data.table(
+ y=rnorm(n),
+ by=round( rnorm(n), 1)
+ )
## run one
byval <- list(by=dt$by)
(o__ <- data.table:::fastorder(byval)) # 2,3,1
[1] 2 3 1
(f__ = data.table:::uniqlist(byval, order=o__)) # 1,3
[1] 1 2 3
(len__ = data.table:::uniqlengths(f__, nrow(dt))) # 2,1
[1] 1 1 1
(firstofeachgroup = o__[f__]) # 2,1
[1] 2 3 1
(origorder = data.table:::iradixorder(firstofeachgroup)) # 2,1
[1] 3 1 2
(f__ = f__[origorder]) # 3,1
[1] 3 1 2
(len__ = len__[origorder]) # 2,1
[1] 1 1 1

## run two
(o__ <- data.table:::fastorder(byval)) # 2,3,1
[1] 1 2 3
(f__ = data.table:::uniqlist(byval, order=o__)) # 1,3
[1] 1 3
(len__ = data.table:::uniqlengths(f__, nrow(dt))) # 2,1
[1] 2 1
(firstofeachgroup = o__[f__]) # 2,1
[1] 1 3
(origorder = data.table:::iradixorder(firstofeachgroup)) # 2,1
[1] 1 2
(f__ = f__[origorder]) # 3,1
[1] 1 3
(len__ = len__[origorder]) # 2,1
[1] 2 1

On Wed, Dec 18, 2013 at 11:22 PM, Arunkumar Srinivasan
Not sure how to debug without being able to reproduce. Tried on Mac OS X
10.8.5 and Debian GNU/Linux 7 (wheezy). I don't have access to a windows
machine. I consistently gives me this:

dt[,
+ list(max=max(y, na.rm=TRUE)),
+ by=list(by)
+ ]
by max
1: 0.7 0.01464054
2: 0.4 0.87328871

dt[,
+ list(max=max(y, na.rm=TRUE)),
+ by=list(by)
+ ]
by max
1: 0.7 0.01464054
2: 0.4 0.87328871

Can either of you provide me with the output of these steps in cases where
there's an error? I've commented the output I get for each step.

byval <- list(by=dt$by)
o__ <- data.table:::fastorder(byval) # 2,3,1
f__ = data.table:::uniqlist(byval, order=o__) # 1,3
len__ = data.table:::uniqlengths(f__, nrow(dt)) # 2,1
firstofeachgroup = o__[f__] # 2,1
origorder = data.table:::iradixorder(firstofeachgroup) # 2,1
f__ = f__[origorder] # 3,1
len__ = len__[origorder] # 2,1


Arun

<...snip...>

_______________________________________________
datatable-help mailing list



_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
Reply | Threaded
Open this post in threaded view
|

Re: 'by' on a numeric column produces inconsistent utput

Arunkumar Srinivasan
The issue has been fixed in commit 1054 now. Once r-forge build kicks in, be sure to update, especially if you're working with R-devel version.

Arun

On Thursday, December 19, 2013 at 12:56 PM, Arunkumar Srinivasan wrote:

Just tested this on the devel version (today's). And yes, this issue happens. But I'm not sure if this is an issue with 'data.table' per-se:

On a clean session, if you do this:

require(data.table)
set.seed(32)
n <- 3
dt <- data.table(y=rnorm(n), by=round( rnorm(n), 1))

ll <- list(dt$by)
yy <- ll[[1L]]
address(dt$by) # [1] "0x7fad3c524a40"
address(ll[[1L]]) # [1] "0x7fad3c524a40"
address(yy) # [1] "0x7fad3c524a40"

You see that all three are pointing to the same address. And that's why the result is wrong because internally "yy" will be changed by reference during "fastorder". And it is *not* supposed to point to "yy" but to have made a copy.

After doing it the first time, the pointing changes back to how it's in R-stable.. Not sure if this is desirable. Probably should report on R-devel.

On R-3.0.2, the same commands as above on a clean session:

require(data.table)
set.seed(32)
n <- 3
dt <- data.table(y=rnorm(n), by=round( rnorm(n), 1))

ll <- list(dt$by)
yy <- ll[[1L]]
address(dt$by) # [1] "0x7fc35b640408"
address(ll[[1L]]) # [1] "0x7fc35a0ec838"
address(yy) # [1] "0x7fc35a0ec838"


Arun

On Thursday, December 19, 2013 at 9:43 AM, Arunkumar Srinivasan wrote:

Simon,

Thanks. One more towards my way :). I think we've nailed down the problem to R-devel version. I'll write again once I discuss it over with Kevin.

Arun

On Thursday, December 19, 2013 at 9:26 AM, Simon Zehnder wrote:

Hi Arun,

here the results on Mac OS X Mavericks with gcc 4.8.2

data.table 1.8.10:

set.seed(32)
n <- 3
dt <- data.table(
+ y=rnorm(n),
+ by=round( rnorm(n), 1)
+ )

dt[,
+ list(max=max(y, na.rm=TRUE)),
+ by=list(by)
+ ]
by max
1: 0.7 0.01464054
2: 0.4 0.87328871

dt[,
+ list(max=max(y, na.rm=TRUE)),
+ by=list(by)
+ ]
by max
1: 0.7 0.01464054
2: 0.4 0.87328871

data.table 1.8.11:

set.seed(32)
n <- 3
dt <- data.table(
+ y=rnorm(n),
+ by=round( rnorm(n), 1)
+ )

dt[,
+ list(max=max(y, na.rm=TRUE)),
+ by=list(by)
+ ]
by max
1: 0.7 0.01464054
2: 0.4 0.87328871

dt[,
+ list(max=max(y, na.rm=TRUE)),
+ by=list(by)
+ ]
by max
1: 0.7 0.01464054
2: 0.4 0.87328871

Best

Simon


On 19 Dec 2013, at 09:05, Arunkumar Srinivasan <[hidden email]> wrote:

Simon, sure.

set.seed(32)
n <- 3
dt <- data.table(
y=rnorm(n),
by=round( rnorm(n), 1)
)

dt[,
list(max=max(y, na.rm=TRUE)),
by=list(by)
]

dt[,
list(max=max(y, na.rm=TRUE)),
by=list(by)
]



Arun

On Thursday, December 19, 2013 at 8:49 AM, Simon Zehnder wrote:

Arun,

if you could send me the reproducible code in copyable form I can as well try it on Mac OS X Mavericks with gcc 4.8.

Best

Simon

On 19 Dec 2013, at 08:44, Arunkumar Srinivasan <[hidden email]> wrote:

Aha, the issue seems to be with 'uniqlist', not sure why it gives
(f__ = data.table:::uniqlist(byval, order=o__)) # 1,3
1,2,3 for you and 1,3 consistently for me. I'll revert this back to `duplist` for now. Not sure how to solve this though. I've tried it so far on 3 machines:

1) OS X 10.8.5 + libvm (gcc)
2) OS X Mavericks + Clang
3) Debian Weezy + gcc

All of them give consistent output. Man this is such a drag.

Arun

On Thursday, December 19, 2013 at 8:37 AM, Kevin Ushey wrote:

Hi Arun,

Here's the output on my machine -- other information missing from
before; it's with OSX Mavericks, with R and data.table compiled with
Apple clang.

---

library(data.table, lib="/Users/kevinushey/Library/R/3.1/library")
set.seed(32)
n <- 3
dt <- data.table(
+ y=rnorm(n),
+ by=round( rnorm(n), 1)
+ )
## run one
byval <- list(by=dt$by)
(o__ <- data.table:::fastorder(byval)) # 2,3,1
[1] 2 3 1
(f__ = data.table:::uniqlist(byval, order=o__)) # 1,3
[1] 1 2 3
(len__ = data.table:::uniqlengths(f__, nrow(dt))) # 2,1
[1] 1 1 1
(firstofeachgroup = o__[f__]) # 2,1
[1] 2 3 1
(origorder = data.table:::iradixorder(firstofeachgroup)) # 2,1
[1] 3 1 2
(f__ = f__[origorder]) # 3,1
[1] 3 1 2
(len__ = len__[origorder]) # 2,1
[1] 1 1 1

## run two
(o__ <- data.table:::fastorder(byval)) # 2,3,1
[1] 1 2 3
(f__ = data.table:::uniqlist(byval, order=o__)) # 1,3
[1] 1 3
(len__ = data.table:::uniqlengths(f__, nrow(dt))) # 2,1
[1] 2 1
(firstofeachgroup = o__[f__]) # 2,1
[1] 1 3
(origorder = data.table:::iradixorder(firstofeachgroup)) # 2,1
[1] 1 2
(f__ = f__[origorder]) # 3,1
[1] 1 3
(len__ = len__[origorder]) # 2,1
[1] 2 1

On Wed, Dec 18, 2013 at 11:22 PM, Arunkumar Srinivasan
Not sure how to debug without being able to reproduce. Tried on Mac OS X
10.8.5 and Debian GNU/Linux 7 (wheezy). I don't have access to a windows
machine. I consistently gives me this:

dt[,
+ list(max=max(y, na.rm=TRUE)),
+ by=list(by)
+ ]
by max
1: 0.7 0.01464054
2: 0.4 0.87328871

dt[,
+ list(max=max(y, na.rm=TRUE)),
+ by=list(by)
+ ]
by max
1: 0.7 0.01464054
2: 0.4 0.87328871

Can either of you provide me with the output of these steps in cases where
there's an error? I've commented the output I get for each step.

byval <- list(by=dt$by)
o__ <- data.table:::fastorder(byval) # 2,3,1
f__ = data.table:::uniqlist(byval, order=o__) # 1,3
len__ = data.table:::uniqlengths(f__, nrow(dt)) # 2,1
firstofeachgroup = o__[f__] # 2,1
origorder = data.table:::iradixorder(firstofeachgroup) # 2,1
f__ = f__[origorder] # 3,1
len__ = len__[origorder] # 2,1


Arun

<...snip...>

_______________________________________________
datatable-help mailing list




_______________________________________________
datatable-help mailing list
[hidden email]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help