How to handle INT8 data

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

How to handle INT8 data

Nicolas Paris
Hello r users,

I have to deal with int8 data with R. AFAIK  R does only handle int4
with `as.integer` function [1]. I wonder:
1. what is the better approach to handle int8 ? `as.character` ?
`as.numeric` ?
2. is there any plan to handle int8 in the future ? As you might know,
int4 is to small to deal with earth population right now.

Thanks for you ideas,

int8 eg:

     human_id      
----------------------
 -1311071933951566764
 -4708675461424073238
 -6865005668390999818
  5578000650960353108
 -3219674686933841021
 -6469229889308771589
  -606871692563545028
 -8199987422425699249
  -463287495999648233
  7675955260644241951

reference:
1. https://www.r-bloggers.com/r-in-a-64-bit-world/

--
Nicolas PARIS

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: How to handle INT8 data

Gabriel Becker
I am not on R-core, so cannot speak to future plans to internally support
int8 (though my impression is that there aren't any, at least none that are
close to fruition).

The standard way of dealing with whole numbers too big to fit in an integer
is to put them in a numeric (double down in C land). this can represent
integers up to 2^53 without loss of precision see (
http://stackoverflow.com/questions/1848700/biggest-integer-that-can-be-stored-in-a-double).
This is how long vector indices are (currently) implemented in R. If it's
good enough for indices it's probably good enough for whatever you need
them for.

Hope that helps.

~G


On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris <[hidden email]>
wrote:

> Hello r users,
>
> I have to deal with int8 data with R. AFAIK  R does only handle int4
> with `as.integer` function [1]. I wonder:
> 1. what is the better approach to handle int8 ? `as.character` ?
> `as.numeric` ?
> 2. is there any plan to handle int8 in the future ? As you might know,
> int4 is to small to deal with earth population right now.
>
> Thanks for you ideas,
>
> int8 eg:
>
>      human_id
> ----------------------
>  -1311071933951566764
>  -4708675461424073238
>  -6865005668390999818
>   5578000650960353108
>  -3219674686933841021
>  -6469229889308771589
>   -606871692563545028
>  -8199987422425699249
>   -463287495999648233
>   7675955260644241951
>
> reference:
> 1. https://www.r-bloggers.com/r-in-a-64-bit-world/
>
> --
> Nicolas PARIS
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>



--
Gabriel Becker, PhD
Associate Scientist (Bioinformatics)
Genentech Research

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: How to handle INT8 data

Murray Stokely
The lack of 64 bit integer support causes lots of problems when dealing
with certain types of data where the loss of precision from coercing to 53
bits with double is unacceptable.

Two packages were developed to deal with this:  int64 and bit64.

You may need to find archival versions of these packages if they've fallen
off cran.

Murray (mobile phone)

On Jan 20, 2017 7:20 AM, "Gabriel Becker" <[hidden email]> wrote:

I am not on R-core, so cannot speak to future plans to internally support
int8 (though my impression is that there aren't any, at least none that are
close to fruition).

The standard way of dealing with whole numbers too big to fit in an integer
is to put them in a numeric (double down in C land). this can represent
integers up to 2^53 without loss of precision see (
http://stackoverflow.com/questions/1848700/biggest-
integer-that-can-be-stored-in-a-double).
This is how long vector indices are (currently) implemented in R. If it's
good enough for indices it's probably good enough for whatever you need
them for.

Hope that helps.

~G


On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris <[hidden email]>
wrote:

> Hello r users,
>
> I have to deal with int8 data with R. AFAIK  R does only handle int4
> with `as.integer` function [1]. I wonder:
> 1. what is the better approach to handle int8 ? `as.character` ?
> `as.numeric` ?
> 2. is there any plan to handle int8 in the future ? As you might know,
> int4 is to small to deal with earth population right now.
>
> Thanks for you ideas,
>
> int8 eg:
>
>      human_id
> ----------------------
>  -1311071933951566764
>  -4708675461424073238
>  -6865005668390999818
>   5578000650960353108
>  -3219674686933841021
>  -6469229889308771589
>   -606871692563545028
>  -8199987422425699249
>   -463287495999648233
>   7675955260644241951
>
> reference:
> 1. https://www.r-bloggers.com/r-in-a-64-bit-world/
>
> --
> Nicolas PARIS
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>



--
Gabriel Becker, PhD
Associate Scientist (Bioinformatics)
Genentech Research

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: How to handle INT8 data

R devel mailing list
In reply to this post by Nicolas Paris
If these are identifiers, store them as strings.  If not, what sort of
calculations do you plan on doing with them?
Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris <[hidden email]> wrote:

> Hello r users,
>
> I have to deal with int8 data with R. AFAIK  R does only handle int4
> with `as.integer` function [1]. I wonder:
> 1. what is the better approach to handle int8 ? `as.character` ?
> `as.numeric` ?
> 2. is there any plan to handle int8 in the future ? As you might know,
> int4 is to small to deal with earth population right now.
>
> Thanks for you ideas,
>
> int8 eg:
>
>      human_id
> ----------------------
>  -1311071933951566764
>  -4708675461424073238
>  -6865005668390999818
>   5578000650960353108
>  -3219674686933841021
>  -6469229889308771589
>   -606871692563545028
>  -8199987422425699249
>   -463287495999648233
>   7675955260644241951
>
> reference:
> 1. https://www.r-bloggers.com/r-in-a-64-bit-world/
>
> --
> Nicolas PARIS
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: How to handle INT8 data

Nicolas Paris
In reply to this post by Murray Stokely
Le 20 janv. 2017 à 18h09, Murray Stokely écrivait :
> The lack of 64 bit integer support causes lots of problems when dealing with
> certain types of data where the loss of precision from coercing to 53 bits with
> double is unacceptable.

Hello Murray,
Do you mean, by eg. -1311071933951566764 loses in precision during
as.numeric(-1311071933951566764) process ?
Thanks,

>
> Two packages were developed to deal with this:  int64 and bit64.
>
> You may need to find archival versions of these packages if they've fallen off
> cran.
>
> Murray (mobile phone)
>
> On Jan 20, 2017 7:20 AM, "Gabriel Becker" <[hidden email]> wrote:
>
>     I am not on R-core, so cannot speak to future plans to internally support
>     int8 (though my impression is that there aren't any, at least none that are
>     close to fruition).
>
>     The standard way of dealing with whole numbers too big to fit in an integer
>     is to put them in a numeric (double down in C land). this can represent
>     integers up to 2^53 without loss of precision see (
>     http://stackoverflow.com/questions/1848700/biggest-
>     integer-that-can-be-stored-in-a-double).
>     This is how long vector indices are (currently) implemented in R. If it's
>     good enough for indices it's probably good enough for whatever you need
>     them for.
>
>     Hope that helps.
>
>     ~G
>
>
>     On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris <[hidden email]>
>     wrote:
>
>     > Hello r users,
>     >
>     > I have to deal with int8 data with R. AFAIK  R does only handle int4
>     > with `as.integer` function [1]. I wonder:
>     > 1. what is the better approach to handle int8 ? `as.character` ?
>     > `as.numeric` ?
>     > 2. is there any plan to handle int8 in the future ? As you might know,
>     > int4 is to small to deal with earth population right now.
>     >
>     > Thanks for you ideas,
>     >
>     > int8 eg:
>     >
>     >      human_id
>     > ----------------------
>     >  -1311071933951566764
>     >  -4708675461424073238
>     >  -6865005668390999818
>     >   5578000650960353108
>     >  -3219674686933841021
>     >  -6469229889308771589
>     >   -606871692563545028
>     >  -8199987422425699249
>     >   -463287495999648233
>     >   7675955260644241951
>     >
>     > reference:
>     > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/
>     >
>     > --
>     > Nicolas PARIS
>     >
>     > ______________________________________________
>     > [hidden email] mailing list
>     > https://stat.ethz.ch/mailman/listinfo/r-devel
>     >
>
>
>
>     --
>     Gabriel Becker, PhD
>     Associate Scientist (Bioinformatics)
>     Genentech Research
>
>             [[alternative HTML version deleted]]
>
>     ______________________________________________
>     [hidden email] mailing list
>     https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

--
Nicolas PARIS

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: How to handle INT8 data

Nicolas Paris
In reply to this post by R devel mailing list
Right, they are identifiers.

Storing them as String has drawbacks:
- huge to store in memory
- slow to process
- huge to index (by eg data.table columns indexes)

Why not storing them as numeric ?

Thanks,

Le 20 janv. 2017 à 18h16, William Dunlap écrivait :

> If these are identifiers, store them as strings.  If not, what sort of
> calculations do you plan on doing with them?
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
>
> On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris <[hidden email]> wrote:
> > Hello r users,
> >
> > I have to deal with int8 data with R. AFAIK  R does only handle int4
> > with `as.integer` function [1]. I wonder:
> > 1. what is the better approach to handle int8 ? `as.character` ?
> > `as.numeric` ?
> > 2. is there any plan to handle int8 in the future ? As you might know,
> > int4 is to small to deal with earth population right now.
> >
> > Thanks for you ideas,
> >
> > int8 eg:
> >
> >      human_id
> > ----------------------
> >  -1311071933951566764
> >  -4708675461424073238
> >  -6865005668390999818
> >   5578000650960353108
> >  -3219674686933841021
> >  -6469229889308771589
> >   -606871692563545028
> >  -8199987422425699249
> >   -463287495999648233
> >   7675955260644241951
> >
> > reference:
> > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/
> >
> > --
> > Nicolas PARIS
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel

--
Nicolas PARIS
Responsable R & D
WIND - PACTE, Hôpital Rothschild ( RTH )
Courriel : [hidden email]
Tel : 01 48 04 21 07

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: How to handle INT8 data

Murray Stokely
In reply to this post by Nicolas Paris
2^53 == 2^53+1
TRUE

Which makes joining or grouping data sets with 64 bit identifiers
problematic.

Murray (mobile)

On Jan 20, 2017 9:15 AM, "Nicolas Paris" <[hidden email]> wrote:

Le 20 janv. 2017 à 18h09, Murray Stokely écrivait :
> The lack of 64 bit integer support causes lots of problems when dealing
with
> certain types of data where the loss of precision from coercing to 53
bits with
> double is unacceptable.

Hello Murray,
Do you mean, by eg. -1311071933951566764 loses in precision during
as.numeric(-1311071933951566764) process ?
Thanks,
>
> Two packages were developed to deal with this:  int64 and bit64.
>
> You may need to find archival versions of these packages if they've
fallen off
> cran.
>
> Murray (mobile phone)
>
> On Jan 20, 2017 7:20 AM, "Gabriel Becker" <[hidden email]> wrote:
>
>     I am not on R-core, so cannot speak to future plans to internally
support
>     int8 (though my impression is that there aren't any, at least none
that are
>     close to fruition).
>
>     The standard way of dealing with whole numbers too big to fit in an
integer
>     is to put them in a numeric (double down in C land). this can
represent
>     integers up to 2^53 without loss of precision see (
>     http://stackoverflow.com/questions/1848700/biggest-
>     integer-that-can-be-stored-in-a-double).
>     This is how long vector indices are (currently) implemented in R. If
it's
>     good enough for indices it's probably good enough for whatever you
need

>     them for.
>
>     Hope that helps.
>
>     ~G
>
>
>     On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris <[hidden email]>
>     wrote:
>
>     > Hello r users,
>     >
>     > I have to deal with int8 data with R. AFAIK  R does only handle int4
>     > with `as.integer` function [1]. I wonder:
>     > 1. what is the better approach to handle int8 ? `as.character` ?
>     > `as.numeric` ?
>     > 2. is there any plan to handle int8 in the future ? As you might
know,

>     > int4 is to small to deal with earth population right now.
>     >
>     > Thanks for you ideas,
>     >
>     > int8 eg:
>     >
>     >      human_id
>     > ----------------------
>     >  -1311071933951566764
>     >  -4708675461424073238
>     >  -6865005668390999818
>     >   5578000650960353108
>     >  -3219674686933841021
>     >  -6469229889308771589
>     >   -606871692563545028
>     >  -8199987422425699249
>     >   -463287495999648233
>     >   7675955260644241951
>     >
>     > reference:
>     > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/
>     >
>     > --
>     > Nicolas PARIS
>     >
>     > ______________________________________________
>     > [hidden email] mailing list
>     > https://stat.ethz.ch/mailman/listinfo/r-devel
>     >
>
>
>
>     --
>     Gabriel Becker, PhD
>     Associate Scientist (Bioinformatics)
>     Genentech Research
>
>             [[alternative HTML version deleted]]
>
>     ______________________________________________
>     [hidden email] mailing list
>     https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

--
Nicolas PARIS

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: How to handle INT8 data

Nicolas Paris
Well I definitely cannot use them as numeric because join is the main
reason of those identifiers.

About int64 and bit64 packages, it's not a solution, because I am
releasing a dataset for external users. I cannot ask them to install a
package in order to exploit them.

I have to be very carefull when releasing the data. If a user just use
read.csv functions, they by default cast the identifiers as numeric.

$ more res.csv
"col1";"col2"
"-1311071933951566764";"toto"
"-1311071933951566764";"tata"


> read.table("res.csv",sep=";",header=T)
           col1 col2
1 -1.311072e+18 toto
2 -1.311072e+18 tata

>sapply(read.table("res.csv",sep=";",header=T),class)
     col1      col2
"numeric"  "factor"

> read.table("res.csv",sep=";",header=T,colClasses="character")
col1 col2
1 -1311071933951566764 toto
2 -1311071933951566764 tata

Am I comdemned to provide a R script with the data in order to exploit the dataset ?

Le 20 janv. 2017 à 18h29, Murray Stokely écrivait :

> 2^53 == 2^53+1
> TRUE
>
> Which makes joining or grouping data sets with 64 bit identifiers problematic.
>
> Murray (mobile)
>
> On Jan 20, 2017 9:15 AM, "Nicolas Paris" <[hidden email]> wrote:
>
>     Le 20 janv. 2017 à 18h09, Murray Stokely écrivait :
>     > The lack of 64 bit integer support causes lots of problems when dealing
>     with
>     > certain types of data where the loss of precision from coercing to 53
>     bits with
>     > double is unacceptable.
>
>     Hello Murray,
>     Do you mean, by eg. -1311071933951566764 loses in precision during
>     as.numeric(-1311071933951566764) process ?
>     Thanks,
>     >
>     > Two packages were developed to deal with this:  int64 and bit64.
>     >
>     > You may need to find archival versions of these packages if they've
>     fallen off
>     > cran.
>     >
>     > Murray (mobile phone)
>     >
>     > On Jan 20, 2017 7:20 AM, "Gabriel Becker" <[hidden email]> wrote:
>     >
>     >     I am not on R-core, so cannot speak to future plans to internally
>     support
>     >     int8 (though my impression is that there aren't any, at least none
>     that are
>     >     close to fruition).
>     >
>     >     The standard way of dealing with whole numbers too big to fit in an
>     integer
>     >     is to put them in a numeric (double down in C land). this can
>     represent
>     >     integers up to 2^53 without loss of precision see (
>     >     http://stackoverflow.com/questions/1848700/biggest-
>     >     integer-that-can-be-stored-in-a-double).
>     >     This is how long vector indices are (currently) implemented in R. If
>     it's
>     >     good enough for indices it's probably good enough for whatever you
>     need
>     >     them for.
>     >
>     >     Hope that helps.
>     >
>     >     ~G
>     >
>     >
>     >     On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris <[hidden email]
>     >
>     >     wrote:
>     >
>     >     > Hello r users,
>     >     >
>     >     > I have to deal with int8 data with R. AFAIK  R does only handle
>     int4
>     >     > with `as.integer` function [1]. I wonder:
>     >     > 1. what is the better approach to handle int8 ? `as.character` ?
>     >     > `as.numeric` ?
>     >     > 2. is there any plan to handle int8 in the future ? As you might
>     know,
>     >     > int4 is to small to deal with earth population right now.
>     >     >
>     >     > Thanks for you ideas,
>     >     >
>     >     > int8 eg:
>     >     >
>     >     >      human_id
>     >     > ----------------------
>     >     >  -1311071933951566764
>     >     >  -4708675461424073238
>     >     >  -6865005668390999818
>     >     >   5578000650960353108
>     >     >  -3219674686933841021
>     >     >  -6469229889308771589
>     >     >   -606871692563545028
>     >     >  -8199987422425699249
>     >     >   -463287495999648233
>     >     >   7675955260644241951
>     >     >
>     >     > reference:
>     >     > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/
>     >     >
>     >     > --
>     >     > Nicolas PARIS
>     >     >
>     >     > ______________________________________________
>     >     > [hidden email] mailing list
>     >     > https://stat.ethz.ch/mailman/listinfo/r-devel
>     >     >
>     >
>     >
>     >
>     >     --
>     >     Gabriel Becker, PhD
>     >     Associate Scientist (Bioinformatics)
>     >     Genentech Research
>     >
>     >             [[alternative HTML version deleted]]
>     >
>     >     ______________________________________________
>     >     [hidden email] mailing list
>     >     https://stat.ethz.ch/mailman/listinfo/r-devel
>     >
>     >
>
>     --
>     Nicolas PARIS
>
>

--
Nicolas PARIS

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: How to handle INT8 data

Gabriel Becker
How many unique idenfiiers do you have?

If they are large (in terms of bytes) but you don't have that many of them
(eg the total possible number you'll ever have is < INT_MAX), you could
store them as factors. You get the speed of integers but the labeling of
full "precision" strings.  Factors are fast for joins.

~G

On Fri, Jan 20, 2017 at 9:47 AM, Nicolas Paris <[hidden email]>
wrote:

> Well I definitely cannot use them as numeric because join is the main
> reason of those identifiers.
>
> About int64 and bit64 packages, it's not a solution, because I am
> releasing a dataset for external users. I cannot ask them to install a
> package in order to exploit them.
>
> I have to be very carefull when releasing the data. If a user just use
> read.csv functions, they by default cast the identifiers as numeric.
>
> $ more res.csv
> "col1";"col2"
> "-1311071933951566764";"toto"
> "-1311071933951566764";"tata"
>
>
> > read.table("res.csv",sep=";",header=T)
>            col1 col2
> 1 -1.311072e+18 toto
> 2 -1.311072e+18 tata
>
> >sapply(read.table("res.csv",sep=";",header=T),class)
>      col1      col2
> "numeric"  "factor"
>
> > read.table("res.csv",sep=";",header=T,colClasses="character")
> col1 col2
> 1 -1311071933951566764 toto
> 2 -1311071933951566764 tata
>
> Am I comdemned to provide a R script with the data in order to exploit the
> dataset ?
>
> Le 20 janv. 2017 à 18h29, Murray Stokely écrivait :
> > 2^53 == 2^53+1
> > TRUE
> >
> > Which makes joining or grouping data sets with 64 bit identifiers
> problematic.
> >
> > Murray (mobile)
> >
> > On Jan 20, 2017 9:15 AM, "Nicolas Paris" <[hidden email]> wrote:
> >
> >     Le 20 janv. 2017 à 18h09, Murray Stokely écrivait :
> >     > The lack of 64 bit integer support causes lots of problems when
> dealing
> >     with
> >     > certain types of data where the loss of precision from coercing to
> 53
> >     bits with
> >     > double is unacceptable.
> >
> >     Hello Murray,
> >     Do you mean, by eg. -1311071933951566764 loses in precision during
> >     as.numeric(-1311071933951566764) process ?
> >     Thanks,
> >     >
> >     > Two packages were developed to deal with this:  int64 and bit64.
> >     >
> >     > You may need to find archival versions of these packages if they've
> >     fallen off
> >     > cran.
> >     >
> >     > Murray (mobile phone)
> >     >
> >     > On Jan 20, 2017 7:20 AM, "Gabriel Becker" <[hidden email]>
> wrote:
> >     >
> >     >     I am not on R-core, so cannot speak to future plans to
> internally
> >     support
> >     >     int8 (though my impression is that there aren't any, at least
> none
> >     that are
> >     >     close to fruition).
> >     >
> >     >     The standard way of dealing with whole numbers too big to fit
> in an
> >     integer
> >     >     is to put them in a numeric (double down in C land). this can
> >     represent
> >     >     integers up to 2^53 without loss of precision see (
> >     >     http://stackoverflow.com/questions/1848700/biggest-
> >     >     integer-that-can-be-stored-in-a-double).
> >     >     This is how long vector indices are (currently) implemented in
> R. If
> >     it's
> >     >     good enough for indices it's probably good enough for whatever
> you
> >     need
> >     >     them for.
> >     >
> >     >     Hope that helps.
> >     >
> >     >     ~G
> >     >
> >     >
> >     >     On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris <
> [hidden email]
> >     >
> >     >     wrote:
> >     >
> >     >     > Hello r users,
> >     >     >
> >     >     > I have to deal with int8 data with R. AFAIK  R does only
> handle
> >     int4
> >     >     > with `as.integer` function [1]. I wonder:
> >     >     > 1. what is the better approach to handle int8 ?
> `as.character` ?
> >     >     > `as.numeric` ?
> >     >     > 2. is there any plan to handle int8 in the future ? As you
> might
> >     know,
> >     >     > int4 is to small to deal with earth population right now.
> >     >     >
> >     >     > Thanks for you ideas,
> >     >     >
> >     >     > int8 eg:
> >     >     >
> >     >     >      human_id
> >     >     > ----------------------
> >     >     >  -1311071933951566764
> >     >     >  -4708675461424073238
> >     >     >  -6865005668390999818
> >     >     >   5578000650960353108
> >     >     >  -3219674686933841021
> >     >     >  -6469229889308771589
> >     >     >   -606871692563545028
> >     >     >  -8199987422425699249
> >     >     >   -463287495999648233
> >     >     >   7675955260644241951
> >     >     >
> >     >     > reference:
> >     >     > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/
> >     >     >
> >     >     > --
> >     >     > Nicolas PARIS
> >     >     >
> >     >     > ______________________________________________
> >     >     > [hidden email] mailing list
> >     >     > https://stat.ethz.ch/mailman/listinfo/r-devel
> >     >     >
> >     >
> >     >
> >     >
> >     >     --
> >     >     Gabriel Becker, PhD
> >     >     Associate Scientist (Bioinformatics)
> >     >     Genentech Research
> >     >
> >     >             [[alternative HTML version deleted]]
> >     >
> >     >     ______________________________________________
> >     >     [hidden email] mailing list
> >     >     https://stat.ethz.ch/mailman/listinfo/r-devel
> >     >
> >     >
> >
> >     --
> >     Nicolas PARIS
> >
> >
>
> --
> Nicolas PARIS
>



--
Gabriel Becker, PhD
Associate Scientist (Bioinformatics)
Genentech Research

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: How to handle INT8 data

Peter Haverty
In reply to this post by Nicolas Paris
For what it is worth, I would be extremely pleased to R's integer type go
to 64bit.  A signed 32bit integer is just a bit too small to index into the
~3 billion position human genome.  The "work arounds" that have arisen for
this specific issue are surprisingly complex.

Pete

____________________
Peter M. Haverty, Ph.D.
Genentech, Inc.
[hidden email]

On Fri, Jan 20, 2017 at 9:47 AM, Nicolas Paris <[hidden email]>
wrote:

> Well I definitely cannot use them as numeric because join is the main
> reason of those identifiers.
>
> About int64 and bit64 packages, it's not a solution, because I am
> releasing a dataset for external users. I cannot ask them to install a
> package in order to exploit them.
>
> I have to be very carefull when releasing the data. If a user just use
> read.csv functions, they by default cast the identifiers as numeric.
>
> $ more res.csv
> "col1";"col2"
> "-1311071933951566764";"toto"
> "-1311071933951566764";"tata"
>
>
> > read.table("res.csv",sep=";",header=T)
>            col1 col2
> 1 -1.311072e+18 toto
> 2 -1.311072e+18 tata
>
> >sapply(read.table("res.csv",sep=";",header=T),class)
>      col1      col2
> "numeric"  "factor"
>
> > read.table("res.csv",sep=";",header=T,colClasses="character")
> col1 col2
> 1 -1311071933951566764 toto
> 2 -1311071933951566764 tata
>
> Am I comdemned to provide a R script with the data in order to exploit the
> dataset ?
>
> Le 20 janv. 2017 à 18h29, Murray Stokely écrivait :
> > 2^53 == 2^53+1
> > TRUE
> >
> > Which makes joining or grouping data sets with 64 bit identifiers
> problematic.
> >
> > Murray (mobile)
> >
> > On Jan 20, 2017 9:15 AM, "Nicolas Paris" <[hidden email]> wrote:
> >
> >     Le 20 janv. 2017 à 18h09, Murray Stokely écrivait :
> >     > The lack of 64 bit integer support causes lots of problems when
> dealing
> >     with
> >     > certain types of data where the loss of precision from coercing to
> 53
> >     bits with
> >     > double is unacceptable.
> >
> >     Hello Murray,
> >     Do you mean, by eg. -1311071933951566764 loses in precision during
> >     as.numeric(-1311071933951566764) process ?
> >     Thanks,
> >     >
> >     > Two packages were developed to deal with this:  int64 and bit64.
> >     >
> >     > You may need to find archival versions of these packages if they've
> >     fallen off
> >     > cran.
> >     >
> >     > Murray (mobile phone)
> >     >
> >     > On Jan 20, 2017 7:20 AM, "Gabriel Becker" <[hidden email]>
> wrote:
> >     >
> >     >     I am not on R-core, so cannot speak to future plans to
> internally
> >     support
> >     >     int8 (though my impression is that there aren't any, at least
> none
> >     that are
> >     >     close to fruition).
> >     >
> >     >     The standard way of dealing with whole numbers too big to fit
> in an
> >     integer
> >     >     is to put them in a numeric (double down in C land). this can
> >     represent
> >     >     integers up to 2^53 without loss of precision see (
> >     >     http://stackoverflow.com/questions/1848700/biggest-
> >     >     integer-that-can-be-stored-in-a-double).
> >     >     This is how long vector indices are (currently) implemented in
> R. If
> >     it's
> >     >     good enough for indices it's probably good enough for whatever
> you
> >     need
> >     >     them for.
> >     >
> >     >     Hope that helps.
> >     >
> >     >     ~G
> >     >
> >     >
> >     >     On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris <
> [hidden email]
> >     >
> >     >     wrote:
> >     >
> >     >     > Hello r users,
> >     >     >
> >     >     > I have to deal with int8 data with R. AFAIK  R does only
> handle
> >     int4
> >     >     > with `as.integer` function [1]. I wonder:
> >     >     > 1. what is the better approach to handle int8 ?
> `as.character` ?
> >     >     > `as.numeric` ?
> >     >     > 2. is there any plan to handle int8 in the future ? As you
> might
> >     know,
> >     >     > int4 is to small to deal with earth population right now.
> >     >     >
> >     >     > Thanks for you ideas,
> >     >     >
> >     >     > int8 eg:
> >     >     >
> >     >     >      human_id
> >     >     > ----------------------
> >     >     >  -1311071933951566764
> >     >     >  -4708675461424073238
> >     >     >  -6865005668390999818
> >     >     >   5578000650960353108
> >     >     >  -3219674686933841021
> >     >     >  -6469229889308771589
> >     >     >   -606871692563545028
> >     >     >  -8199987422425699249
> >     >     >   -463287495999648233
> >     >     >   7675955260644241951
> >     >     >
> >     >     > reference:
> >     >     > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/
> >     >     >
> >     >     > --
> >     >     > Nicolas PARIS
> >     >     >
> >     >     > ______________________________________________
> >     >     > [hidden email] mailing list
> >     >     > https://stat.ethz.ch/mailman/listinfo/r-devel
> >     >     >
> >     >
> >     >
> >     >
> >     >     --
> >     >     Gabriel Becker, PhD
> >     >     Associate Scientist (Bioinformatics)
> >     >     Genentech Research
> >     >
> >     >             [[alternative HTML version deleted]]
> >     >
> >     >     ______________________________________________
> >     >     [hidden email] mailing list
> >     >     https://stat.ethz.ch/mailman/listinfo/r-devel
> >     >
> >     >
> >
> >     --
> >     Nicolas PARIS
> >
> >
>
> --
> Nicolas PARIS
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: How to handle INT8 data

Nicolas Paris
In reply to this post by Gabriel Becker
Hi,

I do have < INT_MAX.
This looks attractive but since they are unique identifiers, storing
them as factor will be likely to be counter-productive. (a string
version + an int32 for each)

I was looking to https://cran.r-project.org/web/packages/csvread/index.html
This looks like a good feet for my needs.
Any chances such an external package for int64 would be integrated in core ?


Le 20 janv. 2017 à 18h57, Gabriel Becker écrivait :

> How many unique idenfiiers do you have?
>
> If they are large (in terms of bytes) but you don't have that many of them (eg
> the total possible number you'll ever have is < INT_MAX), you could store them
> as factors. You get the speed of integers but the labeling of full "precision"
> strings.  Factors are fast for joins.
>
> ~G
>
> On Fri, Jan 20, 2017 at 9:47 AM, Nicolas Paris <[hidden email]> wrote:
>
>     Well I definitely cannot use them as numeric because join is the main
>     reason of those identifiers.
>
>     About int64 and bit64 packages, it's not a solution, because I am
>     releasing a dataset for external users. I cannot ask them to install a
>     package in order to exploit them.
>
>     I have to be very carefull when releasing the data. If a user just use
>     read.csv functions, they by default cast the identifiers as numeric.
>
>     $ more res.csv
>     "col1";"col2"
>     "-1311071933951566764";"toto"
>     "-1311071933951566764";"tata"
>
>
>     > read.table("res.csv",sep=";",header=T)
>                col1 col2
>     1 -1.311072e+18 toto
>     2 -1.311072e+18 tata
>
>     >sapply(read.table("res.csv",sep=";",header=T),class)
>          col1      col2
>     "numeric"  "factor"
>
>     > read.table("res.csv",sep=";",header=T,colClasses="character")
>     col1 col2
>     1 -1311071933951566764 toto
>     2 -1311071933951566764 tata
>
>     Am I comdemned to provide a R script with the data in order to exploit the
>     dataset ?
>
>     Le 20 janv. 2017 à 18h29, Murray Stokely écrivait :
>     > 2^53 == 2^53+1
>     > TRUE
>     >
>     > Which makes joining or grouping data sets with 64 bit identifiers
>     problematic.
>     >
>     > Murray (mobile)
>     >
>     > On Jan 20, 2017 9:15 AM, "Nicolas Paris" <[hidden email]> wrote:
>     >
>     >     Le 20 janv. 2017 à 18h09, Murray Stokely écrivait :
>     >     > The lack of 64 bit integer support causes lots of problems when
>     dealing
>     >     with
>     >     > certain types of data where the loss of precision from coercing to
>     53
>     >     bits with
>     >     > double is unacceptable.
>     >
>     >     Hello Murray,
>     >     Do you mean, by eg. -1311071933951566764 loses in precision during
>     >     as.numeric(-1311071933951566764) process ?
>     >     Thanks,
>     >     >
>     >     > Two packages were developed to deal with this:  int64 and bit64.
>     >     >
>     >     > You may need to find archival versions of these packages if they've
>     >     fallen off
>     >     > cran.
>     >     >
>     >     > Murray (mobile phone)
>     >     >
>     >     > On Jan 20, 2017 7:20 AM, "Gabriel Becker" <[hidden email]>
>     wrote:
>     >     >
>     >     >     I am not on R-core, so cannot speak to future plans to
>     internally
>     >     support
>     >     >     int8 (though my impression is that there aren't any, at least
>     none
>     >     that are
>     >     >     close to fruition).
>     >     >
>     >     >     The standard way of dealing with whole numbers too big to fit
>     in an
>     >     integer
>     >     >     is to put them in a numeric (double down in C land). this can
>     >     represent
>     >     >     integers up to 2^53 without loss of precision see (
>     >     >     http://stackoverflow.com/questions/1848700/biggest-
>     >     >     integer-that-can-be-stored-in-a-double).
>     >     >     This is how long vector indices are (currently) implemented in
>     R. If
>     >     it's
>     >     >     good enough for indices it's probably good enough for whatever
>     you
>     >     need
>     >     >     them for.
>     >     >
>     >     >     Hope that helps.
>     >     >
>     >     >     ~G
>     >     >
>     >     >
>     >     >     On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris <
>     [hidden email]
>     >     >
>     >     >     wrote:
>     >     >
>     >     >     > Hello r users,
>     >     >     >
>     >     >     > I have to deal with int8 data with R. AFAIK  R does only
>     handle
>     >     int4
>     >     >     > with `as.integer` function [1]. I wonder:
>     >     >     > 1. what is the better approach to handle int8 ? `as.character
>     ` ?
>     >     >     > `as.numeric` ?
>     >     >     > 2. is there any plan to handle int8 in the future ? As you
>     might
>     >     know,
>     >     >     > int4 is to small to deal with earth population right now.
>     >     >     >
>     >     >     > Thanks for you ideas,
>     >     >     >
>     >     >     > int8 eg:
>     >     >     >
>     >     >     >      human_id
>     >     >     > ----------------------
>     >     >     >  -1311071933951566764
>     >     >     >  -4708675461424073238
>     >     >     >  -6865005668390999818
>     >     >     >   5578000650960353108
>     >     >     >  -3219674686933841021
>     >     >     >  -6469229889308771589
>     >     >     >   -606871692563545028
>     >     >     >  -8199987422425699249
>     >     >     >   -463287495999648233
>     >     >     >   7675955260644241951
>     >     >     >
>     >     >     > reference:
>     >     >     > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/
>     >     >     >
>     >     >     > --
>     >     >     > Nicolas PARIS
>     >     >     >
>     >     >     > ______________________________________________
>     >     >     > [hidden email] mailing list
>     >     >     > https://stat.ethz.ch/mailman/listinfo/r-devel
>     >     >     >
>     >     >
>     >     >
>     >     >
>     >     >     --
>     >     >     Gabriel Becker, PhD
>     >     >     Associate Scientist (Bioinformatics)
>     >     >     Genentech Research
>     >     >
>     >     >             [[alternative HTML version deleted]]
>     >     >
>     >     >     ______________________________________________
>     >     >     [hidden email] mailing list
>     >     >     https://stat.ethz.ch/mailman/listinfo/r-devel
>     >     >
>     >     >
>     >
>     >     --
>     >     Nicolas PARIS
>     >
>     >
>
>     --
>     Nicolas PARIS
>
>
>
>
> --
> Gabriel Becker, PhD
> Associate Scientist (Bioinformatics)
> Genentech Research

--
Nicolas PARIS
Responsable R & D
WIND - PACTE, Hôpital Rothschild ( RTH )
Courriel : [hidden email]
Tel : 01 48 04 21 07

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: How to handle INT8 data

Gabriel Becker
I, again, can't speak for R-core so I may be wrong about any of this and
they are welcome to correct me but it seems unlikely that they would
integrate a package that defines 64 bit integers in R into the core of R
 without making the changes necessary to provide 64 bit integers as a
fundamental (atomic vector) type. I know this has come up before and they
have been reluctant to make the changes necessary.

As Pete points out, they could "simply" change integers in R to always be
64 bit, though that would make all* (to an extent) integer vectors in R
take up twice as much memory as they do now.

I should also mention that even if R-core did take up this cause, it
wouldn't happen quickly enough for what you probably need. I would guess we
would be talking months or year(s) (i.e. the next non-patch R versions at
the earliest, and likely the one after that >1yr out).

One pragmatic solution (other than the factors which is what I Would
probably do) would be to only distribute your data as an R data package
which depends on csvread or similar.

~G

On Fri, Jan 20, 2017 at 10:05 AM, Nicolas Paris <[hidden email]>
wrote:

> Hi,
>
> I do have < INT_MAX.
> This looks attractive but since they are unique identifiers, storing
> them as factor will be likely to be counter-productive. (a string
> version + an int32 for each)
>
> I was looking to https://cran.r-project.org/web/packages/csvread/index.
> html
> This looks like a good feet for my needs.
> Any chances such an external package for int64 would be integrated in core
> ?
>
>
> Le 20 janv. 2017 à 18h57, Gabriel Becker écrivait :
> > How many unique idenfiiers do you have?
> >
> > If they are large (in terms of bytes) but you don't have that many of
> them (eg
> > the total possible number you'll ever have is < INT_MAX), you could
> store them
> > as factors. You get the speed of integers but the labeling of full
> "precision"
> > strings.  Factors are fast for joins.
> >
> > ~G
> >
> > On Fri, Jan 20, 2017 at 9:47 AM, Nicolas Paris <[hidden email]>
> wrote:
> >
> >     Well I definitely cannot use them as numeric because join is the main
> >     reason of those identifiers.
> >
> >     About int64 and bit64 packages, it's not a solution, because I am
> >     releasing a dataset for external users. I cannot ask them to install
> a
> >     package in order to exploit them.
> >
> >     I have to be very carefull when releasing the data. If a user just
> use
> >     read.csv functions, they by default cast the identifiers as numeric.
> >
> >     $ more res.csv
> >     "col1";"col2"
> >     "-1311071933951566764";"toto"
> >     "-1311071933951566764";"tata"
> >
> >
> >     > read.table("res.csv",sep=";",header=T)
> >                col1 col2
> >     1 -1.311072e+18 toto
> >     2 -1.311072e+18 tata
> >
> >     >sapply(read.table("res.csv",sep=";",header=T),class)
> >          col1      col2
> >     "numeric"  "factor"
> >
> >     > read.table("res.csv",sep=";",header=T,colClasses="character")
> >     col1 col2
> >     1 -1311071933951566764 toto
> >     2 -1311071933951566764 tata
> >
> >     Am I comdemned to provide a R script with the data in order to
> exploit the
> >     dataset ?
> >
> >     Le 20 janv. 2017 à 18h29, Murray Stokely écrivait :
> >     > 2^53 == 2^53+1
> >     > TRUE
> >     >
> >     > Which makes joining or grouping data sets with 64 bit identifiers
> >     problematic.
> >     >
> >     > Murray (mobile)
> >     >
> >     > On Jan 20, 2017 9:15 AM, "Nicolas Paris" <[hidden email]>
> wrote:
> >     >
> >     >     Le 20 janv. 2017 à 18h09, Murray Stokely écrivait :
> >     >     > The lack of 64 bit integer support causes lots of problems
> when
> >     dealing
> >     >     with
> >     >     > certain types of data where the loss of precision from
> coercing to
> >     53
> >     >     bits with
> >     >     > double is unacceptable.
> >     >
> >     >     Hello Murray,
> >     >     Do you mean, by eg. -1311071933951566764 loses in precision
> during
> >     >     as.numeric(-1311071933951566764) process ?
> >     >     Thanks,
> >     >     >
> >     >     > Two packages were developed to deal with this:  int64 and
> bit64.
> >     >     >
> >     >     > You may need to find archival versions of these packages if
> they've
> >     >     fallen off
> >     >     > cran.
> >     >     >
> >     >     > Murray (mobile phone)
> >     >     >
> >     >     > On Jan 20, 2017 7:20 AM, "Gabriel Becker" <
> [hidden email]>
> >     wrote:
> >     >     >
> >     >     >     I am not on R-core, so cannot speak to future plans to
> >     internally
> >     >     support
> >     >     >     int8 (though my impression is that there aren't any, at
> least
> >     none
> >     >     that are
> >     >     >     close to fruition).
> >     >     >
> >     >     >     The standard way of dealing with whole numbers too big
> to fit
> >     in an
> >     >     integer
> >     >     >     is to put them in a numeric (double down in C land).
> this can
> >     >     represent
> >     >     >     integers up to 2^53 without loss of precision see (
> >     >     >     http://stackoverflow.com/questions/1848700/biggest-
> >     >     >     integer-that-can-be-stored-in-a-double).
> >     >     >     This is how long vector indices are (currently)
> implemented in
> >     R. If
> >     >     it's
> >     >     >     good enough for indices it's probably good enough for
> whatever
> >     you
> >     >     need
> >     >     >     them for.
> >     >     >
> >     >     >     Hope that helps.
> >     >     >
> >     >     >     ~G
> >     >     >
> >     >     >
> >     >     >     On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris <
> >     [hidden email]
> >     >     >
> >     >     >     wrote:
> >     >     >
> >     >     >     > Hello r users,
> >     >     >     >
> >     >     >     > I have to deal with int8 data with R. AFAIK  R does
> only
> >     handle
> >     >     int4
> >     >     >     > with `as.integer` function [1]. I wonder:
> >     >     >     > 1. what is the better approach to handle int8 ?
> `as.character
> >     ` ?
> >     >     >     > `as.numeric` ?
> >     >     >     > 2. is there any plan to handle int8 in the future ? As
> you
> >     might
> >     >     know,
> >     >     >     > int4 is to small to deal with earth population right
> now.
> >     >     >     >
> >     >     >     > Thanks for you ideas,
> >     >     >     >
> >     >     >     > int8 eg:
> >     >     >     >
> >     >     >     >      human_id
> >     >     >     > ----------------------
> >     >     >     >  -1311071933951566764
> >     >     >     >  -4708675461424073238
> >     >     >     >  -6865005668390999818
> >     >     >     >   5578000650960353108
> >     >     >     >  -3219674686933841021
> >     >     >     >  -6469229889308771589
> >     >     >     >   -606871692563545028
> >     >     >     >  -8199987422425699249
> >     >     >     >   -463287495999648233
> >     >     >     >   7675955260644241951
> >     >     >     >
> >     >     >     > reference:
> >     >     >     > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/
> >     >     >     >
> >     >     >     > --
> >     >     >     > Nicolas PARIS
> >     >     >     >
> >     >     >     > ______________________________________________
> >     >     >     > [hidden email] mailing list
> >     >     >     > https://stat.ethz.ch/mailman/listinfo/r-devel
> >     >     >     >
> >     >     >
> >     >     >
> >     >     >
> >     >     >     --
> >     >     >     Gabriel Becker, PhD
> >     >     >     Associate Scientist (Bioinformatics)
> >     >     >     Genentech Research
> >     >     >
> >     >     >             [[alternative HTML version deleted]]
> >     >     >
> >     >     >     ______________________________________________
> >     >     >     [hidden email] mailing list
> >     >     >     https://stat.ethz.ch/mailman/listinfo/r-devel
> >     >     >
> >     >     >
> >     >
> >     >     --
> >     >     Nicolas PARIS
> >     >
> >     >
> >
> >     --
> >     Nicolas PARIS
> >
> >
> >
> >
> > --
> > Gabriel Becker, PhD
> > Associate Scientist (Bioinformatics)
> > Genentech Research
>
> --
> Nicolas PARIS
> Responsable R & D
> WIND - PACTE, Hôpital Rothschild ( RTH )
> Courriel : [hidden email]
> Tel : 01 48 04 21 07
>



--
Gabriel Becker, PhD
Associate Scientist (Bioinformatics)
Genentech Research

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: How to handle INT8 data

Willem Ligtenberg-2
In reply to this post by Nicolas Paris
You might want to use a data.table then.
It will automatically detect that it is a 64 bit int.
Although also in that case the user will have to install the data.table
package.
(which is a good idea anyway in my opinion :) )

It will then obviously allow you to join tables.

Willem

On 20-01-17 18:47, Nicolas Paris wrote:

> Well I definitely cannot use them as numeric because join is the main
> reason of those identifiers.
>
> About int64 and bit64 packages, it's not a solution, because I am
> releasing a dataset for external users. I cannot ask them to install a
> package in order to exploit them.
>
> I have to be very carefull when releasing the data. If a user just use
> read.csv functions, they by default cast the identifiers as numeric.
>
> $ more res.csv
> "col1";"col2"
> "-1311071933951566764";"toto"
> "-1311071933951566764";"tata"
>
>
>> read.table("res.csv",sep=";",header=T)
>            col1 col2
> 1 -1.311072e+18 toto
> 2 -1.311072e+18 tata
>
>> sapply(read.table("res.csv",sep=";",header=T),class)
>      col1      col2
> "numeric"  "factor"
>
>> read.table("res.csv",sep=";",header=T,colClasses="character")
> col1 col2
> 1 -1311071933951566764 toto
> 2 -1311071933951566764 tata
>
> Am I comdemned to provide a R script with the data in order to exploit the dataset ?
>
> Le 20 janv. 2017 à 18h29, Murray Stokely écrivait :
>> 2^53 == 2^53+1
>> TRUE
>>
>> Which makes joining or grouping data sets with 64 bit identifiers problematic.
>>
>> Murray (mobile)
>>
>> On Jan 20, 2017 9:15 AM, "Nicolas Paris" <[hidden email]> wrote:
>>
>>     Le 20 janv. 2017 à 18h09, Murray Stokely écrivait :
>>     > The lack of 64 bit integer support causes lots of problems when dealing
>>     with
>>     > certain types of data where the loss of precision from coercing to 53
>>     bits with
>>     > double is unacceptable.
>>
>>     Hello Murray,
>>     Do you mean, by eg. -1311071933951566764 loses in precision during
>>     as.numeric(-1311071933951566764) process ?
>>     Thanks,
>>     >
>>     > Two packages were developed to deal with this:  int64 and bit64.
>>     >
>>     > You may need to find archival versions of these packages if they've
>>     fallen off
>>     > cran.
>>     >
>>     > Murray (mobile phone)
>>     >
>>     > On Jan 20, 2017 7:20 AM, "Gabriel Becker" <[hidden email]> wrote:
>>     >
>>     >     I am not on R-core, so cannot speak to future plans to internally
>>     support
>>     >     int8 (though my impression is that there aren't any, at least none
>>     that are
>>     >     close to fruition).
>>     >
>>     >     The standard way of dealing with whole numbers too big to fit in an
>>     integer
>>     >     is to put them in a numeric (double down in C land). this can
>>     represent
>>     >     integers up to 2^53 without loss of precision see (
>>     >     http://stackoverflow.com/questions/1848700/biggest-
>>     >     integer-that-can-be-stored-in-a-double).
>>     >     This is how long vector indices are (currently) implemented in R. If
>>     it's
>>     >     good enough for indices it's probably good enough for whatever you
>>     need
>>     >     them for.
>>     >
>>     >     Hope that helps.
>>     >
>>     >     ~G
>>     >
>>     >
>>     >     On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris <[hidden email]
>>     >
>>     >     wrote:
>>     >
>>     >     > Hello r users,
>>     >     >
>>     >     > I have to deal with int8 data with R. AFAIK  R does only handle
>>     int4
>>     >     > with `as.integer` function [1]. I wonder:
>>     >     > 1. what is the better approach to handle int8 ? `as.character` ?
>>     >     > `as.numeric` ?
>>     >     > 2. is there any plan to handle int8 in the future ? As you might
>>     know,
>>     >     > int4 is to small to deal with earth population right now.
>>     >     >
>>     >     > Thanks for you ideas,
>>     >     >
>>     >     > int8 eg:
>>     >     >
>>     >     >      human_id
>>     >     > ----------------------
>>     >     >  -1311071933951566764
>>     >     >  -4708675461424073238
>>     >     >  -6865005668390999818
>>     >     >   5578000650960353108
>>     >     >  -3219674686933841021
>>     >     >  -6469229889308771589
>>     >     >   -606871692563545028
>>     >     >  -8199987422425699249
>>     >     >   -463287495999648233
>>     >     >   7675955260644241951
>>     >     >
>>     >     > reference:
>>     >     > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/
>>     >     >
>>     >     > --
>>     >     > Nicolas PARIS
>>     >     >
>>     >     > ______________________________________________
>>     >     > [hidden email] mailing list
>>     >     > https://stat.ethz.ch/mailman/listinfo/r-devel
>>     >     >
>>     >
>>     >
>>     >
>>     >     --
>>     >     Gabriel Becker, PhD
>>     >     Associate Scientist (Bioinformatics)
>>     >     Genentech Research
>>     >
>>     >             [[alternative HTML version deleted]]
>>     >
>>     >     ______________________________________________
>>     >     [hidden email] mailing list
>>     >     https://stat.ethz.ch/mailman/listinfo/r-devel
>>     >
>>     >
>>
>>     --
>>     Nicolas PARIS
>>
>>


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

signature.asc (465 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: How to handle INT8 data

Dirk Eddelbuettel
In reply to this post by Nicolas Paris

Not sure how we got from int8 to int64 ... but for what it is worth, I
recently a) needed 64-bit integers to represent nanosecond timestamps (which
then became the still new-ish CRAN package 'nanotime') and b) found the
support in package bit64 for its bit64::integer64 to be easy too use and
performant -- plus c) the data.table package reads/writes these well.

Dirk

--
http://dirk.eddelbuettel.com | @eddelbuettel | [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: How to handle INT8 data

Kasper Daniel Hansen-2
In reply to this post by Nicolas Paris
Have you benchmarked these potential drawbacks for your usecase? Eg. memory
depends on the structure of the identifies, given how R stores characters
internally.

Given all the issues raised here, I would 100% provide a script for reading
the data into R, if this is for distribution.

Best,
Kasper

On Fri, Jan 20, 2017 at 12:28 PM, Nicolas Paris <[hidden email]>
wrote:

> Right, they are identifiers.
>
> Storing them as String has drawbacks:
> - huge to store in memory
> - slow to process
> - huge to index (by eg data.table columns indexes)
>
> Why not storing them as numeric ?
>
> Thanks,
>
> Le 20 janv. 2017 à 18h16, William Dunlap écrivait :
> > If these are identifiers, store them as strings.  If not, what sort of
> > calculations do you plan on doing with them?
> > Bill Dunlap
> > TIBCO Software
> > wdunlap tibco.com
> >
> >
> > On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris <[hidden email]>
> wrote:
> > > Hello r users,
> > >
> > > I have to deal with int8 data with R. AFAIK  R does only handle int4
> > > with `as.integer` function [1]. I wonder:
> > > 1. what is the better approach to handle int8 ? `as.character` ?
> > > `as.numeric` ?
> > > 2. is there any plan to handle int8 in the future ? As you might know,
> > > int4 is to small to deal with earth population right now.
> > >
> > > Thanks for you ideas,
> > >
> > > int8 eg:
> > >
> > >      human_id
> > > ----------------------
> > >  -1311071933951566764
> > >  -4708675461424073238
> > >  -6865005668390999818
> > >   5578000650960353108
> > >  -3219674686933841021
> > >  -6469229889308771589
> > >   -606871692563545028
> > >  -8199987422425699249
> > >   -463287495999648233
> > >   7675955260644241951
> > >
> > > reference:
> > > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/
> > >
> > > --
> > > Nicolas PARIS
> > >
> > > ______________________________________________
> > > [hidden email] mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-devel
>
> --
> Nicolas PARIS
> Responsable R & D
> WIND - PACTE, Hôpital Rothschild ( RTH )
> Courriel : [hidden email]
> Tel : 01 48 04 21 07
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: How to handle INT8 data

Jeroen Ooms
In reply to this post by Murray Stokely
On Fri, Jan 20, 2017 at 6:09 PM, Murray Stokely <[hidden email]> wrote:
> The lack of 64 bit integer support causes lots of problems when dealing
> with certain types of data where the loss of precision from coercing to 53
> bits with double is unacceptable.
>
> Two packages were developed to deal with this:  int64 and bit64.

Don't forget packages for large arbitrary large numbers such as Rmpfr
and openssl.

  x <- openssl::bignum("12345678987654321")
  x^10

The risk of storing int64 as a double (e.g. in bit64) is that it might
easily be mistaken for a completely different value via unclass() or
Rf_isNumeric() or so.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: How to handle INT8 data

hadley wickham
In reply to this post by Gabriel Becker
To summarise this thread, there are basically three ways of handling int64 in R:

* coerce to character
* coerce to double
* store in double

There is no ideal solution, and each have pros and cons that I've
attempted to summarise below.

## Coerce to character

This is the easiest approach if the data is used as identifiers. It
will have some performance drawbacks when loading and will require
additional memory. It should not have negative performance
implications once the data has been loaded because R has a global
string pool so string comparisons only require a single pointer
comparison (assuming they have the same encoding)

## Coerce to double

This is the easiest approach if your integers are in the range
[-(2^53), 2^53] or you can tolerate some minor loss of precision.

## Store in a double

This technique takes advantage of the fact that doubles and int64s are
the same size, so you can store the binary representation of an int64
in a double. This will effectively be garbage if you treat the vector
as if it is a double, so it requires adding an S3 class and overriding
every generic function with a custom method. Not all functions are
generic, and internal C code will not know about the special class, so
this has the danger of code silently interpreting the data
incorrectly.

This is the approach taken by the bit64 package (and, I believe, the
int64 package, but since that's been archived it's not worth
considering.

Hadley

On Fri, Jan 20, 2017 at 9:19 AM, Gabriel Becker <[hidden email]> wrote:

> I am not on R-core, so cannot speak to future plans to internally support
> int8 (though my impression is that there aren't any, at least none that are
> close to fruition).
>
> The standard way of dealing with whole numbers too big to fit in an integer
> is to put them in a numeric (double down in C land). this can represent
> integers up to 2^53 without loss of precision see (
> http://stackoverflow.com/questions/1848700/biggest-integer-that-can-be-stored-in-a-double).
> This is how long vector indices are (currently) implemented in R. If it's
> good enough for indices it's probably good enough for whatever you need
> them for.
>
> Hope that helps.
>
> ~G
>
>
> On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris <[hidden email]>
> wrote:
>
>> Hello r users,
>>
>> I have to deal with int8 data with R. AFAIK  R does only handle int4
>> with `as.integer` function [1]. I wonder:
>> 1. what is the better approach to handle int8 ? `as.character` ?
>> `as.numeric` ?
>> 2. is there any plan to handle int8 in the future ? As you might know,
>> int4 is to small to deal with earth population right now.
>>
>> Thanks for you ideas,
>>
>> int8 eg:
>>
>>      human_id
>> ----------------------
>>  -1311071933951566764
>>  -4708675461424073238
>>  -6865005668390999818
>>   5578000650960353108
>>  -3219674686933841021
>>  -6469229889308771589
>>   -606871692563545028
>>  -8199987422425699249
>>   -463287495999648233
>>   7675955260644241951
>>
>> reference:
>> 1. https://www.r-bloggers.com/r-in-a-64-bit-world/
>>
>> --
>> Nicolas PARIS
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
>
>
> --
> Gabriel Becker, PhD
> Associate Scientist (Bioinformatics)
> Genentech Research
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



--
http://hadley.nz

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: How to handle INT8 data

Dirk Eddelbuettel

On 21 January 2017 at 10:56, Hadley Wickham wrote:
| To summarise this thread, there are basically three ways of handling int64 in R:
|
| * coerce to character
| * coerce to double
| * store in double
|
| ## Coerce to character

Serious performance loss.
 
| ## Coerce to double

Serious precision + functionality loss.

Rember, int64, not int53, is what we are after. That that is what other
systems we want to interop with have (bigtable indices).

| ## Store in a double

Best approach in my book, and done in bit64::integer.

| This is the approach taken by the bit64 package (and, I believe, the

Incorrect.

That used an S4 class with two int32. The bit64 package has a bit on
comparison. But as int64 is abandonware it doesn't matter either way.

Dirk

--
http://dirk.eddelbuettel.com | @eddelbuettel | [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel