Suggestion: Create On-Disk Dataframes

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Suggestion: Create On-Disk Dataframes

Juan Telleria
Dear R Developers,

I would like to suggest the creation of a new S4 object class for On-Disk
data.frames which do not fit in RAM memory, which could be called
disk.data.frame()

It could be based in rsqlite for example (By translating R syntax to SQL
syntax for example), and the syntax and way of working of the
disk.data.frame() class could be exactly the same than with data.frame
objects.

When the session is of, is such disk.data.frames are not saved, and
implicit DROP TABLE could be done in all the schemas created in rsqlite.

Nowadays, with the SSD disk drives such new data.frame() class could have
sense, specially when dealing with Big Data.

It is true that this new class might be slower than regular data.frame,
data.table or tibble classes, but we would be able to handle much more
data, even if it is at cost of speed.

Also with data sampling, and use of a regular odbc connection we could do
all the work, but for people who do not know how to use RDBMS or specific
purpose R packages for this job, this could work.

Another option would be to base this new S4 class  on feather files, but
maybe making it with rsqlite is simply easier.

A GitHub project could be created for such purpose, so that all the
community can contribute (included me :D ).

Thank you,
Juan

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion: Create On-Disk Dataframes

Suzen, Mehmet
It is not needed. There is a large community of developer using SparkR.
https://spark.apache.org/docs/latest/sparkr.html
It does exactly what you want.

On 3 September 2017 at 20:38, Juan Telleria <[hidden email]> wrote:

> Dear R Developers,
>
> I would like to suggest the creation of a new S4 object class for On-Disk
> data.frames which do not fit in RAM memory, which could be called
> disk.data.frame()
>
> It could be based in rsqlite for example (By translating R syntax to SQL
> syntax for example), and the syntax and way of working of the
> disk.data.frame() class could be exactly the same than with data.frame
> objects.
>
> When the session is of, is such disk.data.frames are not saved, and
> implicit DROP TABLE could be done in all the schemas created in rsqlite.
>
> Nowadays, with the SSD disk drives such new data.frame() class could have
> sense, specially when dealing with Big Data.
>
> It is true that this new class might be slower than regular data.frame,
> data.table or tibble classes, but we would be able to handle much more
> data, even if it is at cost of speed.
>
> Also with data sampling, and use of a regular odbc connection we could do
> all the work, but for people who do not know how to use RDBMS or specific
> purpose R packages for this job, this could work.
>
> Another option would be to base this new S4 class  on feather files, but
> maybe making it with rsqlite is simply easier.
>
> A GitHub project could be created for such purpose, so that all the
> community can contribute (included me :D ).
>
> Thank you,
> Juan
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion: Create On-Disk Dataframes

Dirk Eddelbuettel

On 4 September 2017 at 11:35, Suzen, Mehmet wrote:
| It is not needed. There is a large community of developer using SparkR.
| https://spark.apache.org/docs/latest/sparkr.html
| It does exactly what you want.

I hope you are not going to mail a sparkr commercial to this list every day.
As the count is now at two, this may be an excellent good time to stop it.

Dirk

--
http://dirk.eddelbuettel.com | @eddelbuettel | [hidden email]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion: Create On-Disk Dataframes; SparkR

frederik
What's wrong with SparkR? I never heard of either Spark or SparkR.

For on-disk dataframes there is a package called 'ff'. I looked into
using it, it works well but there are some drawbacks with the
implementation. I think that it should be possible to mmap an object
from disk and use it as a vector, but 'ff' is doing something else:

https://github.com/edwindj/ffbase/issues/52

I think you'd need something called a "weak reference" to do this
properly:

http://homepage.divms.uiowa.edu/~luke/R/references/weakfinex.html

I don't know what SparkR is doing under the hood.

Then again I was mostly interested in having large data sets which
persist across R sessions, while Juan seems to be interested in
supporting data which doesn't fit in RAM. But if something doesn't fit
in RAM, it can be swapped out to disk by the OS, no? So I'm not sure
why you'd want a special interface for that situation, aside from
giving the programmer more control.

Thanks,

Frederick

On Mon, Sep 04, 2017 at 07:43:50AM -0500, Dirk Eddelbuettel wrote:

>
> On 4 September 2017 at 11:35, Suzen, Mehmet wrote:
> | It is not needed. There is a large community of developer using SparkR.
> | https://spark.apache.org/docs/latest/sparkr.html
> | It does exactly what you want.
>
> I hope you are not going to mail a sparkr commercial to this list every day.
> As the count is now at two, this may be an excellent good time to stop it.
>
> Dirk
>
> --
> http://dirk.eddelbuettel.com | @eddelbuettel | [hidden email]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel