First time r user

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

First time r user

dreloyd
CONTENTS DELETED
The author has deleted this message.
Reply | Threaded
Open this post in threaded view
|

Re: First time r user

Rainer Schuermann
It would be helpful if

- you give us some sample data:
   dput( head( myData ) )

- tell us what kind of function you want to apply, or
   how the result looks like that you want to achieve

- show us what you have done so far,
   and where you are stuck




On Saturday 17 August 2013 19:33:08 Dylan Doyle wrote:

>
> Hello R users,
>
>
> I have recently begun a project to analyze a large data set of approximately 1.5 million rows it also has 9 columns. My objective consists of locating particular subsets within this data ie. take all rows with the same column 9 and perform a function on that subset. It was suggested to me that i use the ddply() function from the Pylr package. Any advice would be greatly appreciated
>
>
> Thanks much,
>
> Dylan
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: First time r user

Steve Lianoglou-2
In reply to this post by dreloyd
Hi,

In addition to Rainer's suggestion (which are to give an small example
of what your input data look like and an example of what you want to
output), given the size of your input data, you might want to try to
use the data.table package instead of plyr::ddply -- especially while
you are exploring different combinations/calculations over your data.

Usually, the equivalent data.table approach (to the ddply one) tend to
be orders of magnitude faster and usually more memory efficient.

When the size of my data is small, I often use both (I think the
plyr/ddply "language" is rather beautiful), but when my data gets into
the 1000++ rows, I'll universally switch to data.table.

HTH,
-steve


On Sat, Aug 17, 2013 at 4:33 PM, Dylan Doyle <[hidden email]> wrote:

>
> Hello R users,
>
>
> I have recently begun a project to analyze a large data set of approximately 1.5 million rows it also has 9 columns. My objective consists of locating particular subsets within this data ie. take all rows with the same column 9 and perform a function on that subset. It was suggested to me that i use the ddply() function from the Pylr package. Any advice would be greatly appreciated
>
>
> Thanks much,
>
> Dylan
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



--
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: First time r user

Steve Lianoglou-2
Hi Paul,

First: please keep your replies on list (use reply-all when replying
to R-help lists) so that others can help but also the lists can be
used as a resource for others.

Now:

On Aug 18, 2013, at 12:20 AM, Paul Bernal <[hidden email]> wrote:

> Can R really handle millions of rows of data?

Yup.

> I thought it was not possible.

Surprise :-)

As I type, I'm working with a ~5.5 million row data.table pretty effortlessly.

Columns matter too, of course -- RAM is RAM, after all and you've got
to be able to fit the whole thing into it if you want to use
data.table. Once loaded, though, data.table enables one to do
split/apply/combine calculations over these data quite efficiently.
The first time I used it, I was honestly blown away.

If you find yourself wanting to work with such data, you could do
worse than read through data.table's vignette and FAQ and give it a
spin.

HTH,

-steve

--
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: First time r user

PaulJr
Thanks a lot for the valuable information.

Now my question would necessarily be, how many columns can R handle,
provided that I have millions of rows and, in general, whats the maximum
amount of rows and columns that R can effortlessly handle?

Best regards and again thank you for the help,

Paul
El 18/08/2013 02:35, "Steve Lianoglou" <[hidden email]> escribió:

> Hi Paul,
>
> First: please keep your replies on list (use reply-all when replying
> to R-help lists) so that others can help but also the lists can be
> used as a resource for others.
>
> Now:
>
> On Aug 18, 2013, at 12:20 AM, Paul Bernal <[hidden email]> wrote:
>
> > Can R really handle millions of rows of data?
>
> Yup.
>
> > I thought it was not possible.
>
> Surprise :-)
>
> As I type, I'm working with a ~5.5 million row data.table pretty
> effortlessly.
>
> Columns matter too, of course -- RAM is RAM, after all and you've got
> to be able to fit the whole thing into it if you want to use
> data.table. Once loaded, though, data.table enables one to do
> split/apply/combine calculations over these data quite efficiently.
> The first time I used it, I was honestly blown away.
>
> If you find yourself wanting to work with such data, you could do
> worse than read through data.table's vignette and FAQ and give it a
> spin.
>
> HTH,
>
> -steve
>
> --
> Steve Lianoglou
> Computational Biologist
> Bioinformatics and Computational Biology
> Genentech
>
        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: First time r user

Steve Lianoglou-2
Hi Paul,

On Sun, Aug 18, 2013 at 12:56 AM, Paul Bernal <[hidden email]> wrote:
> Thanks a lot for the valuable information.
>
> Now my question would necessarily be, how many columns can R handle,
> provided that I have millions of rows and, in general, whats the maximum
> amount of rows and columns that R can effortlessly handle?

This is all determined by your RAM.

Prior to R-3.0, R could only handle vectors of length 2^31 - 1. If you
were working with a matrix, that meant that you could only have that
many elements in the entire matrix.

If you were working with a data.frame, you could have data.frames with
2^31-1 rows, and I guess as many columns, since data.frames are really
a list of vectors, the entire thing doesn't have to be in one
contiguous block (and addressable that way)

R-3.0 introduced "Long Vectors" (search for that section in the release notes):

https://stat.ethz.ch/pipermail/r-announce/2013/000561.html

It almost doubles the size of a vector that R can handle (assuming you
are running 64bit). So, if you've got the RAM, you can have a
data.frame/data.table w/ billion(s) of rows, in theory.

To figure out how much data you can handle on your machine, you need
to know the size of real/integer/whatever and the number of elements
of those you will have so you can calculate the amount of RAM you need
to load it all up.

Lastly, I should mention there are packages that let you work with
"out of memory" data, like bigmemory, biglm, ff. Look at the HPC Task
view for more info along those lines:

http://cran.r-project.org/web/views/HighPerformanceComputing.html


>
> Best regards and again thank you for the help,
>
> Paul
> El 18/08/2013 02:35, "Steve Lianoglou" <[hidden email]> escribió:
>
>> Hi Paul,
>>
>> First: please keep your replies on list (use reply-all when replying
>> to R-help lists) so that others can help but also the lists can be
>> used as a resource for others.
>>
>> Now:
>>
>> On Aug 18, 2013, at 12:20 AM, Paul Bernal <[hidden email]> wrote:
>>
>> > Can R really handle millions of rows of data?
>>
>> Yup.
>>
>> > I thought it was not possible.
>>
>> Surprise :-)
>>
>> As I type, I'm working with a ~5.5 million row data.table pretty
>> effortlessly.
>>
>> Columns matter too, of course -- RAM is RAM, after all and you've got
>> to be able to fit the whole thing into it if you want to use
>> data.table. Once loaded, though, data.table enables one to do
>> split/apply/combine calculations over these data quite efficiently.
>> The first time I used it, I was honestly blown away.
>>
>> If you find yourself wanting to work with such data, you could do
>> worse than read through data.table's vignette and FAQ and give it a
>> spin.
>>
>> HTH,
>>
>> -steve
>>
>> --
>> Steve Lianoglou
>> Computational Biologist
>> Bioinformatics and Computational Biology
>> Genentech
>>
>
>         [[alternative HTML version deleted]]
>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



--
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: First time r user

PaulJr
Thank you so much Steve.

The computer I'm currently working with is a 32 bit windows 7 OS. And RAM
is only 4GB so I guess thats a big limitation.
El 18/08/2013 03:11, "Steve Lianoglou" <[hidden email]> escribió:

> Hi Paul,
>
> On Sun, Aug 18, 2013 at 12:56 AM, Paul Bernal <[hidden email]>
> wrote:
> > Thanks a lot for the valuable information.
> >
> > Now my question would necessarily be, how many columns can R handle,
> > provided that I have millions of rows and, in general, whats the maximum
> > amount of rows and columns that R can effortlessly handle?
>
> This is all determined by your RAM.
>
> Prior to R-3.0, R could only handle vectors of length 2^31 - 1. If you
> were working with a matrix, that meant that you could only have that
> many elements in the entire matrix.
>
> If you were working with a data.frame, you could have data.frames with
> 2^31-1 rows, and I guess as many columns, since data.frames are really
> a list of vectors, the entire thing doesn't have to be in one
> contiguous block (and addressable that way)
>
> R-3.0 introduced "Long Vectors" (search for that section in the release
> notes):
>
> https://stat.ethz.ch/pipermail/r-announce/2013/000561.html
>
> It almost doubles the size of a vector that R can handle (assuming you
> are running 64bit). So, if you've got the RAM, you can have a
> data.frame/data.table w/ billion(s) of rows, in theory.
>
> To figure out how much data you can handle on your machine, you need
> to know the size of real/integer/whatever and the number of elements
> of those you will have so you can calculate the amount of RAM you need
> to load it all up.
>
> Lastly, I should mention there are packages that let you work with
> "out of memory" data, like bigmemory, biglm, ff. Look at the HPC Task
> view for more info along those lines:
>
> http://cran.r-project.org/web/views/HighPerformanceComputing.html
>
>
> >
> > Best regards and again thank you for the help,
> >
> > Paul
> > El 18/08/2013 02:35, "Steve Lianoglou" <[hidden email]>
> escribió:
> >
> >> Hi Paul,
> >>
> >> First: please keep your replies on list (use reply-all when replying
> >> to R-help lists) so that others can help but also the lists can be
> >> used as a resource for others.
> >>
> >> Now:
> >>
> >> On Aug 18, 2013, at 12:20 AM, Paul Bernal <[hidden email]>
> wrote:
> >>
> >> > Can R really handle millions of rows of data?
> >>
> >> Yup.
> >>
> >> > I thought it was not possible.
> >>
> >> Surprise :-)
> >>
> >> As I type, I'm working with a ~5.5 million row data.table pretty
> >> effortlessly.
> >>
> >> Columns matter too, of course -- RAM is RAM, after all and you've got
> >> to be able to fit the whole thing into it if you want to use
> >> data.table. Once loaded, though, data.table enables one to do
> >> split/apply/combine calculations over these data quite efficiently.
> >> The first time I used it, I was honestly blown away.
> >>
> >> If you find yourself wanting to work with such data, you could do
> >> worse than read through data.table's vignette and FAQ and give it a
> >> spin.
> >>
> >> HTH,
> >>
> >> -steve
> >>
> >> --
> >> Steve Lianoglou
> >> Computational Biologist
> >> Bioinformatics and Computational Biology
> >> Genentech
> >>
> >
> >         [[alternative HTML version deleted]]
> >
> >
> > ______________________________________________
> > [hidden email] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>
>
> --
> Steve Lianoglou
> Computational Biologist
> Bioinformatics and Computational Biology
> Genentech
>
        [[alternative HTML version deleted]]


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: First time r user

dreloyd
CONTENTS DELETED
The author has deleted this message.
Reply | Threaded
Open this post in threaded view
|

Re: First time r user

Bert Gunter
This is ridiculous!

Please read "An Introduction to R" (ships with R) or other online R
tutorial. There are many good ones. There are also probably online
courses. Please make an effort to learn the basics before posting
further here.

-- Bert



On Sun, Aug 18, 2013 at 7:13 AM, Dylan Doyle <[hidden email]> wrote:

> Hello all thank-you for your speedy replies ,
>
> Here is the first few lines from the head function
>
>  brewery_id            brewery_name review_time review_overall review_aroma
> review_appearance review_profilename
> 1      10325         Vecchio Birraio  1234817823            1.5
>  2.0               2.5            stcules
> 2      10325         Vecchio Birraio  1235915097            3.0
>  2.5               3.0            stcules
> 3      10325         Vecchio Birraio  1235916604            3.0
>  2.5               3.0            stcules
> 4      10325         Vecchio Birraio  1234725145            3.0
>  3.0               3.5            stcules
> 5       1075 Caldera Brewing Company  1293735206            4.0
>  4.5               4.0     johnmichaelsen
> 6       1075 Caldera Brewing Company  1325524659            3.0
>  3.5               3.5            oline73
>
>        beer_style review_palate review_taste              beer_name
> beer_abv beer_beerid
> 1                     Hefeweizen           1.5          1.5           Sausa
> Weizen      5.0       47986
> 2             English Strong Ale           3.0          3.0
> Red Moon      6.2       48213
> 3         Foreign / Export Stout           3.0          3.0 Black Horse
> Black Beer      6.5       48215
> 4                German Pilsener           2.5          3.0
> Sausa Pils      5.0       47969
> 5 American Double / Imperial IPA           4.0          4.5
>  Cauldron DIPA      7.7       64883
> 6           Herbed / Spiced Beer           3.0          3.5    Caldera
> Ginger Beer      4.7       52159
>
> '
> I have only discovered how to import the data set , and run some basic r
> functions on it my goal is to be able to answer questions like what are the
> top 10 pilsner's , or the brewer with the highest abv average. Also using
> two factors such as best beer aroma and appearance, which beer style should
> I try. Let me know if i can give you any more information you might need to
> help me.
>
> Thanks again ,
>
> Dylan
>
>>
>
>
>
> On Sun, Aug 18, 2013 at 4:16 AM, Paul Bernal <[hidden email]> wrote:
>
>> Thank you so much Steve.
>>
>> The computer I'm currently working with is a 32 bit windows 7 OS. And RAM
>> is only 4GB so I guess thats a big limitation.
>> El 18/08/2013 03:11, "Steve Lianoglou" <[hidden email]>
>> escribió:
>>
>> > Hi Paul,
>> >
>> > On Sun, Aug 18, 2013 at 12:56 AM, Paul Bernal <[hidden email]>
>> > wrote:
>> > > Thanks a lot for the valuable information.
>> > >
>> > > Now my question would necessarily be, how many columns can R handle,
>> > > provided that I have millions of rows and, in general, whats the
>> maximum
>> > > amount of rows and columns that R can effortlessly handle?
>> >
>> > This is all determined by your RAM.
>> >
>> > Prior to R-3.0, R could only handle vectors of length 2^31 - 1. If you
>> > were working with a matrix, that meant that you could only have that
>> > many elements in the entire matrix.
>> >
>> > If you were working with a data.frame, you could have data.frames with
>> > 2^31-1 rows, and I guess as many columns, since data.frames are really
>> > a list of vectors, the entire thing doesn't have to be in one
>> > contiguous block (and addressable that way)
>> >
>> > R-3.0 introduced "Long Vectors" (search for that section in the release
>> > notes):
>> >
>> > https://stat.ethz.ch/pipermail/r-announce/2013/000561.html
>> >
>> > It almost doubles the size of a vector that R can handle (assuming you
>> > are running 64bit). So, if you've got the RAM, you can have a
>> > data.frame/data.table w/ billion(s) of rows, in theory.
>> >
>> > To figure out how much data you can handle on your machine, you need
>> > to know the size of real/integer/whatever and the number of elements
>> > of those you will have so you can calculate the amount of RAM you need
>> > to load it all up.
>> >
>> > Lastly, I should mention there are packages that let you work with
>> > "out of memory" data, like bigmemory, biglm, ff. Look at the HPC Task
>> > view for more info along those lines:
>> >
>> > http://cran.r-project.org/web/views/HighPerformanceComputing.html
>> >
>> >
>> > >
>> > > Best regards and again thank you for the help,
>> > >
>> > > Paul
>> > > El 18/08/2013 02:35, "Steve Lianoglou" <[hidden email]>
>> > escribió:
>> > >
>> > >> Hi Paul,
>> > >>
>> > >> First: please keep your replies on list (use reply-all when replying
>> > >> to R-help lists) so that others can help but also the lists can be
>> > >> used as a resource for others.
>> > >>
>> > >> Now:
>> > >>
>> > >> On Aug 18, 2013, at 12:20 AM, Paul Bernal <[hidden email]>
>> > wrote:
>> > >>
>> > >> > Can R really handle millions of rows of data?
>> > >>
>> > >> Yup.
>> > >>
>> > >> > I thought it was not possible.
>> > >>
>> > >> Surprise :-)
>> > >>
>> > >> As I type, I'm working with a ~5.5 million row data.table pretty
>> > >> effortlessly.
>> > >>
>> > >> Columns matter too, of course -- RAM is RAM, after all and you've got
>> > >> to be able to fit the whole thing into it if you want to use
>> > >> data.table. Once loaded, though, data.table enables one to do
>> > >> split/apply/combine calculations over these data quite efficiently.
>> > >> The first time I used it, I was honestly blown away.
>> > >>
>> > >> If you find yourself wanting to work with such data, you could do
>> > >> worse than read through data.table's vignette and FAQ and give it a
>> > >> spin.
>> > >>
>> > >> HTH,
>> > >>
>> > >> -steve
>> > >>
>> > >> --
>> > >> Steve Lianoglou
>> > >> Computational Biologist
>> > >> Bioinformatics and Computational Biology
>> > >> Genentech
>> > >>
>> > >
>> > >         [[alternative HTML version deleted]]
>> > >
>> > >
>> > > ______________________________________________
>> > > [hidden email] mailing list
>> > > https://stat.ethz.ch/mailman/listinfo/r-help
>> > > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html
>> > > and provide commented, minimal, self-contained, reproducible code.
>> > >
>> >
>> >
>> >
>> > --
>> > Steve Lianoglou
>> > Computational Biologist
>> > Bioinformatics and Computational Biology
>> > Genentech
>> >
>>
>>         [[alternative HTML version deleted]]
>>
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
>         [[alternative HTML version deleted]]
>
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



--

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: First time r user

Steve Lianoglou-2
Yes, please do some reading first and give take a crack at your data first.

This will only be a fruitful endeavor for you after you get some
working knowledge of R.

Hadley is compiling a nice book online that I think is very helpful to
read through:
https://github.com/hadley/devtools/wiki/Introduction

The section on "functional looping patterns" will be immediately
useful (once you have a bit more background working with R):
http://github.com/hadley/devtools/wiki/functionals#looping-patterns

It's really a great resource and you should spend the time to read
through it. Once you read and understand the looping-patterns section,
you'll be able to handle your data like a pro and you can move on to
asking more interesting questions ;-)

If something is unclear there, though, please do raise that issue.

HTH,
-steve


On Sun, Aug 18, 2013 at 7:22 AM, Bert Gunter <[hidden email]> wrote:

> This is ridiculous!
>
> Please read "An Introduction to R" (ships with R) or other online R
> tutorial. There are many good ones. There are also probably online
> courses. Please make an effort to learn the basics before posting
> further here.
>
> -- Bert
>
>
>
> On Sun, Aug 18, 2013 at 7:13 AM, Dylan Doyle <[hidden email]> wrote:
>> Hello all thank-you for your speedy replies ,
>>
>> Here is the first few lines from the head function
>>
>>  brewery_id            brewery_name review_time review_overall review_aroma
>> review_appearance review_profilename
>> 1      10325         Vecchio Birraio  1234817823            1.5
>>  2.0               2.5            stcules
>> 2      10325         Vecchio Birraio  1235915097            3.0
>>  2.5               3.0            stcules
>> 3      10325         Vecchio Birraio  1235916604            3.0
>>  2.5               3.0            stcules
>> 4      10325         Vecchio Birraio  1234725145            3.0
>>  3.0               3.5            stcules
>> 5       1075 Caldera Brewing Company  1293735206            4.0
>>  4.5               4.0     johnmichaelsen
>> 6       1075 Caldera Brewing Company  1325524659            3.0
>>  3.5               3.5            oline73
>>
>>        beer_style review_palate review_taste              beer_name
>> beer_abv beer_beerid
>> 1                     Hefeweizen           1.5          1.5           Sausa
>> Weizen      5.0       47986
>> 2             English Strong Ale           3.0          3.0
>> Red Moon      6.2       48213
>> 3         Foreign / Export Stout           3.0          3.0 Black Horse
>> Black Beer      6.5       48215
>> 4                German Pilsener           2.5          3.0
>> Sausa Pils      5.0       47969
>> 5 American Double / Imperial IPA           4.0          4.5
>>  Cauldron DIPA      7.7       64883
>> 6           Herbed / Spiced Beer           3.0          3.5    Caldera
>> Ginger Beer      4.7       52159
>>
>> '
>> I have only discovered how to import the data set , and run some basic r
>> functions on it my goal is to be able to answer questions like what are the
>> top 10 pilsner's , or the brewer with the highest abv average. Also using
>> two factors such as best beer aroma and appearance, which beer style should
>> I try. Let me know if i can give you any more information you might need to
>> help me.
>>
>> Thanks again ,
>>
>> Dylan
>>
>>>
>>
>>
>>
>> On Sun, Aug 18, 2013 at 4:16 AM, Paul Bernal <[hidden email]> wrote:
>>
>>> Thank you so much Steve.
>>>
>>> The computer I'm currently working with is a 32 bit windows 7 OS. And RAM
>>> is only 4GB so I guess thats a big limitation.
>>> El 18/08/2013 03:11, "Steve Lianoglou" <[hidden email]>
>>> escribió:
>>>
>>> > Hi Paul,
>>> >
>>> > On Sun, Aug 18, 2013 at 12:56 AM, Paul Bernal <[hidden email]>
>>> > wrote:
>>> > > Thanks a lot for the valuable information.
>>> > >
>>> > > Now my question would necessarily be, how many columns can R handle,
>>> > > provided that I have millions of rows and, in general, whats the
>>> maximum
>>> > > amount of rows and columns that R can effortlessly handle?
>>> >
>>> > This is all determined by your RAM.
>>> >
>>> > Prior to R-3.0, R could only handle vectors of length 2^31 - 1. If you
>>> > were working with a matrix, that meant that you could only have that
>>> > many elements in the entire matrix.
>>> >
>>> > If you were working with a data.frame, you could have data.frames with
>>> > 2^31-1 rows, and I guess as many columns, since data.frames are really
>>> > a list of vectors, the entire thing doesn't have to be in one
>>> > contiguous block (and addressable that way)
>>> >
>>> > R-3.0 introduced "Long Vectors" (search for that section in the release
>>> > notes):
>>> >
>>> > https://stat.ethz.ch/pipermail/r-announce/2013/000561.html
>>> >
>>> > It almost doubles the size of a vector that R can handle (assuming you
>>> > are running 64bit). So, if you've got the RAM, you can have a
>>> > data.frame/data.table w/ billion(s) of rows, in theory.
>>> >
>>> > To figure out how much data you can handle on your machine, you need
>>> > to know the size of real/integer/whatever and the number of elements
>>> > of those you will have so you can calculate the amount of RAM you need
>>> > to load it all up.
>>> >
>>> > Lastly, I should mention there are packages that let you work with
>>> > "out of memory" data, like bigmemory, biglm, ff. Look at the HPC Task
>>> > view for more info along those lines:
>>> >
>>> > http://cran.r-project.org/web/views/HighPerformanceComputing.html
>>> >
>>> >
>>> > >
>>> > > Best regards and again thank you for the help,
>>> > >
>>> > > Paul
>>> > > El 18/08/2013 02:35, "Steve Lianoglou" <[hidden email]>
>>> > escribió:
>>> > >
>>> > >> Hi Paul,
>>> > >>
>>> > >> First: please keep your replies on list (use reply-all when replying
>>> > >> to R-help lists) so that others can help but also the lists can be
>>> > >> used as a resource for others.
>>> > >>
>>> > >> Now:
>>> > >>
>>> > >> On Aug 18, 2013, at 12:20 AM, Paul Bernal <[hidden email]>
>>> > wrote:
>>> > >>
>>> > >> > Can R really handle millions of rows of data?
>>> > >>
>>> > >> Yup.
>>> > >>
>>> > >> > I thought it was not possible.
>>> > >>
>>> > >> Surprise :-)
>>> > >>
>>> > >> As I type, I'm working with a ~5.5 million row data.table pretty
>>> > >> effortlessly.
>>> > >>
>>> > >> Columns matter too, of course -- RAM is RAM, after all and you've got
>>> > >> to be able to fit the whole thing into it if you want to use
>>> > >> data.table. Once loaded, though, data.table enables one to do
>>> > >> split/apply/combine calculations over these data quite efficiently.
>>> > >> The first time I used it, I was honestly blown away.
>>> > >>
>>> > >> If you find yourself wanting to work with such data, you could do
>>> > >> worse than read through data.table's vignette and FAQ and give it a
>>> > >> spin.
>>> > >>
>>> > >> HTH,
>>> > >>
>>> > >> -steve
>>> > >>
>>> > >> --
>>> > >> Steve Lianoglou
>>> > >> Computational Biologist
>>> > >> Bioinformatics and Computational Biology
>>> > >> Genentech
>>> > >>
>>> > >
>>> > >         [[alternative HTML version deleted]]
>>> > >
>>> > >
>>> > > ______________________________________________
>>> > > [hidden email] mailing list
>>> > > https://stat.ethz.ch/mailman/listinfo/r-help
>>> > > PLEASE do read the posting guide
>>> > http://www.R-project.org/posting-guide.html
>>> > > and provide commented, minimal, self-contained, reproducible code.
>>> > >
>>> >
>>> >
>>> >
>>> > --
>>> > Steve Lianoglou
>>> > Computational Biologist
>>> > Bioinformatics and Computational Biology
>>> > Genentech
>>> >
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>>
>>> ______________________________________________
>>> [hidden email] mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>
>>         [[alternative HTML version deleted]]
>>
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> --
>
> Bert Gunter
> Genentech Nonclinical Biostatistics
>
> Internal Contact Info:
> Phone: 467-7374
> Website:
> http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



--
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: First time r user

JohnDee
In reply to this post by PaulJr
On Sun, 18 Aug 2013 02:56:56 -0500
Paul Bernal <[hidden email]> wrote:

Paul,

I would suggest acquiring at least a small library of of books about
R and reading them.  I would recommend An Introduction to R and R Data
Import/Export (both available online on the R Project Site in both pdf
and HTML), Introductory Statistics with R, Venables and Ripley, and R
in a Nutshell for starters.  It is pointless to answer some of these
questions since the answers are there for the taking.  You also should
look through the logs of previous discussions in the various R user
groups also accessible through the project site, since memory
limitations have often been discussed.

JWDougherty

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.