Quantcast

Execution speed in randomForest

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Execution speed in randomForest

Jason & Caroline Shaw
I am using the randomForest package.  I have found that multiple runs
of precisely the same command can generate drastically different run
times.  Can anyone with knowledge of this package provide some insight
as to why this would happen and whether there's anything I can do
about it?  Here are some details of what I'm doing:

- Data: ~80,000 rows, with 10 columns (one of which is the class label)
- I randomly select 90% of the data to use to build 500 trees.

And this is what I find:

- Execution times of randomForest() using the entire dataset (in
seconds): 20.65, 20.93, 20.79, 21.05, 21.00, 21.52, 21.22, 21.22
- Execution times of randomForest() using the 90% selection: 17.78,
17.74, 126.52, 241.87, 17.56, 17.97, 182.05, 17.82 <-- Note the 3rd,
4th, and 7th.
- When the speed is slow, it often stutters, with one or a few trees
being produced very quickly, followed by a slow build taking 10 or 20
seconds
- The oob results are indistinguishable between the fast and slow runs.

I select the 90% of my data by using sample() to generate indices and
then subsetting, like: selection <- data[sample,].  I thought perhaps
this subsetting was getting repeated, rather than storing in memory a
new copy of all that data, so I tried circumventing this with
eval(data[sample,]).  Probably barking up the wrong tree -- it had no
effect, and doesn't explain the run-to-run variation (really, I'm just
not clear on what eval() is for).  I have also tried garbage
collecting with gc() between each run, and adding a Sys.sleep() for 5
seconds, but neither of these has helped either.

Any ideas?

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Execution speed in randomForest

jholtman
Are you looking at the CPU or the elapsed time?  If it is the elapsed
time, then also capture the CPU time to see if it is different.  Also
consider the use of the Rprof function to see where time is being
spent.  What else is running on the machine?  Are you doing any
paging?  What type of system are you running on?  Use some of the
system level profiling tools.  If on Windows, then use perfmon.

On Fri, Apr 6, 2012 at 11:28 AM, Jason & Caroline Shaw
<[hidden email]> wrote:

> I am using the randomForest package.  I have found that multiple runs
> of precisely the same command can generate drastically different run
> times.  Can anyone with knowledge of this package provide some insight
> as to why this would happen and whether there's anything I can do
> about it?  Here are some details of what I'm doing:
>
> - Data: ~80,000 rows, with 10 columns (one of which is the class label)
> - I randomly select 90% of the data to use to build 500 trees.
>
> And this is what I find:
>
> - Execution times of randomForest() using the entire dataset (in
> seconds): 20.65, 20.93, 20.79, 21.05, 21.00, 21.52, 21.22, 21.22
> - Execution times of randomForest() using the 90% selection: 17.78,
> 17.74, 126.52, 241.87, 17.56, 17.97, 182.05, 17.82 <-- Note the 3rd,
> 4th, and 7th.
> - When the speed is slow, it often stutters, with one or a few trees
> being produced very quickly, followed by a slow build taking 10 or 20
> seconds
> - The oob results are indistinguishable between the fast and slow runs.
>
> I select the 90% of my data by using sample() to generate indices and
> then subsetting, like: selection <- data[sample,].  I thought perhaps
> this subsetting was getting repeated, rather than storing in memory a
> new copy of all that data, so I tried circumventing this with
> eval(data[sample,]).  Probably barking up the wrong tree -- it had no
> effect, and doesn't explain the run-to-run variation (really, I'm just
> not clear on what eval() is for).  I have also tried garbage
> collecting with gc() between each run, and adding a Sys.sleep() for 5
> seconds, but neither of these has helped either.
>
> Any ideas?
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



--
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Execution speed in randomForest

Jason & Caroline Shaw
The CPU time and elapsed time are essentially identical. (That is, the
system time is negligible.)

Using Rprof, I just ran the code twice.  The first time, while
randomForest is doing its thing, there are 850 consecutive lines which
read:
".C" "randomForest.default" "randomForest" "randomForest.formula" "randomForest"
Upon running it a second time, this time taking 285 seconds to
complete, there are 14201 such lines, with nothing intervening

There shouldn't be interference from elsewhere on the machine.  This
is the only memory- and CPU-intensive process.  I don't know how to
check what kind of paging is going on, but since the machine has 16GB
of memory and I am using maybe 3 or 4 at most, I hope paging is not an
issue.

I'm on a CentOS 5 box running R 2.15.0.

On Fri, Apr 6, 2012 at 12:45 PM, jim holtman <[hidden email]> wrote:

> Are you looking at the CPU or the elapsed time?  If it is the elapsed
> time, then also capture the CPU time to see if it is different.  Also
> consider the use of the Rprof function to see where time is being
> spent.  What else is running on the machine?  Are you doing any
> paging?  What type of system are you running on?  Use some of the
> system level profiling tools.  If on Windows, then use perfmon.
>
> On Fri, Apr 6, 2012 at 11:28 AM, Jason & Caroline Shaw
> <[hidden email]> wrote:
>> I am using the randomForest package.  I have found that multiple runs
>> of precisely the same command can generate drastically different run
>> times.  Can anyone with knowledge of this package provide some insight
>> as to why this would happen and whether there's anything I can do
>> about it?  Here are some details of what I'm doing:
>>
>> - Data: ~80,000 rows, with 10 columns (one of which is the class label)
>> - I randomly select 90% of the data to use to build 500 trees.
>>
>> And this is what I find:
>>
>> - Execution times of randomForest() using the entire dataset (in
>> seconds): 20.65, 20.93, 20.79, 21.05, 21.00, 21.52, 21.22, 21.22
>> - Execution times of randomForest() using the 90% selection: 17.78,
>> 17.74, 126.52, 241.87, 17.56, 17.97, 182.05, 17.82 <-- Note the 3rd,
>> 4th, and 7th.
>> - When the speed is slow, it often stutters, with one or a few trees
>> being produced very quickly, followed by a slow build taking 10 or 20
>> seconds
>> - The oob results are indistinguishable between the fast and slow runs.
>>
>> I select the 90% of my data by using sample() to generate indices and
>> then subsetting, like: selection <- data[sample,].  I thought perhaps
>> this subsetting was getting repeated, rather than storing in memory a
>> new copy of all that data, so I tried circumventing this with
>> eval(data[sample,]).  Probably barking up the wrong tree -- it had no
>> effect, and doesn't explain the run-to-run variation (really, I'm just
>> not clear on what eval() is for).  I have also tried garbage
>> collecting with gc() between each run, and adding a Sys.sleep() for 5
>> seconds, but neither of these has helped either.
>>
>> Any ideas?
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
> Jim Holtman
> Data Munger Guru
>
> What is the problem that you are trying to solve?
> Tell me what you want to do, not how you want to do it.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Execution speed in randomForest

Liaw, Andy
Without seeing your code, it's hard to say much more, but do avoid using formula when you have large data.

Andy

-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Jason & Caroline Shaw
Sent: Friday, April 06, 2012 1:20 PM
To: jim holtman
Cc: [hidden email]
Subject: Re: [R] Execution speed in randomForest

The CPU time and elapsed time are essentially identical. (That is, the
system time is negligible.)

Using Rprof, I just ran the code twice.  The first time, while
randomForest is doing its thing, there are 850 consecutive lines which
read:
".C" "randomForest.default" "randomForest" "randomForest.formula" "randomForest"
Upon running it a second time, this time taking 285 seconds to
complete, there are 14201 such lines, with nothing intervening

There shouldn't be interference from elsewhere on the machine.  This
is the only memory- and CPU-intensive process.  I don't know how to
check what kind of paging is going on, but since the machine has 16GB
of memory and I am using maybe 3 or 4 at most, I hope paging is not an
issue.

I'm on a CentOS 5 box running R 2.15.0.

On Fri, Apr 6, 2012 at 12:45 PM, jim holtman <[hidden email]> wrote:

> Are you looking at the CPU or the elapsed time?  If it is the elapsed
> time, then also capture the CPU time to see if it is different.  Also
> consider the use of the Rprof function to see where time is being
> spent.  What else is running on the machine?  Are you doing any
> paging?  What type of system are you running on?  Use some of the
> system level profiling tools.  If on Windows, then use perfmon.
>
> On Fri, Apr 6, 2012 at 11:28 AM, Jason & Caroline Shaw
> <[hidden email]> wrote:
>> I am using the randomForest package.  I have found that multiple runs
>> of precisely the same command can generate drastically different run
>> times.  Can anyone with knowledge of this package provide some insight
>> as to why this would happen and whether there's anything I can do
>> about it?  Here are some details of what I'm doing:
>>
>> - Data: ~80,000 rows, with 10 columns (one of which is the class label)
>> - I randomly select 90% of the data to use to build 500 trees.
>>
>> And this is what I find:
>>
>> - Execution times of randomForest() using the entire dataset (in
>> seconds): 20.65, 20.93, 20.79, 21.05, 21.00, 21.52, 21.22, 21.22
>> - Execution times of randomForest() using the 90% selection: 17.78,
>> 17.74, 126.52, 241.87, 17.56, 17.97, 182.05, 17.82 <-- Note the 3rd,
>> 4th, and 7th.
>> - When the speed is slow, it often stutters, with one or a few trees
>> being produced very quickly, followed by a slow build taking 10 or 20
>> seconds
>> - The oob results are indistinguishable between the fast and slow runs.
>>
>> I select the 90% of my data by using sample() to generate indices and
>> then subsetting, like: selection <- data[sample,].  I thought perhaps
>> this subsetting was getting repeated, rather than storing in memory a
>> new copy of all that data, so I tried circumventing this with
>> eval(data[sample,]).  Probably barking up the wrong tree -- it had no
>> effect, and doesn't explain the run-to-run variation (really, I'm just
>> not clear on what eval() is for).  I have also tried garbage
>> collecting with gc() between each run, and adding a Sys.sleep() for 5
>> seconds, but neither of these has helped either.
>>
>> Any ideas?
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
> Jim Holtman
> Data Munger Guru
>
> What is the problem that you are trying to solve?
> Tell me what you want to do, not how you want to do it.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Loading...