|
I am using the randomForest package. I have found that multiple runs
of precisely the same command can generate drastically different run times. Can anyone with knowledge of this package provide some insight as to why this would happen and whether there's anything I can do about it? Here are some details of what I'm doing: - Data: ~80,000 rows, with 10 columns (one of which is the class label) - I randomly select 90% of the data to use to build 500 trees. And this is what I find: - Execution times of randomForest() using the entire dataset (in seconds): 20.65, 20.93, 20.79, 21.05, 21.00, 21.52, 21.22, 21.22 - Execution times of randomForest() using the 90% selection: 17.78, 17.74, 126.52, 241.87, 17.56, 17.97, 182.05, 17.82 <-- Note the 3rd, 4th, and 7th. - When the speed is slow, it often stutters, with one or a few trees being produced very quickly, followed by a slow build taking 10 or 20 seconds - The oob results are indistinguishable between the fast and slow runs. I select the 90% of my data by using sample() to generate indices and then subsetting, like: selection <- data[sample,]. I thought perhaps this subsetting was getting repeated, rather than storing in memory a new copy of all that data, so I tried circumventing this with eval(data[sample,]). Probably barking up the wrong tree -- it had no effect, and doesn't explain the run-to-run variation (really, I'm just not clear on what eval() is for). I have also tried garbage collecting with gc() between each run, and adding a Sys.sleep() for 5 seconds, but neither of these has helped either. Any ideas? ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
Are you looking at the CPU or the elapsed time? If it is the elapsed
time, then also capture the CPU time to see if it is different. Also consider the use of the Rprof function to see where time is being spent. What else is running on the machine? Are you doing any paging? What type of system are you running on? Use some of the system level profiling tools. If on Windows, then use perfmon. On Fri, Apr 6, 2012 at 11:28 AM, Jason & Caroline Shaw <[hidden email]> wrote: > I am using the randomForest package. I have found that multiple runs > of precisely the same command can generate drastically different run > times. Can anyone with knowledge of this package provide some insight > as to why this would happen and whether there's anything I can do > about it? Here are some details of what I'm doing: > > - Data: ~80,000 rows, with 10 columns (one of which is the class label) > - I randomly select 90% of the data to use to build 500 trees. > > And this is what I find: > > - Execution times of randomForest() using the entire dataset (in > seconds): 20.65, 20.93, 20.79, 21.05, 21.00, 21.52, 21.22, 21.22 > - Execution times of randomForest() using the 90% selection: 17.78, > 17.74, 126.52, 241.87, 17.56, 17.97, 182.05, 17.82 <-- Note the 3rd, > 4th, and 7th. > - When the speed is slow, it often stutters, with one or a few trees > being produced very quickly, followed by a slow build taking 10 or 20 > seconds > - The oob results are indistinguishable between the fast and slow runs. > > I select the 90% of my data by using sample() to generate indices and > then subsetting, like: selection <- data[sample,]. I thought perhaps > this subsetting was getting repeated, rather than storing in memory a > new copy of all that data, so I tried circumventing this with > eval(data[sample,]). Probably barking up the wrong tree -- it had no > effect, and doesn't explain the run-to-run variation (really, I'm just > not clear on what eval() is for). I have also tried garbage > collecting with gc() between each run, and adding a Sys.sleep() for 5 > seconds, but neither of these has helped either. > > Any ideas? > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
The CPU time and elapsed time are essentially identical. (That is, the
system time is negligible.) Using Rprof, I just ran the code twice. The first time, while randomForest is doing its thing, there are 850 consecutive lines which read: ".C" "randomForest.default" "randomForest" "randomForest.formula" "randomForest" Upon running it a second time, this time taking 285 seconds to complete, there are 14201 such lines, with nothing intervening There shouldn't be interference from elsewhere on the machine. This is the only memory- and CPU-intensive process. I don't know how to check what kind of paging is going on, but since the machine has 16GB of memory and I am using maybe 3 or 4 at most, I hope paging is not an issue. I'm on a CentOS 5 box running R 2.15.0. On Fri, Apr 6, 2012 at 12:45 PM, jim holtman <[hidden email]> wrote: > Are you looking at the CPU or the elapsed time? If it is the elapsed > time, then also capture the CPU time to see if it is different. Also > consider the use of the Rprof function to see where time is being > spent. What else is running on the machine? Are you doing any > paging? What type of system are you running on? Use some of the > system level profiling tools. If on Windows, then use perfmon. > > On Fri, Apr 6, 2012 at 11:28 AM, Jason & Caroline Shaw > <[hidden email]> wrote: >> I am using the randomForest package. I have found that multiple runs >> of precisely the same command can generate drastically different run >> times. Can anyone with knowledge of this package provide some insight >> as to why this would happen and whether there's anything I can do >> about it? Here are some details of what I'm doing: >> >> - Data: ~80,000 rows, with 10 columns (one of which is the class label) >> - I randomly select 90% of the data to use to build 500 trees. >> >> And this is what I find: >> >> - Execution times of randomForest() using the entire dataset (in >> seconds): 20.65, 20.93, 20.79, 21.05, 21.00, 21.52, 21.22, 21.22 >> - Execution times of randomForest() using the 90% selection: 17.78, >> 17.74, 126.52, 241.87, 17.56, 17.97, 182.05, 17.82 <-- Note the 3rd, >> 4th, and 7th. >> - When the speed is slow, it often stutters, with one or a few trees >> being produced very quickly, followed by a slow build taking 10 or 20 >> seconds >> - The oob results are indistinguishable between the fast and slow runs. >> >> I select the 90% of my data by using sample() to generate indices and >> then subsetting, like: selection <- data[sample,]. I thought perhaps >> this subsetting was getting repeated, rather than storing in memory a >> new copy of all that data, so I tried circumventing this with >> eval(data[sample,]). Probably barking up the wrong tree -- it had no >> effect, and doesn't explain the run-to-run variation (really, I'm just >> not clear on what eval() is for). I have also tried garbage >> collecting with gc() between each run, and adding a Sys.sleep() for 5 >> seconds, but neither of these has helped either. >> >> Any ideas? >> >> ______________________________________________ >> [hidden email] mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > > > -- > Jim Holtman > Data Munger Guru > > What is the problem that you are trying to solve? > Tell me what you want to do, not how you want to do it. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
Without seeing your code, it's hard to say much more, but do avoid using formula when you have large data.
Andy -----Original Message----- From: [hidden email] [mailto:[hidden email]] On Behalf Of Jason & Caroline Shaw Sent: Friday, April 06, 2012 1:20 PM To: jim holtman Cc: [hidden email] Subject: Re: [R] Execution speed in randomForest The CPU time and elapsed time are essentially identical. (That is, the system time is negligible.) Using Rprof, I just ran the code twice. The first time, while randomForest is doing its thing, there are 850 consecutive lines which read: ".C" "randomForest.default" "randomForest" "randomForest.formula" "randomForest" Upon running it a second time, this time taking 285 seconds to complete, there are 14201 such lines, with nothing intervening There shouldn't be interference from elsewhere on the machine. This is the only memory- and CPU-intensive process. I don't know how to check what kind of paging is going on, but since the machine has 16GB of memory and I am using maybe 3 or 4 at most, I hope paging is not an issue. I'm on a CentOS 5 box running R 2.15.0. On Fri, Apr 6, 2012 at 12:45 PM, jim holtman <[hidden email]> wrote: > Are you looking at the CPU or the elapsed time? If it is the elapsed > time, then also capture the CPU time to see if it is different. Also > consider the use of the Rprof function to see where time is being > spent. What else is running on the machine? Are you doing any > paging? What type of system are you running on? Use some of the > system level profiling tools. If on Windows, then use perfmon. > > On Fri, Apr 6, 2012 at 11:28 AM, Jason & Caroline Shaw > <[hidden email]> wrote: >> I am using the randomForest package. I have found that multiple runs >> of precisely the same command can generate drastically different run >> times. Can anyone with knowledge of this package provide some insight >> as to why this would happen and whether there's anything I can do >> about it? Here are some details of what I'm doing: >> >> - Data: ~80,000 rows, with 10 columns (one of which is the class label) >> - I randomly select 90% of the data to use to build 500 trees. >> >> And this is what I find: >> >> - Execution times of randomForest() using the entire dataset (in >> seconds): 20.65, 20.93, 20.79, 21.05, 21.00, 21.52, 21.22, 21.22 >> - Execution times of randomForest() using the 90% selection: 17.78, >> 17.74, 126.52, 241.87, 17.56, 17.97, 182.05, 17.82 <-- Note the 3rd, >> 4th, and 7th. >> - When the speed is slow, it often stutters, with one or a few trees >> being produced very quickly, followed by a slow build taking 10 or 20 >> seconds >> - The oob results are indistinguishable between the fast and slow runs. >> >> I select the 90% of my data by using sample() to generate indices and >> then subsetting, like: selection <- data[sample,]. I thought perhaps >> this subsetting was getting repeated, rather than storing in memory a >> new copy of all that data, so I tried circumventing this with >> eval(data[sample,]). Probably barking up the wrong tree -- it had no >> effect, and doesn't explain the run-to-run variation (really, I'm just >> not clear on what eval() is for). I have also tried garbage >> collecting with gc() between each run, and adding a Sys.sleep() for 5 >> seconds, but neither of these has helped either. >> >> Any ideas? >> >> ______________________________________________ >> [hidden email] mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > > > -- > Jim Holtman > Data Munger Guru > > What is the problem that you are trying to solve? > Tell me what you want to do, not how you want to do it. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
| Powered by Nabble | Edit this page |
