nested for loops too slow

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

nested for loops too slow

emorway
The small example below works lighting-fast; however, when I run the same
script on my real problem, a 1Gb text file, the for loops have been running
for over 24 hrs and I have no idea if the processing is 10% done or 90%
done.  I have not been able to figure out a betteR way to code up the
material within the for loops at the end of the example below.  The
contents of divChng, the final product, are exactly what I'm after, but I
need help formulating more efficient R script, I've got two more 1Gb files
to process after the current one finishes, whenever that is...

I appreciate any insights/solutions, Eric

dat <- read.table(textConnection("ISEG  IRCH  div  gw
1  1  265  229
1  2  260  298
1  3  234  196
54  1  432  485
54  39  467  485
54  40  468  468
54  41  460  381
54  42  489  502
1  1  265  317
1  2  276  225
1  3  217  164
54  1  430  489
54  39  456  495
54  40  507  607
54  41  483  424
54  42  457  404
1  1  265  278
1  2  287  370
1  3  224  274
54  1  412  585
54  39  473  532
54  40  502  595
54  41  497  441
54  42  447  467
1  1  230  258
1  2  251  152
1  3  199  179
54  1  412  415
54  39  439  538
54  40  474  486
54  41  477  484
54  42  413  346
1  1  230  171
1  2  262  171
1  3  217  263
54  1  432  485
54  39  455  482
54  40  493  419
54  41  489  536
54  42  431  504
1  1  1002  1090
1  2  1222  1178
1  3  1198  1177
54  1  1432  1485
54  39  1876  1975
54  40  1565  1646
54  41  1455  1451
54  42  1427  1524
1  1  1002  968
1  2  1246  1306
1  3  1153  1158
54  1  1532  1585
54  39  1790  1889
54  40  1490  1461
54  41  1518  1536
54  42  1486  1585
1  1  1002  1081
1  2  1229  1262
1  3  1142  1241
54  1  1632  1659
54  39  1797  1730
54  40  1517  1466
54  41  1527  1589
54  42  1514  1612"),header=TRUE)

dat$seq <- ifelse(dat$ISEG==1 & dat$IRCH==1, 1, 0)
tmp <- diff(dat[dat$seq==1,]$div)!=0
dat$idx <- 0
dat[dat$seq==1,][c(TRUE,tmp),]$idx <- 1
dat$ts <- cumsum(dat$idx)
dat$iter <- ave(dat$seq, dat$ts,FUN=cumsum)
dat$ct <- seq(1:length(dat[,1]))

timeStep <- unique(dat$ts)
SEG <- unique(dat$ISEG)
divChng <- data.frame(ts=NA, ISEG=NA, divChng=NA, gwChng=NA, iter=NA)

#Can the following be rescripted for better harnessing R's processing power?

for (i in 1:length(timeStep)){
  for (j in 1:length(SEG)){
    datTS <- subset(dat,ts==timeStep[i] & ISEG==SEG[j] & IRCH==1)
    datGW <- subset(dat,ts==timeStep[i] & ISEG==SEG[j])
    grw <- aggregate(gw ~ iter, datGW, sum)

    DC <- max(datTS$div)-min(datTS$div)
    GRW <- max(grw$gw) - min(grw$gw)
    divChng <- rbind(divChng,c(datTS$ts[1], SEG[j], DC, GRW,
max(datTS$iter)))
  }
}
divChng <- divChng[!is.na(divChng$ISEG),]

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: nested for loops too slow

J Robertson-Burns
You are certainly in Circle 2 of 'The R Inferno',
which I suspect is where almost all of the
computation time is coming from.

Instead of doing:

divChng <- rbind(divChng,c(datTS$ts[1], SEG[j], DC, GRW,
max(datTS$iter)))

it would be much better to create 'divChng' to
be the final length and then in the loop do:

divChng[count,] <- c(datTS$ts[1], SEG[j], DC, GRW,
max(datTS$iter))
count <- count + 1

You may also want to explore 'tapply'.
If this is something you will be doing
a lot of, then you should probably learn
the 'data.table' package.

http://www.burns-stat.com/documents/books/the-r-inferno/

Pat

On 12/04/2015 14:47, Morway, Eric wrote:

> The small example below works lighting-fast; however, when I run the same
> script on my real problem, a 1Gb text file, the for loops have been running
> for over 24 hrs and I have no idea if the processing is 10% done or 90%
> done.  I have not been able to figure out a betteR way to code up the
> material within the for loops at the end of the example below.  The
> contents of divChng, the final product, are exactly what I'm after, but I
> need help formulating more efficient R script, I've got two more 1Gb files
> to process after the current one finishes, whenever that is...
>
> I appreciate any insights/solutions, Eric
>
> dat <- read.table(textConnection("ISEG  IRCH  div  gw
> 1  1  265  229
> 1  2  260  298
> 1  3  234  196
> 54  1  432  485
> 54  39  467  485
> 54  40  468  468
> 54  41  460  381
> 54  42  489  502
> 1  1  265  317
> 1  2  276  225
> 1  3  217  164
> 54  1  430  489
> 54  39  456  495
> 54  40  507  607
> 54  41  483  424
> 54  42  457  404
> 1  1  265  278
> 1  2  287  370
> 1  3  224  274
> 54  1  412  585
> 54  39  473  532
> 54  40  502  595
> 54  41  497  441
> 54  42  447  467
> 1  1  230  258
> 1  2  251  152
> 1  3  199  179
> 54  1  412  415
> 54  39  439  538
> 54  40  474  486
> 54  41  477  484
> 54  42  413  346
> 1  1  230  171
> 1  2  262  171
> 1  3  217  263
> 54  1  432  485
> 54  39  455  482
> 54  40  493  419
> 54  41  489  536
> 54  42  431  504
> 1  1  1002  1090
> 1  2  1222  1178
> 1  3  1198  1177
> 54  1  1432  1485
> 54  39  1876  1975
> 54  40  1565  1646
> 54  41  1455  1451
> 54  42  1427  1524
> 1  1  1002  968
> 1  2  1246  1306
> 1  3  1153  1158
> 54  1  1532  1585
> 54  39  1790  1889
> 54  40  1490  1461
> 54  41  1518  1536
> 54  42  1486  1585
> 1  1  1002  1081
> 1  2  1229  1262
> 1  3  1142  1241
> 54  1  1632  1659
> 54  39  1797  1730
> 54  40  1517  1466
> 54  41  1527  1589
> 54  42  1514  1612"),header=TRUE)
>
> dat$seq <- ifelse(dat$ISEG==1 & dat$IRCH==1, 1, 0)
> tmp <- diff(dat[dat$seq==1,]$div)!=0
> dat$idx <- 0
> dat[dat$seq==1,][c(TRUE,tmp),]$idx <- 1
> dat$ts <- cumsum(dat$idx)
> dat$iter <- ave(dat$seq, dat$ts,FUN=cumsum)
> dat$ct <- seq(1:length(dat[,1]))
>
> timeStep <- unique(dat$ts)
> SEG <- unique(dat$ISEG)
> divChng <- data.frame(ts=NA, ISEG=NA, divChng=NA, gwChng=NA, iter=NA)
>
> #Can the following be rescripted for better harnessing R's processing power?
>
> for (i in 1:length(timeStep)){
>    for (j in 1:length(SEG)){
>      datTS <- subset(dat,ts==timeStep[i] & ISEG==SEG[j] & IRCH==1)
>      datGW <- subset(dat,ts==timeStep[i] & ISEG==SEG[j])
>      grw <- aggregate(gw ~ iter, datGW, sum)
>
>      DC <- max(datTS$div)-min(datTS$div)
>      GRW <- max(grw$gw) - min(grw$gw)
>      divChng <- rbind(divChng,c(datTS$ts[1], SEG[j], DC, GRW,
> max(datTS$iter)))
>    }
> }
> divChng <- divChng[!is.na(divChng$ISEG),]
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: nested for loops too slow

Thierry Onkelinx
In reply to this post by emorway
You don't need loops at all.

    grw <- aggregate(gw ~ ts + ISEG + iter, data = dat, FUN = sum)
    GRW <- aggregate(gw ~ ts + ISEG, data = grw, FUN = function(x){max(x) -
min(x)})
    DC <- aggregate(div ~ ts + ISEG, data = subset(dat, IRCH == 1), FUN =
function(x){max(x) - min(x)})
    iter <- aggregate(iter ~ ts + ISEG, data = subset(dat, IRCH == 1), FUN
= max)
    tmp <- merge(DC, iter)
    merge(tmp, GRW)

another option is to use the plyr package

    library(plyr)
    merge(
      ddply(
        subset(dat, IRCH == 1),
        c("ts", "ISEG"),
        summarize,
        divChng = max(div) - min(div),
        max.iter = max(iter)
      ),
      ddply(
        dat,
        c("ts", "ISEG"),
        summarize,
        gwChng = diff(range(ave(gw, iter, FUN = sum)))
      )
    )

Best regards,

ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature and
Forest
team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
Kliniekstraat 25
1070 Anderlecht
Belgium

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to say
what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data.
~ John Tukey

2015-04-12 15:47 GMT+02:00 Morway, Eric <[hidden email]>:

> The small example below works lighting-fast; however, when I run the same
> script on my real problem, a 1Gb text file, the for loops have been running
> for over 24 hrs and I have no idea if the processing is 10% done or 90%
> done.  I have not been able to figure out a betteR way to code up the
> material within the for loops at the end of the example below.  The
> contents of divChng, the final product, are exactly what I'm after, but I
> need help formulating more efficient R script, I've got two more 1Gb files
> to process after the current one finishes, whenever that is...
>
> I appreciate any insights/solutions, Eric
>
> dat <- read.table(textConnection("ISEG  IRCH  div  gw
> 1  1  265  229
> 1  2  260  298
> 1  3  234  196
> 54  1  432  485
> 54  39  467  485
> 54  40  468  468
> 54  41  460  381
> 54  42  489  502
> 1  1  265  317
> 1  2  276  225
> 1  3  217  164
> 54  1  430  489
> 54  39  456  495
> 54  40  507  607
> 54  41  483  424
> 54  42  457  404
> 1  1  265  278
> 1  2  287  370
> 1  3  224  274
> 54  1  412  585
> 54  39  473  532
> 54  40  502  595
> 54  41  497  441
> 54  42  447  467
> 1  1  230  258
> 1  2  251  152
> 1  3  199  179
> 54  1  412  415
> 54  39  439  538
> 54  40  474  486
> 54  41  477  484
> 54  42  413  346
> 1  1  230  171
> 1  2  262  171
> 1  3  217  263
> 54  1  432  485
> 54  39  455  482
> 54  40  493  419
> 54  41  489  536
> 54  42  431  504
> 1  1  1002  1090
> 1  2  1222  1178
> 1  3  1198  1177
> 54  1  1432  1485
> 54  39  1876  1975
> 54  40  1565  1646
> 54  41  1455  1451
> 54  42  1427  1524
> 1  1  1002  968
> 1  2  1246  1306
> 1  3  1153  1158
> 54  1  1532  1585
> 54  39  1790  1889
> 54  40  1490  1461
> 54  41  1518  1536
> 54  42  1486  1585
> 1  1  1002  1081
> 1  2  1229  1262
> 1  3  1142  1241
> 54  1  1632  1659
> 54  39  1797  1730
> 54  40  1517  1466
> 54  41  1527  1589
> 54  42  1514  1612"),header=TRUE)
>
> dat$seq <- ifelse(dat$ISEG==1 & dat$IRCH==1, 1, 0)
> tmp <- diff(dat[dat$seq==1,]$div)!=0
> dat$idx <- 0
> dat[dat$seq==1,][c(TRUE,tmp),]$idx <- 1
> dat$ts <- cumsum(dat$idx)
> dat$iter <- ave(dat$seq, dat$ts,FUN=cumsum)
> dat$ct <- seq(1:length(dat[,1]))
>
> timeStep <- unique(dat$ts)
> SEG <- unique(dat$ISEG)
> divChng <- data.frame(ts=NA, ISEG=NA, divChng=NA, gwChng=NA, iter=NA)
>
> #Can the following be rescripted for better harnessing R's processing
> power?
>
> for (i in 1:length(timeStep)){
>   for (j in 1:length(SEG)){
>     datTS <- subset(dat,ts==timeStep[i] & ISEG==SEG[j] & IRCH==1)
>     datGW <- subset(dat,ts==timeStep[i] & ISEG==SEG[j])
>     grw <- aggregate(gw ~ iter, datGW, sum)
>
>     DC <- max(datTS$div)-min(datTS$div)
>     GRW <- max(grw$gw) - min(grw$gw)
>     divChng <- rbind(divChng,c(datTS$ts[1], SEG[j], DC, GRW,
> max(datTS$iter)))
>   }
> }
> divChng <- divChng[!is.na(divChng$ISEG),]
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: nested for loops too slow

Bert Gunter
Well, sort of...

aggregate() is basically a wrapper for lapply(), which ultimately must loop
over the function call at the R interpreter level, as opposed to vectorized
functions that loop at the C level and hence can be orders of magnitude
faster. As a result, there is often little difference in efficiency between
explicit and *smart* (in the sense that Pat Burns has already pointed out
of not growing structures at each iteration,among other things)  for()
looping and apply-type calls. For some of us, the chief advantage of the
*apply idioms is that the code is more readable and maintainable, with R
handling fussy details of loop indexing, for example.lapply() is also more
in keeping with the functional programming paradigm.
​Others​
find both these "virtues" to be annoyances,
​ ​
however, and prefer explicit *smart* looping. Chaque un á
​ ​
son goû
​t​
.

None of which necessarily denies the wisdom of the approach you've
suggested, however. It may indeed be considerably faster,
but timing will have to tell. I am just trying to correct
​(again) ​
the
​ ​
widely held misperception
​that yo​
u
​ seem to​
express
​.​


Cheers,
Bert


Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge is
certainly not wisdom."
Clifford Stoll



On Sun, Apr 12, 2015 at 1:48 PM, Thierry Onkelinx <[hidden email]>
wrote:

> You don't need loops at all.
>
>     grw <- aggregate(gw ~ ts + ISEG + iter, data = dat, FUN = sum)
>     GRW <- aggregate(gw ~ ts + ISEG, data = grw, FUN = function(x){max(x) -
> min(x)})
>     DC <- aggregate(div ~ ts + ISEG, data = subset(dat, IRCH == 1), FUN =
> function(x){max(x) - min(x)})
>     iter <- aggregate(iter ~ ts + ISEG, data = subset(dat, IRCH == 1), FUN
> = max)
>     tmp <- merge(DC, iter)
>     merge(tmp, GRW)
>
> another option is to use the plyr package
>
>     library(plyr)
>     merge(
>       ddply(
>         subset(dat, IRCH == 1),
>         c("ts", "ISEG"),
>         summarize,
>         divChng = max(div) - min(div),
>         max.iter = max(iter)
>       ),
>       ddply(
>         dat,
>         c("ts", "ISEG"),
>         summarize,
>         gwChng = diff(range(ave(gw, iter, FUN = sum)))
>       )
>     )
>
> Best regards,
>
> ir. Thierry Onkelinx
> Instituut voor natuur- en bosonderzoek / Research Institute for Nature and
> Forest
> team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
> Kliniekstraat 25
> 1070 Anderlecht
> Belgium
>
> To call in the statistician after the experiment is done may be no more
> than asking him to perform a post-mortem examination: he may be able to say
> what the experiment died of. ~ Sir Ronald Aylmer Fisher
> The plural of anecdote is not data. ~ Roger Brinner
> The combination of some data and an aching desire for an answer does not
> ensure that a reasonable answer can be extracted from a given body of data.
> ~ John Tukey
>
> 2015-04-12 15:47 GMT+02:00 Morway, Eric <[hidden email]>:
>
> > The small example below works lighting-fast; however, when I run the same
> > script on my real problem, a 1Gb text file, the for loops have been
> running
> > for over 24 hrs and I have no idea if the processing is 10% done or 90%
> > done.  I have not been able to figure out a betteR way to code up the
> > material within the for loops at the end of the example below.  The
> > contents of divChng, the final product, are exactly what I'm after, but I
> > need help formulating more efficient R script, I've got two more 1Gb
> files
> > to process after the current one finishes, whenever that is...
> >
> > I appreciate any insights/solutions, Eric
> >
> > dat <- read.table(textConnection("ISEG  IRCH  div  gw
> > 1  1  265  229
> > 1  2  260  298
> > 1  3  234  196
> > 54  1  432  485
> > 54  39  467  485
> > 54  40  468  468
> > 54  41  460  381
> > 54  42  489  502
> > 1  1  265  317
> > 1  2  276  225
> > 1  3  217  164
> > 54  1  430  489
> > 54  39  456  495
> > 54  40  507  607
> > 54  41  483  424
> > 54  42  457  404
> > 1  1  265  278
> > 1  2  287  370
> > 1  3  224  274
> > 54  1  412  585
> > 54  39  473  532
> > 54  40  502  595
> > 54  41  497  441
> > 54  42  447  467
> > 1  1  230  258
> > 1  2  251  152
> > 1  3  199  179
> > 54  1  412  415
> > 54  39  439  538
> > 54  40  474  486
> > 54  41  477  484
> > 54  42  413  346
> > 1  1  230  171
> > 1  2  262  171
> > 1  3  217  263
> > 54  1  432  485
> > 54  39  455  482
> > 54  40  493  419
> > 54  41  489  536
> > 54  42  431  504
> > 1  1  1002  1090
> > 1  2  1222  1178
> > 1  3  1198  1177
> > 54  1  1432  1485
> > 54  39  1876  1975
> > 54  40  1565  1646
> > 54  41  1455  1451
> > 54  42  1427  1524
> > 1  1  1002  968
> > 1  2  1246  1306
> > 1  3  1153  1158
> > 54  1  1532  1585
> > 54  39  1790  1889
> > 54  40  1490  1461
> > 54  41  1518  1536
> > 54  42  1486  1585
> > 1  1  1002  1081
> > 1  2  1229  1262
> > 1  3  1142  1241
> > 54  1  1632  1659
> > 54  39  1797  1730
> > 54  40  1517  1466
> > 54  41  1527  1589
> > 54  42  1514  1612"),header=TRUE)
> >
> > dat$seq <- ifelse(dat$ISEG==1 & dat$IRCH==1, 1, 0)
> > tmp <- diff(dat[dat$seq==1,]$div)!=0
> > dat$idx <- 0
> > dat[dat$seq==1,][c(TRUE,tmp),]$idx <- 1
> > dat$ts <- cumsum(dat$idx)
> > dat$iter <- ave(dat$seq, dat$ts,FUN=cumsum)
> > dat$ct <- seq(1:length(dat[,1]))
> >
> > timeStep <- unique(dat$ts)
> > SEG <- unique(dat$ISEG)
> > divChng <- data.frame(ts=NA, ISEG=NA, divChng=NA, gwChng=NA, iter=NA)
> >
> > #Can the following be rescripted for better harnessing R's processing
> > power?
> >
> > for (i in 1:length(timeStep)){
> >   for (j in 1:length(SEG)){
> >     datTS <- subset(dat,ts==timeStep[i] & ISEG==SEG[j] & IRCH==1)
> >     datGW <- subset(dat,ts==timeStep[i] & ISEG==SEG[j])
> >     grw <- aggregate(gw ~ iter, datGW, sum)
> >
> >     DC <- max(datTS$div)-min(datTS$div)
> >     GRW <- max(grw$gw) - min(grw$gw)
> >     divChng <- rbind(divChng,c(datTS$ts[1], SEG[j], DC, GRW,
> > max(datTS$iter)))
> >   }
> > }
> > divChng <- divChng[!is.na(divChng$ISEG),]
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.